001
Test what matters
Your context will determine what risks you should (and shouldn’t!) care about. Spend time upfront to design effective tests for those.
002
Don’t expect test data to be fit for purpose
No one has the “right” test dataset to hand. Human and AI effort is needed to generate realistic, adversarial and edge case test data.
003
Look under the hood
Testing just the outputs may not be enough. Interim touchpoints in the application pipeline can help with debugging and increase confidence.
004
Use LLMs as judges, but with skill and caution
Human-only evals don’t scale. LLMs-as- judges are often necessary, but need careful design and human calibration. Cheaper, faster alternatives exist in some situations
Our pilot participants share their biggest insight from the testing process, and what they believe is required to make AI assurance more accessible and reliable.
Running some tests and computing some numbers, that is the easy part. But knowing what tests to execute and how to interpret the results, that was the hard part.
AIQURIS
How meticulous that companies need to be in not only defining the bad behavior that should not exist in AI applications, but also the good behavior that should exist and what is considered good.
Guardrails AI
We learned that businesses replacing internal processes with LLM tools need outputs that don’t just pass technical accuracy checks, but also precisely match nuanced internal standards. And these standards are often subjective and qualitative, emphasizing engagement, coherence, and suitability for customer-facing communications.
PwC
AI model testing and AI application testing are different. The definition of safety is very different across different use case and domains. For example, the risk in healthcare is different in the risk in finance.
AIDX Tech
Guidelines or best practices on how to build LLM applications so they can be easily tested would be critical. Formal accreditation of vendors and their test approaches could also help in assuring consistency.
Resaro
There’s too much headache over the cost and complexity of mobilizing testing and assurance technology. Democratize access to these testing technologies well beyond the realm of frontier Labs into the economy, including in SMEs.
PRISM Eval
There’s too much headache over the cost and complexity of mobilizing testing and assurance technology. Democratize access to these testing technologies well beyond the realm of frontier Labs into the economy, including in SMEs.
PRISM Eval
Guidelines or best practices on how to build LLM applications so they can be easily tested would be critical. Formal accreditation of vendors and their test approaches could also help in assuring consistency.
Resaro
AI model testing and AI application testing are different. The definition of safety is very different across different use case and domains. For example, the risk in healthcare is different in the risk in finance.
AIDX Tech
We learned that businesses replacing internal processes with LLM tools need outputs that don’t just pass technical accuracy checks, but also precisely match nuanced internal standards. And these standards are often subjective and qualitative, emphasizing engagement, coherence, and suitability for customer-facing communications.
PwC
How meticulous that companies need to be in not only defining the bad behavior that should not exist in AI applications, but also the good behavior that should exist and what is considered good.
Guardrails AI
Running some tests and computing some numbers, that is the easy part. But knowing what tests to execute and how to interpret the results, that was the hard part.
AIQURIS