Testing Real World GenAI Systems

Participating organisations

Global AI Assurance
Sandbox Overview

30

Applications

12

Geographies

14

Industries

Insights

001

Test what matters

Your context will determine what risks you should (and shouldn’t!) care about. Spend time upfront to design effective tests for those.

002

Don’t expect test data to be fit for purpose

No one has the “right” test dataset to hand. Human and AI effort is needed to generate realistic, adversarial and edge case test data.

003

Look under the hood

Testing just the outputs may not be enough. Interim touchpoints in the application pipeline can help with debugging and increase confidence.

004

Use LLMs as judges, but with skill and caution

Human-only evals don’t scale. LLMs-as- judges are often necessary, but need careful design and human calibration. Cheaper, faster alternatives exist in some situations

What Do Participants Say

Running some tests and computing some numbers, that is the easy part. But knowing what tests to execute and how to interpret the results, that was the hard part.

Dr Martin Saerbeck

AIQURIS

How meticulous that companies need to be in not only defining the bad behavior that should not exist in AI applications, but also the good behavior that should exist and what is considered good.

Safeer Mohiuddin

Guardrails AI

We learned that businesses replacing internal processes with LLM tools need outputs that don’t just pass technical accuracy checks, but also precisely match nuanced internal standards. And these standards are often subjective and qualitative, emphasizing engagement, coherence, and suitability for customer-facing communications.

Matthew Dodgson

PwC

AI model testing and AI application testing are different. The definition of safety is very different across different use case and domains. For example, the risk in healthcare is different in the risk in finance.

Yifan Jia

AIDX Tech

Guidelines or best practices on how to build LLM applications so they can be easily tested would be critical. Formal accreditation of vendors and their test approaches could also help in assuring consistency.

Miguel Fernandes

Resaro

There’s too much headache over the cost and complexity of mobilizing testing and assurance technology. Democratize access to these testing technologies well beyond the realm of frontier Labs into the economy, including in SMEs.

Nicolas Miailhe

PRISM Eval

Nicolas Miailhe

PRISM Eval

Miguel Fernandes

Resaro

Yifan Jia

AIDX Tech

Matthew Dodgson

PwC

How meticulous that companies need to be in not only defining the bad behavior that should not exist in AI applications, but also the good behavior that should exist and what is considered good.

Safeer Mohiuddin

Guardrails AI

Running some tests and computing some numbers, that is the easy part. But knowing what tests to execute and how to interpret the results, that was the hard part.

Dr Martin Saerbeck

AIQURIS

Testing Real World GenAI Systems

Participating organisations

Global AI Assurance
Sandbox Overview

Global AI Assurance
Sandbox Overview

30

Applications

12

Geographies

14

Industries

Insights

What Do Participants Say

Dr Martin Saerbeck

Safeer Mohiuddin

Matthew Dodgson

Yifan Jia

Miguel Fernandes

Nicolas Miailhe

Nicolas Miailhe

Miguel Fernandes

Yifan Jia

Matthew Dodgson

Safeer Mohiuddin

Dr Martin Saerbeck

Be a changemaker for trustworthy AI!

© 2026 AI Verify Foundation. All rights reserved.

Testing Real World GenAI Systems

Participating organisations

Global AI Assurance Sandbox Overview

Global AI Assurance Sandbox Overview

30

Applications

12

Geographies

14

Industries

Insights

What Do Participants Say

Dr Martin Saerbeck

Safeer Mohiuddin

Matthew Dodgson

Yifan Jia

Miguel Fernandes

Nicolas Miailhe

Nicolas Miailhe

Miguel Fernandes

Yifan Jia

Matthew Dodgson

Safeer Mohiuddin

Dr Martin Saerbeck

Be a changemaker for trustworthy AI!

© 2026 AI Verify Foundation. All rights reserved.

Testing Real World GenAI Systems

Global AI Assurance
Sandbox Overview

Global AI Assurance
Sandbox Overview