Testing Real World GenAI Systems

Partnering 17 organisations deploying GenAI applications
with 16 leading AI testing specialists

Global AI Assurance
Pilot Overview

17

Applications

9

Geographies

10

Industries

Insights from the Pilot

001

Test what matters

Your context will determine what risks you should (and shouldn’t!) care about. Spend time upfront to design effective tests for those.

002

Don’t expect test data to be fit for purpose

No one has the “right” test dataset to hand. Human and AI effort is needed to generate realistic, adversarial and edge case test data.

003

Look under the hood

Testing just the outputs may not be enough. Interim touchpoints in the application pipeline can help with debugging and increase confidence.

004

Use LLMs as judges, but with skill and caution

Human-only evals don’t scale. LLMs-as- judges are often necessary, but need careful design and human calibration. Cheaper, faster alternatives exist in some situations

Pilot Outcomes

Testing Real World GenAI Systems

Case Study Compendium

Starter Kit by IMDA

Fact Sheet

Our pilot participants share their biggest insight from the testing process, and what they believe is required to make AI assurance more accessible and reliable.

Defining “what to test”

Application testing  vs model testing

What’s needed  – an open mindset

What’s needed  – standards and accreditation

What's needed: Tooling and Automation

What Do Participants Say

Running some tests and computing some numbers, that is the easy part. But knowing what tests to execute and how to interpret the results, that was the hard part.

Dr Martin Saerbeck

AIQURIS

How meticulous that companies need to be in not only defining the bad behavior that should not exist in AI applications, but also the good behavior that should exist and what is considered good.

Safeer Mohiuddin

Guardrails AI

We learned that businesses replacing internal processes with LLM tools need outputs that don’t just pass technical accuracy checks, but also precisely match nuanced internal standards. And these standards are often subjective and qualitative, emphasizing engagement, coherence, and suitability for customer-facing communications.

Matthew Dodgson

PwC

AI model testing and AI application testing are different. The definition of safety is very different across different use case and domains. For example, the risk in healthcare is different in the risk in finance.

Yifan Jia

AIDX Tech

Guidelines or best practices on how to build LLM applications so they can be easily tested would be critical. Formal accreditation of vendors and their test approaches could also help in assuring consistency.

Miguel Fernandes

Resaro

There’s too much headache over the cost and complexity of mobilizing testing and assurance technology. Democratize access to these testing technologies well beyond the realm of frontier Labs into the economy, including in SMEs.

Nicolas Miailhe

PRISM Eval

Nicolas Miailhe

PRISM Eval

Miguel Fernandes

Resaro

Yifan Jia

AIDX Tech

Matthew Dodgson

PwC

How meticulous that companies need to be in not only defining the bad behavior that should not exist in AI applications, but also the good behavior that should exist and what is considered good.

Safeer Mohiuddin

Guardrails AI

Running some tests and computing some numbers, that is the easy part. But knowing what tests to execute and how to interpret the results, that was the hard part.

Dr Martin Saerbeck

AIQURIS

Testing Real World GenAI Systems

Partnering 17 organisations deploying GenAI applications
with 16 leading AI testing specialists

Global AI Assurance
Pilot Overview

17

Applications

9

Geographies

10

Industries

Insights from the Pilot

Pilot Outcomes

Testing Real World GenAI Systems

Case Study Compendium

Starter Kit by IMDA

Fact Sheet

Defining “what to test”

Application testing  vs model testing

What’s needed  – an open mindset

What’s needed  – standards and accreditation

What's needed: Tooling and Automation

What Do Participants Say

Dr Martin Saerbeck

Safeer Mohiuddin

Matthew Dodgson

Yifan Jia

Miguel Fernandes

Nicolas Miailhe

Nicolas Miailhe

Miguel Fernandes

Yifan Jia

Matthew Dodgson

Safeer Mohiuddin

Dr Martin Saerbeck

Main Report

Case Studies Report

© 2025 AI Verify Foundation. All rights reserved.

Testing Real World GenAI Systems

Partnering 17 organisations deploying GenAI applications with 16 leading AI testing specialists

Global AI Assurance Pilot Overview

17

Applications

9

Geographies

10

Industries

Insights from the Pilot

Pilot Outcomes

Testing Real World GenAI Systems

Case Study Compendium

Starter Kit by IMDA

Fact Sheet

Defining “what to test”

Application testing vs model testing

What’s needed – an open mindset

What’s needed – standards and accreditation

What's needed: Tooling and Automation

What Do Participants Say

Dr Martin Saerbeck

Safeer Mohiuddin

Matthew Dodgson

Yifan Jia

Miguel Fernandes

Nicolas Miailhe

Nicolas Miailhe

Miguel Fernandes

Yifan Jia

Matthew Dodgson

Safeer Mohiuddin

Dr Martin Saerbeck

Main Report

Case Studies Report

© 2025 AI Verify Foundation. All rights reserved.

Testing Real World GenAI Systems

Partnering 17 organisations deploying GenAI applications
with 16 leading AI testing specialists

Global AI Assurance
Pilot Overview

Application testing  vs model testing

What’s needed  – an open mindset

What’s needed  – standards and accreditation