Testing Real World
GenAI Systems

Partnering 17 organisations deploying GenAI applications
with 16 leading AI testing specialists

Global AI Assurance
Pilot Overview

17

Applications

9

Geographies

10

Industries

Insights from the Pilot

001

Test what matters

Your context will determine what risks you should (and shouldn’t!) care about. Spend time upfront to design effective tests for those.

002

Don’t expect test data to be fit for purpose

No one has the “right” test dataset to hand. Human and AI effort is needed to generate realistic, adversarial and edge case test data.

003

Look under the hood

Testing just the outputs may not be enough. Interim touchpoints in the application pipeline can help with debugging and increase confidence.

004

Use LLMs as judges, but with skill and caution

Human-only evals don’t scale. LLMs-as- judges are often necessary, but need careful design and human calibration. Cheaper, faster alternatives exist in some situations

Pilot Outcomes

Testing Real World GenAI Systems

Case Study Compendium

Starter Kit by IMDA

Our pilot participants share their biggest insight from the testing process, and what they believe is required to make AI assurance more accessible and reliable.

Participants Testimonials