Testing Real World
GenAI Systems

Participating organisations

Global AI Assurance
Sandbox Overview

30

Applications

12

Geographies

14

Industries

Insights

001

Test what matters

Your context will determine what risks you should (and shouldn’t!) care about. Spend time upfront to design effective tests for those.

002

Don’t expect test data to be fit for purpose

No one has the “right” test dataset to hand. Human and AI effort is needed to generate realistic, adversarial and edge case test data.

003

Look under the hood

Testing just the outputs may not be enough. Interim touchpoints in the application pipeline can help with debugging and increase confidence.

004

Use LLMs as judges, but with skill and caution

Human-only evals don’t scale. LLMs-as- judges are often necessary, but need careful design and human calibration. Cheaper, faster alternatives exist in some situations

What Do Participants Say