Executive Summary
From Model Safety to Application Reliability
As Generative AI (“GenAI”) transitions from personal productivity tools and consumer-facing chatbots into real-world environments like hospitals, airports and banks, it faces a higher bar on quality and confidence
- Risk assessments depend heavily on the context of the use case – e.g., lower tolerance for error in a clinical application than a customer service chatbot
- Given the higher complexity involved in integrating foundation models with existing data sources, processes and systems, there are more potential points of failure.
However, much of the current work around AI testing focuses on the safety of foundation models, rather than the reliability of end-to-end applications. The Global AI Assurance Pilot was an attempt to address this gap: not through academic research, but by building upon real-life experiences of practitioners around the world.
Learning by doing
The pilot matched 17 deployers of GenAI applications with 16 specialist AI testing firms. These organisations were based in Singapore and 8 other jurisdictions, providing a significant international lens. The primary objective was to surface and codify emerging norms in technical testing of GenAI applications.
The 17 applications were aimed at a mix of internal (12) and external (5) users. There was a human in the loop in most (12) cases. 10 industries were represented, including banking, healthcare and technology. Large Language Models (LLMs) were utilised in a variety of ways in these applications: summarisation, retrieval augmented generation, data extraction, chatbots, classification, translation, agentic workflows and code generation
The “what” and “how” of testing GenAI applications
- Deciding what to test (or not!) was a non-trivial exercise. The 3 risks that interested most deployers were accuracy and robustness, use case specific regulation and compliance requirements, and content safety
- Off-the-shelf LLM benchmark test datasets were rarely used to conduct the tests, except to test content safety in external facing applications. Use-case specific test data sets were used most often, though many decided to supplement these with adversarial red-teaming or simulation testing for edge-case scenarios.
- The 2 most popular ways to evaluate test results were human review and LLMs-as-judges. Many participants highlighted that while the latter are versatile, scalable and accessible, they carry risks and require mitigating controls.
Getting GenAI testing right: 4 practical recommendations
- Test what matters: Your context will determine what risks you should (or should not!) care about. Spend time upfront to design effective tests for those
- Don’t expect test data to be fit for purpose: No one has the “right” test dataset to hand. Human and AI effort is needed to generate realistic, adversarial and edge case test data.
- Look under the hood: Testing just the outputs may not be enough. Interim touchpoints in the application pipeline can help with debugging and redundancy. With agentic AI applications, this becomes a necessity
- Use LLMs as judges, but with skill and caution: Human-only evaluations will not scale. LLMs-as-judges are necessary but require careful design and human calibration. Cheaper, faster and simpler alternatives exist, in some situations.
There was also an overwhelming reinforcement of the critical role of human experts, at every stage of the GenAI testing lifecycle.
What comes next?
Pilot participants suggested 4 areas for future collaboration:
- Building awareness and sharing emerging best practices around GenAI testing
- Moving towards industry standards around “what to test” and “how to test”
- Creating an accreditation framework that promotes consistency and build greater confidence in the technical testing/ assurance market
- Supporting greater automation for technical testing
The launch of the IMDA GenAI Testing Toolkit – for consultation – is an initial step to address some of these requests.
The journey towards making GenAI applications reliable in real-world settings has just started. IMDA and AIVF look forward to continued collaboration with AI builders, deployers and testers, as well as policy makers, on this important initiative