Introduction
The AI Verify Foundation (AIVF) is a non-profit subsidiary of Singapore’s Infocomm Media Development Authority (IMDA). Its mission is to support the creation of a trusted AI ecosystem through access to reliable AI testing capabilities.
Together with its parent IMDA, the AIVF launched the Global AI Assurance Pilot in February 2025, to help codify emerging norms and best practices around technical testing of Generative AI (“GenAI”) applications. Existing, real-world GenAI applications were put to the test, pairing organisations that had deployed them with specialist AI testing firms.
1.1 Rationale
The pilot was motivated by three core beliefs:
GenAI can have a massive, positive impact on our society and economy – if it is adopted at scale in public and private sector organisations
Such “real-world” adoption requires GenAI applications to operate at a much higher level of quality and reliability (vs. the general-purpose models that underpin them)
The extensive work underway on AI model safety and capability is necessary, but not sufficient, to help meet that higher bar.
Large Language Models (LLMs) and their multi-modal equivalents are being adopted extensively as personal productivity tools. However, to have real transformational impact, GenAI must get embedded in the public and private sector organisations that drive critical parts of the economy, such as health, finance, utilities and government services.
Using GenAI in such real-world situations, at scale, raises the quality and reliability bar significantly. Two factors account for this difference: Context and Complexity.
- Unlike a general purpose LLM chatbot application or personal productivity tool, a GenAI-enabled application must operate in the specific context of a use case, organisation, industry and/or socio-cultural expectations. For example, a GenAI application in a healthcare setting may have very low levels of tolerance for “hallucination” compared to one used as an internal employee helpdesk.
- Real-life GenAI applications are also likely to have more layers of complexity. They may use LLMs in conjunction with existing data sources, processes and systems, creating additional potential points of failure beyond the LLM.
Most academic and technology industry efforts around AI testing have tended to focus on Model safety and alignment. A shift is required – from the Safety of Foundation Models to the Reliability of the end-to-end Systems or Applications in which they are embedded.


The pilot was an attempt to start enabling that shift – not through new academic research or technical development, but through real-world experience.
1.2 Objectives
The pilot was launched with 3 target outcomes
Testing norms for GenAI applications
- Inputs into future standards for Technical Testing of GenAI applications.
Foundations for a viable assurance market
- Greater awareness of the ways in which external assurance can build trust in GenAl applications and enable adoption at scale.
- A foundation for potential accreditation programmes in the future
Al testing tool roadmaps
- Inputs into the product roadmaps for open source and proprietary Al testing software
- Specific focus areas for AIVF’s Moonshot platform
1.3 Ground rules
The pilot had the following ground rules:
- The application must involve the use of at least one LLM or multi-modal model
- The application must be live or intended to go-live (not Proofs-of-Concept)
- The exercise must focus on technical testing (not process compliance)
- Testing should be conducted on the GenAI application (not just the underlying foundation model)
- Testing must be conducted by an external party – i.e., an organisation different from the one that has built and/or deployed the application
IMDA and AIVF sought no access to the actual results of the technical tests. The focus was on understanding the deployer’s risk assessment, the design and implementation of technical tests against those risks, and the lessons learnt from the exercise.