What’s next?

Pilot participants provided their views on potential areas for future work. 4 themes emerged:

More training and awareness of the risks
- On the risks of GenAI systems
- On testing and how that needs to become an integral part of the development process
Opportunities to share experiences among testing practitioners and the organisations deploying GenAI apps
- Macro-level (e.g., how to sensitise senior leaders on risk)
- Specific (e.g., the best metrics to test translation quality)
The need for multi-stakeholder engagement around testing – not just with developers but also business leaders, product owners, Subject Matter Experts and risk/ compliance teams

Even non-technical stakeholders (have) to be part of the AI assurance ecosystem. That is where the opportunity is as well.

Fion Lee-Madan – Fairly AI

Across the test lifecycle: Risk assessment, test selection, test execution, test configuration, and result interpretation
Should result in inter-operable/ portable tests and consistency in results (same system, two testers = same outcome)
Ideally, also linked to policy/ regulation positions where it makes sense (e.g., on the use of automated red-teaming or LLMs as a judge)

Some participants suggested the need for standards at a more granular level – e.g.,

Individual test metrics like accuracy of summarisation or translation)
Real-world evaluation benchmarks for specific use cases
Machine readable outputs from GenAI systems to support testing automation

We need standards around the mechanisms to assess accuracy or safety, so that results from different tools and vendors are comparable

Yifan Jia – AIDX

Accreditation scheme for AI testing/ assurance providers (services and software)
As a way of ensuring consistency, common assessment standards and greater confidence among deployers and end-users

Formal accreditation of vendors and their test approaches could also help in assuring consistency and ensuring a common standard of assessment

Miguel Fernandes – Resaro AI

Scalable test environments with stable APIs and broad platform support
Democratised access to testing technologies – not just limited to frontier labs, big technology firms or the largest enterprises

There’s too much headache over the cost and complexity of mobilising testing and assurance technology, particularly for actors who cannot rely on deep LLM expertise or large security budgets

Nicolas Miailhe – PRISM Eval

IMDA and AIVF will take these inputs into consideration as they shape their roadmap. A few immediate actions are underway.

Sharing the outcomes from the Assurance Pilot widely, engaging with AIVF members (200 organisations) and the broader community.
Consultation on the IMDA Starter Kit, containing a set of voluntary guidelines that coalesces rapidly emerging best practices and methodologies for app testing . At this stage, the starter kit covers 4 risks: hallucination, undesirable content, data disclosure, vulnerability to adversarial attack.
Incorporation of both the pilot findings and the Starter Kit into the AIVF open source GenAI testing toolkit roadmap.
Continuation of the collaboration platform provided by the pilot in a different form – e.g., an assurance clinic. The first members of the next cohort are already on-board.

The journey towards making GenAI applications reliable in real-world settings has just started. IMDA and AIVF look forward to continued collaboration with AI builders, deployers and testers, as well as policy makers locally and internationally, on this important initiative.

Testing Real World GenAI Systems

Main Report

Main Report

What’s next?

IMDA and AIVF will take these inputs into consideration as they shape their roadmap. A few immediate actions are underway.

Download Full 17 Case Studies Report

Main Report

Case Studies Report

© 2025 AI Verify Foundation. All rights reserved.

Testing Real World GenAI Systems

Main Report

Main Report

What’s next?

IMDA and AIVF will take these inputs into consideration as they shape their roadmap. A few immediate actions are underway.

Download Full 17 Case Studies Report

Main Report

Case Studies Report

© 2025 AI Verify Foundation. All rights reserved.

Download Full 17 Case Studies Report