What’s next?
Pilot participants provided their views on potential areas for future work. 4 themes emerged:

- More training and awareness of the risks
- On the risks of GenAI systems
- On testing and how that needs to become an integral part of the development process
- Opportunities to share experiences among testing practitioners and the organisations deploying GenAI apps
- Macro-level (e.g., how to sensitise senior leaders on risk)
- Specific (e.g., the best metrics to test translation quality)
- The need for multi-stakeholder engagement around testing – not just with developers but also business leaders, product owners, Subject Matter Experts and risk/ compliance teams
Even non-technical stakeholders (have) to be part of the AI assurance ecosystem. That is where the opportunity is as well.
Fion Lee-Madan – Fairly AI

- Across the test lifecycle: Risk assessment, test selection, test execution, test configuration, and result interpretation
- Should result in inter-operable/ portable tests and consistency in results (same system, two testers = same outcome)
- Ideally, also linked to policy/ regulation positions where it makes sense (e.g., on the use of automated red-teaming or LLMs as a judge)
Some participants suggested the need for standards at a more granular level – e.g.,
- Individual test metrics like accuracy of summarisation or translation)
- Real-world evaluation benchmarks for specific use cases
- Machine readable outputs from GenAI systems to support testing automation
We need standards around the mechanisms to assess accuracy or safety, so that results from different tools and vendors are comparable
Yifan Jia – AIDX

- Accreditation scheme for AI testing/ assurance providers (services and software)
- As a way of ensuring consistency, common assessment standards and greater confidence among deployers and end-users
Formal accreditation of vendors and their test approaches could also help in assuring consistency and ensuring a common standard of assessment
Miguel Fernandes – Resaro AI

- Scalable test environments with stable APIs and broad platform support
- Democratised access to testing technologies – not just limited to frontier labs, big technology firms or the largest enterprises
There’s too much headache over the cost and complexity of mobilising testing and assurance technology, particularly for actors who cannot rely on deep LLM expertise or large security budgets
Nicolas Miailhe – PRISM Eval
IMDA and AIVF will take these inputs into consideration as they shape their roadmap. A few immediate actions are underway.
- Sharing the outcomes from the Assurance Pilot widely, engaging with AIVF members (200 organisations) and the broader community.
- Consultation on the IMDA Starter Kit, containing a set of voluntary guidelines that coalesces rapidly emerging best practices and methodologies for app testing . At this stage, the starter kit covers 4 risks: hallucination, undesirable content, data disclosure, vulnerability to adversarial attack.
- Incorporation of both the pilot findings and the Starter Kit into the AIVF open source GenAI testing toolkit roadmap.
- Continuation of the collaboration platform provided by the pilot in a different form – e.g., an assurance clinic. The first members of the next cohort are already on-board.
The journey towards making GenAI applications reliable in real-world settings has just started. IMDA and AIVF look forward to continued collaboration with AI builders, deployers and testers, as well as policy makers locally and internationally, on this important initiative.