Testing Real World GenAI Systems

Main Report

What’s next?

Pilot participants provided their views on potential areas for future work. 4 themes emerged:

  • More training and awareness of the risks
    • On the risks of GenAI systems 
    • On testing and how that needs to become an integral part of the development process
  • Opportunities to share experiences among testing practitioners and the organisations deploying GenAI apps
    • Macro-level (e.g., how to sensitise senior leaders on risk)
    • Specific (e.g., the best metrics to test translation quality)
  • The need for multi-stakeholder engagement around testing – not just with developers but also business leaders, product owners, Subject Matter Experts and risk/ compliance teams

Even non-technical stakeholders (have) to be part of the AI assurance ecosystem. That is where the opportunity is as well.

  • Across the test lifecycle: Risk assessment, test selection, test execution, test configuration, and result interpretation
  • Should result in inter-operable/ portable tests and consistency in results (same system, two testers = same outcome) 
  • Ideally, also linked to policy/ regulation positions where it makes sense (e.g., on the use of automated red-teaming or LLMs as a judge)

Some participants suggested the need for standards at a more granular level – e.g., 

  • Individual test metrics like accuracy of summarisation or translation)
  • Real-world evaluation benchmarks for specific use cases 
  • Machine readable outputs from GenAI systems to support testing automation

We need standards around the mechanisms to assess accuracy or safety, so that results from different tools and vendors are comparable

  • Accreditation scheme for AI testing/ assurance providers (services and software) 
  • As a way of ensuring consistency, common assessment standards and greater confidence among deployers and end-users

Formal accreditation of vendors and their test approaches could also help in assuring consistency and ensuring a common standard of assessment

  • Scalable test environments with stable APIs and broad platform support
  • Democratised access to testing technologies – not just limited to frontier labs, big technology firms or the largest enterprises

There’s too much headache over the cost and complexity of mobilising testing and assurance technology, particularly for actors who cannot rely on deep LLM expertise or large security budgets

IMDA and AIVF will take these inputs into consideration as they shape their roadmap. A few immediate actions are underway.
  • Sharing the outcomes from the Assurance Pilot widely, engaging with AIVF members (200 organisations) and the broader community.
  • Consultation on the IMDA Starter Kit, containing a set of voluntary guidelines that coalesces rapidly emerging best practices and methodologies for app testing . At this stage, the starter kit covers 4 risks: hallucination, undesirable content, data disclosure, vulnerability to adversarial attack.
  • Incorporation of both the pilot findings and the Starter Kit into the AIVF open source GenAI testing toolkit roadmap.
  • Continuation of the collaboration platform provided by the pilot in a different form – e.g., an assurance clinic. The first members of the next cohort are already on-board.

Download Full
17 Case Studies Report

Get an inside look into real-world testing approaches, 
industry-specific challenges, and the creative ways participants tackled domain-specific risks.