Generative AI Is Software, And Software Needs Testing

Generative AI, machine learning models, deep learning models … this is all software. And just like any type of software, it needs to be tested to make sure that it’s doing the right things and doing them in the right way. Technology leaders are concerned about large language model (LLM) nondeterminism and hallucinations — plausible, but incorrect, suggestions. Leaders are also concerned about how genAI outputs might or not be compliant to their organization’s culture, ethics, policies, and user experience. An unrealistic thought is to just put “humans in the loop” to control the outcomes, but that is an impractical and costly solution, especially when the AI is customer-facing.

Testing GenAI Is Not For The Faint-Hearted

Testing generative AI is not a well-known practice, perhaps not known at all. GenAI is so new that there is not enough consolidated experience in the market on good and effective testing practices. GenAI-infused solutions haven’t deployed at scale in production to pose enough real risks to seriously worry about this enough. I’ve already highlighted how critical testing is when generative AI is used to assist developers and software development teams, a generative AI use case that in Forrester we’ve named TuringBots: AI- and genAI-enabled development assistants.

Generative AI is more complex to test than all previous types of software we’ve been testing. We’ve explored how to test AI-infused applications, before generative AI via ChatGPT came to the market, with two Forrester reports. We recently have updated those two documents with everything that we have learned so far about experiments and ideas for testing generative AI. Check out “It’s Time To Get Really Serious About Testing Your AI” (Part One and Part Two). We are well aware of the additional complexity that generative AI brings with hallucinations and nondeterminism, which is why we have planned a dedicated research stream in 2024 on testing of generative AI. So stay tuned.

Starting Points

Here are some initial thoughts on how to address testing of AI- and generative AI-infused applications.

Leverage test benchmarks, adversarial testing, and test case prompting for LLMs.

Testing the performance of LLMs and genAI is hard because it means testing various natural language properties and expressions syntactically and semantically. Diverse, free, and open source benchmarks and evaluation frameworks exist to test different aspects of language such as safety, ethics, and performance. For example, the OpenAI Moderation API is designed to provide a safer environment for users to interact with OpenAI APIs. It incorporates a moderation layer to filter out harmful or unsafe content, ensuring ethical use of LLMs. Benchmarking should be coupled with human testing. One approach, similar to traditional software testing, is to specify test properties in prompts to which any correct output should comply. Manual adversarial testing where possible can evolve to automated through generative adversarial networks as a third option.

This conceptual graphic shows the tests needed for automatic and autonomous AIIAs.

Let’s Work Together To Improve Generative AI Outcomes

If you or your team are gaining good experiences and practices or learning about new tools that are effective in helping test generative AI, please reach out to me at dlogiudice@forrester.com. As a Forrester client, if you are instead only starting the journey of learning and planning on how to test AI-infused apps or need some general insights, please do schedule an inquiry or guidance session with me by emailing inquiry@forrester.com. We’re off on the road to unlocking a groundbreaking new technology, but if we can’t test it, it will be useless.