Release Generative AI Solutions Without Testing Them Comprehensively At Your Peril

Oct 24 2023

Generative AI Is Software, And Software Needs Testing

Generative AI, machine learning models, deep learning models … this is all software. And just like any type of software, it needs to be tested to make sure that it’s doing the right things and doing them in the right way. Technology leaders are concerned about large language model (LLM) nondeterminism and hallucinations — plausible, but incorrect, suggestions. Leaders are also concerned about how genAI outputs might or not be compliant to their organization’s culture, ethics, policies, and user experience. An unrealistic thought is to just put “humans in the loop” to control the outcomes, but that is an impractical and costly solution, especially when the AI is customer-facing.

Testing GenAI Is Not For The Faint-Hearted

Testing generative AI is not a well-known practice, perhaps not known at all. GenAI is so new that there is not enough consolidated experience in the market on good and effective testing practices. GenAI-infused solutions haven’t deployed at scale in production to pose enough real risks to seriously worry about this enough. I’ve already highlighted how critical testing is when generative AI is used to assist developers and software development teams, a generative AI use case that in Forrester we’ve named TuringBots: AI- and genAI-enabled development assistants.

Generative AI is more complex to test than all previous types of software we’ve been testing. We’ve explored how to test AI-infused applications, before generative AI via ChatGPT came to the market, with two Forrester reports. We recently have updated those two documents with everything that we have learned so far about experiments and ideas for testing generative AI. Check out “It’s Time To Get Really Serious About Testing Your AI” (Part One and Part Two). We are well aware of the additional complexity that generative AI brings with hallucinations and nondeterminism, which is why we have planned a dedicated research stream in 2024 on testing of generative AI. So stay tuned.

Starting Points

Here are some initial thoughts on how to address testing of AI- and generative AI-infused applications.

Leverage test benchmarks, adversarial testing, and test case prompting for LLMs.

Testing the performance of LLMs and genAI is hard because it means testing various natural language properties and expressions syntactically and semantically. Diverse, free, and open source benchmarks and evaluation frameworks exist to test different aspects of language such as safety, ethics, and performance. For example, the OpenAI Moderation API is designed to provide a safer environment for users to interact with OpenAI APIs. It incorporates a moderation layer to filter out harmful or unsafe content, ensuring ethical use of LLMs. Benchmarking should be coupled with human testing. One approach, similar to traditional software testing, is to specify test properties in prompts to which any correct output should comply. Manual adversarial testing where possible can evolve to automated through generative adversarial networks as a third option.

This conceptual graphic shows the tests needed for automatic and autonomous AIIAs.

Let’s Work Together To Improve Generative AI Outcomes

If you or your team are gaining good experiences and practices or learning about new tools that are effective in helping test generative AI, please reach out to me at dlogiudice@forrester.com. As a Forrester client, if you are instead only starting the journey of learning and planning on how to test AI-infused apps or need some general insights, please do schedule an inquiry or guidance session with me by emailing inquiry@forrester.com. We’re off on the road to unlocking a groundbreaking new technology, but if we can’t test it, it will be useless.

Get The Insights At Work Newsletter

Country*

Yes, I’d like to receive Forrester’s Insights At Work newsletter and receive occasional survey invitations and marketing communications.

Thanks for signing up.

Stay tuned for updates from the Forrester blogs.

Related Insights

Blog

Inspire Trust With Robust, Well-Tested AI-Infused Applications

Diego Lo Giudice December 1, 2022

Software takes part in almost everything we do, and yes, we do trust that it works. Yes, sometimes it fails and it drives us nuts, but in most cases, it does what we expect it to do. How have we learned to trust that software works? Through positive experiences with software that meets our expectations. […]

Release Generative AI Solutions Without Testing Them Comprehensively At Your Peril

Generative AI Is Software, And Software Needs Testing

Testing GenAI Is Not For The Faint-Hearted

Starting Points

Leverage test benchmarks, adversarial testing, and test case prompting for LLMs.

Let’s Work Together To Improve Generative AI Outcomes

Related Forrester Content

Categories

Get The Insights At Work Newsletter

Thanks for signing up.

Related Insights

Inspire Trust With Robust, Well-Tested AI-Infused Applications

Get The Insights At Work Newsletter

Thanks for signing up.

Generative AI Is Software, And Software Needs Testing

Testing GenAI Is Not For The Faint-Hearted

Starting Points

Leverage test benchmarks, adversarial testing, and test case prompting for LLMs.

Let’s Work Together To Improve Generative AI Outcomes

Related Links

Related Forrester Content

Categories

See Diego Lo Giudice at:

Get The Insights At Work Newsletter

Thanks for signing up.

Ultimate Guide To The Top 10 Emerging Technologies

Discover the top 10 emerging technologies shaping 2025, based on Forrester’s exhaustive research. Explore the impact, use cases, and benefit horizons of technologies like agentic AI, synthetic data, quantum security, and more.

Related Insights

Inspire Trust With Robust, Well-Tested AI-Infused Applications

Build Trust In Your Agency’s AI

Public Sector: Will Agentic AI Solve Your Integration Problems?

Get The Insights At Work Newsletter

Thanks for signing up.