Software takes part in almost everything we do, and yes, we do trust that it works! OK, sometimes it fails and it drives us nuts, but in most cases, it does what we expect it to do. How have we learned to trust that software works? Through positive experiences with software that meets our expectations. And well-tested software avoids destroying customer experience, since customers never see the bugs — they get identified and fixed earlier in the development process. Thanks to test automation in its continuous delivery process, Intesa Sanpaolo captured 700 bugs as it released new features of its flagship mobile application into production throughout the last 12 months. Thus, test automation is essential to any modern application delivery process.

Similarly, we are also learning to trust AI, which is increasingly being infused into software and apps that we use every day (although not yet at the scale of traditional software). But there are also plenty of examples of bad experiences with AI that compromise ethics, accountability, and inclusiveness and thus erode trust. For example, as we describe in “No Testing Means No Trust In AI: Part 1,” Compas is a machine-learning- (ML) and AI-infused application (AIIA) that judges in 12 US states use to help decide whether to allow defendants out on bail before trial and to help decide on the length of prison sentences. Researchers found that Compas often predicted Black defendants to be at a higher risk of recidivism than they actually were, while it often underestimated the risk that white defendants would reoffend.

Losing trust in AIIAs is a high risk for AI-based innovation. The additional expectation we humans have of AI-infused applications is that, besides just working, we expect intelligent behavior: AIIAs must speak, listen, see, and understand. Forrester Analytics data shows that 73% of enterprises claim to be adopting AI for building new solutions in 2021, up from 68% in 2020, and testing those AI-infused applications becomes even more critical.

Figuring the amount of rigor in testing and there needed effort should be commensurate to the level of business risk the AIIA carries
Stay away from the red zones!

The consequences of not testing AIIAs enough loom even larger for applications in areas that impact life and safety (think self-driving cars and automated factories), cybersecurity, or strategic decision support. And as AI improves and we gradually reduce human intervention, testing AIIAs becomes even more critical.

From another angle, the World Quality Report (WQR) from Capgemini shows that 88% of IT leaders are either thinking of using AI in testing or of testing their AI. And I expect explosive growth in AIIA testing. Here are some of the most common questions I answer for Forrester’s clients:

1) How do I balance the risk of not testing enough or testing too much for AIIAs?

2) How I do know if I’m testing AI sufficiently?

3) Testing AIIAs takes a village. What roles should be involved?

4) Which of the existing testing practices and skills can I use, and which new ones do I need?

5) Is testing equal for all types of AI involved?

6) What about automation of testing AIIAs?

7) How does that automation integrate into MLOps (DevOps for AI)?

If you are among the 73% adopting AI to make your enterprise smarter, faster, and more creative, please read my report, “It’s Time To Get Really Serious About Testing Your AI,” for the answers to those common questions and more.

And I have a question for you, too: How are you testing your AI-infused applications? Please reach out and let me know at dlogiudice@forrester.com.

If you want to hear more about this topic, be sure to catch my presentation at the upcoming Data Strategy & Insights event Nov. 18-19.