Recently, there’s been some very public (and frankly very funny) AI Agent and bot failures.

Like Chipotle’s assistant supporting codegen (since patched): Stop spending money on Claude Code. Chipotle’s support bot is free: : r/ClaudeCode

And in a surreal fashion, Washington State’s callcenter hotline providing Spanish support by speaking English with a Spanish accent: Washington state hotline callers hear AI voice with Spanish accent | AP News

Coinciding with this, other Forrester analysts and I have had a spate of calls where organizations have launched a new AI Agent without testing them.

Put simply, please do not do this.

Please test your AI Agents before launching them – some options on how are below.

 

What do we mean by this?

At minimum – Test all your bots features (and use-cases) yourself.

For any AI Agent, or new feature you’re introducing to it, the minimum effort you should invest is make sure someone has used it as an end-user before this goes live.

This can be as simple as someone on the developer team, or as involved as a dedicated testing group. But you need to make sure someone has actively used your solution – and all its features. This should also be done on an ongoing basis, so when new features are launched, they’re tested too.

This can be time intensive, but as we see with the public cases, not everything works as expected all the time.

In fact, AI can go wrong in more unexpected ways than before. If you can’t ensure features are working as intended, then you might end up on the news.

Please note, this is the minimum possible effort. This is not enough to ensure something won’t go wrong or your application wont fail – this will only catch the most obvious/embarrassing outcomes. A more robust testing practice is recommended.

For more on how Agentic systems fail: Why AI Agents Fail (And How To Fix Them) | Forrester

 

Recommended – Practice red teaming.

A good way to prevent this kind of unexpected permutation is with Red Teaming. Or, intentionally trying to break the bot. We recommend this as a standard practice for your organization.

There are two sides to this – one, traditional, or infosec red teaming. This is focused at finding security exploits. And two, behavioral. This is focused on getting the solution or model to behave in an inappropriate or unintended fashion. It is best to have a practice on both.

At the very least, your team should kick the tires for a day and try as many exploits as possible. Even when you have a governance layer, you must ensure it’s holding up in the wild. Ideally, even post-launch.

For more on the Red Team practice: Use AI Red Teaming To Evaluate The Security Posture Of AI-Enabled Applications | Forrester

For more on standard governance approaches that should be followed: Introducing Forrester’s AEGIS Framework: Agentic AI Enterprise Guardrails For Information Security | Forrester

For specific common governance failures (Source: AIUC): AIUC-1 | The world’s first AI agent standard

For a fun example of what employee-driven red-teaming can look like (Source: Anthropic): Project Vend: Can Claude run a small shop? (And why does that matter?) \ Anthropic

 

Recommended – Test using a testing suite and practice.

Testing an AI Agent system that has agentic capabilities is still an emerging field, but rapid progress is being made. To supplement your testing programs (people whose job is to test your AI tools, applications and agents), testing suites provide additional integrated support. There are two ways to think of testing suites today – synthetic, and ongoing agentic.

Synthetic tests are simple – they test your AI Agent against a sample of pre-created prompts and ideal answers, to act as a “golden set” to test against. This allows you to perform a regression test over time, to validate “does our AI Agent provide the correct responses.”

However, synthetic regression tests are often only performed for an AI Agent after some noteworthy change, like switching out the model used or introducing a number of new use cases. Increasingly, larger testing suites are looking to test automatically, and continuously. Other techniques, like LLM-as-a-judge can provide supplementary runtime supervision.

(Work is coming from Forrester on synthetic testing)

Please note, if you do not have a formal testing program for AI systems, please either hire people for this, or hire a testing services company.

For more on building tests (Source: Anthropic): Demystifying evals for AI agents \ Anthropic

For more on autonomous testing: The Forrester Wave™: Autonomous Testing Platforms, Q4 2025 | Forrester

For how you can make continuous testing work: It’s Time To Get Really Serious About Testing Your AI: Part Two | Forrester

 

Recommended – Test with a representative sample.

The ultimate test of your agents however, will come from your users. They alone determine if you pass or fail. It is in your best interests to make them happy.

The question is – how do we test with real users before production? The answer is a user champion group (or similar convention). Users who have either volunteered themselves, or been selected by you, to test what your Agent is capable of.

This is easier in internal facing use-cases, as employee groups are more straightforward to assemble. However, many customer facing organizations can achieve the same through voluntary test signups.

The risk, is you have users who are over-eager group who don’t make up a representative sample of your user base. In other words, they don’t necessarily represent your average user. This can be avoided through careful group design, or at least, asking users to take on a persona when conducting the test.

If this isn’t possible, at the very least, use a canary test/conditional rollout that can serve as this testbed (though it’s better when its voluntary).

For more on building this user champion group internally: Best Practices For Internal Conversational AI Adoption | Forrester