Building Tests For Conversational AI
The Problem
Recently, we published The Forrester Wave™: Conversational AI Platforms For Employee Services, Q3 2024, ranking how well different conversational AI providers give employees daily support across domains such as IT, HR, finance, and more.
While this report goes into significant depth on current offerings and strategic positioning, I wanted to give some more detail on a particularly useful test I’ve developed to evaluate the question, “How does this conversational AI system stack up against real-life use?”
All conversational AI systems must be configured (or preconfigured) to handle different use cases. Even large language model (LLM)- or generative AI (genAI)-enabled systems need to be prepared to fulfill asks on certain topics. When you implement one of these systems in your org, depending on how you do it, it could make testing difficult without a huge number of end users kicking its tires.
So I came up with a simple test you can run to evaluate the capability of your systems to accommodate this “real user language variance.”
What Should You Evaluate?
I designed this test to give you directional guidance on the following:
- Basic intent recognition: Does the system recognize a variety of different questions that it should be able to support?
- Adaptability of intent recognition: Does the system recognize rephrases of the intent?
- Ability to delineate between similar utterances: Does the system get confused on similar, but different, intents?
- Conversational flexibility: Does the system support variances of human language (e.g., misspellings, verbosity, minimal user inputs, misleading user inputs, vague asks, etc.)?
- Failure handling: How does the system handle unknown or unsupported utterances?
- Answer specificity: How does the system answer user questions? Does it recognize multiple intents and answer all of their questions? Does it tailor responses effectively?
- Action taking: Does the system terminate at an answer, or does it automate the user ask? If not, how can your system be extended into full automation of the user ask?
- Answer and action accuracy: Does the system create responses that are accurate to user asks? Does it use the right sources?
- Performance/speed to answer: Does the system respond in an acceptable amount of time to the user?
- Permissions and governance: Does the system recognize what users should and should not know? Does the system effectively enforce its governance rules?
What Is The Test?
- Create 10 different utterances (user asks) to test against the system, each designed in a specific way to test for different kinds of adaptability.
- This test will require two parties:
- Admin: the party/person responsible for platform training
- Tester: the party/person responsible for testing
 
- Intents will be split among five “known” intents and five “unknown.” This is to simulate real end-user interactions. The admin should not know the utterances developed by the tester, and the tester should not give guidance on what to build into the system outside of the “known” intents.
- At the time of testing, the tester will watch the insider apply the five known intents, and after, the tester will provide them with the five unknown intents in real time.
Utterance development recommendations:
- Four of the known and unknown utterances should mirror one another. For example, if known utterance 1 asks about account issues, utterance 2 should ask about an account lockout, possibly related to an account issue.
- One unknown utterance should not be parallel and instead ask for something that might not yet be covered in the platform.
- No utterances should be the same.
Example Utterances
Known/given utterance examples (misspellings are intentional and encouraged):
- “Account issues.”
- “what is our amternity/paternity policies? Do I need to use vacation days for it?”
- “what is my pay rate? Is this right?”
- “Windoes giving me trouble. I’m by the elevator on the second floor, can someone come check it out?”
- “I just started, opened my computer this morning and now I’m getting a lot of weird utterances, is there anything that I missed from HR, or I need to do to reduce how many times I have to login?”
Unknown/hidden utterance examples:
- “hey I’m having issues logging in – what do I need to do to fix it. Normally I’d be able to figure this out myself, but I’ve tried a few different times and done different things here. I also restarted twice, and I think I’m up to date on my OS and drivers. I’m able to log into my computer itself, but its when I get to the other screens, where we have our apps, that I can’t get in. I’m also kind of worried that I may accidentally trigger a lockout or something – I only tried twice and dnot think my account is locked quite yet but who knows.”
- Objective: I want the system to ask clarifying questions, look up common resolutions, or do additional background work on vague asks. It also should be able to deal with verbosity.
- Parallels number one, but with increased verbosity and additional asks for troubleshooting.
- “can I taake leave next Thursday? I’ll need the full day because it’s my sisters baby shower and we’ll need some time to get ready! Whats my leave policy?”
- Objective: I want to ensure that similar intents and policies aren’t accidentally triggered with similar words for different needs.
- Parallels #2, but asks about leave instead of parental leave policy. Has trigger words for maternity policy contained in attempt to trigger intent clashing.
 
- “what is my pay rate? What is the average for my position here?”
- Objective: I want to ensure that permissions controls are properly enforced.
- Parallels #3, asking about pay rate but also testing for additional information that should have some governance rules or formal policy associated.
 
- Window is open, but nothing is happening.”
- Objective: I want the system to ask clarifying questions and not intent-confuse with just “windows OS troubleshooting” or “physical window issues.”
- Parallels #4, but with less clarity.
 
- “Hey how are you? I hope you’re doing well! I’m new, and was hoping for some onboarding guidance.”
- Objective: I want the system to adapt to new inputs.
- Parallels #5 with having a “new user” but diverges in asking about onboarding guidance/policy instead of technical issue.
 
Curveball (additional utterance intended to break system):
- “My taco isn’t blue.”
- Objective: I want the system to ask clarifying questions.
- Actual issue: User’s OneDrive is no longer synced due to account logout; user needs to log back in.
 
At the end of the first round of the test, you should have a better idea of where your system breaks down, allowing you to target further improvements.
This should not be the only test you run, or the only time you run it, but should provide a foundational understanding of strengths, weaknesses, and allowing you to work from there.
