Let's Talk About Chatbot Testing

Jonathan Springer

Aug 05, 2025

Stop me if you've heard this one:

"But the model's non-deterministic!"

Yep. Me, too.

Spend a day on LinkedIn and you'll see any number of poeple saying:

Large Language Models are the greatest thing since sliced bread, and
We're all making it up as we go along

... often in the same post.

But...

Many of us are working in/with regulated and/or risk-averse businesses where testing and validation of a technical system are as important as (if not moreso than) the creation of that system. Over decades of working with deterministic technical systems, we've built testing frameworks that (when diligently implemented) manage the risk inherent in a technical implementation.

Now we've got Large Language Models (LLMs) and their various implementation patterns. Suddenly no two answers are the same and frequently it's not even clear what the expected result of an interaction should be. How is one to navigate through this situation and implement a system that's still risk free.

(Bonus round: Now we're discussing "agentic" AI where LLMs could actually utilize "tools" that have real-world consequences... we'll touch on the implications briefly, but agentic testing deserves its own article.)

First Principles (or Why Do We Test)

The essential premise of testing is that the behavior of the system under test will predict the behavior of the system in production. To that end, we create processes to ensure that code (and code-like data e.g. rules) are migrated accurately from validated environments to production and we strive to ensure that the data available to systems under test resembles and behaves like production data.

None of that is easy. For years, IT organization essentially cloned production environments to support testing with significant implications for data security. Data masking ("obfuscation") continues to be a hot topic, especially as more and more unstructured data comes under scrutiny for processing by AI. The essential tension is that every time data is masked without very careful consideration, a trade-off occurs with the ability to accurately predict the systems behavior later. If all names in a database are replaced with 'JAMES SMITH,' for example, there's no way to test a search for 'PAM JONES'.

Why do these matter for AI? Because all of these problems still exist in any system more complex than "just chat with the LLM" and now have been additionally obscured behind a screen of natural language expression.

Send a Robot to Test a Robot

If the issue is language, then it seems only natural to task another LLM to evaluate a chat response. In fact, it's so obvious there simply must be a bunch of frameworks out there to do it, right? Right? Um....

Perhaps my search engine foo was not up to the task, but I wasn't able to find anything actively maintained and LLM-savvy when I went hunting. Prove me wrong in the comments!

So I started thinking about what I'd want from a design. Here are a few ideas and I'll likely start a GitHub repository to play with them.

Provide a scripted chat including an "expected" response. Automate the interaction with the LLM-under-test up to the last response, then capture the last response.
Feed the expected chat and the actual chat to an independent judge model. Ask it to opine on whether the actual response is correct based on the expected response.
There may be two flavors of correct: factually correct and contextually correct. In other words, is the answer appropriate and complete.
Ideally develop a scoring model, akin to how one would rate a potential new employee, and have the judge model "fill it out" by providing structured feedback (e.g. JSON).

This feels like a 200-level Python coding exercise. Any academics want to assign it and see what happens?

Stochastic Parrot in Eire

Discussion about this post