IntellAgent - The multi-agent framework to evaluate your conversational agents
And next also to optimize them
Hi Folks, In this blog post I’m going to present a novel method proposed by a couple of brilliant researchers who developed a system to evaluate how well your conversational agents perform.
The implementation is available for everyone in their open-source repo. It has simple compatibility with LangGraph agents/modularity. You can find it here: —> Link to the repo
Let’s begin with some motivation:
The AI Revolution transforms our world in ways we couldn't have imagined just a few years ago. What were once simple chatbots that could barely string together coherent responses have evolved into sophisticated digital assistants. These systems now help doctors analyze complex medical records, work alongside financial advisors to manage investment portfolios and assist teachers in developing personalized learning plans for their students.
These aren't your grandmother's automated phone systems. Today's AI assistants are dynamic agents who can think independently, develop strategies, and adapt to unexpected situations. Imagine having an assistant who remembers every single conversation you've ever had and can connect seemingly unrelated pieces of information across those conversations, seamlessly use various tools to help you, and follow complex rules without missing a beat.
The Challenge: When AI Gets Complex
This is where things get interesting - and complicated. Modern AI assistants need to juggle multiple complex tasks all at once:
First, they need to maintain perfect memory of conversation histories. This isn't just about remembering what was said two minutes ago; it's about recalling that a customer mentioned a nut allergy three weeks ago when they were placing a food order today.
They also need to work with multiple databases simultaneously. Picture an AI juggling different systems at once - checking inventory, pulling up customer records, looking at shipping logs, and verifying pricing, all while maintaining perfect accuracy.
Then there's the challenge of following hundreds of company policies. Take an AI handling insurance claims - it needs to navigate through a maze of claim limits, required documentation, coverage exclusions, regional regulations, and approval workflows, all while staying compliant with industry standards.
These systems also need to work with various tools and external systems. A customer service AI might need to process payments through one system, generate shipping labels through another, and update inventory in a third - all while keeping everything perfectly synchronized.
Security is another critical concern. It's not just about protecting data - it's about knowing exactly what information can be shared, with whom, and under what circumstances. The AI needs to verify identities, manage access permissions, handle sensitive information appropriately, and know when to escalate to human operators.
Through all of this complexity, these systems need to maintain natural, flowing conversations. They need to explain complicated policies in simple terms, walk users through complex processes step-by-step, and handle unexpected questions or requests without missing a beat.
One small mistake - like forgetting to verify identity before sharing sensitive information or misapplying with a refund policy - could have serious consequences. So how do we ensure these AI assistants are reliable before we put them to work in the real world?
Why Traditional Testing isn’t Good Enough
Current testing methods typically involve human testers writing specific scenarios and checking if the AI responds correctly. It's like trying to evaluate a chef's abilities by having them cook the same three dishes over and over. Sure, you know they can make those particular dishes, but what about everything else on the menu?
Enter IntellAgent: A Revolution in AI Agents Evaluation
This is where IntellAgent comes in. It's a groundbreaking system that uses AI to test other AI systems, essentially creating thousands of virtual "mystery shoppers" who can test an AI assistant in countless different scenarios. Let's break down exactly how it works.
The Three-Step Testing Process
IntellAgent operates in three main steps, each carefully designed to create comprehensive and realistic tests:
Before diving into the three stages, we must first understand the system’s input:
Agent Prompt
Contains the agent's operational policies and behavioral guidelines
Can be provided as a comprehensive document detailing all policies the agent should follow
Database Schema (optional)
Provided when the agent needs to interact with databases
Example: Schema for users and orders databases containing customer information
Agent Tools (optional)
A list of tools and capabilities available to the agent
Defines what actions and operations the agent can perform
Step 1: Setting the Stage
The Policy Graph Builder is the first component of IntellAgent's testing process. It creates what's called a "policy graph" - a structured map of all the rules the AI assistant must follow. Think of it as plotting a network where each point represents a policy (like "verify user identity" or "restrict access to sensitive information"), and the connections between points show how often these policies appear together in real-world situations.
The system starts by analyzing input documents, such as the AI assistant's system prompt or company policy documents. It extracts each policy and assigns it a complexity score that indicates how challenging it is to implement. For each pair of policies, it evaluates how likely they are to occur together, using this information to create connections. For instance, "verify identity" might frequently connect to "process refund" since these actions often happen together in customer service scenarios.
The Event Generator then takes this policy graph and creates detailed scenarios that challenge the AI assistant across varying levels of complexity. Starting with a single policy, it "walks" through the graph, selecting additional policies based on their connections and cumulative complexity until it reaches the desired complexity level for the scenario.
These aren't just lists of policies - they're narratives. The system might generate a situation where a customer wants to update their delivery address, add a family member to their account, and use a new payment method, with each request corresponding to specific policies.
The Database Setup brings these scenarios to life by constructing a realistic database state for each scenario. For a scenario involving address updates and family member additions, the database might include a detailed customer profile with purchase histories, loyalty points, and existing family connections.
The data is carefully crafted to be internally consistent and contextually relevant. A customer profile might include contact information with realistic formatting and geographical coherence, a chronologically ordered purchase history with transaction dates and receipts, loyalty program details including points balance and membership tier, and support history showing previous customer service interactions.
Let’s look at the following example:
First, they sample policies in a scenario of certain complexity:
Then, a mock scenario is generated accordingly:
The system then generates a custom database per scenario:
Step 2: The Testing Phase
During the testing phase, a sophisticated AI known as the User Agent acts as the customer. Armed with the scenario details, it interacts with the conversational agent naturally, making requests, providing information, and responding to questions. The User Agent's goal is to push the conversational agent to its limits while maintaining the scenario's logic.
The Testing Environment allows the conversational agent to operate as it would in a live setting, accessing the pre-built database, using its standard tools and APIs, and following its programmed policies. The User-Agent and conversational agent engage in dynamic conversations, with the User Agent adapting its responses based on the conversational agent's behavior to simulate real users' unpredictability.
Step 3: The Evaluation Phase
After each conversation, the Dialog Critic reviews the interaction in detail. It identifies which policies were relevant during the interaction and checks if the conversational agent followed them correctly. It analyzes how the conversational agent handled errors - did it try to recover from mistakes, or did one error lead to more problems?
The Dialog Critic produces a detailed performance report that goes beyond simple pass/fail metrics. It evaluates task completion rates, measuring how efficiently and accurately the conversational agent achieved its goals. It provides a fine-grained analysis of which one of the policies was tested + the failed policies (if exist) + reasoning. The report also highlights policy-specific performance, identifying areas where the conversational agent excelled or struggled. Patterns of failure are analyzed to uncover root causes, such as conflicting policies or limitations in the conversational agent’s training data.
Let’s examine an example of the same generated scenario: first, when the agent is based on GPT-4o; then when it’s based on GPT-4o-mini; both followed by the critique feedback.
GPT-4o as the Agent:
GPT-4o-mini as the Agent:
What Makes This Special
IntellAgent stands out because it can generate endless unique, realistic scenarios, test complex combinations of policies, create complete and consistent test environments, and evaluate performance in detail.
Key Research Findings
When researchers tested IntellAgent with leading AI models like GPT-4o, Claude, and Gemini, they discovered several interesting patterns:
Performance decreased as scenarios became more complex
Different AI models had different "breaking points"
Certain types of policies, particularly those involving user consent, were consistently challenging
Results aligned well with other testing methods, validating IntellAgent's approach
The relative ranking of models shifts across different categories. IntellAgent provides a detailed analysis of the specific policies where the tested chatbot may encounter difficulties:
Why This Matters
As AI systems handle more critical tasks in our lives, ensuring their reliability isn't just important - it's essential. IntellAgent represents a major step forward in making sure AI assistants are trustworthy before they're deployed in real-world situations.
The Future of AI Testing
The researchers have made IntellAgent open-source, allowing anyone to use and improve it. This could lead to more reliable AI systems across all industries, faster development of new features, a better understanding of AI capabilities and limitations, and standardized testing protocols for AI assistants.
For those interested in exploring further, the full research paper is available, and IntellAgent is ready for hands-on experimentation.