AI Testing

The AI Testing module within AI Agents provides a simulation-based testing framework for evaluating the project's overall performance. It enables the creation of test scenarios with defined personas, success criteria, and configurable conversation parameters. AI-simulated conversations are generated and measured against the specified criteria, producing detailed evaluation reports with per-conversation scoring breakdowns.

This document covers the complete AI Testing workflow: from the test scenario listing page through scenario configuration, test execution history, and evaluation results.

Key Benefits

Automated Quality Assurance at Scale: AI-simulated conversations are generated in configurable volumes, eliminating the need for manual testing and enabling comprehensive coverage across diverse interaction scenarios.

Pre-Deployment Risk Mitigation: Potential issues such as incomplete workflows, language inconsistencies, or unmet business requirements are surfaced in a controlled environment before production deployment.

Persona-Driven Behavioral Coverage: Multiple user archetypes (e.g., impatient, detail-oriented, or constrained communicators) may be assigned to a single scenario, ensuring the agent is evaluated against a realistic range of end-user behaviors.

Measurable, Criteria-Based Evaluation: Explicitly defined success criteria provide quantifiable, per-conversation scoring, replacing subjective assessments with data-driven performance metrics.

Granular Diagnostics & Traceability: Results are presented at multiple levels overall success rate, per-criterion pass rates, and full conversation transcripts enabling precise identification of systemic weaknesses and individual points of failure.

Resource Visibility: Token consumption metrics are reported per run, supporting cost monitoring and capacity planning.

Iterative Improvement: A chronological execution history allows successive runs to be compared, verifying that agent modifications yield measurable improvement over time.

1. Test Scenario List

The AI Testing landing page presents all test scenarios within the current project as a card-based layout. It is accessed via the AI Testing tab in the top navigation bar.

1. Create New Test Card: The first card in the layout serves as the entry point for new test creation. Two options are available:

A + button for manual scenario creation, which directs the user to the test configuration form.
A Generate with AI option, which is intended to allow AI-assisted scenario generation from existing conversations. (Coming soon)

2. Scenario Cards: Each existing test scenario is displayed as an individual card containing:

The scenario name
A brief description of the scenario
Last Evaluation Results, which include:
- The date of the most recent evaluation run
- The number of simulated conversations
- An overall success rate badge (displayed as a percentage with a color-coded indicator)
A Run Test button at the bottom of the card for initiating a new test execution

2. Test Configuration

The test configuration page is accessed by selecting the Create New Test card or by editing an existing scenario. The page is organized into two primary tabs: Test and Evaluations. The content described below pertains to the Test tab.

Select Test Type

At the top of the configuration form, two test type options are presented:

Test Type	Description	Availability
Simulate Conversations	Defines scenarios and personas to generate AI-simulated conversations and measure performance against success criteria.	Available
Test with Historical Data	Randomly selects past conversations from production data to measure how the project performed with real users.	Coming soon

📝 Note: The Test with Historical Data option is not yet available. When released, it will allow evaluation against real production conversations rather than AI-generated simulations.

Scenario Definition

The scenario is defined through the following fields:

1. Test Name: (required) A text input field for the scenario's display name.

2. Scenario Description: (required) A multi-line text area in which the expected agent behavior is described in detail. The description serves as the guiding prompt for AI-simulated conversations. It should outline the full conversation flow, including the sequence of information gathering, decision points, and expected agent responses.

Example: A scenario description for a travel booking agent might specify that the agent first collects departure city, destination city, and date range, then presents flight options, books the selected flight, suggests hotel options, and finally summarizes the travel plan with date consistency checks.

3. Additional Info: An optional text area for supplementary instructions that provide contextual details for the simulation. This field may include specific data values, constraints, or situational context that the simulated conversation should incorporate (e.g., "Execute the scenario for travel from New York to Paris for the dates March 25–27").

Persona Configuration

The Persona section defines the behavioral profiles that simulated users will adopt during conversations.

1. + Add From Templates: A dropdown button that provides access to predefined persona templates. These templates offer commonly used behavioral profiles that may be added to the scenario without manual configuration.

2. + Add Persona: A button for creating a custom persona definition from scratch.

Personas define characteristics such as communication style, patience level, decision-making approach, and emotional state. Multiple personas may be assigned to a single scenario.

Examples of persona definitions:

An impatient customer who demands quick resolutions, dislikes repeating information, and may express dissatisfaction if the conversation takes too long
A user who expects short and clear responses, prefers to see 2–3 options before making a decision based on key criteria such as price and layover duration

💡 Tip: Combining multiple personas within a single scenario enables the evaluation of agent robustness across diverse user behaviors. Persona templates provide a convenient starting point, after which descriptions may be customized to match specific testing objectives.

Simulation Parameters

Two numerical fields control the scope of the simulated test run:

Field	Description	Required
Max Agent Messages	The maximum number of messages the agent is permitted to send within a single simulated conversation. This value caps the conversation length.	Yes
Number of Simulated Conversations	The total number of AI-generated conversations to be produced during the test run. Each conversation is independently simulated.	Yes

⚠️ Warning: Setting the Max Agent Messages value too low may result in conversations being truncated before the agent completes the intended workflow. It is recommended that this value be set to accommodate the full expected conversation length.

Success Criteria

The Success Criteria section defines the evaluation metrics against which each simulated conversation is assessed. Criteria may be added manually via + Add Criteria or selected from predefined templates via + Add From Templates.

Each criterion is defined by three fields:

Field	Description
Name	A short, descriptive label for the criterion (e.g., "Flight Reservation Completed").
Description	A detailed explanation of what the criterion evaluates. This description is used by the evaluation engine to assess whether the criterion has been met.
Type	The scoring type for the criterion. The Integer type is used for numerical scoring.

💡 Tip: Success criteria descriptions should be as specific as possible, as the evaluation service relies on these descriptions to determine whether each criterion is satisfied. Vague or overly broad descriptions may result in inconsistent scoring.

3. Test Execution History

Upon navigating to a specific test scenario, the Evaluations tab displays a chronological history of all past test runs.

The execution history is presented as a data table with the following columns:

Column	Description
Run Time	The date and time at which the test was executed. Each entry is rendered as a clickable hyperlink that navigates to the corresponding evaluation detail page.
Duration	The total elapsed time for the test run (e.g., "6 min 10 s").
Simulated Conversations	The number of conversations that were generated during the run.
Success Rate	The overall success rate across all simulated conversations.
Status	The terminal state of the run.

📝 Note: Runs with a Cancelled status may still contain partial results. The Simulated Conversations column indicates how many conversations were completed before cancellation (e.g., a run configured for 10 conversations may show 8 if cancelled early).

4. Evaluation Details

The evaluation detail page provides a comprehensive breakdown of a specific test run's results. It is accessed by selecting a run timestamp from the execution history table.

Performance Summary Cards

Three summary cards are presented at the top of the page, providing an at-a-glance overview of the run's key metrics:

1. Success Criteria Rate: Displays the overall performance score as a large percentage value (e.g., "53%") with the label "Overall Performance."

2. Token Consumption: Reports the total token usage for the test run, broken down into three categories:

Metric	Description
Input	The total number of input tokens consumed across all simulated conversations (e.g., "1.1 M").
Cached Input	The number of tokens served from cache (e.g., "110.0 K").
Output	The total number of output tokens generated (e.g., "51.6 K").

3. Conversation: Summarizes conversation-level statistics:

Metric	Description
Average Agent Messages	The mean number of messages sent by the agent per conversation (e.g., "9").
Number of Conversations	The total count of simulated conversations in the run (e.g., "10").
Average Duration	The mean elapsed time per simulated conversation (e.g., "1 min 26 s").

Success Criteria Breakdown

Below the summary cards, each success criterion is displayed as an individual horizontal progress bar with its corresponding pass rate percentage.

Simulated Conversations Table

The lower portion of the page contains a detailed data table listing every individual simulated conversation. The table includes the following columns:

Column	Description
Score	The conversation's overall score (e.g., "6.1/10").
Duration	The elapsed time for the individual conversation (e.g., "1 min 23 s").
Agent Messages	The number of messages sent by the agent within the conversation.
[Per-Criterion Columns]	One column per success criterion, displaying the integer score achieved for that conversation. Column headers correspond to the criterion names defined in the test configuration.

Conversation Detail Panel

Selecting a conversation row opens a side panel displaying the full conversation transcript. The panel is positioned to the right of the conversations table.

1. Conversation Transcript: The full exchange between the AI Agent and the simulated user is rendered as a chat-style interface:

Agent messages appear on the left side with a bot avatar icon, displayed in light-colored speech bubbles.
User messages appear on the right side with a user avatar icon, displayed in yellow-tinted speech bubbles.

2. AI Agent Handover & Tool Call Details: AI Agent handover and tool call are displayed in the conversation panel, indicating how the AI agent handled the conversation.

💡 Tip: Reviewing individual conversation transcripts alongside their per-criterion scores is valuable for identifying specific points of failure. A conversation with a high overall score but a zero on a particular criterion reveals precisely where the agent deviated from expected behavior.