1. Overview
This document defines the methodology for measuring the success of an Agentic AI system. In Agentic AI, multiple agents collaborate through multi-step reasoning to achieve user goals. The primary unit of measurement is the goal—an end-to-end user objective that may require multiple agents to complete.
2. KPI – Goal Success Rate
The Goal Success Rate is the percentage of goals successfully completed by the Agentic AI system without error or human intervention.
.png?sv=2022-11-02&spr=https&st=2026-02-04T00%3A57%3A04Z&se=2026-02-04T01%3A12%3A04Z&sr=c&sp=r&sig=4im%2FnKLgtKKO%2FwJChTvBu2qkIMMWr7lm0mNqpRZ3A4Y%3D)
3. Understanding Goals
In Agentic AI systems, the primary unit of measurement is the goal. A goal represents an end-to-end user objective that may require multiple steps, multiple agents, and dynamic reasoning to complete.
Key characteristics of goals:
- A session may include multiple goals
- Goals encompass the complete user journey from request to resolution
- Multiple agents may collaborate to achieve a single goal
- Goals are evaluated based on outcome quality, not just individual step correctness
Example:
User request: "Transfer €500 to my savings account if my balance is above €1000"
This goal involves: Balance check agent → Condition evaluation → Transfer agent → Confirmation
Multi-Goal Sessions:
When a single session involves multiple goals:
- Each goal is evaluated independently
- Goals may be sequential (one after another) or nested (sub-goals within a parent goal)
- Agent transitions and handoffs should be evaluated for correctness
Example Session:
A user asks: "Check my last three transactions and if there's a subscription charge, cancel it."
This contains two distinct goals:
- Goal 1: Retrieve last three transactions → Evaluated separately
- Goal 2: Identify and cancel subscription charge → Evaluated separately
Both goals contribute to the overall session success measurement.
4. Success Criteria for Goal Evaluation
A goal is marked as "successful (1)" when all of the following conditions are met:
- The user's intended outcome is achieved correctly
- All intermediate steps and agent handoffs are executed without error
- No unnecessary human intervention is required during goal execution
- The final response or action aligns with user expectations
A goal is marked as "unsuccessful (0)" if any of the following occurs:
- The system misunderstands the user's goal
- An agent fails to complete its designated task
- Incorrect data is retrieved, processed, or presented
- The goal is abandoned or requires unplanned escalation
If a goal is partially completed (e.g., information retrieved correctly but final action failed), it should be marked as unsuccessful, but failure points may be documented for diagnostic purposes.
5. How to Measure Success
Success measurement can be performed using two approaches: simulated conversations via the evaluation tool, or evaluation of real production conversations. Both approaches evaluated against predefined success criteria.
5.1. Approach A: Simulated Conversations
Using the evaluation tool, simulated users interact with the Agentic AI based on predefined scenarios. The tool automatically evaluates each conversation against the configured success criteria. Automated evaluation results may be reviewed or audited by either party if required.
Step 1: Define Test Scenario
Create a scenario that describes the user goal to be tested.
Example:
"Customer wants to check their account balance.
Agent should verify identity and provide accurate balance information."
Step 2: Select Persona
Choose or create a persona that defines the simulated user's behavior and communication style (e.g., Patient & Polite, Budget-conscious, Impatient)
Step 3: Define Success Criteria
Specify the criteria that determine whether the goal is successfully completed.
Example:
| Criteria Name | Description |
|---|---|
| Account Verified | Customer ID is successfully validated |
| Balance Provided | Account balance is accurately displayed |
Step 4: Configure & Run
- Set the number of simulated conversations (e.g., 10)
- Set maximum agent messages per conversation (e.g., 20)
- Add any additional context information if needed
Step 5: Review Results
The tool provides:
- Overall Success Rate: Percentage of conversations meeting all criteria
- Per-Criteria Breakdown: Success rate for each individual criterion
- Conversation Details: Individual results with score, duration, and pass/fail per criterion
The evaluation tool for simulated conversations is currently under development. Features and interface may evolve in future releases.
5.2. Approach B: Production Conversations
Success can also be measured using real conversations from the production environment after the project goes live.
Step 1: Session Selection
Randomly select sessions from the evaluation period ensuring coverage of:
- Different goal types and agent combinations
- Various complexity levels (simple, moderate, complex)
- Various time periods (weekdays, weekends, peak and off-peak hours)
- Minimum recommended: 50 sessions containing at least 100 goal attempts
Step 2: Goal Identification
For each session, identify:
- The number of distinct goals attempted
- The agents involved in each goal
- The expected outcome for each goal
Step 3: Success Labeling
For each identified goal:
- Review the conversation flow, agent reasoning, and actions taken
- Compare the actual outcome against the expected outcome
- Label as successful (1) or unsuccessful (0)
Step 4: Calculate Success Rate
Calculate the success rate using the formula.
Example:
50 sessions analyzed
120 total goals identified across all sessions
102 goals completed successfully
Success Rate = (102 ÷ 120) × 100 = 85%
5.3. Test Guidelines
Well-Defined Agents: Each agent should have clearly defined responsibilities, expected inputs/outputs, and the scenarios it covers to ensure reliable testing and accurate KPI measurement.
Clear Instructions and Test Scenarios: Steps for testing, valid scenarios, and edge cases to be excluded should be clearly defined and approved before testing begins.
Repeatable Testing: Tests for the same goals should be repeatable on different days or by different testers to ensure consistency of results.
Refinement and Iteration: If test results do not meet the expected success rate, test scenarios or agent behaviors may be refined and the tests repeated. Unless otherwise agreed, a maximum of two refinement cycles is permitted.
Configuration Freeze: Agent configurations, prompts, and integrations must remain unchanged during the measurement period. Any modifications require re-testing of affected scenarios.
6. Exclusions from Evaluation
The following scenarios are excluded from KPI calculation:
| Exclusion Type | Description |
|---|---|
| Out-of-Scope | Goals, languages, or scenarios not defined in project scope |
| External Failures | API errors, third-party downtime, AI provider outages |
| Integration Issues | Connectivity problems, timeout errors, data quality issues |
| User Behavior | User abandonment (without agent error), adversarial inputs |
| Audio Quality for Voice Agents | Voice inputs with excessive noise or unintelligible speech |
| Force Majeure | System-wide outages or circumstances beyond control |
7. Limitations & Disclaimers
-
Agentic AI systems utilize large language models (LLMs) which are inherently probabilistic in nature. 100% accuracy cannot be guaranteed for any AI-based system.
-
AI model behavior may exhibit slight variations between identical inputs due to the probabilistic nature of LLMs. This is expected behavior and not considered a defect.
-
The choice of AI provider and model directly impacts system performance and achievable success rates.
-
If the Customer selects or mandates a specific model or provider, success rate targets are calibrated based on that model's capabilities.
-
Model updates, deprecations, or behavioral changes by the AI provider may affect system behavior. Such changes are outside our control and may require recalibration.
