Agentic AI

1. Overview

This document defines the methodology for measuring the success of an Agentic AI system. In Agentic AI, multiple agents collaborate through multi-step reasoning to achieve user goals. The primary unit of measurement is the goal—an end-to-end user objective that may require multiple agents to complete.

2. KPI – Goal Success Rate

The Goal Success Rate is the percentage of goals successfully completed by the Agentic AI system without error or human intervention.

3. Understanding Goals

In Agentic AI systems, the primary unit of measurement is the goal. A goal represents an end-to-end user objective that may require multiple steps, multiple agents, and dynamic reasoning to complete.

Key characteristics of goals:

A session may include multiple goals
Goals encompass the complete user journey from request to resolution
Multiple agents may collaborate to achieve a single goal
Goals are evaluated based on outcome quality, not just individual step correctness

Example:

User request: "Transfer €500 to my savings account if my balance is above €1000" 

This goal involves: Balance check agent → Condition evaluation → Transfer agent → Confirmation

Multi-Goal Sessions:
When a single session involves multiple goals:

Each goal is evaluated independently
Goals may be sequential (one after another) or nested (sub-goals within a parent goal)
Agent transitions and handoffs should be evaluated for correctness

Example Session:

A user asks: "Check my last three transactions and if there's a subscription charge, cancel it."

This contains two distinct goals:

Goal 1: Retrieve last three transactions → Evaluated separately
Goal 2: Identify and cancel subscription charge → Evaluated separately

Both goals contribute to the overall session success measurement.

4. Success Criteria for Goal Evaluation

A goal is marked as "successful (1)" when all of the following conditions are met:

The user's intended outcome is achieved correctly
All intermediate steps and agent handoffs are executed without error
No unnecessary human intervention is required during goal execution
The final response or action aligns with user expectations

A goal is marked as "unsuccessful (0)" if any of the following occurs:

The system misunderstands the user's goal
An agent fails to complete its designated task
Incorrect data is retrieved, processed, or presented
The goal is abandoned or requires unplanned escalation

Partial Success Handling

If a goal is partially completed (e.g., information retrieved correctly but final action failed), it should be marked as unsuccessful, but failure points may be documented for diagnostic purposes.

5. How to Measure Success

Success measurement can be performed using two approaches: simulated conversations via the evaluation tool, or evaluation of real production conversations. Both approaches evaluated against predefined success criteria.

5.1. Approach A: Simulated Conversations

Using the evaluation tool, simulated users interact with the Agentic AI based on predefined scenarios. The tool automatically evaluates each conversation against the configured success criteria. Automated evaluation results may be reviewed or audited by either party if required.

Step 1: Define Test Scenario

Create a scenario that describes the user goal to be tested.

Example:

"Customer wants to check their account balance. 
Agent should verify identity and provide accurate balance information."

Step 2: Select Persona

Choose or create a persona that defines the simulated user's behavior and communication style (e.g., Patient & Polite, Budget-conscious, Impatient)

Step 3: Define Success Criteria

Specify the criteria that determine whether the goal is successfully completed.

Example:

Criteria Name	Description
Account Verified	Customer ID is successfully validated
Balance Provided	Account balance is accurately displayed

Step 4: Configure & Run

Set the number of simulated conversations (e.g., 10)
Set maximum agent messages per conversation (e.g., 20)
Add any additional context information if needed

Step 5: Review Results

The tool provides:

Overall Success Rate: Percentage of conversations meeting all criteria
Per-Criteria Breakdown: Success rate for each individual criterion
Conversation Details: Individual results with score, duration, and pass/fail per criterion

Important Note

The evaluation tool for simulated conversations is currently under development. Features and interface may evolve in future releases.

5.2. Approach B: Production Conversations

Success can also be measured using real conversations from the production environment after the project goes live.

Step 1: Session Selection

Randomly select sessions from the evaluation period ensuring coverage of:

Different goal types and agent combinations
Various complexity levels (simple, moderate, complex)
Various time periods (weekdays, weekends, peak and off-peak hours)
Minimum recommended: 50 sessions containing at least 100 goal attempts

Step 2: Goal Identification

For each session, identify:

The number of distinct goals attempted
The agents involved in each goal
The expected outcome for each goal

Step 3: Success Labeling

For each identified goal:

Review the conversation flow, agent reasoning, and actions taken
Compare the actual outcome against the expected outcome
Label as successful (1) or unsuccessful (0)

Step 4: Calculate Success Rate

Calculate the success rate using the formula.

Example:

50 sessions analyzed
120 total goals identified across all sessions
102 goals completed successfully

Success Rate = (102 ÷ 120) × 100 = 85%

5.3. Test Guidelines

Well-Defined Agents: Each agent should have clearly defined responsibilities, expected inputs/outputs, and the scenarios it covers to ensure reliable testing and accurate KPI measurement.
Clear Instructions and Test Scenarios: Steps for testing, valid scenarios, and edge cases to be excluded should be clearly defined and approved before testing begins.
Repeatable Testing: Tests for the same goals should be repeatable on different days or by different testers to ensure consistency of results.
Refinement and Iteration: If test results do not meet the expected success rate, test scenarios or agent behaviors may be refined and the tests repeated. Unless otherwise agreed, a maximum of two refinement cycles is permitted.
Configuration Freeze: Agent configurations, prompts, and integrations must remain unchanged during the measurement period. Any modifications require re-testing of affected scenarios.

6. Exclusions from Evaluation

The following scenarios are excluded from KPI calculation:

Exclusion Type	Description
Out-of-Scope	Goals, languages, or scenarios not defined in project scope
External Failures	API errors, third-party downtime, AI provider outages
Integration Issues	Connectivity problems, timeout errors, data quality issues
User Behavior	User abandonment (without agent error), adversarial inputs
Audio Quality for Voice Agents	Voice inputs with excessive noise or unintelligible speech
Force Majeure	System-wide outages or circumstances beyond control

7. Limitations & Disclaimers

Agentic AI systems utilize large language models (LLMs) which are inherently probabilistic in nature. 100% accuracy cannot be guaranteed for any AI-based system.
AI model behavior may exhibit slight variations between identical inputs due to the probabilistic nature of LLMs. This is expected behavior and not considered a defect.
The choice of AI provider and model directly impacts system performance and achievable success rates.
If the Customer selects or mandates a specific model or provider, success rate targets are calibrated based on that model's capabilities.
Model updates, deprecations, or behavioral changes by the AI provider may affect system behavior. Such changes are outside our control and may require recalibration.