Benchmarking Guide

Document Number	Revision Number	Revision Date
KN. GU.27.EN	Rev19	14.04.2026

Understanding how well a speech recognition system performs in your environment requires more than a single accuracy number. This guide explains how SESTEK measures recognition quality, what factors influence results, and how to design a meaningful benchmark for your use case.

What is Word Error Rate?

Word Error Rate (WER) is the standard metric for evaluating speech recognition accuracy. It measures how many words in the system's output needed to be corrected to match the reference transcript.

WER is calculated as:

Error Type	Definition
Insertion (I)	A word appears in the hypothesis that is not in the reference
Deletion (D)	A word in the reference is missing from the hypothesis
Substitution (S)	A word in the reference is replaced with an incorrect word

A WER of 5% means the system achieved 95% accuracy.

Example

Designing a Meaningful Benchmark

A benchmark is only as useful as its test set. Results measured on audio that does not reflect your actual environment can be misleading - either overestimating or underestimating real-world performance.

Audio Quality

SR performance varies significantly based on recording conditions. When building your test set, make sure the audio reflects your deployment environment as closely as possible.

Key factors that affect recognition quality:

Background noise - contact center audio is typically noisier than broadcast audio
Audio codec - lower-quality codecs such as 8-bit G.711 reduce recognition accuracy
Overlapping speech - multiple speakers in the same recording increase error rates

Avoid using open-source corpora as your primary benchmark dataset. SR models are often trained on these datasets, which can artificially inflate accuracy scores if the model has already seen similar data.

Speaker Diversity

A representative test set should reflect the range of speakers your application will encounter. Speaker-specific factors that influence accuracy include:

Age - children, adults, and elderly speakers may have distinct speech patterns
Accent - regional and non-native accents can affect recognition rates
Gender - vocal characteristics differ and models should be tested across both

A benchmark built on a narrow speaker profile may not generalize to your full user base.

Dataset Size

For statistically meaningful results, use at least 10,000 words - roughly one hour of continuous speech - per language being tested. Smaller datasets increase the margin of error and make it harder to detect meaningful differences between systems.

Limitations of WER

WER is the most widely used metric for SR evaluation, but it has known limitations:

All errors are treated equally. A substitution that changes the meaning of a sentence is counted the same as a minor spelling variation.
Punctuation and capitalization are ignored. These affect transcript readability and downstream processing but are not reflected in WER.
Speaker diarization is not captured. Who said what is outside the scope of WER.
Reference transcripts are expensive to produce. Manual transcription services charge per hour, making large-scale benchmarking costly.

These limitations do not make WER invalid - but they are worth accounting for when interpreting results, especially across different domains or audio conditions.

For systems using codecs that do not support G.711 - such as G.729 - recognition accuracy may be negatively affected by approximately 10%.

How SESTEK Benchmarks SR

At SESTEK, we conduct periodic comparison tests to evaluate SESTEK SR against other systems and track performance over time. Each test uses audio that reflects real contact center conditions, with reference transcripts created by human annotators.

Our goal is to ensure that accuracy figures we report are representative of what customers actually experience - not idealized lab conditions.

Documentation Index