Accuracy Benchmark for Models

Document Number	Revision Number	Revision Date
KN. GU.27.EN	Rev14	30.05.2025

📢Stay up to date with Speech Recognition Accuracy Comparison Test 2024 - latest industry insights👀

At Sestek, we understand that achieving successful outcomes with speech recognition is crucial for our users. As part of our commitment to continuous improvement and ensuring the highest levels of performance for Knovvu Speech Recognition, we conduct periodic comparison tests to evaluate and compare recognition success rates.

Through these periodic comparison tests, we are able to enhance the overall user experience with Knovvu Speech Recognition. Our users can be assured that we are continually working to optimize the system's accuracy, efficiency, and adaptability, resulting in seamless integration and exceptional performance for various applications.

Overview

To effectively gauge the precision of a Speech Recognition (SR) system like Sestek, it is essential to utilize a test set of audio that mirrors the audio typically encountered in the intended application.

For every audio clip, a flawless 'reference' transcript created by a human is necessary, serving as a standard to compare against the 'hypothesis' transcript generated by the ASR system.

The Word Error Rate (WER) is determined by comparing the reference transcript with the hypothesis transcript to identify the least number of modifications needed to make the transcripts identical.

Selecting the Test Set

Choosing the right test set is crucial for effective benchmarking.

Audio Quality

The performance of ASR systems can vary greatly due to several factors, such as:

Background noise
Audio codec (e.g., 8-bit audio which is of lower quality)
Overlapping speech from multiple speakers

It's important that the test sets closely align with the specific audio environment of your application. For example, audio from contact centers typically contains more background noise and is more challenging to transcribe compared to the clearer audio used in broadcast for captioning.

Where possible, avoid using an open-source corpus for benchmarking. SR models often train on these datasets, which could skew results if the models are optimized for familiar data.

Speaker Diversity

Characteristics specific to each speaker can also significantly impact SR accuracy, such as:

Age
Accent
Gender

Dataset Size

For benchmarking Word Error Rate (WER) across different engines with statistical relevance, the size of the dataset is paramount.

Generally, it is advisable to use at least 10,000 words (or about 1 hour of continuous speech) for each language you're testing.

How We Calculate WER?

The Word Error Rate (WER) measures inaccuracies in the hypothesis transcript, including insertions, deletions, and substitutions. It is calculated as a percentage, normalizing the number of errors by the total word count of the reference transcript (N).

Errors are modifications required to align the hypothesis with the reference transcript:

Insertions (I): Words that are not in the reference are added to the hypothesis.
Deletions (D): Words that are in the reference are omitted from the hypothesis.
Substitutions (S): Words in the reference are replaced with incorrect words in the hypothesis.
Word Count (N): The total number of words in the reference transcript.

Thus, if an automatically generated transcript achieves a Word Error Rate (WER) of 5%, it implies an accuracy of 95%.

Example

Let's take a more humorous and lengthier example to understand the calculation of the Word Error Rate (WER):

Reference Transcript	Hypothesis Transcript
The quick brown fox jumps over the lazy dog who was dreaming about chasing squirrels in its sleep.	A quick red fox jumped over two lazy dogs dreaming of chasing fast squirrels while sleeping.

In this modified example, we have:

Insertions (I): 3 ("A," "two," and "fast" are added)
Deletions (D): 0 (no words are missing from the hypothesis that are in the reference)
Substitutions (S): 3 ("brown" is replaced with "red," "jumps" is replaced with "jumped," and "was" is replaced with "while")

The total number of words in the reference transcript (N) is 15.

Using the WER formula:

The Word Error Rate (WER) for this funnier and longer example is 40%. This means that 40% of the words in the hypothesis transcript required changes to correctly match the reference transcript.

Challenges of Using Word Error Rate

Word Error Rate continues to be the leading metric for evaluating Automatic Speech Recognition systems, yet it comes with certain challenges:

Creating reference transcripts is expensive, manual transcription services charging high fees per hour.
WER does not differentiate between types of errors; it treats all errors the same, whether they significantly change the meaning of a sentence or are simple spelling mistakes.
It does not account for punctuation, capitalization, or speaker diarization, all of which are crucial for the transcript's readability and accuracy.

Note

Additionally, for operations that do not support G.711 codec (e.g., G.729), the recognition rate will be negatively affected around 10%.