Text Normalization

Prev Next

Text Normalization (TN) is a critical component of speech recognition systems. Raw transcriptions are transformed into more readable and semantically accurate text by converting spoken input into a standardized format. Named Entity Recognition (NER) is used to extract entities such as numbers, dates, currencies, and phone numbers from the recognized text.

One of the key challenges in this process is handling numbers appropriately - ensuring the transcription aligns with the user's intent and context.

For example:

  • "one step at a time" should remain as "one step at a time", not "1 step at a time"
  • "two wrongs don't make a right" should not be transcribed as "2 wrongs don't make a right" in conversational contexts

SESTEK SR addresses this through two approaches: pre-defined entities and custom entities, both integrated via the TextManip service.


1. TN/NER with Pre-defined Entities

Pre-defined entity support is enabled using the TextManipType header parameter in POST dictation/request. When used, a request with the pre-defined ParsingList is sent to the TextManip service for Text Normalization and Named Entity Recognition.

Key Description
TextManipType Standard · Aggressive · None

If not specified, Standard mode is used by default.

1.1. Pre-defined Entity Types

Entity LabelClass Description
Number Detects numerical values in any format
SrDate Detects date expressions, excluding relative terms like "today", "tomorrow", "yesterday"
Currency Detects monetary values including amounts paired with currency symbols or names
Time Detects clock times in different formats (e.g. "10:30 AM", "22:15")
SerialNumber Detects structured numerical sequences representing serial numbers

1.2. TextManipType Modes

Standard

Converts numeric expressions only when explicitly numerical. Retains non-literal or contextual phrases as-is, ensuring semantic accuracy.

Standard mode uses TextManip's text and word output parameters as result. See Contextual Numerical Expressions Handling for more details.

Examples:

Input (spoken) Output (transcribed)
"one can never know the answer" "one can never know the answer"
"two heads are better than one" "two heads are better than one"
"i have one last question for you" "i have one last question for you"

Aggressive

Converts all potential numbers into numerical format, prioritizing numeric representation over contextual nuance.

Aggressive mode uses TextManip's displayText and displayWord output parameters as result. See NER Methods for more details.

Examples:

Input (spoken) Output (transcribed)
"one can never know the answer" "1 can never know the answer"
"two heads are better than one" "2 heads are better than 1"
"three days ago, we had a meeting about this" "3 days ago we had a meeting about this"
"i have one last question for you" "i have 1 last question for you"

None

Disables text normalization entirely. All transcriptions preserve the exact spoken format without any changes to numeric expressions.

This mode is useful for scenarios where raw transcription without modification is required.

Examples:

Input (spoken) Output (transcribed)
"i will be there in three hours" "i will be there in three hours"
"room number one is ready" "room number one is ready"

2. TN/NER with Custom Entities

Custom entity support is enabled using the parameters body parameter in form-data on POST dictation/request.

Key Type Description
audio File The audio file to be transcribed (e.g. payment.wav)
parameters Text JSON object specifying the NER ParsingList

ParsingList example:

{
  "NER": {
    "ParsingList": [
      { "LabelClass": "SrDate",       "Name": "@srdate" },
      { "LabelClass": "Time",         "Name": "@time" },
      { "LabelClass": "Iban",         "Name": "@iban" },
      { "LabelClass": "Tckn",         "Name": "@tckn" },
      { "LabelClass": "PhoneNumber",  "Name": "@phonenumber" }
    ]
  }
}
Attention
  • parameters works with TextManipType Standard and Aggressive modes only.
  • If TextManipType is set to None, the parameters field is automatically disabled and no TN/NER will be applied.

Result Accuracy

Text Normalization and NER can be applied using either plain text or word-by-word input.

Plain text input

The text is processed as a whole, without considering the pauses between words during speech. This can lead to slightly inaccurate results depending on the entity being matched.

Word-by-word input

Each word is processed individually, and the pause time between words during speech is taken into account when detecting entities. This provides more accurate results, particularly for phone numbers and serial numbers where digit grouping matters.

How to implement word-by-word TN/NER

Enable ProduceNBestList: True in POST dictation/request. This produces detailed per-word information in the response via the RecognizedWords parameter.

Name Description
Word The recognized word from the speech input
StartTimeMsec Start time of the utterance in milliseconds
EndTimeMsec End time of the utterance in milliseconds
Confidence Confidence value of the recognized result (percentage)
WordType Type of the word: Normal, Filler, Suffix, or Prefix

This enables extracting entities word-by-word using the TextManip POST denormalize/by-words method.

The denormalize/by-words method includes a PauseThreshold: 1000 parameter. This controls how consecutive words are grouped:

  • If StartTimeMsec of Word_2 − EndTimeMsec of Word_1 < 1000 ms → the two words form a single number
  • If ≥ 1000 ms → they are treated as two separate numbers

Example:

Input (spoken) Plain text output (ProduceNBestList: False) Word-by-word output (ProduceNBestList: True)
"sıfır beş yüz otuz beş yüz kırk iki otuz üç seksen yedi" "0 535 142 33 87" "0 530 542 33 87"

API Reference

1. Pre-defined Entities

TextManipType: Standard

Request:

curl \
  --location POST '{server-url}/dictation/request' \
  --header 'Content-Type: audio/wave' \
  --header 'ModelName: English' \
  --header 'Tenant: Default' \
  --header 'TextManipType: Standard' \
  --header 'Authorization: Bearer <authorization_token>' \
  --form 'audio=@"/path/to/audio/file.wav"'

Response:

{
  "audioLink": "https://{{Address}}/audiopath.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "resultText": "I sent 7 emails but only received a reply to one",
  "success": true
}

TextManipType: Aggressive

Request:

curl \
  --location POST '{server-url}/dictation/request' \
  --header 'Content-Type: audio/wave' \
  --header 'ModelName: English' \
  --header 'Tenant: Default' \
  --header 'TextManipType: Aggressive' \
  --header 'Authorization: Bearer <authorization_token>' \
  --form 'audio=@"/path/to/audio/file.wav"'

Response:

{
  "audioLink": "https://{{Address}}/audiopath.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "resultText": "I sent 7 emails but only received a reply to 1",
  "success": true
}

TextManipType: None

Request:

curl \
  --location POST '{server-url}/dictation/request' \
  --header 'Content-Type: audio/wave' \
  --header 'ModelName: English' \
  --header 'Tenant: Default' \
  --header 'TextManipType: None' \
  --header 'Authorization: Bearer <authorization_token>' \
  --form 'audio=@"/path/to/audio/file.wav"'

Response:

{
  "audioLink": "https://{{Address}}/audiopath.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "resultText": "I sent seven emails but only received a reply to one",
  "success": true
}

2. Custom Entities

parameters + TextManipType: None

Request:

curl \
  --location POST '{server-url}/dictation/request' \
  --header 'Content-Type: audio/wave' \
  --header 'ModelName: Turkish' \
  --header 'Tenant: Default' \
  --header 'TextManipType: None' \
  --header 'Authorization: Bearer <authorization_token>' \
  --form 'audio=@"/path/to/audio/file.wav"' \
  --form 'parameters="{
    \"NER\": {
      \"ParsingList\": [
        { \"LabelClass\": \"PhoneNumber\", \"Name\": \"phonenumber\" }
      ]
    }
  }"'

Response:

{
  "resultText": "sıfır beş yüz otuz beş yüz kırk iki otuz üç seksen yedi",
  "confidence": 0.97,
  "success": true
}

parameters + TextManipType: Standard

Request:

curl \
  --location POST '{server-url}/dictation/request' \
  --header 'Content-Type: audio/wave' \
  --header 'ModelName: Turkish' \
  --header 'Tenant: Default' \
  --header 'TextManipType: Standard' \
  --header 'Authorization: Bearer <authorization_token>' \
  --form 'audio=@"/path/to/audio/file.wav"' \
  --form 'parameters="{
    \"NER\": {
      \"ParsingList\": [
        { \"LabelClass\": \"PhoneNumber\", \"Name\": \"phonenumber\" }
      ]
    }
  }"'

Response:

{
  "resultText": "sıfır 535 142 33 87",
  "confidence": 0.97,
  "success": true
}

parameters + TextManipType: Aggressive

Request:

curl \
  --location POST '{server-url}/dictation/request' \
  --header 'Content-Type: audio/wave' \
  --header 'ModelName: Turkish' \
  --header 'Tenant: Default' \
  --header 'TextManipType: Aggressive' \
  --header 'Authorization: Bearer <authorization_token>' \
  --form 'audio=@"/path/to/audio/file.wav"' \
  --form 'parameters="{
    \"NER\": {
      \"ParsingList\": [
        { \"LabelClass\": \"PhoneNumber\", \"Name\": \"phonenumber\" }
      ]
    }
  }"'

Response:

{
  "resultText": "0 535 142 33 87",
  "confidence": 0.97,
  "success": true
}

parameters + TextManipType: Aggressive + ProduceNBestList: True

Request:

curl \
  --location POST '{server-url}/dictation/request' \
  --header 'Content-Type: audio/wave' \
  --header 'ModelName: Turkish' \
  --header 'Tenant: Default' \
  --header 'TextManipType: Aggressive' \
  --header 'ProduceNBestList: true' \
  --header 'Authorization: Bearer <authorization_token>' \
  --form 'audio=@"/path/to/audio/file.wav"' \
  --form 'parameters="{
    \"NER\": {
      \"ParsingList\": [
        { \"LabelClass\": \"PhoneNumber\", \"Name\": \"phonenumber\" }
      ]
    }
  }"'

Response:

{
  "audioLink": "https://{{Address}}/audiopath.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "nbestlist": {
    "utterances": [
      {
        "confidence": 94,
        "nlsmlResult": "",
        "recognizedWords": [
          {
            "confidence": 97,
            "endTimeMsec": 4060,
            "startTimeMsec": 0,
            "word": "0 530 542 33 87",
            "wordType": "phonenumber"
          }
        ]
      }
    ]
  },
  "resultText": "0 530 542 33 87",
  "success": true
}

Summary

By supporting both pre-defined and custom entity configurations alongside multiple TextManipType modes, SESTEK SR enables precise control over how spoken expressions are interpreted, normalized, or preserved. The availability of plain text and word-by-word processing ensures that applications can achieve the right balance between semantic correctness and numeric precision.