Text Normalization

Prev Next

Introduction

Text Normalization (TN) is a critical component of speech recognition systems. Entities are extracted through Named Entity Recognition (NER), and raw transcriptions are transformed into more readable and semantically accurate text by converting spoken input into a standardized format of the matching entity. One of the key challenges in this process is handling numbers appropriately, ensuring the transcription aligns with the user's intent and context.

For example:

  • The phrase "one step at a time" should remain as "one step at a time", not "1 step at a time".
  • Similarly, "two wrongs don't make a right" should not be transcribed as "2 wrongs don't make a right" in conversational contexts.

To address this, our system introduces two alternative approaches to directly implement TN/NER with Textmanip service integration. This is applicable by either using Pre-defined Entities or Custom Entities defined by the user.


1. TN/NER with Pre-defined Entities

  • Can be implemented using "TextManipType" Headers Parameter's Standard and Aggressive modes in dictation/request.
  • When used, a request with the pre-defined entities "ParsingList" is sent to TextManip service for Text Normalization/Named Entity Recognition.
  • Must be added to the Header parameters of POST dictation/request along with the Request Header Fields as the following:
Key Description
TextManipType Standard
Aggressive
None
"TextManipType" Default mode is "Standard"

If not specified, "Standard" mode is implemented by default for "TextManipType" parameter.

1.1. Pre-defined Entities:

Entity LabelClass Description
Number Detects numerical values in any format.
SrDate Detects date expressions excluding terms like (e.g., "today" or "tomorrow" or "Friday", "Yesterday").
Currency Detects monetary values including amounts paired with currency symbols or names.
Time Detects clock times in different formats (e.g., "10:30 AM", "22:15").
SerialNumber Detects structured numerical sequences representing serial numbers.

1.2. "TextManipType" Modes:

- Standard Mode

  • Converts numeric expressions only when explicitly numerical.
  • Retains non-literal or contextual phrases as-is, ensuring semantic accuracy.
Textmanip Output Usage

Standard mode uses TextManip's "text" and "word" output parameters as result. Check Contextual Numerical Expressions Handling for more details.

Examples

Input (Spoken) Output (Transcribed)
"one can never know the answer" "one can never know the answer"
"two heads are better than one" "two heads are better than one"
"i have one last question for you" "i have one last question for you"

- Aggressive Mode

  • Converts all potential numbers into numerical format, prioritizing numeric representation over contextual nuance.
Textmanip Output Usage

Standard mode uses TextManip's "displayText" and "displayWord" output parameters as result. Check NER Methods for more details.

Examples

Input (Spoken) Output (Transcribed)
"one can never know the answer" "1 can never know the answer"
"two heads are better than one" "2 heads are better than 1"
"three days ago, we had a meeting about this" "3 days ago we had a meeting about this"
"i have one last question for you" "i have 1 last question for you"

- None Mode

  • Disables text normalization for number conversion entirely.
  • Ensures that all transcriptions preserve the exact spoken format without any changes to numeric expressions.

Examples

Input (Spoken) Output (Transcribed)
"i will be there in three hours" "i will be there in three hours"
"room number one is ready" "room number one is ready"
Info

This mode is useful for scenarios where raw transcription without modification is required.


2. TN/NER with Custom Entities

  • Can be implemented using "parameters" Body parameter in from-data.
  • Must be added to the Body parameters of POST dictation/request along with the audio file as the following:
Key Type Description
audio File The audio file to be transcribed. (e.g., payment.wav)
parameters Text JSON object specifying NER ParsingList.

ParsingList example:

{
"NER": {
  "ParsingList" : [
        {"LabelClass" : "SrDate", "Name" : "@srdate" },
        {"LabelClass" : "Time", "Name" : "@time" },
        {"LabelClass" : "Iban", "Name" : "@iban" }, 
        {"LabelClass" : "Tckn", "Name" : "@tckn" },
        {"LabelClass" : "PhoneNumber", "Name" : "@phonenumber"}
      ]
   }
}
Attention!
  • "parameters" works with "TextManipType" Standard and Aggressive modes.
  • If "TextManipType" mode is "None" the "parameters" is automatically disabled and no TN/NER will be implemented.

Result Accuracy

Text Normalization/NER can be implemented with plain text input or word-by-word input.

1. Plain text Input

In plain text input, the text is processed as a whole without considering the pauses between words during speech. This can lead to slightly inaccurate results depending on the matching entity.

2. Word-by-word Input

In the word-by-word input method, each word is processed individually, and the pause time between each word during speech is taken into account while implementing NER and detecting entities. This provides more accurate results.

How to Implement word-by-word TN/NER?

  • By enabling "ProduceNBestList: True" parameter in POST dictation/request, detailed information about each recognized word are produced in the response with "RecognizedWords" parameter as the following:
Name Description
Word The recognized word from the speech input.
StartTimeMsec The start time of the utterance in milliseconds.
EndTimeMsec The end time of the utterance in milliseconds.
Confidence The confidence value of the recognized result in percentage.
WordType The type of the word (e.g., Normal, Filler, Suffix, Prefix).
  • This allows extracting entities from the word-by-word input rather than plain text input with Textmanip POST "denormalize/by-words" method.
  • POST "denormalize/by-words" method includes "PauseThreshold: 1000" parameter.
  • This threshold controls the time elapsed between each word; If it is less than the threshold value 1000 ms, these two numbers are perceived it as a single number. Otherwise, perceived as two separate numbers.
  • In other words, if StartTimeMsec of Word_2 - EndTimeMsec of Word_1 < 1000 --> word1 + word2 are a numerical expression of one single number.

Examples

Input (Spoken) "Plain" Text Output
(ProduceNBestList: False)
"word-by-word" Text Output
(ProduceNBestList: True)
"sıfır beş yüz otuz beş yüz kırk iki otuz üç seksen yedi" "0 535 142 33 87" "0 530 542 33 87"

API Documentation:

1. Text Normalization/NER with Pre-defined Entities

TextManipType: Standard Mode

Request Example (cURL)

curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'Tenant: Default' \
--header 'TextManipType: Standard' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \

Response Example

{
  "audioLink": "https://{{Address}}/audiopath.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "resultText": "I sent 7 emails but only received a reply to one",
  "success": true
}

TextManipType: Aggressive Mode

Request Example (cURL)

curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'Tenant: Default' \
--header 'TextManipType: Aggressive' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \

Response Example

{
  "audioLink": "https://{{Address}}/audiopath.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "resultText": "I sent 7 emails but only received a reply to 1",
  "success": true
}

TextManipType: None Mode

Request Example (cURL)

curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'Tenant: Default' \
--header 'TextManipType: Aggressive' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \

Response Example

{
  "audioLink": "https://{{Address}}/audiopath.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "resultText": "I sent seven emails but only received a reply to one",
  "success": true
}

2. Text Normalization/NER with Custom Entities

"parameters" + "TextManipType:" None

Request Example (cURL)

curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: Turkish' \
--header 'Tenant: Default' \
--header 'TextManipType: None' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \
--form 'parameters="{
    \"NER\":
      {
        \"ParsingList\":
        [
            {
                \"LabelClass\": \"PhoneNumber\",
                \"Name\": \"phonenumber\"
            }
        ]
      }
    }"'

Response Example

{
  "audioLink": "https://{{Address}}/audiopath.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "resultText": "sıfır beş yüz otuz beş yüz kırk iki otuz üç seksen yedi",
  "success": true
}

"parameters" + "TextManipType": Standard

Request Example (cURL)

curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: Turkish' \
--header 'Tenant: Default' \
--header 'TextManipType: Standard' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \
--form 'parameters="{
    \"NER\":
      {
        \"ParsingList\":
        [
            {
                \"LabelClass\": \"PhoneNumber\",
                \"Name\": \"phonenumber\"
            }
        ]
      }
    }"'

Response Example

{
  "audioLink": "https://{{Address}}/audiopath.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "resultText": "sıfır 535 142 33 87",
  "success": true
}

"parameters" + "TextManipType": Aggressive

Request Example (cURL)

curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: Turkish' \
--header 'Tenant: Default' \
--header 'TextManipType: Standard' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \
--form 'parameters="{
    \"NER\":
      {
        \"ParsingList\":
        [
            {
                \"LabelClass\": \"PhoneNumber\",
                \"Name\": \"phonenumber\"
            }
        ]
      }
    }"'

Response Example

{
  "audioLink": "https://{{Address}}/audiopath.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "resultText": "0 535 142 33 87",
  "success": true
}

"parameters" + "TextManipType": Aggressive + "ProduceNBestList": True

Request Example (cURL)

curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: Turkish' \
--header 'Tenant: Default' \
--header 'TextManipType: Aggressive' \
--header 'ProduceNBestList: true' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \
--form 'parameters="{
    \"NER\":
      {
        \"ParsingList\":
        [
            {
                \"LabelClass\": \"PhoneNumber\",
                \"Name\": \"phonenumber\"
            }
        ]
      }
    }"'

Response Example

{
  "audioLink": "https://{{Address}}/audiopath.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "nbestlist": {
        "utterances": [
            {
                "confidence": 94,
                "nlsmlResult": "",
                "recognizedWords": [
                    {
                        "confidence": 97,
                        "endTimeMsec": 4060,
                        "startTimeMsec": 0,
                        "word": "0 530 542 33 87",
                        "wordType": "phonenumber"
                    }
                ]
            }
        ]
    },
  "resultText": "0 530 542 33 87",
  "success": true
}

Conclusion

By supporting both pre-defined and custom entity configurations, along with multiple TextManipType modes, our speech recognition system enables precise control over how spoken expressions are interpreted, normalized, or preserved.

The availability of plain text and word-by-word processing approaches allows the system to adapt to varying accuracy requirements. This flexibility ensures that applications can achieve the desired balance between semantic correctness and numeric precision.