Text Normalization

10 Dec 2024
3 Minutes to read
Contributors

Text Normalization

Updated on 10 Dec 2024
3 Minutes to read
Contributors

Article summary

Did you find this summary helpful?

Thank you for your feedback!

Introduction

Text normalization is a critical component of speech recognition systems. It transforms raw transcriptions into more readable and semantically accurate text by converting spoken input into a standardized format. One of the key challenges in this process is handling numbers appropriately, ensuring the transcription aligns with the user's intent and context.

For example:

The phrase "one step at a time" should remain as "one step at a time", not "1 step at a time".
Similarly, "two wrongs don't make a right" should not be transcribed as "2 wrongs don't make a right" in conversational contexts.

To address this, our system introduces TextManip Types, a flexible approach to number conversion that offers three modes: Standard, Aggressive and None. This allows users to tailor number handling according to their specific needs.

TextManip Modes

1. Standard Mode

Converts numeric expressions only when explicitly numerical.
Retains non-literal or contextual phrases as-is, ensuring semantic accuracy.

Examples

Input (Spoken)	Output (Transcribed)
"one can never know the answer"	"one can never know the answer"
"two heads are better than one"	"two heads are better than one"
"i have one last question for you"	"i have one last question for you"

2. Aggressive Mode

Converts all potential numbers into numerical format, prioritizing numeric representation over contextual nuance.

Examples

Input (Spoken)	Output (Transcribed)
"one can never know the answer"	"1 can never know the answer"
"two heads are better than one"	"2 heads are better than 1"
"three days ago, we had a meeting about this"	"3 days ago we had a meeting about this"
"i have one last question for you"	"i have 1 last question for you"

3. None Mode

Disables text normalization for number conversion entirely.
Ensures that all transcriptions preserve the exact spoken format without any changes to numeric expressions.

Examples

Input (Spoken)	Output (Transcribed)
"i will be there in three hours"	"i will be there in three hours"
"room number one is ready"	"room number one is ready"

Info

This mode is useful for scenarios where raw transcription without modification is required.

Use Cases

Use Case 1: Conversational AI

In customer service bots or virtual assistants, maintaining the conversational tone is essential. Standard Mode ensures transcriptions are natural and contextually accurate.

Spoken: "two wrongs don’t make a right"
Transcribed (Standard): "two wrongs don’t make a right"

Use Case 2: Data-Driven Applications

In logistics, finance, or technical scenarios where numeric precision is crucial, Aggressive Mode is the ideal choice.

Spoken: "batch number thirty two needs review"
Transcribed (Aggressive): "batch number 32 needs review"

API Documentation

Endpoint

POST {Address}/v1/speech/dictation/request

Response Example: Standard Mode

Request as Curl Command

curl --location '{Address}/v1/speech/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'Tenant: Default' \
--header 'TextManipType: Standard' \
--header 'Authorization: Bearer <your_authorization_token>' \
--form 'audio=@"/path/to/your/audio/file.wav"' \
--form 'parameters="{
\"NER\": {
  \"ParsingList\": [
    {
      \"LabelClass\": \"Number\",
      \"Name\": \"number\"
    }
  ]
}
}"'

Response

{
  "audioLink": "{Address}/audio-logs/Default/2024/12/10/18/audio_file.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "resultText": "i have one last question for you",
  "success": true
}

Response Example: Aggressive Mode

Request

curl --location '{Address}/v1/speech/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'Tenant: Default' \
--header 'TextManipType: Agressive' \
--header 'Authorization: Bearer <your_authorization_token>' \
--form 'audio=@"/path/to/your/audio/file.wav"' \
--form 'parameters="{
\"NER\": {
  \"ParsingList\": [
    {
      \"LabelClass\": \"Number\",
      \"Name\": \"number\"
    }
  ]
}
}"'

Response

{
  "audioLink": "{Address}/audio-logs/Default/2024/12/10/18/audio.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "resultText": "i have 1 last question for you",
  "success": true
}

Conclusion

The introduction of TextManip Types in our speech recognition system offers unparalleled control over text normalization. Whether you prioritize semantic context in conversational applications or require strict numeric representation for data-driven tasks, this feature is designed to adapt to your needs.

Was this article helpful?

What's Next

Postman Collection

Table of contents

Introduction
TextManip Modes
Use Cases
API Documentation
Conclusion