Introduction
Text Normalization (TN) is a critical component of speech recognition systems. Entities are extracted through Named Entity Recognition (NER), and raw transcriptions are transformed into more readable and semantically accurate text by converting spoken input into a standardized format of the matching entity. One of the key challenges in this process is handling numbers appropriately, ensuring the transcription aligns with the user's intent and context.
For example:
- The phrase "one step at a time" should remain as "one step at a time", not "1 step at a time".
- Similarly, "two wrongs don't make a right" should not be transcribed as "2 wrongs don't make a right" in conversational contexts.
To address this, our system introduces two alternative approaches to directly implement TN/NER with Textmanip service integration. This is applicable by either using Pre-defined Entities or Custom Entities defined by the user.
1. TN/NER with Pre-defined Entities
- Can be implemented using "TextManipType" Headers Parameter's Standard and Aggressive modes in dictation/request.
- When used, a request with the pre-defined entities "ParsingList" is sent to TextManip service for Text Normalization/Named Entity Recognition.
- Must be added to the Header parameters of POST dictation/request along with the Request Header Fields as the following:
| Key | Description |
|---|---|
| TextManipType | Standard Aggressive None |
If not specified, "Standard" mode is implemented by default for "TextManipType" parameter.
1.1. Pre-defined Entities:
| Entity LabelClass | Description |
|---|---|
| Number | Detects numerical values in any format. |
| SrDate | Detects date expressions excluding terms like (e.g., "today" or "tomorrow" or "Friday", "Yesterday"). |
| Currency | Detects monetary values including amounts paired with currency symbols or names. |
| Time | Detects clock times in different formats (e.g., "10:30 AM", "22:15"). |
| SerialNumber | Detects structured numerical sequences representing serial numbers. |
1.2. "TextManipType" Modes:
- Standard Mode
- Converts numeric expressions only when explicitly numerical.
- Retains non-literal or contextual phrases as-is, ensuring semantic accuracy.
Standard mode uses TextManip's "text" and "word" output parameters as result. Check Contextual Numerical Expressions Handling for more details.
Examples
| Input (Spoken) | Output (Transcribed) |
|---|---|
| "one can never know the answer" | "one can never know the answer" |
| "two heads are better than one" | "two heads are better than one" |
| "i have one last question for you" | "i have one last question for you" |
- Aggressive Mode
- Converts all potential numbers into numerical format, prioritizing numeric representation over contextual nuance.
Standard mode uses TextManip's "displayText" and "displayWord" output parameters as result. Check NER Methods for more details.
Examples
| Input (Spoken) | Output (Transcribed) |
|---|---|
| "one can never know the answer" | "1 can never know the answer" |
| "two heads are better than one" | "2 heads are better than 1" |
| "three days ago, we had a meeting about this" | "3 days ago we had a meeting about this" |
| "i have one last question for you" | "i have 1 last question for you" |
- None Mode
- Disables text normalization for number conversion entirely.
- Ensures that all transcriptions preserve the exact spoken format without any changes to numeric expressions.
Examples
| Input (Spoken) | Output (Transcribed) |
|---|---|
| "i will be there in three hours" | "i will be there in three hours" |
| "room number one is ready" | "room number one is ready" |
This mode is useful for scenarios where raw transcription without modification is required.
2. TN/NER with Custom Entities
- Can be implemented using "parameters" Body parameter in from-data.
- Must be added to the Body parameters of POST dictation/request along with the audio file as the following:
| Key | Type | Description |
|---|---|---|
audio |
File |
The audio file to be transcribed. (e.g., payment.wav) |
parameters |
Text |
JSON object specifying NER ParsingList. |
ParsingList example:
{
"NER": {
"ParsingList" : [
{"LabelClass" : "SrDate", "Name" : "@srdate" },
{"LabelClass" : "Time", "Name" : "@time" },
{"LabelClass" : "Iban", "Name" : "@iban" },
{"LabelClass" : "Tckn", "Name" : "@tckn" },
{"LabelClass" : "PhoneNumber", "Name" : "@phonenumber"}
]
}
}
- "parameters" works with "TextManipType" Standard and Aggressive modes.
- If "TextManipType" mode is "None" the "parameters" is automatically disabled and no TN/NER will be implemented.
Result Accuracy
Text Normalization/NER can be implemented with plain text input or word-by-word input.
1. Plain text Input
In plain text input, the text is processed as a whole without considering the pauses between words during speech. This can lead to slightly inaccurate results depending on the matching entity.
2. Word-by-word Input
In the word-by-word input method, each word is processed individually, and the pause time between each word during speech is taken into account while implementing NER and detecting entities. This provides more accurate results.
How to Implement word-by-word TN/NER?
- By enabling "ProduceNBestList: True" parameter in POST dictation/request, detailed information about each recognized word are produced in the response with "RecognizedWords" parameter as the following:
| Name | Description |
|---|---|
| Word | The recognized word from the speech input. |
| StartTimeMsec | The start time of the utterance in milliseconds. |
| EndTimeMsec | The end time of the utterance in milliseconds. |
| Confidence | The confidence value of the recognized result in percentage. |
| WordType | The type of the word (e.g., Normal, Filler, Suffix, Prefix). |
- This allows extracting entities from the word-by-word input rather than plain text input with Textmanip POST "denormalize/by-words" method.
- POST "denormalize/by-words" method includes
"PauseThreshold: 1000"parameter. - This threshold controls the time elapsed between each word; If it is less than the threshold value
1000 ms, these two numbers are perceived it as a single number. Otherwise, perceived as two separate numbers. - In other words, if
StartTimeMsecof Word_2 -EndTimeMsecof Word_1 <1000--> word1 + word2 are a numerical expression of one single number.
Examples
| Input (Spoken) | "Plain" Text Output (ProduceNBestList: False) |
"word-by-word" Text Output (ProduceNBestList: True) |
|---|---|---|
| "sıfır beş yüz otuz beş yüz kırk iki otuz üç seksen yedi" | "0 535 142 33 87" | "0 530 542 33 87" |
API Documentation:
1. Text Normalization/NER with Pre-defined Entities
TextManipType: Standard Mode
Request Example (cURL)
curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'Tenant: Default' \
--header 'TextManipType: Standard' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \
Response Example
{
"audioLink": "https://{{Address}}/audiopath.wav",
"confidence": 0.97,
"detectedAudioContent": "recognizable-speech",
"errorCode": null,
"errorMessage": null,
"moreInfo": null,
"resultText": "I sent 7 emails but only received a reply to one",
"success": true
}
TextManipType: Aggressive Mode
Request Example (cURL)
curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'Tenant: Default' \
--header 'TextManipType: Aggressive' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \
Response Example
{
"audioLink": "https://{{Address}}/audiopath.wav",
"confidence": 0.97,
"detectedAudioContent": "recognizable-speech",
"errorCode": null,
"errorMessage": null,
"moreInfo": null,
"resultText": "I sent 7 emails but only received a reply to 1",
"success": true
}
TextManipType: None Mode
Request Example (cURL)
curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'Tenant: Default' \
--header 'TextManipType: Aggressive' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \
Response Example
{
"audioLink": "https://{{Address}}/audiopath.wav",
"confidence": 0.97,
"detectedAudioContent": "recognizable-speech",
"errorCode": null,
"errorMessage": null,
"moreInfo": null,
"resultText": "I sent seven emails but only received a reply to one",
"success": true
}
2. Text Normalization/NER with Custom Entities
"parameters" + "TextManipType:" None
Request Example (cURL)
curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: Turkish' \
--header 'Tenant: Default' \
--header 'TextManipType: None' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \
--form 'parameters="{
\"NER\":
{
\"ParsingList\":
[
{
\"LabelClass\": \"PhoneNumber\",
\"Name\": \"phonenumber\"
}
]
}
}"'
Response Example
{
"audioLink": "https://{{Address}}/audiopath.wav",
"confidence": 0.97,
"detectedAudioContent": "recognizable-speech",
"errorCode": null,
"errorMessage": null,
"moreInfo": null,
"resultText": "sıfır beş yüz otuz beş yüz kırk iki otuz üç seksen yedi",
"success": true
}
"parameters" + "TextManipType": Standard
Request Example (cURL)
curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: Turkish' \
--header 'Tenant: Default' \
--header 'TextManipType: Standard' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \
--form 'parameters="{
\"NER\":
{
\"ParsingList\":
[
{
\"LabelClass\": \"PhoneNumber\",
\"Name\": \"phonenumber\"
}
]
}
}"'
Response Example
{
"audioLink": "https://{{Address}}/audiopath.wav",
"confidence": 0.97,
"detectedAudioContent": "recognizable-speech",
"errorCode": null,
"errorMessage": null,
"moreInfo": null,
"resultText": "sıfır 535 142 33 87",
"success": true
}
"parameters" + "TextManipType": Aggressive
Request Example (cURL)
curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: Turkish' \
--header 'Tenant: Default' \
--header 'TextManipType: Standard' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \
--form 'parameters="{
\"NER\":
{
\"ParsingList\":
[
{
\"LabelClass\": \"PhoneNumber\",
\"Name\": \"phonenumber\"
}
]
}
}"'
Response Example
{
"audioLink": "https://{{Address}}/audiopath.wav",
"confidence": 0.97,
"detectedAudioContent": "recognizable-speech",
"errorCode": null,
"errorMessage": null,
"moreInfo": null,
"resultText": "0 535 142 33 87",
"success": true
}
"parameters" + "TextManipType": Aggressive + "ProduceNBestList": True
Request Example (cURL)
curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: Turkish' \
--header 'Tenant: Default' \
--header 'TextManipType: Aggressive' \
--header 'ProduceNBestList: true' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \
--form 'parameters="{
\"NER\":
{
\"ParsingList\":
[
{
\"LabelClass\": \"PhoneNumber\",
\"Name\": \"phonenumber\"
}
]
}
}"'
Response Example
{
"audioLink": "https://{{Address}}/audiopath.wav",
"confidence": 0.97,
"detectedAudioContent": "recognizable-speech",
"errorCode": null,
"errorMessage": null,
"moreInfo": null,
"nbestlist": {
"utterances": [
{
"confidence": 94,
"nlsmlResult": "",
"recognizedWords": [
{
"confidence": 97,
"endTimeMsec": 4060,
"startTimeMsec": 0,
"word": "0 530 542 33 87",
"wordType": "phonenumber"
}
]
}
]
},
"resultText": "0 530 542 33 87",
"success": true
}
Conclusion
By supporting both pre-defined and custom entity configurations, along with multiple TextManipType modes, our speech recognition system enables precise control over how spoken expressions are interpreted, normalized, or preserved.
The availability of plain text and word-by-word processing approaches allows the system to adapt to varying accuracy requirements. This flexibility ensures that applications can achieve the desired balance between semantic correctness and numeric precision.
