Text Normalization (TN) is a critical component of speech recognition systems. Raw transcriptions are transformed into more readable and semantically accurate text by converting spoken input into a standardized format. Named Entity Recognition (NER) is used to extract entities such as numbers, dates, currencies, and phone numbers from the recognized text.
One of the key challenges in this process is handling numbers appropriately - ensuring the transcription aligns with the user's intent and context.
For example:
- "one step at a time" should remain as "one step at a time", not "1 step at a time"
- "two wrongs don't make a right" should not be transcribed as "2 wrongs don't make a right" in conversational contexts
SESTEK SR addresses this through two approaches: pre-defined entities and custom entities, both integrated via the TextManip service.
1. TN/NER with Pre-defined Entities
Pre-defined entity support is enabled using the TextManipType header parameter in POST dictation/request. When used, a request with the pre-defined ParsingList is sent to the TextManip service for Text Normalization and Named Entity Recognition.
| Key | Description |
|---|---|
TextManipType |
Standard · Aggressive · None |
If not specified, Standard mode is used by default.
1.1. Pre-defined Entity Types
| Entity LabelClass | Description |
|---|---|
Number |
Detects numerical values in any format |
SrDate |
Detects date expressions, excluding relative terms like "today", "tomorrow", "yesterday" |
Currency |
Detects monetary values including amounts paired with currency symbols or names |
Time |
Detects clock times in different formats (e.g. "10:30 AM", "22:15") |
SerialNumber |
Detects structured numerical sequences representing serial numbers |
1.2. TextManipType Modes
Standard
Converts numeric expressions only when explicitly numerical. Retains non-literal or contextual phrases as-is, ensuring semantic accuracy.
Standard mode uses TextManip's text and word output parameters as result. See Contextual Numerical Expressions Handling for more details.
Examples:
| Input (spoken) | Output (transcribed) |
|---|---|
| "one can never know the answer" | "one can never know the answer" |
| "two heads are better than one" | "two heads are better than one" |
| "i have one last question for you" | "i have one last question for you" |
Aggressive
Converts all potential numbers into numerical format, prioritizing numeric representation over contextual nuance.
Aggressive mode uses TextManip's displayText and displayWord output parameters as result. See NER Methods for more details.
Examples:
| Input (spoken) | Output (transcribed) |
|---|---|
| "one can never know the answer" | "1 can never know the answer" |
| "two heads are better than one" | "2 heads are better than 1" |
| "three days ago, we had a meeting about this" | "3 days ago we had a meeting about this" |
| "i have one last question for you" | "i have 1 last question for you" |
None
Disables text normalization entirely. All transcriptions preserve the exact spoken format without any changes to numeric expressions.
This mode is useful for scenarios where raw transcription without modification is required.
Examples:
| Input (spoken) | Output (transcribed) |
|---|---|
| "i will be there in three hours" | "i will be there in three hours" |
| "room number one is ready" | "room number one is ready" |
2. TN/NER with Custom Entities
Custom entity support is enabled using the parameters body parameter in form-data on POST dictation/request.
| Key | Type | Description |
|---|---|---|
audio |
File |
The audio file to be transcribed (e.g. payment.wav) |
parameters |
Text |
JSON object specifying the NER ParsingList |
ParsingList example:
{
"NER": {
"ParsingList": [
{ "LabelClass": "SrDate", "Name": "@srdate" },
{ "LabelClass": "Time", "Name": "@time" },
{ "LabelClass": "Iban", "Name": "@iban" },
{ "LabelClass": "Tckn", "Name": "@tckn" },
{ "LabelClass": "PhoneNumber", "Name": "@phonenumber" }
]
}
}
parametersworks withTextManipTypeStandard and Aggressive modes only.- If
TextManipTypeis set to None, theparametersfield is automatically disabled and no TN/NER will be applied.
Result Accuracy
Text Normalization and NER can be applied using either plain text or word-by-word input.
Plain text input
The text is processed as a whole, without considering the pauses between words during speech. This can lead to slightly inaccurate results depending on the entity being matched.
Word-by-word input
Each word is processed individually, and the pause time between words during speech is taken into account when detecting entities. This provides more accurate results, particularly for phone numbers and serial numbers where digit grouping matters.
How to implement word-by-word TN/NER
Enable ProduceNBestList: True in POST dictation/request. This produces detailed per-word information in the response via the RecognizedWords parameter.
| Name | Description |
|---|---|
Word |
The recognized word from the speech input |
StartTimeMsec |
Start time of the utterance in milliseconds |
EndTimeMsec |
End time of the utterance in milliseconds |
Confidence |
Confidence value of the recognized result (percentage) |
WordType |
Type of the word: Normal, Filler, Suffix, or Prefix |
This enables extracting entities word-by-word using the TextManip POST denormalize/by-words method.
The denormalize/by-words method includes a PauseThreshold: 1000 parameter. This controls how consecutive words are grouped:
- If
StartTimeMsecof Word_2 −EndTimeMsecof Word_1 < 1000 ms → the two words form a single number - If ≥ 1000 ms → they are treated as two separate numbers
Example:
| Input (spoken) | Plain text output (ProduceNBestList: False) |
Word-by-word output (ProduceNBestList: True) |
|---|---|---|
| "sıfır beş yüz otuz beş yüz kırk iki otuz üç seksen yedi" | "0 535 142 33 87" | "0 530 542 33 87" |
API Reference
1. Pre-defined Entities
TextManipType: Standard
Request:
curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'Tenant: Default' \
--header 'TextManipType: Standard' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"'
Response:
{
"audioLink": "https://{{Address}}/audiopath.wav",
"confidence": 0.97,
"detectedAudioContent": "recognizable-speech",
"errorCode": null,
"errorMessage": null,
"moreInfo": null,
"resultText": "I sent 7 emails but only received a reply to one",
"success": true
}
TextManipType: Aggressive
Request:
curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'Tenant: Default' \
--header 'TextManipType: Aggressive' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"'
Response:
{
"audioLink": "https://{{Address}}/audiopath.wav",
"confidence": 0.97,
"detectedAudioContent": "recognizable-speech",
"errorCode": null,
"errorMessage": null,
"moreInfo": null,
"resultText": "I sent 7 emails but only received a reply to 1",
"success": true
}
TextManipType: None
Request:
curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'Tenant: Default' \
--header 'TextManipType: None' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"'
Response:
{
"audioLink": "https://{{Address}}/audiopath.wav",
"confidence": 0.97,
"detectedAudioContent": "recognizable-speech",
"errorCode": null,
"errorMessage": null,
"moreInfo": null,
"resultText": "I sent seven emails but only received a reply to one",
"success": true
}
2. Custom Entities
parameters + TextManipType: None
Request:
curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: Turkish' \
--header 'Tenant: Default' \
--header 'TextManipType: None' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \
--form 'parameters="{
\"NER\": {
\"ParsingList\": [
{ \"LabelClass\": \"PhoneNumber\", \"Name\": \"phonenumber\" }
]
}
}"'
Response:
{
"resultText": "sıfır beş yüz otuz beş yüz kırk iki otuz üç seksen yedi",
"confidence": 0.97,
"success": true
}
parameters + TextManipType: Standard
Request:
curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: Turkish' \
--header 'Tenant: Default' \
--header 'TextManipType: Standard' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \
--form 'parameters="{
\"NER\": {
\"ParsingList\": [
{ \"LabelClass\": \"PhoneNumber\", \"Name\": \"phonenumber\" }
]
}
}"'
Response:
{
"resultText": "sıfır 535 142 33 87",
"confidence": 0.97,
"success": true
}
parameters + TextManipType: Aggressive
Request:
curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: Turkish' \
--header 'Tenant: Default' \
--header 'TextManipType: Aggressive' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \
--form 'parameters="{
\"NER\": {
\"ParsingList\": [
{ \"LabelClass\": \"PhoneNumber\", \"Name\": \"phonenumber\" }
]
}
}"'
Response:
{
"resultText": "0 535 142 33 87",
"confidence": 0.97,
"success": true
}
parameters + TextManipType: Aggressive + ProduceNBestList: True
Request:
curl \
--location POST '{server-url}/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: Turkish' \
--header 'Tenant: Default' \
--header 'TextManipType: Aggressive' \
--header 'ProduceNBestList: true' \
--header 'Authorization: Bearer <authorization_token>' \
--form 'audio=@"/path/to/audio/file.wav"' \
--form 'parameters="{
\"NER\": {
\"ParsingList\": [
{ \"LabelClass\": \"PhoneNumber\", \"Name\": \"phonenumber\" }
]
}
}"'
Response:
{
"audioLink": "https://{{Address}}/audiopath.wav",
"confidence": 0.97,
"detectedAudioContent": "recognizable-speech",
"errorCode": null,
"errorMessage": null,
"moreInfo": null,
"nbestlist": {
"utterances": [
{
"confidence": 94,
"nlsmlResult": "",
"recognizedWords": [
{
"confidence": 97,
"endTimeMsec": 4060,
"startTimeMsec": 0,
"word": "0 530 542 33 87",
"wordType": "phonenumber"
}
]
}
]
},
"resultText": "0 530 542 33 87",
"success": true
}
Summary
By supporting both pre-defined and custom entity configurations alongside multiple TextManipType modes, SESTEK SR enables precise control over how spoken expressions are interpreted, normalized, or preserved. The availability of plain text and word-by-word processing ensures that applications can achieve the right balance between semantic correctness and numeric precision.
