Pronunciation Customization

Prev Next

SESTEK SR allows you to customize how specific words appear in transcription output - without retraining the language model. By providing a pronunciation list in your HTTP request, you can map recognized speech to your preferred text output. This is especially useful for foreign words, brand names, and domain-specific terminology that the base model may transcribe inconsistently.

This feature also performs context biasing under the hood, increasing the probability that target words are recognized correctly in the first place - not just corrected after the fact.

For more on how context biasing works, see the Context Biasing documentation.


How It Works

sr-pronunciation-how-it-works.png

You include a pronunciations parameter in your HTTP request body. Each entry maps a recognized form (how the word sounds) to your desired output (how it should appear in the transcript).

For example: {"capuchino": "Cappuccino"} tells the SR service to replace any occurrence of "capuchino" in the recognized text with "Cappuccino."

Multiple variants of the same word can be included in the list to cover different mispronunciations.


Example: Single Word Correction

sr-pronunciation-before-after.png

Request:

curl --location '{{Address}}/v1/speech/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'ModelVersion: 0' \
--header 'Tenant: Default' \
--header 'Authorization: Bearer <your_token_here>' \
--form 'audio=@"/path/to/your/audio/file.wav"' \
--form 'pronunciations="[
    {\"capuchino\": \"Cappuccino\"},
    {\"cappachino\": \"Cappuccino\"},
    {\"capuccino\": \"Cappuccino\"}
]"'

Before customization:

{
    "resultText": "I ordered a capuchino from the machine but found it too bitter and couldn't drink it.",
    "confidence": 0.99,
    "success": true
}

After customization:

{
    "resultText": "I ordered a Cappuccino from the machine but found it too bitter and couldn't drink it.",
    "confidence": 0.99,
    "success": true
}

Example: Multi-Word Phrase Mapping

The pronunciation list can also handle longer phrases - useful for merging technical terms or abbreviations.

Request:

curl --location '{{Address}}/v1/speech/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'ModelVersion: 0' \
--header 'Tenant: Default' \
--header 'Authorization: Bearer <your_token_here>' \
--form 'audio=@"/path/to/your/audio/file.wav"' \
--form 'pronunciations="[
    {\"artificial intelligence system\": \"AI System\"}
]"'

Before customization:

{
    "resultText": "We need to upgrade our artificial intelligence system to handle more tasks.",
    "confidence": 0.94,
    "success": true
}

After customization:

{
    "resultText": "We need to upgrade our AI System to handle more tasks.",
    "confidence": 0.94,
    "success": true
}

Use Cases

Foreign words and brand names
Ensure that product names, company names, or foreign-language terms are always transcribed in their correct form - regardless of how they are pronounced.

Rapid customization
Apply quick fixes to transcription output without retraining the model. New entries can be added per request with no downtime or redeployment.

Multi-word phrase normalization
Collapse verbose spoken phrases into preferred written forms - for example, mapping "artificial intelligence system" to "AI System" or "myocardial infarction" to "MI."

Domain-specific terminology
Standardize industry jargon, internal product codes, or technical terms that fall outside the base model's training vocabulary.


Request Parameters

Parameter Type Required Description
audio file Yes Audio file to be transcribed
pronunciations JSON array No List of word-to-output mappings. If omitted, the SR service returns the recognized text without any customization.

Pronunciations format:

[
    {"recognized_form": "desired_output"},
    {"another_form": "desired_output"}
]

Important Notes

  • Required per request - the pronunciations parameter must be included in every request where customization is needed. It is not stored between requests.
  • Language-independent - the feature works across all supported languages without modifying the underlying language model.
  • Multiple variants - include multiple entries for the same target word to cover different mispronunciations or spelling variations.

MRCP Configuration

For deployments using MRCP, pronunciation customization can be configured directly in the IVR interface without additional coding.

Steps:

  1. Open the IVR parameter settings or customization screen.
  2. Add a key-value entry:
    • Key: Pronunciation
    • Value: A JSON-formatted string mapping recognized forms to desired outputs.
  3. Save and apply the configuration.

Once set, the IVR will include the pronunciation list in every MRCP request sent to the SESTEK SR service. This provides a flexible, code-free way to refine transcription accuracy for specific customer needs or terminology.