Pronunciation Customization

06 Nov 2024
5 Minutes to read
Contributors

Pronunciation Customization

Updated on 06 Nov 2024
5 Minutes to read
Contributors

Article summary

Did you find this summary helpful?

Thank you for your feedback!

Overview

In the Knovvu Speech Recognition (SR) service, users can now customize the pronunciation of specific words in their audio recordings by providing a pronunciation list in the HTTP request body. This feature allows you to refine the text output without the need to retrain the language model, making it especially useful for handling foreign words or brand names.

How It Works

The pronunciation customization feature works by mapping the recognized speech to your desired text output. For example, if the SR service recognizes "capuchino" in the audio, you can instruct the service to output "Cappuccino" in the transcription result. This is achieved by including a pronunciation list in the HTTP request.

Enhanced Context Biasing

This feature not only corrects incorrectly recognized words but also performs context biasing, which increases the probability of accurate recognition. This way, it ensures that even challenging words or phrases that are frequently misrecognized can be transcribed more accurately.

👀 For more detailed information on how context biasing enhances speech recognition accuracy, click here to explore the complete documentation.

Example Request

Here is an example of how to send a pronunciation list in a POST request using curl:

curl --location '{{Address}}/v1/speech/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'ModelVersion: 0' \
--header 'Tenant: Default' \
--header 'Authorization: Bearer <your_token_here>' \
--form 'audio=@"/path/to/your/audio/file.wav"' \
--form 'pronunciations="[
    {
        \"capuchino\": \"Cappuccino\"
    },
    {
        \"cappachino\": \"Cappuccino\"
    },
    {
        \"capuccino \": \"Cappuccino\"
    }
]"'

Before and After Example

To demonstrate the impact of the pronunciation customization feature, let's consider the following scenario:

Before Customization

Without using the pronunciation customization feature, the Speech Recognition service might produce the following result for the audio input:

{
    "audioLink": "{{Address}}/audio-logs/Default/2024/9/2/13/20240902T125734-2615_in.wav",
    "confidence": 0.99,
    "detectedAudioContent": "recognizable-speech",
    "errorCode": null,
    "errorMessage": null,
    "moreInfo": null,
    "resultText": "I ordered a capuchino from the machine but found it too bitter and couldn't drink it.",
    "success": true
}

In this case, the word "capuchino" is transcribed as is, without correcting to the proper spelling "Cappuccino."

After Customization

By using the pronunciation customization feature, you can correct the spelling automatically. Here's the modified request:

curl --location '{{Address}}/v1/speech/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'ModelVersion: 0' \
--header 'Tenant: Default' \
--header 'Authorization: Bearer <your_token_here>' \
--form 'audio=@"/path/to/your/audio/file.wav"' \
--form 'pronunciations="[
    {
        \"capuchino\": \"Cappuccino\"
    }
]"'

With this customization, the SR service will return the following corrected result:

{
    "audioLink": "{{Address}}/audio-logs/Default/2024/9/2/13/20240902T125734-2615_in.wav",
    "confidence": 0.99,
    "detectedAudioContent": "recognizable-speech",
    "errorCode": null,
    "errorMessage": null,
    "moreInfo": null,
    "resultText": "I ordered a Cappuccino from the machine but found it too bitter and couldn't drink it.",
    "success": true
}

This example illustrates how the pronunciation customization feature can enhance the accuracy of your transcription results, ensuring that words are spelled correctly even if they are mispronounced in the audio input.

Here's an example using a longer, more complex phrase that can be merged into a single term, relevant to technology and commonly used in modern communication:

Handling Multi-Word Phrases

The pronunciation customization feature can also handle longer phrases. This is particularly useful for modern technical terms or jargon.

Example

Let’s take the phrase "artificial intelligence system." In speech, this might be recognized as three separate words, but in technical writing, you may prefer it to be merged into a single term like "AI system" or "AI System." Using the pronunciation customization feature, you can control how the phrase is transcribed.

Here’s how you can customize it:

curl --location '{{Address}}/v1/speech/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'ModelVersion: 0' \
--header 'Tenant: Default' \
--header 'Authorization: Bearer <your_token_here>' \
--form 'audio=@"/path/to/your/audio/file.wav"' \
--form 'pronunciations="[
    {
        \"artificial intelligence system\": \"AI System\"
    }
]"'

Before Customization

Without customization, the transcription result might look like this:

{
    "audioLink": "{{Address}}/audio-logs/Default/2024/9/2/13/20240902T125734-2615_in.wav",
    "confidence": 0.94,
    "detectedAudioContent": "recognizable-speech",
    "errorCode": null,
    "errorMessage": null,
    "moreInfo": null,
    "resultText": "We need to upgrade our artificial intelligence system to handle more tasks.",
    "success": true
}

In this case, "artificial intelligence system" is transcribed as three separate words, which may not be the desired format.

After Customization

With the pronunciation customization feature applied, the result becomes:

{
    "audioLink": "{{Address}}/audio-logs/Default/2024/9/2/13/20240902T125734-2615_in.wav",
    "confidence": 0.94,
    "detectedAudioContent": "recognizable-speech",
    "errorCode": null,
    "errorMessage": null,
    "moreInfo": null,
    "resultText": "We need to upgrade our AI System to handle more tasks.",
    "success": true
}

In this case, "artificial intelligence system" is merged and simplified to "AI System," providing a concise and technically correct term.

Detailed Explanation

Audio File: The audio file is provided as usual in the request using the audio parameter.
Pronunciations List: The pronunciations parameter allows you to specify a list of words that you want to customize in the transcription result. The list is formatted as a JSON array of objects, where each object contains a recognized word as the key and the desired output as the value.

For example, the object {"capuchino": "Cappuccino"} tells the SR service to replace any occurrence of "capuchino" in the recognized text with "Cappuccino."

Important Notes

Required Parameter: The pronunciations parameter must be included in every HTTP request where you need this customization. If this parameter is not provided, the SR service will output the recognized text as is, without any customizations.
Language Independence: This feature is language-independent, allowing you to apply it across different languages without modifying the underlying language model.

Use Case

This feature is particularly useful in scenarios where:

Foreign Words: You want to ensure that foreign words or brand names are transcribed correctly.
Rapid Customization: You need to apply quick fixes to the transcription output without retraining the model.

MRCP Integration for Pronunciation Customization

For users utilizing the Knovvu Speech Recognition (SR) service through MRCP (Media Resource Control Protocol), pronunciation customization can also be achieved by configuring specific parameters within the IVR (Interactive Voice Response) system interface.

In this setup, the IVR should have a parameter entry screen or configuration page where key-value pairs can be set to define the desired pronunciation customizations. Here’s how it works:

MRCP IVR Configuration

Access IVR Parameter Settings: Open the IVR interface where you can add key-value configurations. This is typically under a customization or settings section.

Set Pronunciation Customization Parameters:

Key: Use Pronunciation as the key name.
Value: Provide the pronunciation list as a JSON-formatted string. This JSON object should map words or phrases (that may be misrecognized) to their desired transcription output.

Submit and Save: Ensure these parameters are saved and applied. Once configured, the IVR will include this pronunciation customization in the requests sent to the Knovvu SR service via MRCP.

By setting up these pronunciation parameters directly in the IVR, you can tailor the transcription results without additional coding, making it easy to adapt to specific customer needs or technical terminology directly through the MRCP interface. This method offers a flexible, language-independent solution for refining transcription accuracy across various IVR use cases.

Was this article helpful?

What's Next

Text Normalization

Table of contents