- Print
- PDF
Overview
In the Knovvu Speech Recognition (SR) service, users can now customize the pronunciation of specific words in their audio recordings by providing a pronunciation list in the HTTP request body. This feature allows you to refine the text output without the need to retrain the language model, making it especially useful for handling foreign words or brand names.
How It Works
The pronunciation customization feature works by mapping the recognized speech to your desired text output. For example, if the SR service recognizes "capuchino" in the audio, you can instruct the service to output "Cappuccino" in the transcription result. This is achieved by including a pronunciation list in the HTTP request.
Enhanced Context Biasing
This feature not only corrects incorrectly recognized words but also performs context biasing, which increases the probability of accurate recognition. This way, it ensures that even challenging words or phrases that are frequently misrecognized can be transcribed more accurately.
👀 For more detailed information on how context biasing enhances speech recognition accuracy, click here to explore the complete documentation.
Example Request
Here is an example of how to send a pronunciation list in a POST request using curl
:
curl --location '{{Address}}/v1/speech/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'ModelVersion: 0' \
--header 'Tenant: Default' \
--header 'Authorization: Bearer <your_token_here>' \
--form 'audio=@"/path/to/your/audio/file.wav"' \
--form 'pronunciations="[
{
\"capuchino\": \"Cappuccino\"
},
{
\"cappachino\": \"Cappuccino\"
},
{
\"capuccino \": \"Cappuccino\"
}
]"'
Before and After Example
To demonstrate the impact of the pronunciation customization feature, let's consider the following scenario:
Before Customization
Without using the pronunciation customization feature, the Speech Recognition service might produce the following result for the audio input:
{
"audioLink": "{{Address}}/audio-logs/Default/2024/9/2/13/20240902T125734-2615_in.wav",
"confidence": 0.99,
"detectedAudioContent": "recognizable-speech",
"errorCode": null,
"errorMessage": null,
"moreInfo": null,
"resultText": "I ordered a capuchino from the machine but found it too bitter and couldn't drink it.",
"success": true
}
In this case, the word "capuchino" is transcribed as is, without correcting to the proper spelling "Cappuccino."
After Customization
By using the pronunciation customization feature, you can correct the spelling automatically. Here's the modified request:
curl --location '{{Address}}/v1/speech/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'ModelVersion: 0' \
--header 'Tenant: Default' \
--header 'Authorization: Bearer <your_token_here>' \
--form 'audio=@"/path/to/your/audio/file.wav"' \
--form 'pronunciations="[
{
\"capuchino\": \"Cappuccino\"
}
]"'
With this customization, the SR service will return the following corrected result:
{
"audioLink": "{{Address}}/audio-logs/Default/2024/9/2/13/20240902T125734-2615_in.wav",
"confidence": 0.99,
"detectedAudioContent": "recognizable-speech",
"errorCode": null,
"errorMessage": null,
"moreInfo": null,
"resultText": "I ordered a Cappuccino from the machine but found it too bitter and couldn't drink it.",
"success": true
}
This example illustrates how the pronunciation customization feature can enhance the accuracy of your transcription results, ensuring that words are spelled correctly even if they are mispronounced in the audio input.
Here's an example using a longer, more complex phrase that can be merged into a single term, relevant to technology and commonly used in modern communication:
Handling Multi-Word Phrases
The pronunciation customization feature can also handle longer phrases. This is particularly useful for modern technical terms or jargon.
Example
Let’s take the phrase "artificial intelligence system." In speech, this might be recognized as three separate words, but in technical writing, you may prefer it to be merged into a single term like "AI system" or "AI System." Using the pronunciation customization feature, you can control how the phrase is transcribed.
Here’s how you can customize it:
curl --location '{{Address}}/v1/speech/dictation/request' \
--header 'Content-Type: audio/wave' \
--header 'ModelName: English' \
--header 'ModelVersion: 0' \
--header 'Tenant: Default' \
--header 'Authorization: Bearer <your_token_here>' \
--form 'audio=@"/path/to/your/audio/file.wav"' \
--form 'pronunciations="[
{
\"artificial intelligence system\": \"AI System\"
}
]"'
Before Customization
Without customization, the transcription result might look like this:
{
"audioLink": "{{Address}}/audio-logs/Default/2024/9/2/13/20240902T125734-2615_in.wav",
"confidence": 0.94,
"detectedAudioContent": "recognizable-speech",
"errorCode": null,
"errorMessage": null,
"moreInfo": null,
"resultText": "We need to upgrade our artificial intelligence system to handle more tasks.",
"success": true
}
In this case, "artificial intelligence system" is transcribed as three separate words, which may not be the desired format.
After Customization
With the pronunciation customization feature applied, the result becomes:
{
"audioLink": "{{Address}}/audio-logs/Default/2024/9/2/13/20240902T125734-2615_in.wav",
"confidence": 0.94,
"detectedAudioContent": "recognizable-speech",
"errorCode": null,
"errorMessage": null,
"moreInfo": null,
"resultText": "We need to upgrade our AI System to handle more tasks.",
"success": true
}
In this case, "artificial intelligence system" is merged and simplified to "AI System," providing a concise and technically correct term.
Detailed Explanation
- Audio File: The audio file is provided as usual in the request using the
audio
parameter. - Pronunciations List: The
pronunciations
parameter allows you to specify a list of words that you want to customize in the transcription result. The list is formatted as a JSON array of objects, where each object contains a recognized word as the key and the desired output as the value.
For example, the object {"capuchino": "Cappuccino"}
tells the SR service to replace any occurrence of "capuchino" in the recognized text with "Cappuccino."
Important Notes
- Required Parameter: The
pronunciations
parameter must be included in every HTTP request where you need this customization. If this parameter is not provided, the SR service will output the recognized text as is, without any customizations. - Language Independence: This feature is language-independent, allowing you to apply it across different languages without modifying the underlying language model.
Use Case
This feature is particularly useful in scenarios where:
- Foreign Words: You want to ensure that foreign words or brand names are transcribed correctly.
- Rapid Customization: You need to apply quick fixes to the transcription output without retraining the model.
MRCP Integration for Pronunciation Customization
For users utilizing the Knovvu Speech Recognition (SR) service through MRCP (Media Resource Control Protocol), pronunciation customization can also be achieved by configuring specific parameters within the IVR (Interactive Voice Response) system interface.
In this setup, the IVR should have a parameter entry screen or configuration page where key-value pairs can be set to define the desired pronunciation customizations. Here’s how it works:
MRCP IVR Configuration
Access IVR Parameter Settings: Open the IVR interface where you can add key-value configurations. This is typically under a customization or settings section.
Set Pronunciation Customization Parameters:
Key: Use
Pronunciation
as the key name.Value: Provide the pronunciation list as a JSON-formatted string. This JSON object should map words or phrases (that may be misrecognized) to their desired transcription output.
Submit and Save: Ensure these parameters are saved and applied. Once configured, the IVR will include this pronunciation customization in the requests sent to the Knovvu SR service via MRCP.
By setting up these pronunciation parameters directly in the IVR, you can tailor the transcription results without additional coding, making it easy to adapt to specific customer needs or technical terminology directly through the MRCP interface. This method offers a flexible, language-independent solution for refining transcription accuracy across various IVR use cases.