The Speech Recognition WebSocket API provides real-time speech-to-text transcription by allowing clients to stream audio continuously and receive recognition results during the session.
This API is typically used for real-time voice scenarios where audio is sent in small chunks over a WebSocket connection.
1. WebSocket Endpoint
The Speech Recognition WebSocket endpoint may differ depending on the customer environment, region, or deployment type.
Use the WebSocket host provided for your environment.
Endpoint Format
wss://<sr-websocket-host>/recognizer?ModelName=<ModelName>
Example
wss://srapi.knovvu.com/recognizer?ModelName=EnglishStream
In this example, srapi.knovvu.com represents the Speech Recognition WebSocket host for a specific environment.
The actual host may be different for other regions, private cloud environments, or on-premises deployments.
The ModelName query parameter is required. It is used during WebSocket routing to direct the request to the correct speech recognition model.
The value must match the model that will be used in the recognition request.
For example, if the client wants to use the EnglishStream model, the WebSocket connection URL should be:
wss://<sr-websocket-host>/recognizer?ModelName=EnglishStream
And the recognition payload should also include:
"model-name": "EnglishStream"
2. Authentication
To use the service, the client must authenticate with a valid LDM token.
The token should be provided according to the agreed authentication method for the project or tenant.
Example Placeholder
"Authorization": "<token>"
Replace <token> with the actual authorization token value.
3. Recognition Flow
A typical WebSocket recognition flow is:
- Obtain a valid authorization token.
- Connect to the WebSocket endpoint with the correct host and the required
ModelNamequery parameter. - Send a
recognizemessage to start recognition. - Stream raw audio chunks to the server.
- Receive partial, milestone, or final recognition results.
- Send
finalize-recognitionorstop-recognitionwhen needed.
4. Client-to-Server Messages
Client-to-server messages are sent from the client application to the Speech Recognition WebSocket API.
4.1 recognize
The recognize message starts a speech recognition session.
Payload Fields
| Field | Type | Required | Description |
|---|---|---|---|
message-name |
string | Yes | Must be recognize. |
audio-format |
string | Yes | Audio format. Supported value: pcm. |
sample-rate |
integer | Yes | Sample rate of the audio, for example 8000 or 16000. |
model-name |
string | Yes | Name of the speech recognition model. This should match the ModelName value in the WebSocket URL. |
model-tenant |
string | No | Tenant name of the selected model. If omitted, the default tenant may be used. |
model-version |
string | No | Version of the selected model. If omitted, the default version may be used. |
audio-splitter |
string | No | Audio splitting strategy. Common value: realtime-vad. |
vad-sensitivity |
integer | No | Voice Activity Detection sensitivity. Range: 1-10. Default: 6. |
vad-pre-speech-buffer-msec |
integer | No | Amount of audio kept before detected speech starts, in milliseconds. Default: 300. |
vad-post-speech-buffer-msec |
integer | No | Amount of audio kept after detected speech ends, in milliseconds. Default: 400. |
vad-max-speech-duration-msec |
integer | No | Maximum speech duration in milliseconds. -1 means no limit. Default: -1. |
vad-silence-trigger-msec |
integer | No | Silence duration used to trigger speech end detection. Default: 400. |
vad-graceful-silence-threshold-msec |
integer | No | Graceful silence threshold in milliseconds. Default: 10000. |
Example: recognize Request
WebSocket URL:
wss://<sr-websocket-host>/recognizer?ModelName=EnglishStream
Payload:
{
"message-name": "recognize",
"audio-format": "pcm",
"sample-rate": 16000,
"model-name": "EnglishStream",
"model-version": "1",
"audio-splitter": "realtime-vad",
"Authorization": "<token>"
}
Example: recognize Request with VAD Parameters
{
"message-name": "recognize",
"audio-format": "pcm",
"sample-rate": 16000,
"model-name": "EnglishStream",
"model-tenant": "Default",
"model-version": "1",
"audio-splitter": "realtime-vad",
"vad-sensitivity": 6,
"vad-pre-speech-buffer-msec": 300,
"vad-post-speech-buffer-msec": 400,
"vad-max-speech-duration-msec": -1,
"vad-silence-trigger-msec": 400,
"vad-graceful-silence-threshold-msec": 10000,
"Authorization": "<token>"
}
Important Note About Model Routing
The model name must be provided in the WebSocket URL as a query parameter.
Correct:
wss://<sr-websocket-host>/recognizer?ModelName=EnglishStream
Incorrect:
wss://<sr-websocket-host>/recognizer
If the ModelName query parameter is missing or does not match an available model, the server may not route the WebSocket session to the correct model and may return an error such as:
{
"operation-result": "Cannot find model Default_EnglishStream_1",
"session-id": "93af793b5e564e4e"
}
To avoid this issue:
- Use the correct WebSocket host for your environment.
- Always include
ModelName=<ModelName>in the WebSocket URL. - Make sure the
ModelNamein the URL matches themodel-namein therecognizepayload. - Make sure the selected model, tenant, and version are available for the customer environment.
4.2 stop-recognition
The stop-recognition message stops an ongoing recognition session.
This message may discard unprocessed audio data or recognition events that have not yet been received by the client. Some events that were already generated by the server before the stop request may still be delivered.
Payload
{
"message-name": "stop-recognition"
}
4.3 finalize-recognition
The finalize-recognition message asks the server to finalize the current recognition session and return the final result.
This is typically used when the client has finished sending audio and wants to complete the recognition process gracefully.
Payload
{
"message-name": "finalize-recognition"
}
5. Audio Streaming
After sending the recognize message, the client can start sending audio chunks over the WebSocket connection.
The audio must match the configuration provided in the recognize payload.
For example, if the payload contains:
{
"audio-format": "pcm",
"sample-rate": 16000
}
Then the streamed audio should be:
- PCM audio
- 16-bit signed samples
- 16 kHz sample rate
- Sent in binary audio chunks
6. Server-to-Client Messages
Server-to-client messages are sent by the Speech Recognition WebSocket API to the client application.
6.1 recognize-response
Indicates whether the recognition session has started successfully.
Example
{
"message-name": "recognize-response",
"operation-result": "Success",
"recognition-id": "12345"
}
6.2 partial-result
Provides an interim recognition result.
Partial results are not final and may change as more audio is processed.
Example
{
"message-name": "partial-result",
"recognition-id": "12345",
"text": "This is a partial result."
}
6.3 milestone-result
Provides a stable recognition result for a completed speech segment.
Milestone results are cumulative. To obtain the full recognized text, concatenate milestone results in the order they are received.
Example
{
"message-name": "milestone-result",
"recognition-id": "12345",
"text": "This is a milestone result."
}
6.4 final-result
Provides the final recognition result.
Example
{
"message-name": "final-result",
"recognition-id": "12345",
"operation-result": "Success",
"text": "This is the final result.",
"confidence": "0.95"
}
6.5 stop-recognition-response
Indicates whether the recognition session was stopped successfully.
Example
{
"message-name": "stop-recognition-response",
"operation-result": "Success",
"recognition-id": "12345"
}
6.6 finalize-recognition-response
Indicates whether finalization of the recognition session has started successfully.
Example
{
"message-name": "finalize-recognition-response",
"operation-result": "Success",
"recognition-id": "12345"
}
7. Troubleshooting
7.1 Error: Cannot find model
Example Error
{
"operation-result": "Cannot find model Default_EnglishStream_1",
"session-id": "93af793b5e564e4e"
}
Possible Causes
- The WebSocket URL does not include the required
ModelNamequery parameter. - The
ModelNamevalue in the URL does not match themodel-namevalue in the payload. - The selected WebSocket host does not belong to the customer’s assigned environment.
- The selected model is not available for the tenant.
- The selected model version is not available.
- The tenant value is missing or incorrect.
Recommended Check
WebSocket URL:
wss://<sr-websocket-host>/recognizer?ModelName=EnglishStream
Payload:
{
"message-name": "recognize",
"audio-format": "pcm",
"sample-rate": 16000,
"model-name": "EnglishStream",
"model-version": "1",
"audio-splitter": "realtime-vad",
"Authorization": "<token>"
}
8. Best Practices
- Use the correct WebSocket host for your environment.
- Always include
ModelNamein the WebSocket URL. - Keep the URL
ModelNameand payloadmodel-nameconsistent. - Use a valid authorization token.
- Make sure the audio sample rate matches the
sample-ratevalue in the payload. - Send audio in the expected format, such as PCM 16-bit.
- Use
finalize-recognitionwhen all audio has been sent and a final result is expected. - Use
stop-recognitiononly when the recognition session should be interrupted or discarded.
