Transcribe in Real-time

Prev Next

The Speech Recognition WebSocket API provides real-time speech-to-text transcription by allowing clients to stream audio continuously and receive recognition results during the session.

This API is typically used for real-time voice scenarios where audio is sent in small chunks over a WebSocket connection.


1. WebSocket Endpoint

The Speech Recognition WebSocket endpoint may differ depending on the customer environment, region, or deployment type.

Use the WebSocket host provided for your environment.

Endpoint Format

wss://<sr-websocket-host>/recognizer?ModelName=<ModelName>

Example

wss://srapi.knovvu.com/recognizer?ModelName=EnglishStream

In this example, srapi.knovvu.com represents the Speech Recognition WebSocket host for a specific environment.

The actual host may be different for other regions, private cloud environments, or on-premises deployments.

The ModelName query parameter is required. It is used during WebSocket routing to direct the request to the correct speech recognition model.

The value must match the model that will be used in the recognition request.

For example, if the client wants to use the EnglishStream model, the WebSocket connection URL should be:

wss://<sr-websocket-host>/recognizer?ModelName=EnglishStream

And the recognition payload should also include:

"model-name": "EnglishStream"

2. Authentication

To use the service, the client must authenticate with a valid LDM token.

The token should be provided according to the agreed authentication method for the project or tenant.

Example Placeholder

"Authorization": "<token>"

Replace <token> with the actual authorization token value.


3. Recognition Flow

A typical WebSocket recognition flow is:

  1. Obtain a valid authorization token.
  2. Connect to the WebSocket endpoint with the correct host and the required ModelName query parameter.
  3. Send a recognize message to start recognition.
  4. Stream raw audio chunks to the server.
  5. Receive partial, milestone, or final recognition results.
  6. Send finalize-recognition or stop-recognition when needed.

4. Client-to-Server Messages

Client-to-server messages are sent from the client application to the Speech Recognition WebSocket API.


4.1 recognize

The recognize message starts a speech recognition session.

Payload Fields

Field Type Required Description
message-name string Yes Must be recognize.
audio-format string Yes Audio format. Supported value: pcm.
sample-rate integer Yes Sample rate of the audio, for example 8000 or 16000.
model-name string Yes Name of the speech recognition model. This should match the ModelName value in the WebSocket URL.
model-tenant string No Tenant name of the selected model. If omitted, the default tenant may be used.
model-version string No Version of the selected model. If omitted, the default version may be used.
audio-splitter string No Audio splitting strategy. Common value: realtime-vad.
vad-sensitivity integer No Voice Activity Detection sensitivity. Range: 1-10. Default: 6.
vad-pre-speech-buffer-msec integer No Amount of audio kept before detected speech starts, in milliseconds. Default: 300.
vad-post-speech-buffer-msec integer No Amount of audio kept after detected speech ends, in milliseconds. Default: 400.
vad-max-speech-duration-msec integer No Maximum speech duration in milliseconds. -1 means no limit. Default: -1.
vad-silence-trigger-msec integer No Silence duration used to trigger speech end detection. Default: 400.
vad-graceful-silence-threshold-msec integer No Graceful silence threshold in milliseconds. Default: 10000.

Example: recognize Request

WebSocket URL:

wss://<sr-websocket-host>/recognizer?ModelName=EnglishStream

Payload:

{
  "message-name": "recognize",
  "audio-format": "pcm",
  "sample-rate": 16000,
  "model-name": "EnglishStream",
  "model-version": "1",
  "audio-splitter": "realtime-vad",
  "Authorization": "<token>"
}

Example: recognize Request with VAD Parameters

{
  "message-name": "recognize",
  "audio-format": "pcm",
  "sample-rate": 16000,
  "model-name": "EnglishStream",
  "model-tenant": "Default",
  "model-version": "1",
  "audio-splitter": "realtime-vad",
  "vad-sensitivity": 6,
  "vad-pre-speech-buffer-msec": 300,
  "vad-post-speech-buffer-msec": 400,
  "vad-max-speech-duration-msec": -1,
  "vad-silence-trigger-msec": 400,
  "vad-graceful-silence-threshold-msec": 10000,
  "Authorization": "<token>"
}

Important Note About Model Routing

The model name must be provided in the WebSocket URL as a query parameter.

Correct:

wss://<sr-websocket-host>/recognizer?ModelName=EnglishStream

Incorrect:

wss://<sr-websocket-host>/recognizer

If the ModelName query parameter is missing or does not match an available model, the server may not route the WebSocket session to the correct model and may return an error such as:

{
  "operation-result": "Cannot find model Default_EnglishStream_1",
  "session-id": "93af793b5e564e4e"
}

To avoid this issue:

  • Use the correct WebSocket host for your environment.
  • Always include ModelName=<ModelName> in the WebSocket URL.
  • Make sure the ModelName in the URL matches the model-name in the recognize payload.
  • Make sure the selected model, tenant, and version are available for the customer environment.

4.2 stop-recognition

The stop-recognition message stops an ongoing recognition session.

This message may discard unprocessed audio data or recognition events that have not yet been received by the client. Some events that were already generated by the server before the stop request may still be delivered.

Payload

{
  "message-name": "stop-recognition"
}

4.3 finalize-recognition

The finalize-recognition message asks the server to finalize the current recognition session and return the final result.

This is typically used when the client has finished sending audio and wants to complete the recognition process gracefully.

Payload

{
  "message-name": "finalize-recognition"
}

5. Audio Streaming

After sending the recognize message, the client can start sending audio chunks over the WebSocket connection.

The audio must match the configuration provided in the recognize payload.

For example, if the payload contains:

{
  "audio-format": "pcm",
  "sample-rate": 16000
}

Then the streamed audio should be:

  • PCM audio
  • 16-bit signed samples
  • 16 kHz sample rate
  • Sent in binary audio chunks

6. Server-to-Client Messages

Server-to-client messages are sent by the Speech Recognition WebSocket API to the client application.


6.1 recognize-response

Indicates whether the recognition session has started successfully.

Example

{
  "message-name": "recognize-response",
  "operation-result": "Success",
  "recognition-id": "12345"
}

6.2 partial-result

Provides an interim recognition result.

Partial results are not final and may change as more audio is processed.

Example

{
  "message-name": "partial-result",
  "recognition-id": "12345",
  "text": "This is a partial result."
}

6.3 milestone-result

Provides a stable recognition result for a completed speech segment.

Milestone results are cumulative. To obtain the full recognized text, concatenate milestone results in the order they are received.

Example

{
  "message-name": "milestone-result",
  "recognition-id": "12345",
  "text": "This is a milestone result."
}

6.4 final-result

Provides the final recognition result.

Example

{
  "message-name": "final-result",
  "recognition-id": "12345",
  "operation-result": "Success",
  "text": "This is the final result.",
  "confidence": "0.95"
}

6.5 stop-recognition-response

Indicates whether the recognition session was stopped successfully.

Example

{
  "message-name": "stop-recognition-response",
  "operation-result": "Success",
  "recognition-id": "12345"
}

6.6 finalize-recognition-response

Indicates whether finalization of the recognition session has started successfully.

Example

{
  "message-name": "finalize-recognition-response",
  "operation-result": "Success",
  "recognition-id": "12345"
}

7. Troubleshooting

7.1 Error: Cannot find model

Example Error

{
  "operation-result": "Cannot find model Default_EnglishStream_1",
  "session-id": "93af793b5e564e4e"
}

Possible Causes

  1. The WebSocket URL does not include the required ModelName query parameter.
  2. The ModelName value in the URL does not match the model-name value in the payload.
  3. The selected WebSocket host does not belong to the customer’s assigned environment.
  4. The selected model is not available for the tenant.
  5. The selected model version is not available.
  6. The tenant value is missing or incorrect.

Recommended Check

WebSocket URL:

wss://<sr-websocket-host>/recognizer?ModelName=EnglishStream

Payload:

{
  "message-name": "recognize",
  "audio-format": "pcm",
  "sample-rate": 16000,
  "model-name": "EnglishStream",
  "model-version": "1",
  "audio-splitter": "realtime-vad",
  "Authorization": "<token>"
}

8. Best Practices

  • Use the correct WebSocket host for your environment.
  • Always include ModelName in the WebSocket URL.
  • Keep the URL ModelName and payload model-name consistent.
  • Use a valid authorization token.
  • Make sure the audio sample rate matches the sample-rate value in the payload.
  • Send audio in the expected format, such as PCM 16-bit.
  • Use finalize-recognition when all audio has been sent and a final result is expected.
  • Use stop-recognition only when the recognition session should be interrupted or discarded.