Transcribe a File

31 Jan 2025
27 Minutes to read
Contributors

Transcribe a File

Updated on 31 Jan 2025
27 Minutes to read
Contributors

Article summary

Did you find this summary helpful?

Thank you for your feedback!

This document outlines the usage of the Knovvu SR REST API, which provides two types of speech recognition services:

Speech Recognition with Grammar: Performs speech recognition using a grammar file.
Speech Dictation with Language Model: Uses a language model for speech dictation.

For more details on these services, please refer to the relevant documentation: Recognition Methods

Authentication

Knovvu SR service requires a bearer token when sending recognition requests and using license-required methods. Therefore, a token must be obtained before using these methods.

The required information for token generation includes:

API Client ID
API Client Secret

These credentials are provided after the subscription is created.

Steps for Token Generation

Send a POST request to the Get Integration Token LDM endpoint to create a bearer token.

Request Example (cURL)

curl --location --request POST 'https://identity.ldm.knovvu.com/connect/token' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'client_id=[client_id]' \
--data-urlencode 'client_secret=[client_secret] =' \
--data-urlencode 'grant_type=client_credentials' \
--data-urlencode 'scope=Ldm_Integration'

Response Example

{
  "access_token": "[token]",
  "expires_in": 31536000,
  "token_type": "Bearer",
  "scope": "Ldm_Integration"
}

Using the API After Authentication

Once authentication is completed and a valid bearer token is obtained, you can start using the speech recognition functionalities of the API. There are two primary methods for recognizing speech, depending on your use case:

Speech Recognition with Grammar: Performs speech recognition using a predefined grammar file. This method is ideal for structured scenarios where the recognized speech needs to match a set of predefined words or phrases.
Speech Dictation with Language Model: Uses a language model to recognize free-form speech. This method is suitable for scenarios where users are expected to speak naturally without predefined constraints.

Speech Recognition with Grammar

The API enables speech recognition using a grammar file. The process involves the following steps:

1. Selecting a Grammar File

Multiple grammar files can be loaded onto the server.
For each recognition request, specify the grammar file to be used.

2.Sending the Audio File

Submit an audio file along with the selected grammar file to the service.

3.Receiving the Recognized Text

The service returns the recognized text based on the specified grammar file.

Additionally, the service includes a speech validation feature. This feature is used for short audio files to verify whether they contain the expected text. In this case, a grammar file does not need to be supplied, as the service automatically generates one for validation purposes.

Grammar Files

Grammars are used by speech recognizers and other grammar processors to define the words and patterns to be recognized. Developers can specify words and structures for speech recognition.

There are two types of grammars:

List Grammar: A simple grammar format developed by Sestek, similar to a word list.

For example, a file containing the following lines:

Apple
Banana

Note:
A valid grammar file in this format should only contain two elements. This format is only supported for Turkish language and its use is discouraged. However, support will continue for backward compatibility. For new projects, SRGS formatted grammar files are recommended.

SRGS XML Grammar

Srgsxml is a W3C specification, providing SRGS XML format support with SISR and NLSML.

A basic SRGS grammar, which defines a list of words to be recognized, looks like this:

<?xml version="1.0" encoding="UTF-8" ?>
<grammar mode="voice" tag-format="semantics/1.0" xml:lang="en-US" version="1.0" root="main">
    <rule id="main">
        <one-of>
            <item>Apple</item>
            <item>Banana</item>
        </one-of>
    </rule>
</grammar>

Typical Usage Scenarios

The following diagram illustrates the typical usage scenarios for the SR REST API:

Client                                SR REST API
  +-----------------------------------+
  |   +--------+ GET  +--------+     |
  |   | List Available Grammars |     |
  |   +--------+ GET  +--------+     |
  |   | Create New Grammar      |     |
  |   +--------+ POST +--------+     |
  |   | Make Speech Recognition |     |
  +-----------------------------------+

List Available Grammars

Actor: REST Service Client
Goal: Retrieve available grammars from the speech recognition service.

Step	REST Service Client Actions	REST Service Actions
1	Send GET request to the Grammars endpoint.	Returns information about all available grammars in JSON format.

Post Conditions

None

Business Rules

No license is required to call this service.

Create New Grammar

Actor: REST Service Client
Goal: Define a new grammar in the speech recognition service to be used in future recognition requests.

Pre-Conditions

The user must create a new grammar file using a text editor.
The grammar file must be valid.

Steps for Creating a New Grammar

Step	REST Service Client Actions	REST Service Actions
1	POST a new grammar file to the Grammars endpoint.	Saves the grammar file content on the server.

Post Conditions

The newly uploaded grammar file can be used in speech recognition requests by specifying its name.

Business Rules

If a POST request uploads a grammar file with an existing name, it will overwrite the current grammar file.
No license is required to use this service.

Other Notes (Assumptions, Issues, Special Requirements)

Grammar file names should be in ANSI format (as they are used in HTTP header fields).
Grammar files should be encoded in UTF-8.

Make Speech Recognition

Actor: REST Service Client
Goal: Perform speech recognition using an audio file and a predefined grammar.

Pre-Conditions

Valid grammar name
- The grammar must be selected from the available grammars.
Valid audio file
- The audio file must be prerecorded by the user.

Steps for Speech Recognition

Step	REST Service Client Actions	Rest Service Actions
1	Send a POST request to the Request endpoint with the audio file and the grammar name to be used in speech recognition.	Returns the recognition result in JSON format.

Post Conditions

None

Business Rules

Common audio formats are mostly supported (e.g., .wav, .opus, .mp3).
If the audio file is uncompressed, it is recommended to compress it before sending to improve network usage.
Opus format is recommended for compression, as it has the least impact on recognition accuracy.

GET Grammars

URL: v1/speech/recognition/grammars
Method: GET

Summary
Retrieves information about available grammar files.

Description
The Grammars endpoint returns a list of available grammars that can be used in speech recognition requests.

Note: No license is required to call this service.

Response Fields

The response contains essential information about each grammar file.

Name	Description
Grammars	An array of objects containing grammar details such as name, ID, and content type.
Success	`True`: The request was successful. `False`: The request failed.
ErrorMessage	Error message in case of a failed request.
ErrorCode	Error code when the request fails (e.g., Internal Server Error).
MoreInfo	Additional information about the response.

Success Response Example

{
  "grammars": [
    {
      "id": 125,
      "name": "my42",
      "tenant": "default",
      "type": "application/srgs+xml"
    }
  ],
  "success": true,
  "errorMessage": null,
  "errorCode": null,
  "moreInfo": null
}

Error Response Example

{
  "success": false,
  "errorMessage": "Unexpected Error",
  "errorCode": "internal-service-error",
  "moreInfo": null
}

GET Specific Grammars

URL: v1/speech/recognition/grammars/{id}
Method: GET
Summary: Retrieves the content of a specific grammar.
Description: Allows users to download a grammar file from the SR server.

Note: No license is required to call this service.

DELETE Grammars

URL: v1/speech/recognition/grammars/{id}
Method: DELETE
Summary: Deletes a grammar file.
Description: Allows users to remove a grammar file from the SR server.

Note: No license is required to call this service.

POST Grammars

URL: v1/speech/recognition/grammars
Method: POST
Summary: Defines a new grammar.
Description: Allows users to send a new grammar definition to the Grammars endpoint.
Once added, this grammar can be used in recognition requests.

Details

To send a new grammar definition:

Set the Grammar Name in the request header.
- Example: GrammarName: NewGrammarName
Set the Tenant of the grammar (optional).
- If the tenant parameter is not set, the default value will be used.
- Example: Tenant: NewGrammarTenant
Add the grammar file content as binary data to the request body.
Specify the correct Content-Type in the request header when adding the grammar content.

Supported Content-Types for Grammars

SRGS XML:
```
application/srgs+xml
```
Sestek Custom SRGS List:
```
application/x-gslist
```

Test with cURL

To test the POST Grammars request using cURL, use the following command:

curl \
--header "GrammarName: NewSrgsXmlGrammar" \
--header "Tenant: NewGrammarTenant" \
--data-binary "@NewGrammar.grxml" \
-H "Content-Type: application/srgs+xml" \
-X POST "https://{sr-server}/v1/speech/recognition/grammars"

Note:

Replace {sr-server} with the actual server URL.
Ensure that NewGrammar.grxml is the correct path to your grammar file.

Example Request (SrgsXml Grammar)

Below is an example POST request to send an SrgsXml Grammar to the speech recognition service.

POST http://acme-pc:5000/v1/speech/recognition/grammars HTTP/1.1
Content-Disposition: File; fileName="NewSrgsXmlGrammar"; fileExtension=".grxml"
GrammarName: NewSrgsXmlGrammar
Accept: application/json, application/xml, text/json, text/x-json, text/javascript, text/xml
Accept-charset: utf-8
User-Agent: sestek-speech-recognition-rest--client
Content-Type: application/srgs+xml
Host: acme-pc:5000
Content-Length: 515
Accept-Encoding: gzip, deflate

Example SrgsXml Grammar File

<?xml version="1.0" encoding="UTF-8" ?>
<grammar mode="voice" tag-format="semantics/1.0" xml:lang="en-US" version="1.0" root="main">
    <rule id="main">
        <item>
            <one-of>
                <item>number one<tag> out = "number 1"; </tag></item>
                <item>number two<tag> out = "number 2"; </tag></item>
            </one-of>
        </item>
    </rule>
</grammar>

Note: No license is required to call this service.

Example Request: (Sestek Custom Srgs List Grammar)

Below is an example POST request to send a Sestek Custom Srgs List Grammar to the speech recognition service.

POST http://acme-pc:5000/v1/speech/recognition/grammars HTTP/1.1
Content-Disposition: File; fileName="NewlistGrammar"; fileExtension=".txt"
GrammarName: NewlistGrammar
Accept: application/json, application/xml, text/json, text/x-json, text/javascript, text/xml
Accept-charset: utf-8
User-Agent: sestek-speech-recognition-rest--client
Content-Type: application/x-gslist
Host: acme-pc:5000
Content-Length: 32
Accept-Encoding: gzip, deflate

Example Sestek Custom Srgs List Grammar File

Mersin
Adana
Corum
Kastamonu

Request Fields

Name	Description
GrammarName	Specify a custom name for the provided grammar.
Tenant	Specify a tenant for the provided grammar.
Content-Type	Can take two valid values: • SrgsXml = `"application/srgs+xml"` • SestekCustomSrgsList = `"application/x-gslist"`

Success Response Example

{
  "success": true,
  "id": 126,
  "errorMessage": null,
  "errorCode": null,
  "moreInfo": null
}

Error Response Example

{
  "success": false,
  "errorMessage": "Recognition-Parameters Are Not Defined At Header",
  "errorCode": "missing-parameter",
  "moreInfo": null
}

Response Fields

The response is returned in JSON format.

Name	Description
Success	`True`: The request succeeded. `False`: The request failed.
ErrorMessage	If the request fails, this field contains the failure message.
ErrorCode	If the request fails, this field contains the failure error code (e.g., Internal Service Error).
MoreInfo	Any additional information about the response.
id	A server-generated unique ID for the grammar. This ID is required when downloading or deleting the grammar from the server.

Note:

No license is required to call this service.
If a grammar file with the same name already exists, the new upload will override the existing grammar content.

POST Generate Grammar

URL: v1/speech/recognition/grammars/generator
Method: POST
Summary: Generates a new grammar.
Description: Converts a word list into an SRGS-XML grammar.

Note: No license is required to call this service.

Details

In the request header:

Content-Type must be text/plain.
You can specify the language of the grammar.
- If the language parameter is not set, it defaults to "tr-TR".
- Example: language: "en-US"
The request body must contain a word list, where each word is on a new line.

Request Example with cURL

curl -X POST "http://{sr-server}/v1/speech/recognition/grammars/generator" \
-H "Content-Length: 20" \
-H "Content-Type: text/plain" \
-H "language: tr-TR" \
-d "merhaba
nasılsın"

Request Fields

Name	Description
Language	Specify a language for the generated grammar.
Content-Type	Must be `text/plain`.

Success Response Example

<?xml version="1.0" encoding="UTF-8" ?>
<grammar mode="voice" tag-format="semantics/1.0" xml:lang="tr-TR" version="1.0" root="main">
    <rule id="main">
        <one-of>
            <item>merhaba<tag>out = "merhaba";</tag></item>
            <item>nasılsın<tag>out = "nasılsın";</tag></item>
        </one-of>
    </rule>
</grammar>

Error Response Example

{
  "errorCode": "cannot-generate-grammar",
  "errorMessage": "failed to generate grammar",
  "moreInfo": "Content-Type has not been specified",
  "success": false
}

POST Recognition Request

URL: v1/speech/recognition/request
Method: POST
Summary: Performs speech recognition.
Description:
Allows users to send a speech recognition request with an audio file and grammar name.
The service will return the recognized text from the speech.

How to Send a Request:

To perform a speech recognition request:

Include the following mandatory HTTP header fields:
- Frequency (e.g., 8000)
- GrammarName (e.g., "SimpleTestGrammar")
The following fields are optional:
- Tenant (Defaults to "Default" if not set)
- SendAudioDownloadLink (e.g., true)

Example parameters:

- Frequency: 8000
- GrammarName: SimpleTestGrammar
- SendAudioDownloadLink: true

If the Speech Recognition Service requires a license (e.g., when used in the cloud),
you must send a bearer token as Authorization in the request header.

Once the request headers are set, the audio file should be sent as binary data in the request body.
The Content-Type of the audio file must also be specified.

Supported Audio Mime Types:

audio/opus
audio/wav
audio/wave

Request Example with cURL:

curl --location --request POST 'https://sr.knovuapi.com/v1/speech/recognition/request' \
--header 'Content-Type: audio/wave' \
--header 'GrammarName: SampleGrammar' \
--header 'Tenant: Default' \
--header 'Authorization: Bearer [token]' \
--data-binary '@/C:/Program Files/Sestek/SR/data/hello-world.wav'

Example Request (For WAV File):

Below is an example POST request to send a WAV audio file for speech recognition.

POST https://acme-pc:5000/v1/speech/recognition/request HTTP/1.1
Content-Disposition: file; fileName="08-tr-recognition-audio"; fileExtension=".wav"

Frequency: 8000
GrammarName: NewGrammar
SendAudioDownloadLink: True
Authorization: Bearer [token]

Accept: application/json, application/xml, text/json, text/x-json, text/javascript, text/xml
Accept-charset: utf-8
User-Agent: sestek-speech-recognition-rest--client
Content-Type: audio/wav
Host: acme-pc:5000
Content-Length: 18046
Accept-Encoding: gzip, deflate

RIFF WAVEfmt ...

Example Request (For Opus File):

Below is an example POST request to send an Opus audio file for speech recognition.

POST https://acme-pc:5000/v1/speech/recognition/request HTTP/1.1
Content-Disposition: file; fileName="09-tr-recognition-audio"; fileExtension=".opus"

Frequency: 8000
GrammarName: NewGrammar
SendAudioDownloadLink: True
Authorization: Bearer [token]

Accept: application/json, application/xml, text/json, text/x-json, text/javascript, text/xml
Accept-charset: utf-8
User-Agent: sestek-speech-recognition-rest--client
Content-Type: audio/opus
Host: acme-pc:5000
Content-Length: 9918
Accept-Encoding: gzip, deflate

OggS ..

Request Fields:

Name	Description
Frequency	The frequency at which the operation will be performed. If not specified, the default value is 8000. This is not necessarily the frequency of the audio file sent; the sampling rate will be determined from its format.
GrammarName	The name of the grammar used for this recognition.
Tenant	The tenant of the grammar used for this recognition.
SendAudioDownloadLink	If set to True, the server will host the audio file and provide a download link in the response.
Authorization: Bearer [token]	Required for cloud usage. If using this REST service via cloud, a token must be included in the request. API Client ID and API Client Secret parameters are needed to generate a token.
Content-Type	Specifies the type of audio file being sent (e.g., `audio/[type]`).

Success Response Example:

{
  "confidence": 0.99,
  "recognizedText": "Mersin",
  "semanticResult": "Mersin",
  "speechStartTimeMsec": 0,
  "speechEndTimeMsec": 1125,
  "audioLink": "http://acme-pc/...",
  "success": true,
  "errorMessage": null,
  "errorCode": null,
  "moreInfo": null
}

Error Response Example:

{
  "success": false,
  "errorMessage": "GrammarName has not been specified",
  "errorCode": "missing-parameter",
  "moreInfo": null
}

Response Fields:

Name	Description
RecognizedText	The plain text recognition result.
SemanticResult	A machine-processable representation of the recognized text, which can be more detailed than a standard transcript.
SpeechStartTimeMsec	The time (in milliseconds) when speech started in the provided audio file.
SpeechEndTimeMsec	The time (in milliseconds) when speech ended in the provided audio file.
Confidence	The confidence score of the recognition result (range [0,1]). A value of 1 indicates absolute confidence.
AudioLink	A link to the hosted audio file if `SendAudioDownloadLink = true`.
Success	`True`: The request succeeded. `False`: The request failed.
ErrorMessage	If the request fails, this field contains the failure message.
ErrorCode	If the request fails, this field contains the failure error code (e.g., Internal Service Error).
MoreInfo	Any additional information about the response.

Note:
Based on service configuration, a license may or may not be required to use this service.

POST Validation

URL: v1/speech/recognition/validation
Method: POST
Summary: Validates whether the provided audio contains a specific text.
Description:
This endpoint is useful for cases where you have a relatively short audio file and need to determine if its content matches a given short text.

The Validation endpoint returns information about whether the provided audio matches the given text.

How to Send Request:

The validation request parameters should be sent as multipart/form-data.

End-Point Test with cURL:

Below is an example cURL request for testing the POST Validation endpoint.

curl \
--form validation-parameters='{
  "validationText": "Alaska",
  "language": "en-US",
  "sendDownloadLink": true,
  "Authorization": "Bearer [token]"
};type=application/json' \
--form upload=@Alaska.wav;type=audio/wav \
-X POST "http://[server-url]/v1/speech/recognition/validation"

Notes:

Replace [server-url] with the real server URL.
Replace [token] with a valid authorization token.
The audio file (Alaska.wav) should be correctly formatted as audio/wav.

Example Request:

Below is an example POST request for speech validation.

POST http://acme-pc:11000/v1/speech/recognition/validation HTTP/1.1
Host: acme-pc:11000
User-Agent: curl/7.48.0
Accept: */*
Content-Length: 44567
Expect: 100-continue
Content-Type: multipart/form-data; boundary=------------------------decd0b63ed9a4bf6

------------------------decd0b63ed9a4bf6
Content-Disposition: form-data; name="validation-parameters"
Content-Type: application/json

{
  "validationText": "Alaska",
  "language": "en-US",
  "frequency": 8000,
  "mediaType": "Wav",
  "sendDownloadLink": true,
  "Authorization": "Bearer [token]"
}

------------------------decd0b63ed9a4bf6
Content-Disposition: form-data; name="upload"; filename="Alaska.wav"
Content-Type: audio/wav

RIFF WAVEfmt ...

Explanation:

Validation parameters are sent as JSON inside a multipart/form-data request.
Audio file (Alaska.wav) is attached as binary data.
Authorization token (Bearer [token]) is required for authentication.
The Content-Type specifies the file format (audio/wav).

Note: Replace [token] with a valid authorization token.

Request Fields:

Name	Description
Validation parameters	Form parameter name = `"validation-parameters"` with JSON content.
Text	The text that will be checked to determine if the audio content matches.
Language	The language of the text, such as `tr-TR` or `en-US`.
Frequency	The frequency of the audio file.
Send_download_link	If set to true, a download link for the uploaded audio file will be provided. Default value is false.
Authorization	If a license is not required, this should be `"Null"`. If required, provide the token.
Audio Binary Data	The audio file to be validated. This file should be in Opus or WAV format.

Example Response:

{
  "answer": "valid",
  "moreInfo": "RecognizedText : Alaska",
  "audioLink": "Not Available"
}

Response Fields:

Name	Description
Validation parameters	Answer to the validation request.
Valid	The audio-speech matches the given control text.
NotValid	The audio-speech does not match the given control text.
MoreInfo	Additional information about the response, such as error details.
AudioLink	A link to download the input audio file in WAV format.

Speech Dictation with Language Model

This service allows you to perform Speech Dictation using a Language Model.

A language model is a file used by a Speech Dictation Engine to recognize speech.
It contains a large set of words along with their probability of occurrence, making it suitable for dictation applications.

How Language Models Work

Language models constrain the search in the decoder by limiting the number of possible words that can be considered at any given time.
This results in faster execution and higher accuracy.

How to Use Speech Dictation

Choose one of the available language models.
Send your audio file to the service, specifying the selected language model.

Important Notes

You cannot define new language models with this service.
You can use predefined language models.

Typical Usage Scenario

List Available Language Models

Actor: REST Service Client
Goal: Retrieve the list of available language models.

Pre-Conditions

None

Steps

Step	REST Service Client Actions	REST Service Actions
1	Send a GET request to the Models endpoint.	Returns a list of available language models in JSON format.

Post Conditions

None

Business Rules

None

Other Notes

No license is required to call this service.

Make Speech Dictation

Actor: REST Service Client
Goal: Perform speech dictation (recognition).

Pre-Conditions

Valid model name
- Must be selected from the Available Models List.
Valid audio file
- Supported formats include wav, opus, mp3, etc.

Steps

Step	REST Service Client Actions	REST Service Actions
1	Send a POST request to the Request endpoint with the audio file and selected language model name.	Returns the dictation (recognition) result in JSON format.

Post Conditions

None

Business Rules

Common audio formats are supported (wav, opus, mp3, etc.).
If using uncompressed data, it is recommended to compress it before sending.
Opus format is preferred for compression, as it has the least impact on recognition accuracy.

Other Notes

None

GET Models

URL: v1/speech/dictation/models
Method: GET
Summary: Retrieves available dictation models information.

Description
The Models endpoint returns information about the available dictation models.
The response includes:

Model name
Model frequency
Total number of models

How to Send Request
Simply send a GET HTTP request to the service endpoint.

End-Point Test with cURL

curl -X GET "https://{sr-server}/v1/speech/dictation/models"

Note:

Replace {sr-server} with the real server URL.

Example Response

{
  "models": [
    {
      "name": "TURKISH_GENERAL",
      "tenant": "Default",
      "is_persistent": "true",
      "frequency": 8000,
      "version": 1
    },
    {
      "name": "ENGLISH_GENERAL",
      "tenant": "Default",
      "is_persistent": "true",
      "frequency": 8000,
      "version": 1
    },
    {
      "name": "BankingTurkish",
      "tenant": "Default",
      "is_persistent": "false",
      "frequency": 8000,
      "version": 1
    }
  ],
  "modelsCount": 3,
  "success": true,
  "errorMessage": null,
  "errorCode": null,
  "moreInfo": null
}

**Explanation:

The response contains a list of available dictation models.
Each model includes:
- Name (e.g., "TURKISH_GENERAL", "ENGLISH_GENERAL")
- Tenant (Default tenant)
- Persistence (Indicates if the model is persistent)
- Frequency (e.g., 8000 Hz)
- Version (Version number of the model)
The total number of models is provided as "modelsCount": 3.
Success is true if the request was successful.
Error fields (errorMessage, errorCode, moreInfo) are null if no errors occurred.

Response Fields:

Name	Description
Models	An array of model information detailing the available dictation models.
Name	The name of the model.
Tenant	The tenant associated with the model.
Is Persistent	If `true`, the model cannot be deleted by an LMS update.
Frequency	The frequency of the model.
Version	The version of the model.
ModelsCount	The total number of models available in the service.
Success	`True`: The request succeeded. `False`: The request failed.

Error Fields:

Name	Description
ErrorMessage	If the request fails, this field contains the failure message.
ErrorCode	If the request fails, this field contains the failure error code (e.g., Internal Service Error).
MoreInfo	Any additional details about the response.

Note:

No license is required to call this service.

Get Specific Model

URL:

v1/speech/dictation/models?ModelName={ModelName}&ModelVersion={ModelVersion}&Tenant={ModelTenant}

Method: GET
Summary: Retrieves the specified dictation model as a zip file.
Description:
Allows users to download a specific model from the SR server.

End-Point Test with cURL:

curl -X GET "https://{sr-server}/v1/speech/dictation/models?ModelName={ModelName}&ModelVersion={ModelVersion}&Tenant={ModelTenant}"

Notes:

Replace {sr-server} with the real server URL.
Replace {ModelName}, {ModelVersion}, and {ModelTenant} with the appropriate model details.

Add Model

URL: v1/speech/dictation/models
Method: POST
Summary: Adds a model to the model list.
Description:
Allows users to upload a new model to the SR server.

How to Send a Request:

To add a model, you must send the following parameters:

Mandatory Parameters:
- ModelName
- ModelVersion
Optional Parameters (default values are used if not provided):
- Tenant (Default)
- IsPersistent (false)

Additionally, the model content must be included as a zipped file in the request body.

End-Point Test with cURL:

curl \
--header "ModelName: TURKISH_GENERAL" \
--header "ModelVersion: 1" \
--header "IsPersistent: true" \
--data-binary "@TurkishGeneral.zip" \
-H "Content-Type: application/zip" \
-X POST "https://{server-url}/v1/speech/dictation/models"

Notes:

Replace {server-url} with the real server URL.
Ensure the model file (TurkishGeneral.zip) is in the correct format.

Delete Model

URL: v1/speech/dictation/models
Method: DELETE
Summary: Deletes a model from the model list.
Description:
Allows users to remove a model from the SR server.

How to Send a Request:

To delete a model, send the following parameters:

Mandatory Parameters:
- ModelName
- ModelVersion
Optional Parameter (default value is used if not provided):
- Tenant (Default)

End-Point Test with cURL:

curl \
--header "ModelName: TURKISH_GENERAL" \
--header "Tenant: ModelTenant" \
--header "ModelVersion: 1" \
-X DELETE "https://{server-url}/v1/speech/dictation/models"

Set Model Default Version

URL: /models/defaultversion
Method: POST
Summary: Sets the default version of a model.
Description:
Allows users to change the default version of a model in the SR server.

How to Send a Request:

To set the default version of a model, send the following parameters:

Mandatory Parameters:
- ModelName
- ModelVersion
Optional Parameter (default value is used if not provided):
- Tenant (Default)

End-Point Test with cURL:

Below is an example cURL request to set the default version of a model.

curl \
--header "ModelName: TURKISH_GENERAL" \
--header "Tenant: ModelTenant" \
--header "ModelVersion: 1" \
-X POST "https://{server-url}/models/defaultversion"

Notes:

Replace {server-url} with the real server URL.
Replace {ModelName}, {ModelVersion}, and {ModelTenant} with the appropriate model details.

POST Dictation Request

URL: v1/speech/dictation/request
Method: POST
Summary: Make Speech Dictation.
Description:
Allows users to send a speech dictation request to the Request endpoint.
The service will return the dictation (recognition) result for the provided audio file.

How to Send Request:

To send a new dictation request, include the following recognition parameters:

Mandatory Parameter:
- ModelName
Optional Parameters (default values used if not provided):
- Tenant
- ModelVersion
- ProduceNBestList
- NBestListLength
- SendAudioDownloadLink

Example Parameters:

- ModelName: TURKISH_GENERAL_TEST
- Tenant: ModelTenant
- ModelVersion: 1
- ProduceNBestList: True
- NBestListLength: 5
- SendAudioDownloadLink: True

Authorization:

If the Speech Recognition Service requires a license (e.g., when used in cloud environments),
the Bearer token must be included in the Authorization header.

Audio File Upload:

The audio file should be sent as binary data in the request body.
Content-Type must be set accordingly.
Supported formats:
- audio/opus
- audio/wav
- audio/wave

End-Point Test with cURL:

Below is an example cURL request to send a speech dictation request.

curl \
--header "ModelName: TURKISH_GENERAL" \
--header "Tenant: ModelTenant" \
--header "ModelVersion: 1" \
--header "Authorization: Bearer [token]" \
--data-binary "@AudioToDictate.wav" \
-H "Content-Type: audio/wav" \
-X POST "https://{server-url}/v1/speech/dictation/request"

Notes:

Replace {server-url} with the real server URL.
Replace [token] with a valid authorization token if required.
Ensure the audio file (AudioToDictate.wav) is in a supported format.

Request Header Fields:

Name	Description
ModelName	Name of the language model you want to use in your speech dictation (recognition).
Tenant	Tenant of the language model.
ModelVersion	Version of the language model you want to use in your speech dictation (recognition).
ProduceNBestList	If true, produces multiple hypotheses as the dictation result.
NbestListLength	Maximum number of recognition hypotheses.
SendAudioDownloadLink	If true, provides a downloadable link for the audio file you send.
ModelsCount	Total number of models available in this service.
Success	`True`: The request succeeded. `False`: The request failed.
ErrorMessage	If the request fails, this field contains the failure message.
ErrorCode	If the request fails, this field contains the failure error code (e.g., Internal Service Error).
MoreInfo	Any extra information about the response.
Content-Type	The MIME type of the audio file.

Example Response:

{
  "resultText": "sayın meslektaşım ",
  "confidence": 1,
  "speechStartTimeMsec": 0,
  "speechEndTimeMsec": 2437,
  "nbestlist": {
    "utterances": [
      {
        "nlsmlResult": "",
        "confidence": 99,
        "recognizedWords": [
          {
            "word": "sayın",
            "startTimeMsec": 410,
            "endTimeMsec": 1110,
            "confidence": 99,
            "wordType": 1,
            "speakerId": null
          },
          {
            "word": "meslektaşım",
            "startTimeMsec": 1140,
            "endTimeMsec": 1960,
            "confidence": 99,
            "wordType": 1,
            "speakerId": null
          }
        ]
      }
    ]
  },
  "audioLink": "http://acme-pc/...",
  "success": true,
  "errorMessage": null,
  "errorCode": null,
  "moreInfo": null
}

Explanation:

resultText: The final recognized text from the speech.
confidence: The overall confidence score of the recognition.
speechStartTimeMsec: The starting timestamp of the speech in milliseconds.
speechEndTimeMsec: The ending timestamp of the speech in milliseconds.
nbestlist: Contains multiple utterances (if ProduceNBestList is enabled).
- Each utterance has:
  - nlsmlResult: (if applicable).
  - confidence: Confidence score for the utterance.
  - recognizedWords: Detailed information about each recognized word.
    - word: The recognized word.
    - startTimeMsec: The start time of the word in milliseconds.
    - endTimeMsec: The end time of the word in milliseconds.
    - confidence: Confidence score for the word.
    - wordType: Type of the word (1 indicates a recognized word).
    - speakerId: If speaker identification is enabled, this field will contain the speaker ID.
audioLink: A link to the recorded audio file.
success: true if the request was successful.
errorMessage: If an error occurs, this field contains the error message.
errorCode: If an error occurs, this field contains the error code.
moreInfo: Additional information about the response.

Response Fields:

Name	Description
ResultText	The dictated (recognized) text.
Confidence	The confidence score of recognition (Range: `[0,1]`).
SpeechStartTimeMsec	The time (in milliseconds) where speech started in the audio file.
SpeechEndTimeMsec	The time (in milliseconds) where speech ended in the audio file.
Nbestlist	The result of multiple hypotheses as the dictation result.

Additional Response Fields:

Name	Description
AudioLink	A link to download the input audio in WAV format.
Success	`True`: The request succeeded. `False`: The request failed.
ErrorMessage	If the request fails, this field contains the failure message.
ErrorCode	If the request fails, this field contains the failure error code (e.g., Internal Service Error).
MoreInfo	Additional details about the response.

Nbestlist Fields
Contains multiple recognition hypotheses.

Name	Description
NlsmlResult	The recognition result in NLSML format.
Utterances	The recognized utterances.
Confidence	The confidence score for the recognition hypothesis.
RecognizedWords	The recognized words list.

RecognizedWords Fields
Contains detailed information about each recognized word.

Name	Description
Word	The recognized word from the speech input.
StartTimeMsec	The start time of the utterance in milliseconds.
EndTimeMsec	The end time of the utterance in milliseconds.
Confidence	The confidence value of the recognized result in percentage.
WordType	The type of the word (e.g., Normal, Filler, Suffix, Prefix).

Note:

Based on service configuration, a license may or may not be required for this service.

POST Dictation Request with Custom Words

URL: v1/speech/dictation/request
Method: POST
Summary: Make Speech Dictation with Custom Words.
Description:
This endpoint allows users to send a speech dictation request with custom words added to the language model.
The service will return the dictation (recognition) result for the provided speech (audio file).

How to Send Request:

To send a dictation request with custom words, the following recognition parameters must be included:

Mandatory Parameter:
- ModelName
Optional Parameters (default values used if not provided):
- Tenant
- ModelVersion
- ProduceNBestList
- NBestListLength
- SendAudioDownloadLink

Example Parameters

- ModelName: TURKISH_GENERAL_TEST
- Tenant: ModelTenant
- ModelVersion: 1
- ProduceNBestList: True
- NBestListLength: 5
- SendAudioDownloadLink: True

Authorization:

If the Speech Recognition Service requires a license (e.g., when used in cloud environments), the Bearer token must be included in the Authorization header.

Audio File & Custom Words Upload:

The audio file should be sent as binary data in the request body.
Custom words should also be sent as multipart form-data along with the audio file.

End-Point Test with cURL

Below is an example cURL request to send a speech dictation request with custom words.

curl \
--header "ModelName: TURKISH_GENERAL" \
--header "Tenant: ModelTenant" \
--header "ModelVersion: 1" \
--header "Authorization: Bearer [token]" \
--header "Content-Type: multipart/form-data" \
--form custom-list="tencere\nkapak" \
--form audio="@AudioToDictate.wav" \
-X POST "https://{server-url}/v1/speech/dictation/request"

Request Header Fields

Name	Description
ModelName	Name of the language model you want to use in your speech dictation (recognition).
Tenant	Tenant of the language model.
ModelVersion	Version of the language model you want to use in your speech dictation (recognition).
ProduceNBestList	If true, produces multiple hypotheses as the dictation result.
NbestListLength	Maximum number of recognition hypotheses.
SendAudioDownloadLink	If true, provides a downloadable link for the audio file you send.

Additional Fields

Name	Description
*Token (license.)**	Required if using cloud services. The API Client ID and API Client Secret will be provided upon service purchase to generate a token.
Content-Type	The MIME type of the request body.

Notes

Replace {server-url} with the real server URL.
Replace [token] with a valid authorization token if required.
Ensure the audio file (AudioToDictate.wav) is in a supported format.
The custom word list should be sent as multipart form-data.

Example Response

{
  "resultText": "sayın meslektaşım ",
  "confidence": 1,
  "speechStartTimeMsec": 0,
  "speechEndTimeMsec": 2437,
  "nbestlist": {
    "utterances": [
      {
        "nlsmlResult": "",
        "confidence": 99,
        "recognizedWords": [
          {
            "word": "sayın",
            "startTimeMsec": 410,
            "endTimeMsec": 1110,
            "confidence": 99,
            "wordType": 1,
            "speakerId": null
          },
          {
            "word": "meslektaşım",
            "startTimeMsec": 1140,
            "endTimeMsec": 1960,
            "confidence": 99,
            "wordType": 1,
            "speakerId": null
          }
        ]
      }
    ]
  },
  "audioLink": "http://acme-pc/...",
  "success": true,
  "errorMessage": null,
  "errorCode": null,
  "moreInfo": null
}

Explanation

resultText: The final recognized text from the speech.
confidence: The overall confidence score of the recognition.
speechStartTimeMsec: The starting timestamp of the speech in milliseconds.
speechEndTimeMsec: The ending timestamp of the speech in milliseconds.
nbestlist: Contains multiple utterances (if ProduceNBestList is enabled).
- Each utterance has:
  - nlsmlResult: (if applicable).
  - confidence: Confidence score for the utterance.
  - recognizedWords: Detailed information about each recognized word.
    - word: The recognized word.
    - startTimeMsec: The start time of the word in milliseconds.
    - endTimeMsec: The end time of the word in milliseconds.
    - confidence: Confidence score for the word.
    - wordType: Type of the word (1 indicates a recognized word).
    - speakerId: If speaker identification is enabled, this field will contain the speaker ID.
audioLink: A link to the recorded audio file.
success: true if the request was successful.
errorMessage: If an error occurs, this field contains the error message.
errorCode: If an error occurs, this field contains the error code.
moreInfo: Additional information about the response.

Response Fields

Name	Description
ResultText	Dictated (recognized) text.
Confidence	Confidence of recognition. Range [0,1].
SpeechStartTimeMsec	The starting time of the recognized speech in milliseconds.
SpeechEndTimeMsec	The ending time of the recognized speech in milliseconds.
Nbestlist	Contains multiple hypotheses for the recognition result.
AudioLink	A link where the input audio file can be downloaded in wave format.
Success	`True` if the request succeeded, `False` if it failed.
ErrorMessage	A failure message when the request fails.
ErrorCode	Error code if the request fails (e.g., Internal Service Error).
MoreInfo	Any extra details about the response.

Nbestlist Fields

Name	Description
Utterances	List of recognized utterances.
NlsmlResult	The request succeeded.
Confidence	Confidence level of the recognition.
RecognizedWords	List of words recognized in the speech.

RecognizedWords Fields

Name	Description
Word	Recognized word from speech.
StartTimeMsec	The start time of the recognized word in milliseconds.
EndTimeMsec	The end time of the recognized word in milliseconds.
Confidence	The confidence level of the recognized word (in percentage).
WordType	Classification of the word (e.g., Normal, Filler, Suffix, Prefix).

Note:
Depending on the service configuration, a license may or may not be required to call this service.

TEST TOOLS

There are several tools that you can use to test the REST API.

Curl

Curl is a command-line tool for transferring data using various protocols. It is available on multiple platforms.

For more information, visit:
🔗 https://curl.haxx.se/

It can be used to interact with the Sestek Speech Recognition REST API.

Curl is available on many platforms, including Windows, Linux, and MacOS:
🔗 https://curl.haxx.se/download.html

For Windows installation, refer to:
🔗 https://support.zendesk.com/hc/en-us/articles/203691436-Installing-and-using-cURL#curl_win

Fiddler

Fiddler is a free web debugging proxy that works across multiple browsers, systems, and platforms.

🔗 http://www.telerik.com/fiddler

HTTPie

HTTPie is a cURL alternative that is particularly suited for JSON-based REST APIs.

🔗 https://github.com/jkbrzt/httpie

For installation (Windows, Mac OS X, Linux), you can use pip:
🔗 https://pip.pypa.io/en/latest/

Postman

Postman is a popular API testing tool available as both a Google Chrome Packaged App and a Google Chrome in-browser app.

🔗 https://www.getpostman.com/

Paw

Paw is a Mac application that simplifies interaction with REST services.

🔗 https://luckymarmot.com/paw

I'm Only Resting

"I'm Only Resting" is a feature-rich WinForms-based HTTP client.

🔗 http://www.swensensoftware.com/im-only-resting

APPENDICES

This section provides an overview of useful tools and references to help you interact with REST APIs effectively. For simple brief definitions of REST API concepts, check out the following resources:

Was this article helpful?

What's Next

Transcribe in Real-time

Table of contents

Authentication
Using the API After Authentication
POST Validation
Get Specific Model
Add Model
Delete Model
Set Model Default Version
POST Dictation Request
POST Dictation Request with Custom Words
TEST TOOLS
APPENDICES