---
title: "Text Normalization"
slug: "sr-text-normalization"
description: "Learn how Knovvu SR uses text normalization to convert spoken input into accurate, readable transcripts with flexible number handling modes."
updated: 2026-04-15T21:08:08Z
published: 2026-04-15T21:08:08Z
canonical: "docs.knovvu.com/sr-text-normalization"
---

> ## Documentation Index
> Fetch the complete documentation index at: https://docs.knovvu.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Text Normalization

Text Normalization (TN) is a critical component of speech recognition systems. Raw transcriptions are transformed into more readable and semantically accurate text by converting spoken input into a standardized format. Named Entity Recognition (NER) is used to extract entities such as numbers, dates, currencies, and phone numbers from the recognized text.

One of the key challenges in this process is handling numbers appropriately - ensuring the transcription aligns with the user's intent and context.

For example:

- *"one step at a time"* should remain as *"one step at a time"*, not *"1 step at a time"*
- *"two wrongs don't make a right"* should not be transcribed as *"2 wrongs don't make a right"* in conversational contexts

SESTEK SR addresses this through two approaches: **pre-defined entities** and **custom entities**, both integrated via the TextManip service.

---

## 1. TN/NER with Pre-defined Entities

Pre-defined entity support is enabled using the `TextManipType` header parameter in `POST dictation/request`. When used, a request with the pre-defined `ParsingList` is sent to the TextManip service for Text Normalization and Named Entity Recognition.

| Key | Description |
| --- | --- |
| `TextManipType` | `Standard` · `Aggressive` · `None` |

If not specified, `Standard` mode is used by default.

### 1.1. Pre-defined Entity Types

| Entity LabelClass | Description |
| --- | --- |
| `Number` | Detects numerical values in any format |
| `SrDate` | Detects date expressions, excluding relative terms like "today", "tomorrow", "yesterday" |
| `Currency` | Detects monetary values including amounts paired with currency symbols or names |
| `Time` | Detects clock times in different formats (e.g. "10:30 AM", "22:15") |
| `SerialNumber` | Detects structured numerical sequences representing serial numbers |

### 1.2. TextManipType Modes

#### Standard

Converts numeric expressions only when explicitly numerical. Retains non-literal or contextual phrases as-is, ensuring semantic accuracy.

Standard mode uses TextManip's `text` and `word` output parameters as result. See [Contextual Numerical Expressions Handling](https://docs.knovvu.com/docs/ner-methods#contextual-numerical-expressions-handling) for more details.

**Examples:**

| Input (spoken) | Output (transcribed) |
| --- | --- |
| "one can never know the answer" | "one can never know the answer" |
| "two heads are better than one" | "two heads are better than one" |
| "i have one last question for you" | "i have one last question for you" |

#### Aggressive

Converts all potential numbers into numerical format, prioritizing numeric representation over contextual nuance.

Aggressive mode uses TextManip's `displayText` and `displayWord` output parameters as result. See [NER Methods](https://docs.knovvu.com/docs/ner-methods) for more details.

**Examples:**

| Input (spoken) | Output (transcribed) |
| --- | --- |
| "one can never know the answer" | "1 can never know the answer" |
| "two heads are better than one" | "2 heads are better than 1" |
| "three days ago, we had a meeting about this" | "3 days ago we had a meeting about this" |
| "i have one last question for you" | "i have 1 last question for you" |

#### None

Disables text normalization entirely. All transcriptions preserve the exact spoken format without any changes to numeric expressions.

This mode is useful for scenarios where raw transcription without modification is required.

**Examples:**

| Input (spoken) | Output (transcribed) |
| --- | --- |
| "i will be there in three hours" | "i will be there in three hours" |
| "room number one is ready" | "room number one is ready" |

---

## 2. TN/NER with Custom Entities

Custom entity support is enabled using the `parameters` body parameter in `form-data` on `POST dictation/request`.

| Key | Type | Description |
| --- | --- | --- |
| `audio` | `File` | The audio file to be transcribed (e.g. `payment.wav`) |
| `parameters` | `Text` | JSON object specifying the NER ParsingList |

**ParsingList example:**

```
{
  "NER": {
    "ParsingList": [
      { "LabelClass": "SrDate",       "Name": "@srdate" },
      { "LabelClass": "Time",         "Name": "@time" },
      { "LabelClass": "Iban",         "Name": "@iban" },
      { "LabelClass": "Tckn",         "Name": "@tckn" },
      { "LabelClass": "PhoneNumber",  "Name": "@phonenumber" }
    ]
  }
}
```

Attention

- `parameters` works with `TextManipType` **Standard** and **Aggressive** modes only.
- If `TextManipType` is set to **None**, the `parameters` field is automatically disabled and no TN/NER will be applied.

---

## Result Accuracy

Text Normalization and NER can be applied using either **plain text** or **word-by-word** input.

### Plain text input

The text is processed as a whole, without considering the pauses between words during speech. This can lead to slightly inaccurate results depending on the entity being matched.

### Word-by-word input

Each word is processed individually, and the pause time between words during speech is taken into account when detecting entities. This provides more accurate results, particularly for phone numbers and serial numbers where digit grouping matters.

### How to implement word-by-word TN/NER

Enable `ProduceNBestList: True` in `POST dictation/request`. This produces detailed per-word information in the response via the `RecognizedWords` parameter.

| Name | Description |
| --- | --- |
| `Word` | The recognized word from the speech input |
| `StartTimeMsec` | Start time of the utterance in milliseconds |
| `EndTimeMsec` | End time of the utterance in milliseconds |
| `Confidence` | Confidence value of the recognized result (percentage) |
| `WordType` | Type of the word: Normal, Filler, Suffix, or Prefix |

This enables extracting entities word-by-word using the TextManip `POST denormalize/by-words` method.

The `denormalize/by-words` method includes a `PauseThreshold: 1000` parameter. This controls how consecutive words are grouped:

- If `StartTimeMsec` of Word_2 − `EndTimeMsec` of Word_1 < 1000 ms → the two words form a single number
- If ≥ 1000 ms → they are treated as two separate numbers

**Example:**

| Input (spoken) | Plain text output (`ProduceNBestList: False`) | Word-by-word output (`ProduceNBestList: True`) |
| --- | --- | --- |
| "sıfır beş yüz otuz beş yüz kırk iki otuz üç seksen yedi" | "0 535 142 33 87" | "0 530 542 33 87" |

---

## API Reference

### 1. Pre-defined Entities

#### `TextManipType: Standard`

**Request:**

```
curl \
  --location POST '{server-url}/dictation/request' \
  --header 'Content-Type: audio/wave' \
  --header 'ModelName: English' \
  --header 'Tenant: Default' \
  --header 'TextManipType: Standard' \
  --header 'Authorization: Bearer <authorization_token>' \
  --form 'audio=@"/path/to/audio/file.wav"'
```

**Response:**

```
{
  "audioLink": "https://{{Address}}/audiopath.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "resultText": "I sent 7 emails but only received a reply to one",
  "success": true
}
```

#### `TextManipType: Aggressive`

**Request:**

```
curl \
  --location POST '{server-url}/dictation/request' \
  --header 'Content-Type: audio/wave' \
  --header 'ModelName: English' \
  --header 'Tenant: Default' \
  --header 'TextManipType: Aggressive' \
  --header 'Authorization: Bearer <authorization_token>' \
  --form 'audio=@"/path/to/audio/file.wav"'
```

**Response:**

```
{
  "audioLink": "https://{{Address}}/audiopath.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "resultText": "I sent 7 emails but only received a reply to 1",
  "success": true
}
```

#### `TextManipType: None`

**Request:**

```
curl \
  --location POST '{server-url}/dictation/request' \
  --header 'Content-Type: audio/wave' \
  --header 'ModelName: English' \
  --header 'Tenant: Default' \
  --header 'TextManipType: None' \
  --header 'Authorization: Bearer <authorization_token>' \
  --form 'audio=@"/path/to/audio/file.wav"'
```

**Response:**

```
{
  "audioLink": "https://{{Address}}/audiopath.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "resultText": "I sent seven emails but only received a reply to one",
  "success": true
}
```

---

### 2. Custom Entities

#### `parameters` + `TextManipType: None`

**Request:**

```
curl \
  --location POST '{server-url}/dictation/request' \
  --header 'Content-Type: audio/wave' \
  --header 'ModelName: Turkish' \
  --header 'Tenant: Default' \
  --header 'TextManipType: None' \
  --header 'Authorization: Bearer <authorization_token>' \
  --form 'audio=@"/path/to/audio/file.wav"' \
  --form 'parameters="{
    \"NER\": {
      \"ParsingList\": [
        { \"LabelClass\": \"PhoneNumber\", \"Name\": \"phonenumber\" }
      ]
    }
  }"'
```

**Response:**

```
{
  "resultText": "sıfır beş yüz otuz beş yüz kırk iki otuz üç seksen yedi",
  "confidence": 0.97,
  "success": true
}
```

#### `parameters` + `TextManipType: Standard`

**Request:**

```
curl \
  --location POST '{server-url}/dictation/request' \
  --header 'Content-Type: audio/wave' \
  --header 'ModelName: Turkish' \
  --header 'Tenant: Default' \
  --header 'TextManipType: Standard' \
  --header 'Authorization: Bearer <authorization_token>' \
  --form 'audio=@"/path/to/audio/file.wav"' \
  --form 'parameters="{
    \"NER\": {
      \"ParsingList\": [
        { \"LabelClass\": \"PhoneNumber\", \"Name\": \"phonenumber\" }
      ]
    }
  }"'
```

**Response:**

```
{
  "resultText": "sıfır 535 142 33 87",
  "confidence": 0.97,
  "success": true
}
```

#### `parameters` + `TextManipType: Aggressive`

**Request:**

```
curl \
  --location POST '{server-url}/dictation/request' \
  --header 'Content-Type: audio/wave' \
  --header 'ModelName: Turkish' \
  --header 'Tenant: Default' \
  --header 'TextManipType: Aggressive' \
  --header 'Authorization: Bearer <authorization_token>' \
  --form 'audio=@"/path/to/audio/file.wav"' \
  --form 'parameters="{
    \"NER\": {
      \"ParsingList\": [
        { \"LabelClass\": \"PhoneNumber\", \"Name\": \"phonenumber\" }
      ]
    }
  }"'
```

**Response:**

```
{
  "resultText": "0 535 142 33 87",
  "confidence": 0.97,
  "success": true
}
```

#### `parameters` + `TextManipType: Aggressive` + `ProduceNBestList: True`

**Request:**

```
curl \
  --location POST '{server-url}/dictation/request' \
  --header 'Content-Type: audio/wave' \
  --header 'ModelName: Turkish' \
  --header 'Tenant: Default' \
  --header 'TextManipType: Aggressive' \
  --header 'ProduceNBestList: true' \
  --header 'Authorization: Bearer <authorization_token>' \
  --form 'audio=@"/path/to/audio/file.wav"' \
  --form 'parameters="{
    \"NER\": {
      \"ParsingList\": [
        { \"LabelClass\": \"PhoneNumber\", \"Name\": \"phonenumber\" }
      ]
    }
  }"'
```

**Response:**

```
{
  "audioLink": "https://{{Address}}/audiopath.wav",
  "confidence": 0.97,
  "detectedAudioContent": "recognizable-speech",
  "errorCode": null,
  "errorMessage": null,
  "moreInfo": null,
  "nbestlist": {
    "utterances": [
      {
        "confidence": 94,
        "nlsmlResult": "",
        "recognizedWords": [
          {
            "confidence": 97,
            "endTimeMsec": 4060,
            "startTimeMsec": 0,
            "word": "0 530 542 33 87",
            "wordType": "phonenumber"
          }
        ]
      }
    ]
  },
  "resultText": "0 530 542 33 87",
  "success": true
}
```

---

## Summary

By supporting both pre-defined and custom entity configurations alongside multiple `TextManipType` modes, SESTEK SR enables precise control over how spoken expressions are interpreted, normalized, or preserved. The availability of plain text and word-by-word processing ensures that applications can achieve the right balance between semantic correctness and numeric precision.
