TTS Best Practices

This document provides practical guidance for using SESTEK Text-to-Speech effectively. It explains how general text formatting, punctuation, spacing, special characters, and output configuration influence TTS behavior, and it offers best practices for improving pronunciation, pacing, prosody, and synthesis stability across common use cases.

By following these guidelines, you can produce more natural-sounding output and better understand expected system behavior when diagnosing undesired synthesis results.

1. Purpose and Scope

Objectives

Enable higher-quality TTS outputs
Explain how the TTS system interprets general written text
Reduce ambiguity when diagnosing synthesis issues
Provide clear guidance for text preparation and request formatting
Improve output naturalness through better punctuation, spacing, and prosody control

Scope

General Text-to-Speech (TTS) best practices
Text preparation before synthesis
Special character handling
Request formatting and JSON escaping
Output-oriented troubleshooting

This document describes general system behavior and recommended usage patterns for TTS input preparation. It is not language-specific guidance and should not be mixed with language-specific documents such as Arabic Voices Best Practices.

2. TTS Processing Overview

SESTEK TTS follows this synthesis pipeline:

Input Text
    ↓
Text Normalization
    ↓
Prosody Assignment
    ↓
Phoneme Generation
    ↓
Waveform Synthesis

Most unexpected synthesis outcomes originate from input ambiguity, formatting, or normalization behavior, not from the voice engine itself.

What This Means in Practice

Even when the words themselves are correct, the following may still affect the final result:

missing punctuation
excessive punctuation
incorrect spacing
unresolved placeholders
raw special characters
malformed JSON payloads
ambiguous numeric or symbolic expressions
unsuitable sentence length or structure

3. Prosody and Why It Matters

In TTS, prosody refers to how speech is delivered, including:

Pause placement
Pause length
Rhythm
Phrase grouping
Emphasis
Sentence melody / intonation

Punctuation and special characters influence prosody even when they are not spoken literally.

Examples:

A comma usually creates a short pause
A period usually creates a sentence-ending pause
A question mark usually creates question-like intonation
An ellipsis may create a hesitation or trailing pause
Repeated symbols may create unnatural rhythm or exaggerated emphasis

A character does not need to be spoken aloud to affect synthesis. Many characters influence prosody indirectly by changing how the text is segmented and interpreted.

4. Special Characters Reference Tables

Special characters may affect synthesis in different ways. Some characters are explicitly handled by the SESTEK TTS engine as pause delimiters with configured pause durations.

Other characters are not defined as engine-level pause delimiters, but may still affect pronunciation, rhythm, readability, or output naturalness depending on the text and context.

The pause durations in the first table are configured at engine level and are expressed in milliseconds. Actual perceived pause length may still vary depending on the voice, model behavior, and surrounding text.

4.1. SESTEK TTS Pause Delimiter Characters

The table below lists the characters currently supported by the SESTEK TTS engine as pause delimiters.

Character	Name	Configured Pause Duration	Example	Typical Effect on Output	Recommendation
`.`	Full stop / period	400 ms	`Your order is ready.`	Creates a sentence-ending pause	Use to end complete thoughts
`,`	Comma	150 ms	`Hello, John.`	Creates a short phrase break	Use for natural short pauses
`;`	Semicolon	300 ms	`Approved; next step follows.`	Creates a structured pause	Prefer comma or period unless a semicolon is clearly needed
`:`	Colon	300 ms	`Options: sales, support.`	Introduces what follows with a medium pause	Use only when the sentence structure supports it
`!`	Exclamation mark	350 ms	`Thank you!`	Adds emphasis with a stronger ending pause	Use sparingly
`?`	Question mark	400 ms	`Are you available?`	Creates question-like intonation with a stronger ending pause	Use only for actual questions
`"`	Double quotation mark	150 ms	`He said "Wait".`	May create a short pause and slight quoted emphasis	Use only for actual quoted content; escape as`\"` in JSON when used
`\|`	Vertical bar / pipe	1 ms	`Option A \| Option B`	Creates an almost negligible pause	Do not rely on it for natural phrasing; prefer standard punctuation or SSML when supported

4.2. Other Common Characters and Their Possible Effects

The table below lists other common characters that are not defined as engine-level pause delimiters, but may still affect synthesis depending on context.

The “Relative Audio Representation” column is illustrative only. Actual output may vary depending on voice, language, model behavior, normalization rules, and surrounding text.

Character	Name	Example	Relative Audio Representation	Typical Effect on Output	Common Risk	Recommendation
`'`	Apostrophe / single quotation mark	`I'm ready.`	`I'm ready` `[no pause]`	Usually merges naturally into the word	Wrong spacing may break word flow	Use only when it is part of normal writing
`\`	Backslash	`C:\\folder\\file`	`C backslash folder backslash file` or malformed handling	Often sounds technical or is interpreted before synthesis	Escaping issues or broken payload	Avoid in spoken text; escape as `\\` in JSON when used
`/`	Slash	`yes/no`	`yes slash no`	May be spoken literally as “slash” or create awkward phrasing	Unnatural speech in dates, options, or codes	Rewrite as words when natural speech is needed
`-`	Hyphen	`six-digit`	`six-digit` `[very short link]`	Often links words with little or no pause	Repeated use may create broken pacing	Use only where grammatically needed
`–`	Dash	`Thank you – goodbye.`	`Thank you` `[medium pause]` `goodbye`	Creates stronger separation than a comma	Can sound abrupt if overused	Use sparingly
`...`	Ellipsis	`Let me check...`	`Let me check` `[hesitation pause]`	Often creates trailing or delayed cadence	Overuse breaks rhythm	Use rarely
`(` `)`	Parentheses	`Monday (if available)`	`Monday` `[side-note pause]` `if available`	Makes the parenthetical content sound secondary	Awkward spoken flow	Rewrite inline if the content must be clearly spoken
`[` `]`	Square brackets	`[CustomerName]`	`open bracket customer name close bracket` or unresolved placeholder behavior	Often sounds like raw system text	Placeholder leakage	Avoid in final TTS input
`{` `}`	Curly braces	`{Amount}`	`open brace amount close brace` or unresolved variable behavior	Usually sounds like template text	Unresolved placeholder exposure	Never leave in final spoken text
`<` `>`	Angle brackets	`<tag>`	`less than tag greater than` or malformed markup-like behavior	May expose markup or technical text	HTML/XML leakage	Remove tags and markup before synthesis
`&`	Ampersand	`Research & Development`	`Research and Development` or `Research ampersand Development`	May be normalized or read literally	Inconsistent reading	Prefer writing `and`
`@`	At sign	`support@example.com`	`support at example dot com`	Usually spoken literally as “at”	Can sound too technical	Use only when literal reading is intended
`#`	Hash / number sign	`Order #4521`	`Order hash four five two one` or `Order number 4521`	May be spoken as “hash,” “hashtag,” or “number”	Inconsistent reading	Prefer writing `number`
`%`	Percent sign	`20% discount`	`twenty percent discount`	Often understood correctly	Can sound abrupt in dense numeric text	Prefer writing `percent` in customer-facing prompts
`+`	Plus sign	`A+B plan`	`A plus B plan`	Usually spoken literally	May sound unnatural in product/package names	Rewrite if mathematical meaning is not intended
`=`	Equals sign	`value = 5`	`value equals five`	Usually spoken literally	Sounds technical	Rewrite in plain language when possible
`_`	Underscore	`user_name`	`user underscore name`	Often read literally in technical contexts	Sounds like code	Avoid in spoken content
`*`	Asterisk	`promo*gold`	`promo asterisk gold` or unpredictable pause	May be spoken literally or ignored	Sounds technical or decorative	Avoid unless specifically required

4.3. Prosody-Sensitive Character Combinations

Some character combinations may produce different prosodic effects than the individual characters alone. This is especially important when punctuation or delimiter characters are preserved from display-oriented text and passed directly to TTS.

Adjacent delimiter characters may create stronger, less natural, or unexpected prosodic boundaries depending on the configured pause durations and how the text is interpreted during processing.

This is particularly relevant when sentence punctuation or list separators are combined with delimiter characters such as quotation marks or vertical bars.

Meaningless Prosodic Grouping

Quotation mark " and vertical bar | characters, when combined with comma ,, create meaningless prosodic grouping, broken rhythm, or unnatural pause boundaries.

This does not mean the text is invalid. It means the text structure may be less suitable for natural speech output.

Examples:

Combination	Poor	Recommended
`",`	`The categories are "Standard", "Premium", and "Pro"`	Prefer plain text or colon-based grouping such as: `The categories are Standard, Premium and Pro` or `The categories are: Standard, Premium and Pro`
`\|,`	`The categories are \|Standard\|, \|Premium\| and \|Pro\|`	Use vertical-bar-based emphasis only after validation and avoid combining it with commas where possible. Prefer structures such as: `\|Standard \|Premium`

Key Rules

Validate spoken naturalness as well as text correctness.
Remove unnecessary quotation marks from enumerations or category lists where possible.
Avoid combining delimiter characters with commas unless the result has been validated in synthesis output.
Prefer punctuation that supports natural prosody over display-style quoting.

5. Text Preparation and Input Cleaning

High-quality TTS output starts with clean and well-prepared input text.

Avoid

repeated punctuation
inconsistent spacing
raw variable placeholders
markup tags
decorative symbols
code-like strings in customer-facing prompts

Examples:

Poor	Recommended
`Hello Mr.Doe---your order #4521 is ready!!!`	`Hello, Mr. Doe. Your order number 4 5 2 1 is ready.`
`Dear [John], your balance is {500 US Dollars}.`	`Dear John, your balance is 500 US Dollars.`
`<p>Your order is confirmed</p>`	`Your order is confirmed.`

Raw technical content may be spoken literally, interpreted unexpectedly, or reduce overall naturalness.

6. Spacing and Sentence Structure

6.1. Word Spacing

Spacing errors frequently lead to poor synthesis quality.
Tips:

Use a single space between words
Keep punctuation attached correctly to the preceding word
Avoid overloading one sentence with too much information

Examples:

Poor	Recommended
`Your order is confirmed`	`Your order is confirmed.`
`Mr.Doe`	`Mr. Doe`
`I 'm here to help`	`I'm here to help.`

6.2. Sentence Length

Very long sentences may reduce naturalness and make phrasing unstable.

Tips:

Split long paragraphs into shorter sentences

Examples:

Poor	Recommended
`Your application has been approved and the contract will be sent by email today and you should review it and reply by Friday so we can continue the process.`	`Your application has been approved. The contract will be sent by email today. Please review it and reply by Friday.`

7. Numbers, Symbols, and Spoken-Friendly Rewriting

Symbols and compressed written forms do not always produce the most natural spoken output.

Tips:

Prefer the spoken form whenever clarity matters more than visual compactness.

Examples:

Input	Recommended
`Order #4521`	`Order number 4 5 2 1` or rewrite fully `Order number four five two one`
`20% discount`	`20 percent discount`
`yes/no`	`yes or no`
`A+B package`	`A plus B package`
`10/04/2026`	`10 April 2026`

A format that looks correct on screen is not always the format that sounds best in speech.

8. URLs, Emails, and Technical Strings

8.1. URLs and Technical Strings

URLs, file paths, usernames, and technical identifiers often sound unnatural when synthesized directly.

Tips:

Only keep such content in TTS input if the system is expected to read it literally.

Examples:

Input	Possible Output Behavior	Recommended
`example.com/order/4521`	May be read as a long technical string	`example dot com slash order slash 4 5 2 1`
`C:\newfolder\test`	May sound highly technical or break payload formatting	Remove or rewrite for spoken output
`user_name`	May be read as `user underscore name`	`user name` if the intention is the user name specifically

8.2. Emails

Emails may be normalized to spoken forms automatically, but exact pronunciation of handles, punctuation, and domains may vary.

Tips:

Keep the email address only if it must be spoken literally.
Otherwise, rewrite it as a spoken instruction.
Define abbreviations for domains.
Separate confusing punctuation marks (dots, hyphens, underscore, etc.) at the end of the email address.

Examples:

Input	Possible pronunciation	Recommended
`support@example.com`	`support at example dot com`	use only if literal reading is required
`info@company.ai`	may vary depending on domain reading per language	rewrite if exact pronunciation matters

Punctuation and Separator Characters with Emails:

If an email address is followed immediately by sentence punctuation or separator characters, some outputs may interpret those characters as part of the email address. This may result in undesired synthesis. For such cases follow the tips below.

Tips:

Rewrite the sentence
insert a pause using space character or SSML if supported
or express the email address in spoken form.

Examples (English):

Input	Possible pronunciation	Recommended input	Recommended input with SSML
`my email address is support@sestek.com. I live in Ankara`	`my email address is support at sestek dot com dot I live in Ankara`	`my email address is support at sestek dot com. I live in Ankara` or if cannot be written: `my email address is support@sestek.com I live in Ankara`	`<speak>My email address is support@sestek.com<break time="300ms"/> I live in Ankara.</speak>`

Examples (Turkish):

Input	Possible pronunciation	Recommended input	Recommended input with SSML
`eposta adresim support@sestek.com. Ankara'da ikamet ediyorum`	`eposta adresim support et sestek nokta kom nokta Ankara'da ikamet ediyorum`	`eposta adresim support et sestek nokta kom. Ankara'da ikamet ediyorum` or if cannot be written: `eposta adresim support@sestek.com Ankara'da ikamet ediyorum`	`<speak>eposta adresim support@sestek.com<break time="300ms"/> Ankara'da ikamet ediyorum.</speak>`
`eposta adresim support@sestek.com'dir. Ankara'da ikamet ediyorum`	`eposta adresim support et sestek nokta kom dir nokta Ankara'da ikamet ediyorum`	`eposta adresim support et sestek nokta kom'dir. Ankara'da ikamet ediyorum` or if cannot be written: `eposta adresim support@sestek.com'dir Ankara'da ikamet ediyorum`	`<speak>eposta adresim support@sestek.com'dir<break time="300ms"/> Ankara'da ikamet ediyorum.</speak>`

9. Pronunciation and Language Customization

9.1. Abbreviations

Abbreviation management allows custom pronunciations for abbreviations and acronyms. Default system behavior applies unless explicitly customized.

For advanced control, refer to: Customization for Synthesis Accuracy

Undefined abbreviations are a common source of unexpected pronunciation.

Tips:

Define abbreviations or custom pronunciations for foreign-origin words and brand names that do not have clear equivalents in the target language or script.
Decide whether the abbreviation should be spoken:
- as a full word
- letter-by-letter
- or as its expanded form
Prefer the representation that matches the intended spoken output.

Examples:

Input	Recommended
`Dr. Smith`	`Doctor Smith`
`Mr. Doe`	`Mister Doe`
`ATM`	`A T M` if the letters should be spoken individually
`ETA`	`E T A` or `estimated time of arrival`, depending on the intended output

9.2. Names, Brand Names, Foreign-Origin Words, and Mixed-Language Text

Proper nouns, brand names, product names, and foreign-origin words are a common source of pronunciation variation.

Mixed-language text may also reduce naturalness, especially when scripts or pronunciation rules change within the same sentence.

Tips:

Define abbreviations for foreign-origin words, brand names that have no equivalent in the target language.
Prefer the form that is closest to the intended spoken output.
Expand abbreviations letter-by-letter or segment-by-segment when needed in context.

Examples (Turkish):

Input	Recommended
`"Mike"`	`"Mayk"`
`"Google"`	`"gugıl"`
`"Peugeot"`	`"pejo"`
`"ATM"`	`"ateme"` (letter-by-letter)
`"ChatGPT"`	`"çet cii pii tii"` (segment-by-segment)

Examples (English):

Input	Recommended
`"Husain"`	`"who-sayin"`
`"Dr"`	`"doctor"`
`"VIP"`	`"V I P"`

If a name, brand, or foreign-origin word appears frequently in your prompts, it is better to define it once through abbreviations than to correct it repeatedly in input text.

9.3. Normalization

Normalization converts written text into a spoken-friendly representation before synthesis. This helps improve pronunciation of numbers, dates, and other complex text elements.

Normalization behavior is automatic and not user-configurable. Spoken output depends on detected format and context.

9.3.1. Numbers

Numeric values are automatically converted to spoken forms. Context determines whether numbers are read digit-by-digit or as whole values.

Long numeric sequences can cause numbers to sound too fast or dense.

Tips:

Write numbers in the exact representation form of their value for natural synthesis.
Rewrite the number in words when a full-value reading is required.
Separate digits when digit-by-digit reading is intended.
Insert pauses for long identifiers or codes.
For additional pause guidance, see Section 10.

Input	Recommended
`123`	`one hundred twenty-three` if a whole-number reading is intended
`1 2 3 4`	`one two three four` if digit-by-digit reading is intended
`Order 4521`	`Order 4 5 2 1` or `Order four five two one` if it is an identifier

9.3.2. Dates

Numeric and alphanumeric date forms are automatically converted to spoken forms. Month phrases are always spoken even when input is in numeric format.

Spoken output depends on detected format and context. Ambiguous date formats will cause incorrect synthesis.

Tips:

Prefer unambiguous date formats.
Use month names when clarity matters.
Keep separators and formatting consistent.

Input	Recommended
`10/04/2026`	`10 April 2026`
`04/10/2026`	rewrite using month name if ambiguity is possible
`10 April 2026`	spoken naturally with less ambiguity

9.3.3. Times

Numeric and alphanumeric time forms are automatically converted to spoken forms. Spoken output depends on detected format and surrounding context.

Incorrect Synthesis

Ambiguous time formats will cause incorrect synthesis.

Tips:

Prefer clear and standard time formats.
Rewrite unusual or compressed forms when clarity matters.

Examples

Input	Recommended
`9:45`	`9:45` or `nine forty-five`
`09.45`	use a standard time format consistently
`945`	rewrite if the intended time reading is important

9.3.4. Currencies

Currency names and symbols are normalized to spoken forms automatically.

Currency codes (e.g. USD, EUR or TRY) are not supported and must be defined as abbreviations.

Tips:

Prefer full currency names for correct synthesis.

Examples:

Input	Recommended
`$20`	`20 dollars`
`20 USD`	`20 US dollars`
`500 TRY`	`500 Turkish lira`

Define abbreviations for currency codes if their spoken form must be controlled.

Examples (English):

Input	Recommended
`$`	`dollars`
`USD`	`US dollars`
`TRY`	`Turkish lira`

Examples (Turkish):

Input	Recommended
`$`	`dolar`
`USD`	`dolar` or `amerikan doları`
`TL`	`T L` or `te le`

9.3.5. Addresses

There is no address-specific normalization. Numeric parts in addresses follow standard number normalization rules.

Province, street, and building names, flat numbers, and dense address strings may be interpreted unexpectedly. Long numeric sequences may sound dense without pauses.

Tips:

Add commas at natural phrase boundaries.
Expand abbreviations where needed.
Separate long numeric segments clearly.

Examples:

Input	Recommended
`221B Baker St.`	`221 B, Baker Street` if that is the intended spoken reading
`742 Evergreen Terrace Apt 5`	`742 Evergreen Terrace, Apartment number 5`

9.3.6. Symbols

The pronunciation of symbols may be interpreted differently depending on language and surrounding context.

Tips:

Define abbreviations for symbols for customized pronunciation.

Examples (Turkish):

Symbol: "&" → Default pronunciation: "ve".

Input	Recommended
`D&R`	`di en ar`
`miles&smiles`	`mayls en smayls`

10. Pauses and Flow Control

10.1. Punctuation

Punctuation influences pauses automatically, but manual pause control may still be useful in specific cases.

Tips:

Use punctuation first for natural phrasing
Use manual pause control only when necessary
Place pauses at semantic boundaries
Avoid excessive pause insertion in short text

Examples:

Without structure:
Your verification code is 1 2 5 0 7 8 and your order number is 4 5 2 1 please keep them for reference.
Better structured:
Your verification code is 1 2 5 0 7 8. Your order number is 4 5 2 1. Please keep them for reference.

Lack of pauses in long numeric sequences may cause incorrect synthesis.

10.2. SSML Tags

SSML tags allow customizing pronunciation, intonation, and emphasis. SESTEK supports the most commonly used SSML tags including <break>, <say-as>, <voice>, and audio insertion.
For full details, refer to: SSML Tag Support.

Tips:

Use pauses sparingly - avoid excessive breaks in short sentences
Place pauses at semantic boundaries
For long numeric sequences, group digits (3–3–4 or 2–2–2) and add short breaks

Examples:

Without SSML pauses:
Your verification code is 1 2 5 0 7 8 and your order number is 4 5 2 1 please keep them for reference.
With SSML pauses:

<speak>Your verification code is<break time='300ms'/> 1 2 5<break time='250ms'/>0 7 8<break time='250ms'/> and your order number is<break time='300ms'/>4 5 2 1<break time='300ms'/>please keep them for reference</speak>

Excessive manual pause control may make output sound artificial. Use it only when punctuation and sentence structure do not provide enough clarity.

11. Why Output May Sound Unexpected

Common Causes

Missing punctuation
Excessive punctuation
Unresolved placeholders
Raw special characters
Dense numeric content
Long or poorly segmented sentences
Raw URLs or technical strings
Channel/output mismatch
Malformed payloads caused by incorrect JSON escaping

Examples:

Poor	Recommended
`Hello "Mr.Doe", your order #4521 has been shipped!!! Track here: <https://example.com/order/4521>`	`Hello, Mr. Doe. Your order number 4521 has been shipped. You can track your order on our website.`

12. Voice Selection, Rate, Volume, and Supported Controls

12.1. Voices

Different voices may vary in clarity, pacing, and handling of numbers, acronyms, or foreign-origin words.
For the full list of supported voices, refer to: Supported Languages and Voices

Premium Voices Note:

Premium voices are LLM-based voices optimized for more natural-sounding and higher-quality neural speech output.
However, their synthesis behavior may differ from standard voices in unexpected ways, especially in speech flow, prosody, and handling of symbols or mixed-language text.

Tip:

Always test and validate premium voices with representative prompts in the target environment before production use.

12.2. Adjustable Parameters

Depending on the deployment and selected voice, the following parameters may be available:

Rate - controls the speaking tempo of the voice.
Volume - controls the base loudness level of the voice.

Recommended Approach
Start with default values, then adjust gradually and validate with representative prompts.

Parameter	Recommended
Rate	Start from the default value and adjust in small steps (±0.05) and validate until the voice sounds natural and intelligible
Volume	Keep default unless the playback scenario requires a change

Parameter Values

Optimal voice, rate, and volume values are subjective to the listener, target language, and playback scenario.

12.3. Emotion, Tone, and Phoneme Tags

Explicit emotion or tone controls and phoneme tags are not supported.
Emotional delivery cannot be directly controlled via markup. Indirect influence is possible through:

Sentence structure
Punctuation
Wording
Strategic pause placement

For implementation-specific markup support, refer to: SSML Tag Support

13. Audio Output Format and Playback Compatibility Guide

Match synthesis output settings to the target playback channel
Avoid unnecessary transcoding where possible
Use telephony-appropriate settings for IVR/call-center playback
Validate synthesized audio in the actual downstream environment
Be aware that playback issues may originate from channel limitations rather than the synthesis itself

14. Troubleshooting Guide

A word, abbreviation, or brand name is pronounced incorrectly

Possible causes: Undefined abbreviation, acronym ambiguity, foreign-origin word, mixed-language text, or unexpected default normalization.
Actions: Define a custom pronunciation if supported, rewrite the term in a spoken-friendly form, expand the abbreviation, or keep the sentence in one language/script where possible.

Numbers, dates, or times sound unexpected

Possible causes: Ambiguous format, dense numeric content, insufficient pauses, or a whole-number vs. digit-by-digit mismatch.
Actions: Rewrite the value in the intended spoken form, separate digits when needed, use unambiguous date/time formats, and add pauses where clarity is critical.

Output does not sound as intended in mixed-language content

Possible causes: Frequent language-switching, foreign-origin words, or inconsistent script usage within the same sentence.
Actions: Keep the sentence in one language where possible, use localized spellings/transliterations in language-specific deployments, or define recurring foreign terms through abbreviation.

Output sounds too fast or dense

Possible causes: Long sentences, dense numeric blocks, insufficient punctuation.
Actions: Split the text into shorter sentences, add punctuation, group content more clearly.

Output sounds robotic or overly segmented

Possible causes: Too much punctuation, excessive commas, too many forced pauses.
Actions: Simplify punctuation and remove unnecessary separators.

Symbols are spoken literally

Possible causes: Raw symbols were sent directly in the input.
Actions: Rewrite symbols in spoken form where needed.

Placeholders or markup are spoken

Possible causes: Template variables or tags were not resolved before synthesis.
Actions: Clean and normalize the text before sending it to TTS.

Request fails before audio is generated

Possible causes: Invalid JSON, especially unescaped " or \.
Actions: Validate and escape the payload correctly.

15. Summary

Recommended:

Use clean, spoken-friendly text
Apply punctuation intentionally
Keep spacing consistent
Rewrite symbols and compact forms when clarity matters
Remove placeholders, markup, and technical artifacts
Escape JSON payloads correctly
Use unambiguous number, date, time, and currency formats
Define abbreviations or custom pronunciations for recurring acronyms, names, and foreign-origin terms
Keep one language/script per sentence where possible
Match output settings to the target playback environment
Use sentence structure to support natural prosody
Validate voice, rate, and volume settings with representative prompts

Avoid:

Repeated decorative punctuation
Meaningless Prosodic Grouping.
Raw URLs and code-like text in customer-facing prompts
Leaving unresolved placeholders in synthesis input
Assuming every symbol will be read naturally
Diagnosing voice quality before validating input formatting
Treating screen-friendly formatting as automatically speech-friendly
Relying on ambiguous abbreviations without customization
Mixing languages/scripts within the same sentence unnecessarily
Expecting emotion control tags
Passing raw numeric-heavy text without preparation

16. Final Note

Well-formed TTS output starts with well-prepared input. In many cases, improving punctuation, spacing, symbol usage, and request formatting is enough to significantly improve synthesis clarity, prosody, and stability without changing the voice itself.

Documentation Index

TTS Best Practices

1. Purpose and Scope

Objectives

Scope

2. TTS Processing Overview

What This Means in Practice

3. Prosody and Why It Matters

4. Special Characters Reference Tables

4.1. SESTEK TTS Pause Delimiter Characters

4.2. Other Common Characters and Their Possible Effects

4.3. Prosody-Sensitive Character Combinations

Key Rules

5. Text Preparation and Input Cleaning

Recommended

Avoid

6. Spacing and Sentence Structure

6.1. Word Spacing

6.2. Sentence Length

7. Numbers, Symbols, and Spoken-Friendly Rewriting

8. URLs, Emails, and Technical Strings

8.1. URLs and Technical Strings

8.2. Emails

9. Pronunciation and Language Customization

9.1. Abbreviations

9.2. Names, Brand Names, Foreign-Origin Words, and Mixed-Language Text

9.3. Normalization

9.3.1. Numbers

9.3.2. Dates

9.3.3. Times

9.3.4. Currencies

9.3.5. Addresses

9.3.6. Symbols

10. Pauses and Flow Control

10.1. Punctuation

10.2. SSML Tags

11. Why Output May Sound Unexpected

12. Voice Selection, Rate, Volume, and Supported Controls

12.1. Voices

12.2. Adjustable Parameters

12.3. Emotion, Tone, and Phoneme Tags

13. Audio Output Format and Playback Compatibility Guide

14. Troubleshooting Guide

A word, abbreviation, or brand name is pronounced incorrectly

Numbers, dates, or times sound unexpected

Output does not sound as intended in mixed-language content

Output sounds too fast or dense

Output sounds robotic or overly segmented

Symbols are spoken literally

Placeholders or markup are spoken

Request fails before audio is generated

15. Summary

16. Final Note