TTS Best Practices

Prev Next

This document provides practical guidance for using SESTEK Text-to-Speech effectively. It explains how general text formatting, punctuation, spacing, special characters, and output configuration influence TTS behavior, and it offers best practices for improving pronunciation, pacing, prosody, and synthesis stability across common use cases.

By following these guidelines, you can produce more natural-sounding output and better understand expected system behavior when diagnosing undesired synthesis results.


1. Purpose and Scope

Objectives

  • Enable higher-quality TTS outputs
  • Explain how the TTS system interprets general written text
  • Reduce ambiguity when diagnosing synthesis issues
  • Provide clear guidance for text preparation and request formatting
  • Improve output naturalness through better punctuation, spacing, and prosody control

Scope

  • General Text-to-Speech (TTS) best practices
  • Text preparation before synthesis
  • Special character handling
  • Request formatting and JSON escaping
  • Output-oriented troubleshooting

This document describes general system behavior and recommended usage patterns for TTS input preparation. It is not language-specific guidance and should not be mixed with language-specific pronunciation documents such as Arabic TTS guidance.


2. TTS Processing Overview

SESTEK TTS follows this synthesis pipeline:

Input Text
    ↓
Text Normalization
    ↓
Prosody Assignment
    ↓
Phoneme Generation
    ↓
Waveform Synthesis

Most unexpected synthesis outcomes originate from input ambiguity, formatting, or normalization behavior, not from the voice engine itself.

What This Means in Practice

Even when the words themselves are correct, the following may still affect the final result:

  • missing punctuation
  • excessive punctuation
  • incorrect spacing
  • unresolved placeholders
  • raw special characters
  • malformed JSON payloads
  • ambiguous numeric or symbolic expressions
  • unsuitable sentence length or structure

3. Prosody and Why It Matters

In TTS, prosody refers to how speech is delivered, including:

  • Pause placement
  • Pause length
  • Rhythm
  • Phrase grouping
  • Emphasis
  • Sentence melody / intonation

Punctuation and special characters influence prosody even when they are not spoken literally.

Examples:

  • A comma usually creates a short pause
  • A period usually creates a sentence-ending pause
  • A question mark usually creates question-like intonation
  • An ellipsis may create a hesitation or trailing pause
  • Repeated symbols may create unnatural rhythm or exaggerated emphasis

A character does not need to be spoken aloud to affect synthesis. Many characters influence prosody indirectly by changing how the text is segmented and interpreted.


4. Special Characters Reference Tables

Special characters may affect synthesis in different ways. Some characters are explicitly handled by the SESTEK TTS engine as pause delimiters with configured pause durations.

Other characters are not defined as engine-level pause delimiters, but may still affect pronunciation, rhythm, readability, or output naturalness depending on the text and context.

The pause durations in the first table are configured at engine level and are expressed in milliseconds. Actual perceived pause length may still vary depending on the voice, model behavior, and surrounding text.

4.1. SESTEK TTS Pause Delimiter Characters

The table below lists the characters currently supported by the SESTEK TTS engine as pause delimiters.

Character Name Configured Pause Duration Example Typical Effect on Output Recommendation
. Full stop / period 400 ms Your order is ready. Creates a sentence-ending pause Use to end complete thoughts
, Comma 150 ms Hello, John. Creates a short phrase break Use for natural short pauses
; Semicolon 300 ms Approved; next step follows. Creates a structured pause Prefer comma or period unless a semicolon is clearly needed
: Colon 300 ms Options: sales, support. Introduces what follows with a medium pause Use only when the sentence structure supports it
! Exclamation mark 350 ms Thank you! Adds emphasis with a stronger ending pause Use sparingly
? Question mark 400 ms Are you available? Creates question-like intonation with a stronger ending pause Use only for actual questions
" Double quotation mark 150 ms He said "Wait". May create a short pause and slight quoted emphasis Use only for actual quoted content; escape as\" in JSON when used
| Vertical bar / pipe 1 ms Option A | Option B Creates an almost negligible pause Do not rely on it for natural phrasing; prefer standard punctuation or SSML when supported

4.2. Other Common Characters and Their Possible Effects

The table below lists other common characters that are not defined as engine-level pause delimiters, but may still affect synthesis depending on context.

The “Relative Audio Representation” column is illustrative only. Actual output may vary depending on voice, language, model behavior, normalization rules, and surrounding text.

Character Name Example Relative Audio Representation Typical Effect on Output Common Risk Recommendation
' Apostrophe / single quotation mark I'm ready. I'm ready [no pause] Usually merges naturally into the word Wrong spacing may break word flow Use only when it is part of normal writing
\ Backslash C:\\folder\\file C backslash folder backslash file or malformed handling Often sounds technical or is interpreted before synthesis Escaping issues or broken payload Avoid in spoken text; escape as \\ in JSON when used
/ Slash yes/no yes slash no May be spoken literally as “slash” or create awkward phrasing Unnatural speech in dates, options, or codes Rewrite as words when natural speech is needed
- Hyphen six-digit six-digit [very short link] Often links words with little or no pause Repeated use may create broken pacing Use only where grammatically needed
Dash Thank you – goodbye. Thank you [medium pause] goodbye Creates stronger separation than a comma Can sound abrupt if overused Use sparingly
... Ellipsis Let me check... Let me check [hesitation pause] Often creates trailing or delayed cadence Overuse breaks rhythm Use rarely
( ) Parentheses Monday (if available) Monday [side-note pause] if available Makes the parenthetical content sound secondary Awkward spoken flow Rewrite inline if the content must be clearly spoken
[ ] Square brackets [CustomerName] open bracket customer name close bracket or unresolved placeholder behavior Often sounds like raw system text Placeholder leakage Avoid in final TTS input
{ } Curly braces {Amount} open brace amount close brace or unresolved variable behavior Usually sounds like template text Unresolved placeholder exposure Never leave in final spoken text
< > Angle brackets <tag> less than tag greater than or malformed markup-like behavior May expose markup or technical text HTML/XML leakage Remove tags and markup before synthesis
& Ampersand Research & Development Research and Development or Research ampersand Development May be normalized or read literally Inconsistent reading Prefer writing and
@ At sign support@example.com support at example dot com Usually spoken literally as “at” Can sound too technical Use only when literal reading is intended
# Hash / number sign Order #4521 Order hash four five two one or Order number 4521 May be spoken as “hash,” “hashtag,” or “number” Inconsistent reading Prefer writing number
% Percent sign 20% discount twenty percent discount Often understood correctly Can sound abrupt in dense numeric text Prefer writing percent in customer-facing prompts
+ Plus sign A+B plan A plus B plan Usually spoken literally May sound unnatural in product/package names Rewrite if mathematical meaning is not intended
= Equals sign value = 5 value equals five Usually spoken literally Sounds technical Rewrite in plain language when possible
_ Underscore user_name user underscore name Often read literally in technical contexts Sounds like code Avoid in spoken content
* Asterisk promo*gold promo asterisk gold or unpredictable pause May be spoken literally or ignored Sounds technical or decorative Avoid unless specifically required

4.3. Prosody-Sensitive Character Combinations

Some character combinations may produce different prosodic effects than the individual characters alone. This is especially important when punctuation or delimiter characters are preserved from display-oriented text and passed directly to TTS.

Adjacent delimiter characters may create stronger, less natural, or unexpected prosodic boundaries depending on the configured pause durations and how the text is interpreted during processing.

This is particularly relevant when sentence punctuation or list separators are combined with delimiter characters such as quotation marks or vertical bars.

Meaningless Prosodic Grouping

Quotation mark " and vertical bar | characters, when combined with comma ,, create meaningless prosodic grouping, broken rhythm, or unnatural pause boundaries.

This does not mean the text is invalid. It means the text structure may be less suitable for natural speech output.

Examples:

Combination Poor Recommended
", The categories are "Standard", "Premium", and "Pro" Prefer plain text or colon-based grouping such as: The categories are Standard, Premium and Pro or The categories are: Standard, Premium and Pro
|, The categories are |Standard|, |Premium| and |Pro| Use vertical-bar-based emphasis only after validation and avoid combining it with commas where possible. Prefer structures such as: |Standard |Premium

Key Rules

  • Validate spoken naturalness as well as text correctness.
  • Remove unnecessary quotation marks from enumerations or category lists where possible.
  • Avoid combining delimiter characters with commas unless the result has been validated in synthesis output.
  • Prefer punctuation that supports natural prosody over display-style quoting.

5. Text Preparation and Input Cleaning

High-quality TTS output starts with clean and well-prepared input text.

Recommended

  • Use complete and readable sentences
  • Keep punctuation intentional and minimal
  • Use single spacing between words
  • Resolve placeholders before synthesis
  • Remove raw HTML, XML, JSON fragments, and template markers
  • Rewrite technical strings into spoken-friendly text when possible

Avoid

  • repeated punctuation
  • inconsistent spacing
  • raw variable placeholders
  • markup tags
  • decorative symbols
  • code-like strings in customer-facing prompts

Examples:

Poor Recommended
Hello Mr.Doe---your order #4521 is ready!!! Hello, Mr. Doe. Your order number 4 5 2 1 is ready.
Dear [John], your balance is {500 US Dollars}. Dear John, your balance is 500 US Dollars.
<p>Your order is confirmed</p> Your order is confirmed.

Raw technical content may be spoken literally, interpreted unexpectedly, or reduce overall naturalness.


6. Spacing and Sentence Structure

6.1. Word Spacing

Spacing errors frequently lead to poor synthesis quality.
Tips:

  • Use a single space between words
  • Keep punctuation attached correctly to the preceding word
  • Avoid overloading one sentence with too much information

Examples:

Poor Recommended
Your order is confirmed Your order is confirmed.
Mr.Doe Mr. Doe
I 'm here to help I'm here to help.

6.2. Sentence Length

Very long sentences may reduce naturalness and make phrasing unstable.

Tips:

  • Split long paragraphs into shorter sentences

Examples:

Poor Recommended
Your application has been approved and the contract will be sent by email today and you should review it and reply by Friday so we can continue the process. Your application has been approved. The contract will be sent by email today. Please review it and reply by Friday.

7. Numbers, Symbols, and Spoken-Friendly Rewriting

Symbols and compressed written forms do not always produce the most natural spoken output.

Tips:

  • Prefer the spoken form whenever clarity matters more than visual compactness.

Examples:

Input Recommended
Order #4521 Order number 4 5 2 1 or rewrite fully Order number four five two one
20% discount 20 percent discount
yes/no yes or no
A+B package A plus B package
10/04/2026 10 April 2026

A format that looks correct on screen is not always the format that sounds best in speech.


8. URLs, Emails, and Technical Strings

8.1. URLs and Technical Strings

URLs, file paths, usernames, and technical identifiers often sound unnatural when synthesized directly.

Tips:

  • Only keep such content in TTS input if the system is expected to read it literally.

Examples:

Input Possible Output Behavior Recommended
example.com/order/4521 May be read as a long technical string example dot com slash order slash 4 5 2 1
C:\newfolder\test May sound highly technical or break payload formatting Remove or rewrite for spoken output
user_name May be read as user underscore name user name if the intention is the user name specifically

8.2. Emails

Emails may be normalized to spoken forms automatically, but exact pronunciation of handles, punctuation, and domains may vary.

Tips:

  • Keep the email address only if it must be spoken literally.
  • Otherwise, rewrite it as a spoken instruction.
  • Define abbreviations for domains.
  • Separate confusing punctuation marks (dots, hyphens, underscore, etc.) at the end of the email address.

Examples:

Input Possible pronunciation Recommended
support@example.com support at example dot com use only if literal reading is required
info@company.ai may vary depending on domain reading per language rewrite if exact pronunciation matters
Punctuation and Separator Characters with Emails:

If an email address is followed immediately by sentence punctuation or separator characters, some outputs may interpret those characters as part of the email address. This may result in undesired synthesis. For such cases follow the tips below.

Tips:

  • Rewrite the sentence
  • insert a pause using space character or SSML if supported
  • or express the email address in spoken form.

Examples (English):

Input Possible pronunciation Recommended input Recommended input with SSML
my email address is support@sestek.com. I live in Ankara my email address is support at sestek dot com dot I live in Ankara my email address is support at sestek dot com. I live in Ankara or if cannot be written: my email address is support@sestek.com I live in Ankara <speak>My email address is support@sestek.com<break time="300ms"/> I live in Ankara.</speak>

Examples (Turkish):

Input Possible pronunciation Recommended input Recommended input with SSML
eposta adresim support@sestek.com. Ankara'da ikamet ediyorum eposta adresim support et sestek nokta kom nokta Ankara'da ikamet ediyorum eposta adresim support et sestek nokta kom. Ankara'da ikamet ediyorum or if cannot be written: eposta adresim support@sestek.com Ankara'da ikamet ediyorum <speak>eposta adresim support@sestek.com<break time="300ms"/> Ankara'da ikamet ediyorum.</speak>
eposta adresim support@sestek.com'dir. Ankara'da ikamet ediyorum eposta adresim support et sestek nokta kom dir nokta Ankara'da ikamet ediyorum eposta adresim support et sestek nokta kom'dir. Ankara'da ikamet ediyorum or if cannot be written: eposta adresim support@sestek.com'dir Ankara'da ikamet ediyorum <speak>eposta adresim support@sestek.com'dir<break time="300ms"/> Ankara'da ikamet ediyorum.</speak>

9. Pronunciation and Language Customization

9.1. Abbreviations

Abbreviation management allows custom pronunciations for abbreviations and acronyms. Default system behavior applies unless explicitly customized.

For advanced control, refer to: Customization for Synthesis Accuracy

Undefined abbreviations are a common source of unexpected pronunciation.

Tips:

  • Define abbreviations or custom pronunciations for foreign-origin words and brand names that do not have clear equivalents in the target language or script.
  • Decide whether the abbreviation should be spoken:
    • as a full word
    • letter-by-letter
    • or as its expanded form
  • Prefer the representation that matches the intended spoken output.

Examples:

Input Recommended
Dr. Smith Doctor Smith
Mr. Doe Mister Doe
ATM A T M if the letters should be spoken individually
ETA E T A or estimated time of arrival, depending on the intended output

9.2. Names, Brand Names, Foreign-Origin Words, and Mixed-Language Text

Proper nouns, brand names, product names, and foreign-origin words are a common source of pronunciation variation.

Mixed-language text may also reduce naturalness, especially when scripts or pronunciation rules change within the same sentence.

Tips:

  • Define abbreviations for foreign-origin words, brand names that have no equivalent in the target language.
  • Prefer the form that is closest to the intended spoken output.
  • Expand abbreviations letter-by-letter or segment-by-segment when needed in context.

Examples (Turkish):

Input Recommended
"Mike" "Mayk"
"Google" "gugıl"
"Peugeot" "pejo"
"ATM" "ateme" (letter-by-letter)
"ChatGPT" "çet cii pii tii" (segment-by-segment)

Examples (English):

Input Recommended
"Husain" "who-sayin"
"Dr" "doctor"
"VIP" "V I P"

If a name, brand, or foreign-origin word appears frequently in your prompts, it is better to define it once through abbreviations than to correct it repeatedly in input text.

9.3. Normalization

Normalization converts written text into a spoken-friendly representation before synthesis. This helps improve pronunciation of numbers, dates, and other complex text elements.

Normalization behavior is automatic and not user-configurable. Spoken output depends on detected format and context.

9.3.1. Numbers

Numeric values are automatically converted to spoken forms. Context determines whether numbers are read digit-by-digit or as whole values.

Long numeric sequences can cause numbers to sound too fast or dense.

Tips:

  • Write numbers in the exact representation form of their value for natural synthesis.
  • Rewrite the number in words when a full-value reading is required.
  • Separate digits when digit-by-digit reading is intended.
  • Insert pauses for long identifiers or codes.
  • For additional pause guidance, see Section 10.
Input Recommended
123 one hundred twenty-three if a whole-number reading is intended
1 2 3 4 one two three four if digit-by-digit reading is intended
Order 4521 Order 4 5 2 1 or Order four five two one if it is an identifier

9.3.2. Dates

Numeric and alphanumeric date forms are automatically converted to spoken forms. Month phrases are always spoken even when input is in numeric format.

Spoken output depends on detected format and context. Ambiguous date formats will cause incorrect synthesis.

Tips:

  • Prefer unambiguous date formats.
  • Use month names when clarity matters.
  • Keep separators and formatting consistent.
Input Recommended
10/04/2026 10 April 2026
04/10/2026 rewrite using month name if ambiguity is possible
10 April 2026 spoken naturally with less ambiguity

9.3.3. Times

Numeric and alphanumeric time forms are automatically converted to spoken forms. Spoken output depends on detected format and surrounding context.

Incorrect Synthesis

Ambiguous time formats will cause incorrect synthesis.

Tips:

  • Prefer clear and standard time formats.
  • Rewrite unusual or compressed forms when clarity matters.

Examples

Input Recommended
9:45 9:45 or nine forty-five
09.45 use a standard time format consistently
945 rewrite if the intended time reading is important

9.3.4. Currencies

Currency names and symbols are normalized to spoken forms automatically.

Currency codes (e.g. USD, EUR or TRY) are not supported and must be defined as abbreviations.

Tips:

  • Prefer full currency names for correct synthesis.

Examples:

Input Recommended
$20 20 dollars
20 USD 20 US dollars
500 TRY 500 Turkish lira
  • Define abbreviations for currency codes if their spoken form must be controlled.

Examples (English):

Input Recommended
$ dollars
USD US dollars
TRY Turkish lira

Examples (Turkish):

Input Recommended
$ dolar
USD dolar or amerikan doları
TL T L or te le

9.3.5. Addresses

There is no address-specific normalization. Numeric parts in addresses follow standard number normalization rules.

Province, street, and building names, flat numbers, and dense address strings may be interpreted unexpectedly. Long numeric sequences may sound dense without pauses.

Tips:

  • Add commas at natural phrase boundaries.
  • Expand abbreviations where needed.
  • Separate long numeric segments clearly.

Examples:

Input Recommended
221B Baker St. 221 B, Baker Street if that is the intended spoken reading
742 Evergreen Terrace Apt 5 742 Evergreen Terrace, Apartment number 5

9.3.6. Symbols

The pronunciation of symbols may be interpreted differently depending on language and surrounding context.

Tips:

  • Define abbreviations for symbols for customized pronunciation.

Examples (Turkish):

  • Symbol: "&" → Default pronunciation: "ve".
Input Recommended
D&R di en ar
miles&smiles mayls en smayls

10. Pauses and Flow Control

10.1. Punctuation

Punctuation influences pauses automatically, but manual pause control may still be useful in specific cases.

Tips:

  • Use punctuation first for natural phrasing
  • Use manual pause control only when necessary
  • Place pauses at semantic boundaries
  • Avoid excessive pause insertion in short text

Examples:

  • Without structure:
    Your verification code is 1 2 5 0 7 8 and your order number is 4 5 2 1 please keep them for reference.
  • Better structured:
    Your verification code is 1 2 5 0 7 8. Your order number is 4 5 2 1. Please keep them for reference.

Lack of pauses in long numeric sequences may cause incorrect synthesis.

10.2. SSML Tags

SSML tags allow customizing pronunciation, intonation, and emphasis. SESTEK supports the most commonly used SSML tags including <break>, <say-as>, <voice>, and audio insertion.
For full details, refer to: SSML Tag Support.

Tips:

  • Use pauses sparingly - avoid excessive breaks in short sentences
  • Place pauses at semantic boundaries
  • For long numeric sequences, group digits (3–3–4 or 2–2–2) and add short breaks

Examples:

  • Without SSML pauses:
    Your verification code is 1 2 5 0 7 8 and your order number is 4 5 2 1 please keep them for reference.

  • With SSML pauses:

<speak>Your verification code is<break time='300ms'/> 1 2 5<break time='250ms'/>0 7 8<break time='250ms'/> and your order number is<break time='300ms'/>4 5 2 1<break time='300ms'/>please keep them for reference</speak>

Excessive manual pause control may make output sound artificial. Use it only when punctuation and sentence structure do not provide enough clarity.


11. Why Output May Sound Unexpected

Common Causes

  • Missing punctuation
  • Excessive punctuation
  • Unresolved placeholders
  • Raw special characters
  • Dense numeric content
  • Long or poorly segmented sentences
  • Raw URLs or technical strings
  • Channel/output mismatch
  • Malformed payloads caused by incorrect JSON escaping

Examples:

Poor Recommended
Hello "Mr.Doe", your order #4521 has been shipped!!! Track here: <https://example.com/order/4521> Hello, Mr. Doe. Your order number 4521 has been shipped. You can track your order on our website.

12. Voice Selection, Rate, Volume, and Supported Controls

12.1. Voices

Different voices may vary in clarity, pacing, and handling of numbers, acronyms, or foreign-origin words.
For the full list of supported voices, refer to: Supported Languages and Voices

Premium Voices Note:
  • Premium voices are LLM-based voices optimized for more natural-sounding and higher-quality neural speech output.
  • However, their synthesis behavior may differ from standard voices in unexpected ways, especially in speech flow, prosody, and handling of symbols or mixed-language text.

Tip:

  • Always test and validate premium voices with representative prompts in the target environment before production use.

12.2. Adjustable Parameters

Depending on the deployment and selected voice, the following parameters may be available:

  • Rate - controls the speaking tempo of the voice.
  • Volume - controls the base loudness level of the voice.

Recommended Approach
Start with default values, then adjust gradually and validate with representative prompts.

Parameter Recommended
Rate Start from the default value and adjust in small steps (±0.05) and validate until the voice sounds natural and intelligible
Volume Keep default unless the playback scenario requires a change
Parameter Values

Optimal voice, rate, and volume values are subjective to the listener, target language, and playback scenario.

12.3. Emotion, Tone, and Phoneme Tags

Explicit emotion or tone controls and phoneme tags are not supported.
Emotional delivery cannot be directly controlled via markup. Indirect influence is possible through:

  • Sentence structure
  • Punctuation
  • Wording
  • Strategic pause placement

For implementation-specific markup support, refer to: SSML Tag Support


13. Audio Output Format and Playback Compatibility Guide

  • Match synthesis output settings to the target playback channel
  • Avoid unnecessary transcoding where possible
  • Use telephony-appropriate settings for IVR/call-center playback
  • Validate synthesized audio in the actual downstream environment
  • Be aware that playback issues may originate from channel limitations rather than the synthesis itself

14. Troubleshooting Guide

A word, abbreviation, or brand name is pronounced incorrectly

Possible causes: Undefined abbreviation, acronym ambiguity, foreign-origin word, mixed-language text, or unexpected default normalization.
Actions: Define a custom pronunciation if supported, rewrite the term in a spoken-friendly form, expand the abbreviation, or keep the sentence in one language/script where possible.

Numbers, dates, or times sound unexpected

Possible causes: Ambiguous format, dense numeric content, insufficient pauses, or a whole-number vs. digit-by-digit mismatch.
Actions: Rewrite the value in the intended spoken form, separate digits when needed, use unambiguous date/time formats, and add pauses where clarity is critical.

Output does not sound as intended in mixed-language content

Possible causes: Frequent language-switching, foreign-origin words, or inconsistent script usage within the same sentence.
Actions: Keep the sentence in one language where possible, use localized spellings/transliterations in language-specific deployments, or define recurring foreign terms through abbreviation.

Output sounds too fast or dense

Possible causes: Long sentences, dense numeric blocks, insufficient punctuation.
Actions: Split the text into shorter sentences, add punctuation, group content more clearly.

Output sounds robotic or overly segmented

Possible causes: Too much punctuation, excessive commas, too many forced pauses.
Actions: Simplify punctuation and remove unnecessary separators.

Symbols are spoken literally

Possible causes: Raw symbols were sent directly in the input.
Actions: Rewrite symbols in spoken form where needed.

Placeholders or markup are spoken

Possible causes: Template variables or tags were not resolved before synthesis.
Actions: Clean and normalize the text before sending it to TTS.

Request fails before audio is generated

Possible causes: Invalid JSON, especially unescaped " or \.
Actions: Validate and escape the payload correctly.


15. Summary

Recommended:

  • Use clean, spoken-friendly text
  • Apply punctuation intentionally
  • Keep spacing consistent
  • Rewrite symbols and compact forms when clarity matters
  • Remove placeholders, markup, and technical artifacts
  • Escape JSON payloads correctly
  • Use unambiguous number, date, time, and currency formats
  • Define abbreviations or custom pronunciations for recurring acronyms, names, and foreign-origin terms
  • Keep one language/script per sentence where possible
  • Match output settings to the target playback environment
  • Use sentence structure to support natural prosody
  • Validate voice, rate, and volume settings with representative prompts

Avoid:

  • Repeated decorative punctuation
  • Meaningless Prosodic Grouping.
  • Raw URLs and code-like text in customer-facing prompts
  • Leaving unresolved placeholders in synthesis input
  • Assuming every symbol will be read naturally
  • Diagnosing voice quality before validating input formatting
  • Treating screen-friendly formatting as automatically speech-friendly
  • Relying on ambiguous abbreviations without customization
  • Mixing languages/scripts within the same sentence unnecessarily
  • Expecting emotion control tags
  • Passing raw numeric-heavy text without preparation

16. Final Note

Well-formed TTS output starts with well-prepared input. In many cases, improving punctuation, spacing, symbol usage, and request formatting is enough to significantly improve synthesis clarity, prosody, and stability without changing the voice itself.