This document provides practical guidance for using SESTEK Text-to-Speech effectively. It explains how general text formatting, punctuation, spacing, special characters, and output configuration influence TTS behavior, and it offers best practices for improving pronunciation, pacing, prosody, and synthesis stability across common use cases.
By following these guidelines, you can produce more natural-sounding output and better understand expected system behavior when diagnosing undesired synthesis results.
1. Purpose and Scope
Objectives
- Enable higher-quality TTS outputs
- Explain how the TTS system interprets general written text
- Reduce ambiguity when diagnosing synthesis issues
- Provide clear guidance for text preparation and request formatting
- Improve output naturalness through better punctuation, spacing, and prosody control
Scope
- General Text-to-Speech (TTS) best practices
- Text preparation before synthesis
- Special character handling
- Request formatting and JSON escaping
- Output-oriented troubleshooting
This document describes general system behavior and recommended usage patterns for TTS input preparation. It is not language-specific guidance and should not be mixed with language-specific pronunciation documents such as Arabic TTS guidance.
2. TTS Processing Overview
SESTEK TTS follows this synthesis pipeline:
Input Text
↓
Text Normalization
↓
Prosody Assignment
↓
Phoneme Generation
↓
Waveform Synthesis
Most unexpected synthesis outcomes originate from input ambiguity, formatting, or normalization behavior, not from the voice engine itself.
What This Means in Practice
Even when the words themselves are correct, the following may still affect the final result:
- missing punctuation
- excessive punctuation
- incorrect spacing
- unresolved placeholders
- raw special characters
- malformed JSON payloads
- ambiguous numeric or symbolic expressions
- unsuitable sentence length or structure
3. Prosody and Why It Matters
In TTS, prosody refers to how speech is delivered, including:
- Pause placement
- Pause length
- Rhythm
- Phrase grouping
- Emphasis
- Sentence melody / intonation
Punctuation and special characters influence prosody even when they are not spoken literally.
Examples:
- A comma usually creates a short pause
- A period usually creates a sentence-ending pause
- A question mark usually creates question-like intonation
- An ellipsis may create a hesitation or trailing pause
- Repeated symbols may create unnatural rhythm or exaggerated emphasis
A character does not need to be spoken aloud to affect synthesis. Many characters influence prosody indirectly by changing how the text is segmented and interpreted.
4. Special Characters Reference Tables
Special characters may affect synthesis in different ways. Some characters are explicitly handled by the SESTEK TTS engine as pause delimiters with configured pause durations.
Other characters are not defined as engine-level pause delimiters, but may still affect pronunciation, rhythm, readability, or output naturalness depending on the text and context.
The pause durations in the first table are configured at engine level and are expressed in milliseconds. Actual perceived pause length may still vary depending on the voice, model behavior, and surrounding text.
4.1. SESTEK TTS Pause Delimiter Characters
The table below lists the characters currently supported by the SESTEK TTS engine as pause delimiters.
| Character | Name | Configured Pause Duration | Example | Typical Effect on Output | Recommendation |
|---|---|---|---|---|---|
. |
Full stop / period | 400 ms | Your order is ready. |
Creates a sentence-ending pause | Use to end complete thoughts |
, |
Comma | 150 ms | Hello, John. |
Creates a short phrase break | Use for natural short pauses |
; |
Semicolon | 300 ms | Approved; next step follows. |
Creates a structured pause | Prefer comma or period unless a semicolon is clearly needed |
: |
Colon | 300 ms | Options: sales, support. |
Introduces what follows with a medium pause | Use only when the sentence structure supports it |
! |
Exclamation mark | 350 ms | Thank you! |
Adds emphasis with a stronger ending pause | Use sparingly |
? |
Question mark | 400 ms | Are you available? |
Creates question-like intonation with a stronger ending pause | Use only for actual questions |
" |
Double quotation mark | 150 ms | He said "Wait". |
May create a short pause and slight quoted emphasis | Use only for actual quoted content; escape as\" in JSON when used |
| |
Vertical bar / pipe | 1 ms | Option A | Option B |
Creates an almost negligible pause | Do not rely on it for natural phrasing; prefer standard punctuation or SSML when supported |
4.2. Other Common Characters and Their Possible Effects
The table below lists other common characters that are not defined as engine-level pause delimiters, but may still affect synthesis depending on context.
The “Relative Audio Representation” column is illustrative only. Actual output may vary depending on voice, language, model behavior, normalization rules, and surrounding text.
| Character | Name | Example | Relative Audio Representation | Typical Effect on Output | Common Risk | Recommendation |
|---|---|---|---|---|---|---|
' |
Apostrophe / single quotation mark | I'm ready. |
I'm ready [no pause] |
Usually merges naturally into the word | Wrong spacing may break word flow | Use only when it is part of normal writing |
\ |
Backslash | C:\\folder\\file |
C backslash folder backslash file or malformed handling |
Often sounds technical or is interpreted before synthesis | Escaping issues or broken payload | Avoid in spoken text; escape as \\ in JSON when used |
/ |
Slash | yes/no |
yes slash no |
May be spoken literally as “slash” or create awkward phrasing | Unnatural speech in dates, options, or codes | Rewrite as words when natural speech is needed |
- |
Hyphen | six-digit |
six-digit [very short link] |
Often links words with little or no pause | Repeated use may create broken pacing | Use only where grammatically needed |
– |
Dash | Thank you – goodbye. |
Thank you [medium pause] goodbye |
Creates stronger separation than a comma | Can sound abrupt if overused | Use sparingly |
... |
Ellipsis | Let me check... |
Let me check [hesitation pause] |
Often creates trailing or delayed cadence | Overuse breaks rhythm | Use rarely |
( ) |
Parentheses | Monday (if available) |
Monday [side-note pause] if available |
Makes the parenthetical content sound secondary | Awkward spoken flow | Rewrite inline if the content must be clearly spoken |
[ ] |
Square brackets | [CustomerName] |
open bracket customer name close bracket or unresolved placeholder behavior |
Often sounds like raw system text | Placeholder leakage | Avoid in final TTS input |
{ } |
Curly braces | {Amount} |
open brace amount close brace or unresolved variable behavior |
Usually sounds like template text | Unresolved placeholder exposure | Never leave in final spoken text |
< > |
Angle brackets | <tag> |
less than tag greater than or malformed markup-like behavior |
May expose markup or technical text | HTML/XML leakage | Remove tags and markup before synthesis |
& |
Ampersand | Research & Development |
Research and Development or Research ampersand Development |
May be normalized or read literally | Inconsistent reading | Prefer writing and |
@ |
At sign | support@example.com |
support at example dot com |
Usually spoken literally as “at” | Can sound too technical | Use only when literal reading is intended |
# |
Hash / number sign | Order #4521 |
Order hash four five two one or Order number 4521 |
May be spoken as “hash,” “hashtag,” or “number” | Inconsistent reading | Prefer writing number |
% |
Percent sign | 20% discount |
twenty percent discount |
Often understood correctly | Can sound abrupt in dense numeric text | Prefer writing percent in customer-facing prompts |
+ |
Plus sign | A+B plan |
A plus B plan |
Usually spoken literally | May sound unnatural in product/package names | Rewrite if mathematical meaning is not intended |
= |
Equals sign | value = 5 |
value equals five |
Usually spoken literally | Sounds technical | Rewrite in plain language when possible |
_ |
Underscore | user_name |
user underscore name |
Often read literally in technical contexts | Sounds like code | Avoid in spoken content |
* |
Asterisk | promo*gold |
promo asterisk gold or unpredictable pause |
May be spoken literally or ignored | Sounds technical or decorative | Avoid unless specifically required |
4.3. Prosody-Sensitive Character Combinations
Some character combinations may produce different prosodic effects than the individual characters alone. This is especially important when punctuation or delimiter characters are preserved from display-oriented text and passed directly to TTS.
Adjacent delimiter characters may create stronger, less natural, or unexpected prosodic boundaries depending on the configured pause durations and how the text is interpreted during processing.
This is particularly relevant when sentence punctuation or list separators are combined with delimiter characters such as quotation marks or vertical bars.
Quotation mark " and vertical bar | characters, when combined with comma ,, create meaningless prosodic grouping, broken rhythm, or unnatural pause boundaries.
This does not mean the text is invalid. It means the text structure may be less suitable for natural speech output.
Examples:
| Combination | Poor | Recommended |
|---|---|---|
", |
The categories are "Standard", "Premium", and "Pro" |
Prefer plain text or colon-based grouping such as: The categories are Standard, Premium and Pro or The categories are: Standard, Premium and Pro |
|, |
The categories are |Standard|, |Premium| and |Pro| |
Use vertical-bar-based emphasis only after validation and avoid combining it with commas where possible. Prefer structures such as: |Standard |Premium |
Key Rules
- Validate spoken naturalness as well as text correctness.
- Remove unnecessary quotation marks from enumerations or category lists where possible.
- Avoid combining delimiter characters with commas unless the result has been validated in synthesis output.
- Prefer punctuation that supports natural prosody over display-style quoting.
5. Text Preparation and Input Cleaning
High-quality TTS output starts with clean and well-prepared input text.
Recommended
- Use complete and readable sentences
- Keep punctuation intentional and minimal
- Use single spacing between words
- Resolve placeholders before synthesis
- Remove raw HTML, XML, JSON fragments, and template markers
- Rewrite technical strings into spoken-friendly text when possible
Avoid
- repeated punctuation
- inconsistent spacing
- raw variable placeholders
- markup tags
- decorative symbols
- code-like strings in customer-facing prompts
Examples:
| Poor | Recommended |
|---|---|
Hello Mr.Doe---your order #4521 is ready!!! |
Hello, Mr. Doe. Your order number 4 5 2 1 is ready. |
Dear [John], your balance is {500 US Dollars}. |
Dear John, your balance is 500 US Dollars. |
<p>Your order is confirmed</p> |
Your order is confirmed. |
Raw technical content may be spoken literally, interpreted unexpectedly, or reduce overall naturalness.
6. Spacing and Sentence Structure
6.1. Word Spacing
Spacing errors frequently lead to poor synthesis quality.
Tips:
- Use a single space between words
- Keep punctuation attached correctly to the preceding word
- Avoid overloading one sentence with too much information
Examples:
| Poor | Recommended |
|---|---|
Your order is confirmed |
Your order is confirmed. |
Mr.Doe |
Mr. Doe |
I 'm here to help |
I'm here to help. |
6.2. Sentence Length
Very long sentences may reduce naturalness and make phrasing unstable.
Tips:
- Split long paragraphs into shorter sentences
Examples:
| Poor | Recommended |
|---|---|
Your application has been approved and the contract will be sent by email today and you should review it and reply by Friday so we can continue the process. |
Your application has been approved. The contract will be sent by email today. Please review it and reply by Friday. |
7. Numbers, Symbols, and Spoken-Friendly Rewriting
Symbols and compressed written forms do not always produce the most natural spoken output.
Tips:
- Prefer the spoken form whenever clarity matters more than visual compactness.
Examples:
| Input | Recommended |
|---|---|
Order #4521 |
Order number 4 5 2 1 or rewrite fully Order number four five two one |
20% discount |
20 percent discount |
yes/no |
yes or no |
A+B package |
A plus B package |
10/04/2026 |
10 April 2026 |
A format that looks correct on screen is not always the format that sounds best in speech.
8. URLs, Emails, and Technical Strings
8.1. URLs and Technical Strings
URLs, file paths, usernames, and technical identifiers often sound unnatural when synthesized directly.
Tips:
- Only keep such content in TTS input if the system is expected to read it literally.
Examples:
| Input | Possible Output Behavior | Recommended |
|---|---|---|
example.com/order/4521 |
May be read as a long technical string | example dot com slash order slash 4 5 2 1 |
C:\newfolder\test |
May sound highly technical or break payload formatting | Remove or rewrite for spoken output |
user_name |
May be read as user underscore name |
user name if the intention is the user name specifically |
8.2. Emails
Emails may be normalized to spoken forms automatically, but exact pronunciation of handles, punctuation, and domains may vary.
Tips:
- Keep the email address only if it must be spoken literally.
- Otherwise, rewrite it as a spoken instruction.
- Define abbreviations for domains.
- Separate confusing punctuation marks (dots, hyphens, underscore, etc.) at the end of the email address.
Examples:
| Input | Possible pronunciation | Recommended |
|---|---|---|
support@example.com |
support at example dot com |
use only if literal reading is required |
info@company.ai |
may vary depending on domain reading per language | rewrite if exact pronunciation matters |
If an email address is followed immediately by sentence punctuation or separator characters, some outputs may interpret those characters as part of the email address. This may result in undesired synthesis. For such cases follow the tips below.
Tips:
- Rewrite the sentence
- insert a pause using space character or SSML if supported
- or express the email address in spoken form.
Examples (English):
| Input | Possible pronunciation | Recommended input | Recommended input with SSML |
|---|---|---|---|
my email address is support@sestek.com. I live in Ankara |
my email address is support at sestek dot com dot I live in Ankara |
my email address is support at sestek dot com. I live in Ankara or if cannot be written: my email address is support@sestek.com I live in Ankara |
<speak>My email address is support@sestek.com<break time="300ms"/> I live in Ankara.</speak> |
Examples (Turkish):
| Input | Possible pronunciation | Recommended input | Recommended input with SSML |
|---|---|---|---|
eposta adresim support@sestek.com. Ankara'da ikamet ediyorum |
eposta adresim support et sestek nokta kom nokta Ankara'da ikamet ediyorum |
eposta adresim support et sestek nokta kom. Ankara'da ikamet ediyorum or if cannot be written: eposta adresim support@sestek.com Ankara'da ikamet ediyorum |
<speak>eposta adresim support@sestek.com<break time="300ms"/> Ankara'da ikamet ediyorum.</speak> |
eposta adresim support@sestek.com'dir. Ankara'da ikamet ediyorum |
eposta adresim support et sestek nokta kom dir nokta Ankara'da ikamet ediyorum |
eposta adresim support et sestek nokta kom'dir. Ankara'da ikamet ediyorum or if cannot be written: eposta adresim support@sestek.com'dir Ankara'da ikamet ediyorum |
<speak>eposta adresim support@sestek.com'dir<break time="300ms"/> Ankara'da ikamet ediyorum.</speak> |
9. Pronunciation and Language Customization
9.1. Abbreviations
Abbreviation management allows custom pronunciations for abbreviations and acronyms. Default system behavior applies unless explicitly customized.
For advanced control, refer to: Customization for Synthesis Accuracy
Undefined abbreviations are a common source of unexpected pronunciation.
Tips:
- Define abbreviations or custom pronunciations for foreign-origin words and brand names that do not have clear equivalents in the target language or script.
- Decide whether the abbreviation should be spoken:
- as a full word
- letter-by-letter
- or as its expanded form
- Prefer the representation that matches the intended spoken output.
Examples:
| Input | Recommended |
|---|---|
Dr. Smith |
Doctor Smith |
Mr. Doe |
Mister Doe |
ATM |
A T M if the letters should be spoken individually |
ETA |
E T A or estimated time of arrival, depending on the intended output |
9.2. Names, Brand Names, Foreign-Origin Words, and Mixed-Language Text
Proper nouns, brand names, product names, and foreign-origin words are a common source of pronunciation variation.
Mixed-language text may also reduce naturalness, especially when scripts or pronunciation rules change within the same sentence.
Tips:
- Define abbreviations for foreign-origin words, brand names that have no equivalent in the target language.
- Prefer the form that is closest to the intended spoken output.
- Expand abbreviations letter-by-letter or segment-by-segment when needed in context.
Examples (Turkish):
| Input | Recommended |
|---|---|
"Mike" |
"Mayk" |
"Google" |
"gugıl" |
"Peugeot" |
"pejo" |
"ATM" |
"ateme" (letter-by-letter) |
"ChatGPT" |
"çet cii pii tii" (segment-by-segment) |
Examples (English):
| Input | Recommended |
|---|---|
"Husain" |
"who-sayin" |
"Dr" |
"doctor" |
"VIP" |
"V I P" |
If a name, brand, or foreign-origin word appears frequently in your prompts, it is better to define it once through abbreviations than to correct it repeatedly in input text.
9.3. Normalization
Normalization converts written text into a spoken-friendly representation before synthesis. This helps improve pronunciation of numbers, dates, and other complex text elements.
Normalization behavior is automatic and not user-configurable. Spoken output depends on detected format and context.
9.3.1. Numbers
Numeric values are automatically converted to spoken forms. Context determines whether numbers are read digit-by-digit or as whole values.
Long numeric sequences can cause numbers to sound too fast or dense.
Tips:
- Write numbers in the exact representation form of their value for natural synthesis.
- Rewrite the number in words when a full-value reading is required.
- Separate digits when digit-by-digit reading is intended.
- Insert pauses for long identifiers or codes.
- For additional pause guidance, see Section 10.
| Input | Recommended |
|---|---|
123 |
one hundred twenty-three if a whole-number reading is intended |
1 2 3 4 |
one two three four if digit-by-digit reading is intended |
Order 4521 |
Order 4 5 2 1 or Order four five two one if it is an identifier |
9.3.2. Dates
Numeric and alphanumeric date forms are automatically converted to spoken forms. Month phrases are always spoken even when input is in numeric format.
Spoken output depends on detected format and context. Ambiguous date formats will cause incorrect synthesis.
Tips:
- Prefer unambiguous date formats.
- Use month names when clarity matters.
- Keep separators and formatting consistent.
| Input | Recommended |
|---|---|
10/04/2026 |
10 April 2026 |
04/10/2026 |
rewrite using month name if ambiguity is possible |
10 April 2026 |
spoken naturally with less ambiguity |
9.3.3. Times
Numeric and alphanumeric time forms are automatically converted to spoken forms. Spoken output depends on detected format and surrounding context.
Ambiguous time formats will cause incorrect synthesis.
Tips:
- Prefer clear and standard time formats.
- Rewrite unusual or compressed forms when clarity matters.
Examples
| Input | Recommended |
|---|---|
9:45 |
9:45 or nine forty-five |
09.45 |
use a standard time format consistently |
945 |
rewrite if the intended time reading is important |
9.3.4. Currencies
Currency names and symbols are normalized to spoken forms automatically.
Currency codes (e.g. USD, EUR or TRY) are not supported and must be defined as abbreviations.
Tips:
- Prefer full currency names for correct synthesis.
Examples:
| Input | Recommended |
|---|---|
$20 |
20 dollars |
20 USD |
20 US dollars |
500 TRY |
500 Turkish lira |
- Define abbreviations for currency codes if their spoken form must be controlled.
Examples (English):
| Input | Recommended |
|---|---|
$ |
dollars |
USD |
US dollars |
TRY |
Turkish lira |
Examples (Turkish):
| Input | Recommended |
|---|---|
$ |
dolar |
USD |
dolar or amerikan doları |
TL |
T L or te le |
9.3.5. Addresses
There is no address-specific normalization. Numeric parts in addresses follow standard number normalization rules.
Province, street, and building names, flat numbers, and dense address strings may be interpreted unexpectedly. Long numeric sequences may sound dense without pauses.
Tips:
- Add commas at natural phrase boundaries.
- Expand abbreviations where needed.
- Separate long numeric segments clearly.
Examples:
| Input | Recommended |
|---|---|
221B Baker St. |
221 B, Baker Street if that is the intended spoken reading |
742 Evergreen Terrace Apt 5 |
742 Evergreen Terrace, Apartment number 5 |
9.3.6. Symbols
The pronunciation of symbols may be interpreted differently depending on language and surrounding context.
Tips:
- Define abbreviations for symbols for customized pronunciation.
Examples (Turkish):
- Symbol:
"&"→ Default pronunciation:"ve".
| Input | Recommended |
|---|---|
D&R |
di en ar |
miles&smiles |
mayls en smayls |
10. Pauses and Flow Control
10.1. Punctuation
Punctuation influences pauses automatically, but manual pause control may still be useful in specific cases.
Tips:
- Use punctuation first for natural phrasing
- Use manual pause control only when necessary
- Place pauses at semantic boundaries
- Avoid excessive pause insertion in short text
Examples:
- Without structure:
Your verification code is 1 2 5 0 7 8 and your order number is 4 5 2 1 please keep them for reference. - Better structured:
Your verification code is 1 2 5 0 7 8. Your order number is 4 5 2 1. Please keep them for reference.
Lack of pauses in long numeric sequences may cause incorrect synthesis.
10.2. SSML Tags
SSML tags allow customizing pronunciation, intonation, and emphasis. SESTEK supports the most commonly used SSML tags including <break>, <say-as>, <voice>, and audio insertion.
For full details, refer to: SSML Tag Support.
Tips:
- Use pauses sparingly - avoid excessive breaks in short sentences
- Place pauses at semantic boundaries
- For long numeric sequences, group digits (3–3–4 or 2–2–2) and add short breaks
Examples:
-
Without SSML pauses:
Your verification code is 1 2 5 0 7 8 and your order number is 4 5 2 1 please keep them for reference. -
With SSML pauses:
<speak>Your verification code is<break time='300ms'/> 1 2 5<break time='250ms'/>0 7 8<break time='250ms'/> and your order number is<break time='300ms'/>4 5 2 1<break time='300ms'/>please keep them for reference</speak>
Excessive manual pause control may make output sound artificial. Use it only when punctuation and sentence structure do not provide enough clarity.
11. Why Output May Sound Unexpected
Common Causes
- Missing punctuation
- Excessive punctuation
- Unresolved placeholders
- Raw special characters
- Dense numeric content
- Long or poorly segmented sentences
- Raw URLs or technical strings
- Channel/output mismatch
- Malformed payloads caused by incorrect JSON escaping
Examples:
| Poor | Recommended |
|---|---|
Hello "Mr.Doe", your order #4521 has been shipped!!! Track here: <https://example.com/order/4521> |
Hello, Mr. Doe. Your order number 4521 has been shipped. You can track your order on our website. |
12. Voice Selection, Rate, Volume, and Supported Controls
12.1. Voices
Different voices may vary in clarity, pacing, and handling of numbers, acronyms, or foreign-origin words.
For the full list of supported voices, refer to: Supported Languages and Voices
- Premium voices are LLM-based voices optimized for more natural-sounding and higher-quality neural speech output.
- However, their synthesis behavior may differ from standard voices in unexpected ways, especially in speech flow, prosody, and handling of symbols or mixed-language text.
Tip:
- Always test and validate premium voices with representative prompts in the target environment before production use.
12.2. Adjustable Parameters
Depending on the deployment and selected voice, the following parameters may be available:
- Rate - controls the speaking tempo of the voice.
- Volume - controls the base loudness level of the voice.
Recommended Approach
Start with default values, then adjust gradually and validate with representative prompts.
| Parameter | Recommended |
|---|---|
| Rate | Start from the default value and adjust in small steps (±0.05) and validate until the voice sounds natural and intelligible |
| Volume | Keep default unless the playback scenario requires a change |
Optimal voice, rate, and volume values are subjective to the listener, target language, and playback scenario.
12.3. Emotion, Tone, and Phoneme Tags
Explicit emotion or tone controls and phoneme tags are not supported.
Emotional delivery cannot be directly controlled via markup. Indirect influence is possible through:
- Sentence structure
- Punctuation
- Wording
- Strategic pause placement
For implementation-specific markup support, refer to: SSML Tag Support
13. Audio Output Format and Playback Compatibility Guide
- Match synthesis output settings to the target playback channel
- Avoid unnecessary transcoding where possible
- Use telephony-appropriate settings for IVR/call-center playback
- Validate synthesized audio in the actual downstream environment
- Be aware that playback issues may originate from channel limitations rather than the synthesis itself
14. Troubleshooting Guide
A word, abbreviation, or brand name is pronounced incorrectly
Possible causes: Undefined abbreviation, acronym ambiguity, foreign-origin word, mixed-language text, or unexpected default normalization.
Actions: Define a custom pronunciation if supported, rewrite the term in a spoken-friendly form, expand the abbreviation, or keep the sentence in one language/script where possible.
Numbers, dates, or times sound unexpected
Possible causes: Ambiguous format, dense numeric content, insufficient pauses, or a whole-number vs. digit-by-digit mismatch.
Actions: Rewrite the value in the intended spoken form, separate digits when needed, use unambiguous date/time formats, and add pauses where clarity is critical.
Output does not sound as intended in mixed-language content
Possible causes: Frequent language-switching, foreign-origin words, or inconsistent script usage within the same sentence.
Actions: Keep the sentence in one language where possible, use localized spellings/transliterations in language-specific deployments, or define recurring foreign terms through abbreviation.
Output sounds too fast or dense
Possible causes: Long sentences, dense numeric blocks, insufficient punctuation.
Actions: Split the text into shorter sentences, add punctuation, group content more clearly.
Output sounds robotic or overly segmented
Possible causes: Too much punctuation, excessive commas, too many forced pauses.
Actions: Simplify punctuation and remove unnecessary separators.
Symbols are spoken literally
Possible causes: Raw symbols were sent directly in the input.
Actions: Rewrite symbols in spoken form where needed.
Placeholders or markup are spoken
Possible causes: Template variables or tags were not resolved before synthesis.
Actions: Clean and normalize the text before sending it to TTS.
Request fails before audio is generated
Possible causes: Invalid JSON, especially unescaped " or \.
Actions: Validate and escape the payload correctly.
15. Summary
Recommended:
- Use clean, spoken-friendly text
- Apply punctuation intentionally
- Keep spacing consistent
- Rewrite symbols and compact forms when clarity matters
- Remove placeholders, markup, and technical artifacts
- Escape JSON payloads correctly
- Use unambiguous number, date, time, and currency formats
- Define abbreviations or custom pronunciations for recurring acronyms, names, and foreign-origin terms
- Keep one language/script per sentence where possible
- Match output settings to the target playback environment
- Use sentence structure to support natural prosody
- Validate voice, rate, and volume settings with representative prompts
Avoid:
- Repeated decorative punctuation
- Meaningless Prosodic Grouping.
- Raw URLs and code-like text in customer-facing prompts
- Leaving unresolved placeholders in synthesis input
- Assuming every symbol will be read naturally
- Diagnosing voice quality before validating input formatting
- Treating screen-friendly formatting as automatically speech-friendly
- Relying on ambiguous abbreviations without customization
- Mixing languages/scripts within the same sentence unnecessarily
- Expecting emotion control tags
- Passing raw numeric-heavy text without preparation
16. Final Note
Well-formed TTS output starts with well-prepared input. In many cases, improving punctuation, spacing, symbol usage, and request formatting is enough to significantly improve synthesis clarity, prosody, and stability without changing the voice itself.