This document provides practical guidance for using SESTEK Arabic Text-to-Speech effectively. It explains how Arabic text is interpreted by the TTS system and offers best practices for optimizing pronunciation, pacing, and speech clarity across different Arabic-language scenarios.
By following these guidelines, you can produce more natural-sounding output and better understand expected system behavior when diagnosing pronunciation issues.
1. Purpose and Scope
Objectives
- Enable high-quality Arabic TTS outputs
- Explain how the TTS system processes Arabic text
- Reduce ambiguity when diagnosing pronunciation issues
- Provide clear explanations for expected system behavior
Scope
- Arabic Text-to-Speech (TTS)
- Modern Standard Arabic (MSA) with dialect-sensitive behavior
- General use and troubleshooting
This document describes system behavior and recommended usage patterns, not how Arabic should be spoken linguistically.
2. Arabic TTS Processing Overview
SESTEK Arabic TTS follows this synthesis pipeline:
Input Text
↓
Text Normalization
↓
Phoneme Generation
↓
Waveform Synthesis
Most unexpected pronunciation outcomes originate from text ambiguity, not from the voice engine itself.
3. Key Arabic-Specific Challenges
Arabic introduces additional complexity due to:
- Ambiguity caused by absence of written diacritics (tashkeel)
- Multiple valid pronunciations for the same spelling
- Dialect vs. Modern Standard Arabic (MSA) differences
- Mixed usage of MSA and dialectal forms
- Rare or region-specific Arabic names
- Numeric and mixed-language expressions
- Complex number, date, and currency expressions
- Proper nouns, brand, province, street and building names without Arabic equivalents
- Undefined abbreviations
- Missing SSML pauses (no clear separation between phrases and/or numbers)
These factors may lead to pronunciation that feels unexpected without prior normalization or guidance.
4. Pronunciation and Language Customization
4.1. Abbreviations
Abbreviation management allows custom pronunciations for abbreviations. Default system behavior applies unless explicitly customized.
For advanced control, refer to: Customization for Synthesis Accuracy
Undefined abbreviations are a common source of unexpected pronunciation.
Tips: Define uncommon or rare abbreviations for precise pronunciations, or replace with synonyms if available.
Examples:
| Input | Recommended |
|---|---|
"دكتور أحمد" |
"د. أحمد" |
"مهندس أحمد" |
"م. أحمد" |
4.2. Names and Brand Names
Proper nouns and brand names without Arabic equivalents are a common source of pronunciation variation.
Tip 1: Prefer Arabic-script equivalents for proper nouns and brand names.
| Input | Recommended |
|---|---|
"Mike" |
"مايك" |
"Samsung" |
"سَامسوْنْغْ" or "سَامسوْنْقْ" |
"Google" |
"غوغل" or "جوجل" |
Tip 2: Define abbreviations for brand names without Arabic equivalents.
| Input | Recommended |
|---|---|
"Bio" |
"بَاْيُوْ" |
"Peugeot" |
"بّيْجوْ" |
Tip 3: Expand abbreviations letter-by-letter when needed in context.
| Input | Recommended |
|---|---|
"ADX" |
"إِي دِي إكْس" (letter-by-letter) |
"ATM" |
"إيه تي إم" (letter-by-letter) |
Tip 4: Expand abbreviations segment-by-segment for dialect-specific pronunciations.
| Input | Recommended |
|---|---|
"Google" |
"قُ وْقِلْ" (segment-by-segment) |
4.3. Normalization
Normalization converts written text into a spoken-friendly representation before synthesis. This helps improve pronunciation of numbers, dates, and other complex text elements.
Normalization behavior is automatic and not user-configurable. Spoken output depends on detected format and context.
4.3.1. Numbers
Numeric values are automatically converted to spoken forms. Context determines whether numbers are read digit-by-digit or as whole values.
Long numeric sequences can cause numbers to sound too fast or dense.
Tips: Insert SSML pauses and rewrite numbers in words for natural synthesis. Write numbers in the exact representation form of their value for correct synthesis.
| Input | Recommended |
|---|---|
"123" (no pauses) |
"مائة وثلاثة وعشرون" |
"1 2 3 4" (with pauses) |
"واحد اثنان ثلاثة أربعة" |
4.3.2. Dates
Numeric and alphanumeric date forms are automatically converted to spoken forms. Month phrases are always spoken even when input is in numeric format.
Unsupported date formats will cause incorrect synthesis.
Tips: Write dates in supported formats for correct synthesis.
| Format | Input | Spoken as |
|---|---|---|
| Numeric | "١/٥/٢٠٢٥" |
واحد مايو الفين وخمسة وعشرين |
| Numeric | "1/5/2025" |
واحد مايو الفين وخمسة وعشرين |
| Written | "١ مايو ٢٠٢٥" |
واحد مايو الفين وخمسة وعشرين |
| Written | "1 مايو 2025" |
واحد مايو الفين وخمسة وعشرين |
| Written text | (spoken as-is) | واحد مايو الفين وخمسة وعشرين |
4.3.3. Times
Numeric and alphanumeric time forms are automatically converted to spoken forms. Spoken output depends on detected format and surrounding context.
Unsupported time formats will cause incorrect synthesis.
Tips: Write time in supported formats for correct synthesis.
| Input | Spoken as |
|---|---|
"٩:٤٥" |
تسعة وخمسة واربعين دقيقة |
"9:45" |
تسعة وخمسة واربعين دقيقة |
| Written text | تسعة وخمسة واربعين دقيقة (spoken as-is) |
4.3.4. Currencies
Currency names and symbols are normalized to spoken forms automatically.
Currency codes (e.g. USD, EUR) are not supported and must be defined as abbreviations.
Tips: Write currencies in supported formats for correct synthesis. Define abbreviations for currency codes.
| Input | Spoken as |
|---|---|
"دولار" |
دولار |
"$" |
دولار |
"USD" (via abbreviation) |
دولار |
4.3.5. Addresses
There is no address-specific normalization. Numeric parts in addresses follow standard number normalization rules.
Province, street, and building names without Arabic equivalents may be interpreted unexpectedly. Long numeric sequences may sound dense without pauses.
Tips: Define abbreviations for street or location names. Add pauses for clarity.
| Input | Recommended |
|---|---|
"شارع الملك فهد رقم 1234" |
"شارع الملك فهد, رقم 1234" (with comma for clarity) |
4.3.6. Symbols
The default pronunciation of symbols is Fusha. Symbols may be interpreted differently depending on dialect and surrounding context.
Example: Input "%" → Default pronunciation: "بِالمِئَةْ"
Tips:
- Define abbreviations for symbols: add
"%"→"بِالمِيِّةْ" - Use diacritics for customized pronunciation:
"١٠٠%"→"مِيِّةْ بِالمِيِّةْ"
4.3.7. Emails
Emails are automatically normalized to spoken forms, but may be mispronounced. Domain pronunciations are mostly correct.
Symbols, punctuation marks, and Latin characters in email addresses may be interpreted unexpectedly.
Examples:
| Input | Possible pronunciation |
|---|---|
"info@sestek.com" |
"إنفو آتِ سِستيكِ نُقْطَةْ كُوْمْ" |
"help@outlook.com" |
"هِلبْ آتْ آوتلُوك نُقْطَةْ كَمْ" |
Tip: Write non-Arabic handles and symbols in their exact desired spoken form, customized with diacritics based on dialect.
| Input | Recommended |
|---|---|
"info@sestek.com" |
"إنفو آتِ سِستيكِ نُقْطَةْ كُوْمْ" |
"help@outlook.com" |
"هِلْبْ آتْ آوْتْلُوْكْ دوت كُمْ" |
5. Diacritics (Tashkeel) and Dialect Sensitivity
Arabic text is typically written without diacritics, which can cause ambiguity in pronunciation.
5.1. Automatic Diacritics
SESTEK TTS applies automatic tashkeel by default. Diacritics are required for intelligible Arabic speech. Missing or incorrect tashkeel may result in valid but unintended pronunciations.
Example:
Input: "دخل أحمد البنك"
Possible pronunciations:
"دَخَلَ أَحْمَدُ الْبَنْكَ""دَخَلَ أَحْمَدُ الْبَنْكْ""دَخَلْ أَحْمَدْ الْبَنْكْ"
Tip: Define diacritics explicitly to guide pronunciation as desired.
Desired input: "دَخَلَ أَحْمَدُ الْبَنْكَ"
5.2. Dialect vs. MSA Expressions
Dialectal expectations may differ from MSA. The same concept expressed differently by dialect:
| Dialect | Expression for "now" |
|---|---|
| MSA | "الآن" |
| Egyptian | "دلوقتي" |
| Palestinian / Jordanian | "هسّا" |
| Najdi / Kuwaiti / Emirati / Saudi | "هالحين" / "دحين" / "الحين" |
| Kuwaiti | "الحزَّة" |
| Syrian | "هلأ" / "هلّق" |
Tip: Use dialect-relevant expressions or MSA consistently.
Example for Jassim Voice (ar-KW): "الحزَّة"
6. Pauses and Flow Control (SSML)
SSML tags allow customizing pronunciation, intonation, and emphasis. SESTEK supports the most commonly used SSML tags including <break>, <say-as>, <voice>, and audio insertion.
For full details, refer to: SSML Tag Support
Lack of pauses in long numeric sequences may cause incorrect synthesis.
Tips:
- Use pauses sparingly - avoid excessive breaks in short sentences
- Place pauses at semantic boundaries
- For long numeric sequences, group digits (3–3–4 or 2–2–2) and add short breaks
Examples:
Without pauses:
"رقم طلبك هو 123456789"
With SSML pauses:
<speak>رقم طلبك هو <break time='300ms'/> 123 <break time='250ms'/> 456 <break time='250ms'/> 789</speak>
Without pauses:
"للتواصل اتصل على 0501234567"
With SSML pauses:
<speak>للتواصل اتصل على <break time='200ms'/> 050 <break time='200ms'/> 123 <break time='200ms'/> 4567</speak>
7. Speech Rate, Volume and Voice Selection
7.1. Adjustable Parameters
- Voice selection - different Arabic voices may vary in clarity. See Supported Languages and Voices for the full list.
- Rate - controls the speaking tempo of the voice
- Volume - controls the base loudness level of the voice
7.2. Recommended Values for Arabic Voices
Start with the values below, then adjust in small steps (±0.05) and validate until the voice sounds natural and intelligible.
| Parameter | Recommended value |
|---|---|
| Rate | 1.1 – 1.3 |
| Volume | 1.0 (default) |
Optimal rate and volume are subjective to the listener and scenario.
8. Emotion, Tone and Phoneme Tags
Explicit emotion or tone controls and Phoneme Tags are not supported.
Emotional delivery cannot be directly controlled via markup. Indirect influence is possible through:
- Sentence structure
- Punctuation
- Strategic pause placement
9. Why Output May Sound Unexpected
9.1. Common Ambiguous Arabic Words
Ambiguous words may receive a valid but unintended reading. Manual diacritics may still be overridden in ambiguous contexts.
| Input | Possible readings |
|---|---|
"ملك" |
"مَلِك" (king) / "مَلَك" (angel) / "مَلَكَ" (owned) |
"علم" |
"عِلْم" (knowledge) / "عَلَم" (flag) / "عَلَّمَ" (taught) |
"قدر" |
"قَدَر" (fate) / "قِدْر" (cooking pot) |
"جمل" |
"جَمَل" (camel) / "جُمَل" (sentences) |
"سلم" |
"سِلْم" (peace) / "سُلَّم" (stairs) / "سَلَّمَ" (handed over) |
"كتب" |
"كُتُب" (books) / "كَتَبَ" (he wrote) |
"عين" |
"عَيْن" (eye / spring) / "عَيَّنَ" (appointed) |
Tip: Disambiguate with tashkeel or rephrasing.
| Ambiguous | Clear |
|---|---|
"ضع الطعام في القدر" |
"ضَعِ الطَّعَامَ فِي القِدْرِ" (cooking pot) |
"هذا قدر الإنسان" |
"هَذَا قَدَرُ الإِنْسَانِ" (fate) |
"رفع علم الدولة" |
"رَفَعَ عَلَمَ الدَّوْلَةِ" (flag) |
9.2. Text Preparation and Cleaning
High-quality Arabic TTS output starts with clean and well-prepared input text.
Tips:
- Split long paragraphs into shorter sentences
- Avoid dense numeric blocks in a single sentence
- Ensure proper spacing between Arabic words
| Poor | Better |
|---|---|
"يرجىالاتصال علرقم 0501234567 فيحال وجودأيستفسار" |
"يرجى الاتصال على الرقم 0501234567 في حال وجود أي استفسار" |
9.3. Mixed-Language Text (Arabic + English)
Excessive language-switching within a single sentence may reduce naturalness.
Tips:
- Avoid keeping English words in Latin characters unless defined in Abbreviations
- Use Arabic-script equivalents (Arabic transliteration) to guide pronunciation
| Poor | Better |
|---|---|
"إلى أحدث إصدار الآن WhatsApp قم بتحديث تطبيق" |
"قم بتحديث تطبيق واتساب إلى أحدث إصدار الآن" |
10. Troubleshooting Guide
Word is pronounced incorrectly
Possible causes: Missing diacritics, ambiguous spelling, foreign-origin word.
Actions: Add tashkeel, split the word to modify spelling, or use phonetic alternatives.
Numbers sound too fast or unclear
Possible causes: Long numeric sequences, missing or insufficient pauses.
Actions: Insert SSML pauses, rewrite numbers in words.
Output does not sound dialectal
Explanation: Default behavior prioritizes MSA. Dialectal pronunciation is not explicitly selectable - use dialect-specific vocabulary and expressions to guide the output.
11. Summary
Recommended:
- Use clear, unambiguous dialect-specific Arabic text
- Apply pauses where clarity is critical
- Keep parameters consistent across turns
- Define abbreviations for foreign words, symbols, and currency codes
- Use supported date and time formats
Avoid:
- Mixing dialects within a sentence
- Overusing diacritics
- Expecting emotion control tags
- Passing raw numeric-heavy text without preparation
