Arabic Voices Best Practices

This document provides practical guidance for using SESTEK Arabic Text-to-Speech effectively. It explains how Arabic text is interpreted by the TTS system and offers best practices for optimizing pronunciation, pacing, and speech clarity across different Arabic-language scenarios.

By following these guidelines, you can produce more natural-sounding output and better understand expected system behavior when diagnosing pronunciation issues.

1. Purpose and Scope

Objectives

Enable high-quality Arabic TTS outputs
Explain how the TTS system processes Arabic text
Reduce ambiguity when diagnosing pronunciation issues
Provide clear explanations for expected system behavior

Scope

Arabic Text-to-Speech (TTS)
Modern Standard Arabic (MSA) with dialect-sensitive behavior
General use and troubleshooting

This document describes system behavior and recommended usage patterns, not how Arabic should be spoken linguistically.

2. Arabic TTS Processing Overview

SESTEK Arabic TTS follows this synthesis pipeline:

Input Text
    ↓
Text Normalization
    ↓
Phoneme Generation
    ↓
Waveform Synthesis

Most unexpected pronunciation outcomes originate from text ambiguity, not from the voice engine itself.

3. Key Arabic-Specific Challenges

Arabic introduces additional complexity due to:

Ambiguity caused by absence of written diacritics (tashkeel)
Multiple valid pronunciations for the same spelling
Dialect vs. Modern Standard Arabic (MSA) differences
Mixed usage of MSA and dialectal forms
Rare or region-specific Arabic names
Numeric and mixed-language expressions
Complex number, date, and currency expressions
Proper nouns, brand, province, street and building names without Arabic equivalents
Undefined abbreviations
Missing SSML pauses (no clear separation between phrases and/or numbers)

These factors may lead to pronunciation that feels unexpected without prior normalization or guidance.

4. Pronunciation and Language Customization

4.1. Abbreviations

Abbreviation management allows custom pronunciations for abbreviations. Default system behavior applies unless explicitly customized.

For advanced control, refer to: Customization for Synthesis Accuracy

Undefined abbreviations are a common source of unexpected pronunciation.

Tips: Define uncommon or rare abbreviations for precise pronunciations, or replace with synonyms if available.

Examples:

Input	Recommended
`"دكتور أحمد"`	`"د. أحمد"`
`"مهندس أحمد"`	`"م. أحمد"`

4.2. Names and Brand Names

Proper nouns and brand names without Arabic equivalents are a common source of pronunciation variation.

Tip 1: Prefer Arabic-script equivalents for proper nouns and brand names.

Input	Recommended
`"Mike"`	`"مايك"`
`"Samsung"`	`"سَامسوْنْغْ"` or `"سَامسوْنْقْ"`
`"Google"`	`"غوغل"` or `"جوجل"`

Tip 2: Define abbreviations for brand names without Arabic equivalents.

Input	Recommended
`"Bio"`	`"بَاْيُوْ"`
`"Peugeot"`	`"بّيْجوْ"`

Tip 3: Expand abbreviations letter-by-letter when needed in context.

Input	Recommended
`"ADX"`	`"إِي دِي إكْس"` (letter-by-letter)
`"ATM"`	`"إيه تي إم"` (letter-by-letter)

Tip 4: Expand abbreviations segment-by-segment for dialect-specific pronunciations.

Input	Recommended
`"Google"`	`"قُ وْقِلْ"` (segment-by-segment)

4.3. Normalization

Normalization converts written text into a spoken-friendly representation before synthesis. This helps improve pronunciation of numbers, dates, and other complex text elements.

Normalization behavior is automatic and not user-configurable. Spoken output depends on detected format and context.

4.3.1. Numbers

Numeric values are automatically converted to spoken forms. Context determines whether numbers are read digit-by-digit or as whole values.

Long numeric sequences can cause numbers to sound too fast or dense.

Tips: Insert SSML pauses and rewrite numbers in words for natural synthesis. Write numbers in the exact representation form of their value for correct synthesis.

Input	Recommended
`"123"` (no pauses)	`"مائة وثلاثة وعشرون"`
`"1 2 3 4"` (with pauses)	`"واحد اثنان ثلاثة أربعة"`

4.3.2. Dates

Numeric and alphanumeric date forms are automatically converted to spoken forms. Month phrases are always spoken even when input is in numeric format.

Unsupported date formats will cause incorrect synthesis.

Tips: Write dates in supported formats for correct synthesis.

Format	Input	Spoken as
Numeric	`"١/٥/٢٠٢٥"`	واحد مايو الفين وخمسة وعشرين
Numeric	`"1/5/2025"`	واحد مايو الفين وخمسة وعشرين
Written	`"١ مايو ٢٠٢٥"`	واحد مايو الفين وخمسة وعشرين
Written	`"1 مايو 2025"`	واحد مايو الفين وخمسة وعشرين
Written text	(spoken as-is)	واحد مايو الفين وخمسة وعشرين

4.3.3. Times

Numeric and alphanumeric time forms are automatically converted to spoken forms. Spoken output depends on detected format and surrounding context.

Unsupported time formats will cause incorrect synthesis.

Tips: Write time in supported formats for correct synthesis.

Input	Spoken as
`"٩:٤٥"`	تسعة وخمسة واربعين دقيقة
`"9:45"`	تسعة وخمسة واربعين دقيقة
Written text	تسعة وخمسة واربعين دقيقة (spoken as-is)

4.3.4. Currencies

Currency names and symbols are normalized to spoken forms automatically.

Currency codes (e.g. USD, EUR) are not supported and must be defined as abbreviations.

Tips: Write currencies in supported formats for correct synthesis. Define abbreviations for currency codes.

Input	Spoken as
`"دولار"`	دولار
`"$"`	دولار
`"USD"` (via abbreviation)	دولار

4.3.5. Addresses

There is no address-specific normalization. Numeric parts in addresses follow standard number normalization rules.

Province, street, and building names without Arabic equivalents may be interpreted unexpectedly. Long numeric sequences may sound dense without pauses.

Tips: Define abbreviations for street or location names. Add pauses for clarity.

Input	Recommended
`"شارع الملك فهد رقم 1234"`	`"شارع الملك فهد, رقم 1234"` (with comma for clarity)

4.3.6. Symbols

The default pronunciation of symbols is Fusha. Symbols may be interpreted differently depending on dialect and surrounding context.

Example: Input "%" → Default pronunciation: "بِالمِئَةْ"

Tips:

Define abbreviations for symbols: add "%" → "بِالمِيِّةْ"
Use diacritics for customized pronunciation: "١٠٠%" → "مِيِّةْ بِالمِيِّةْ"

4.3.7. Emails

Emails are automatically normalized to spoken forms, but may be mispronounced. Domain pronunciations are mostly correct.

Symbols, punctuation marks, and Latin characters in email addresses may be interpreted unexpectedly.

Examples:

Input	Possible pronunciation
`"info@sestek.com"`	`"إنفو آتِ سِستيكِ نُقْطَةْ كُوْمْ"`
`"help@outlook.com"`	`"هِلبْ آتْ آوتلُوك نُقْطَةْ كَمْ"`

Tip: Write non-Arabic handles and symbols in their exact desired spoken form, customized with diacritics based on dialect.

Input	Recommended
`"info@sestek.com"`	`"إنفو آتِ سِستيكِ نُقْطَةْ كُوْمْ"`
`"help@outlook.com"`	`"هِلْبْ آتْ آوْتْلُوْكْ دوت كُمْ"`

5. Diacritics (Tashkeel) and Dialect Sensitivity

Arabic text is typically written without diacritics, which can cause ambiguity in pronunciation.

5.1. Automatic Diacritics

SESTEK TTS applies automatic tashkeel by default. Diacritics are required for intelligible Arabic speech. Missing or incorrect tashkeel may result in valid but unintended pronunciations.

Example:

Input: "دخل أحمد البنك"

Possible pronunciations:

"دَخَلَ أَحْمَدُ الْبَنْكَ"
"دَخَلَ أَحْمَدُ الْبَنْكْ"
"دَخَلْ أَحْمَدْ الْبَنْكْ"

Tip: Define diacritics explicitly to guide pronunciation as desired.

Desired input: "دَخَلَ أَحْمَدُ الْبَنْكَ"

5.2. Dialect vs. MSA Expressions

Dialectal expectations may differ from MSA. The same concept expressed differently by dialect:

Dialect	Expression for "now"
MSA	`"الآن"`
Egyptian	`"دلوقتي"`
Palestinian / Jordanian	`"هسّا"`
Najdi / Kuwaiti / Emirati / Saudi	`"هالحين"` / `"دحين"` / `"الحين"`
Kuwaiti	`"الحزَّة"`
Syrian	`"هلأ"` / `"هلّق"`

Tip: Use dialect-relevant expressions or MSA consistently.

Example for Jassim Voice (ar-KW): "الحزَّة"

6. Pauses and Flow Control (SSML)

SSML tags allow customizing pronunciation, intonation, and emphasis. SESTEK supports the most commonly used SSML tags including <break>, <say-as>, <voice>, and audio insertion.

For full details, refer to: SSML Tag Support

Lack of pauses in long numeric sequences may cause incorrect synthesis.

Tips:

Use pauses sparingly - avoid excessive breaks in short sentences
Place pauses at semantic boundaries
For long numeric sequences, group digits (3–3–4 or 2–2–2) and add short breaks

Examples:

Without pauses:

"رقم طلبك هو 123456789"

With SSML pauses:

<speak>رقم طلبك هو <break time='300ms'/> 123 <break time='250ms'/> 456 <break time='250ms'/> 789</speak>

Without pauses:

"للتواصل اتصل على 0501234567"

With SSML pauses:

<speak>للتواصل اتصل على <break time='200ms'/> 050 <break time='200ms'/> 123 <break time='200ms'/> 4567</speak>

7. Speech Rate, Volume and Voice Selection

7.1. Adjustable Parameters

Voice selection - different Arabic voices may vary in clarity. See Supported Languages and Voices for the full list.
Rate - controls the speaking tempo of the voice
Volume - controls the base loudness level of the voice

7.2. Recommended Values for Arabic Voices

Start with the values below, then adjust in small steps (±0.05) and validate until the voice sounds natural and intelligible.

Parameter	Recommended value
Rate	1.1 – 1.3
Volume	1.0 (default)

Optimal rate and volume are subjective to the listener and scenario.

8. Emotion, Tone and Phoneme Tags

Explicit emotion or tone controls and Phoneme Tags are not supported.

Emotional delivery cannot be directly controlled via markup. Indirect influence is possible through:

Sentence structure
Punctuation
Strategic pause placement

9. Why Output May Sound Unexpected

9.1. Common Ambiguous Arabic Words

Ambiguous words may receive a valid but unintended reading. Manual diacritics may still be overridden in ambiguous contexts.

Input	Possible readings
`"ملك"`	`"مَلِك"` (king) / `"مَلَك"` (angel) / `"مَلَكَ"` (owned)
`"علم"`	`"عِلْم"` (knowledge) / `"عَلَم"` (flag) / `"عَلَّمَ"` (taught)
`"قدر"`	`"قَدَر"` (fate) / `"قِدْر"` (cooking pot)
`"جمل"`	`"جَمَل"` (camel) / `"جُمَل"` (sentences)
`"سلم"`	`"سِلْم"` (peace) / `"سُلَّم"` (stairs) / `"سَلَّمَ"` (handed over)
`"كتب"`	`"كُتُب"` (books) / `"كَتَبَ"` (he wrote)
`"عين"`	`"عَيْن"` (eye / spring) / `"عَيَّنَ"` (appointed)

Tip: Disambiguate with tashkeel or rephrasing.

Ambiguous	Clear
`"ضع الطعام في القدر"`	`"ضَعِ الطَّعَامَ فِي القِدْرِ"` (cooking pot)
`"هذا قدر الإنسان"`	`"هَذَا قَدَرُ الإِنْسَانِ"` (fate)
`"رفع علم الدولة"`	`"رَفَعَ عَلَمَ الدَّوْلَةِ"` (flag)

9.2. Text Preparation and Cleaning

High-quality Arabic TTS output starts with clean and well-prepared input text.

Tips:

Split long paragraphs into shorter sentences
Avoid dense numeric blocks in a single sentence
Ensure proper spacing between Arabic words

Poor	Better
`"يرجىالاتصال علرقم 0501234567 فيحال وجودأيستفسار"`	`"يرجى الاتصال على الرقم 0501234567 في حال وجود أي استفسار"`

9.3. Mixed-Language Text (Arabic + English)

Excessive language-switching within a single sentence may reduce naturalness.

Tips:

Avoid keeping English words in Latin characters unless defined in Abbreviations
Use Arabic-script equivalents (Arabic transliteration) to guide pronunciation

Poor	Better
`"إلى أحدث إصدار الآن WhatsApp قم بتحديث تطبيق"`	`"قم بتحديث تطبيق واتساب إلى أحدث إصدار الآن"`

10. Troubleshooting Guide

Word is pronounced incorrectly

Possible causes: Missing diacritics, ambiguous spelling, foreign-origin word.

Actions: Add tashkeel, split the word to modify spelling, or use phonetic alternatives.

Numbers sound too fast or unclear

Possible causes: Long numeric sequences, missing or insufficient pauses.

Actions: Insert SSML pauses, rewrite numbers in words.

Output does not sound dialectal

Explanation: Default behavior prioritizes MSA. Dialectal pronunciation is not explicitly selectable - use dialect-specific vocabulary and expressions to guide the output.

11. Summary

Recommended:

Use clear, unambiguous dialect-specific Arabic text
Apply pauses where clarity is critical
Keep parameters consistent across turns
Define abbreviations for foreign words, symbols, and currency codes
Use supported date and time formats

Avoid:

Mixing dialects within a sentence
Overusing diacritics
Expecting emotion control tags
Passing raw numeric-heavy text without preparation

Documentation Index

Arabic Voices Best Practices

1. Purpose and Scope

Objectives

Scope

2. Arabic TTS Processing Overview

3. Key Arabic-Specific Challenges

4. Pronunciation and Language Customization

4.1. Abbreviations

4.2. Names and Brand Names

4.3. Normalization

4.3.1. Numbers

4.3.2. Dates

4.3.3. Times

4.3.4. Currencies

4.3.5. Addresses

4.3.6. Symbols

4.3.7. Emails

5. Diacritics (Tashkeel) and Dialect Sensitivity

5.1. Automatic Diacritics

5.2. Dialect vs. MSA Expressions

6. Pauses and Flow Control (SSML)

7. Speech Rate, Volume and Voice Selection

7.1. Adjustable Parameters

7.2. Recommended Values for Arabic Voices

8. Emotion, Tone and Phoneme Tags

9. Why Output May Sound Unexpected

9.1. Common Ambiguous Arabic Words

9.2. Text Preparation and Cleaning

9.3. Mixed-Language Text (Arabic + English)

10. Troubleshooting Guide

Word is pronounced incorrectly

Numbers sound too fast or unclear

Output does not sound dialectal

11. Summary