Arabic Voices Best Practices

Prev Next

Best Practices

This document provides practical guidance for using Sestek Arabic Text-to-Speech (TTS) effectively.

It explains how Arabic text is interpreted by the TTS system and offers best practices for optimizing pronunciation, pacing, and overall speech clarity across different Arabic-language scenarios.

By following these guidelines, you can create more natural-sounding speech outputs and better understand how to optimize input text to achieve your desired results.


1. Purpose and Scope

Objectives

  • Enable high-quality Arabic TTS outputs
  • Explain how the TTS system processes Arabic text
  • Reduce ambiguity when diagnosing pronunciation issues
  • Provide clear explanations for expected system behavior

Scope

  • Arabic Text-to-Speech (TTS)
  • Modern Standard Arabic (MSA) with dialect-sensitive behavior
  • General use and troubleshooting
Important

This document describes system behavior and recommended usage patterns, not how Arabic should be spoken linguistically.


2. Arabic TTS Processing Overview

Sestek Arabic TTS follows the synthesis pipeline:

Input Text

Text Normalization

Phoneme Generation

Waveform Synthesis

Important:

Most unexpected pronunciation outcomes originate from text ambiguity, rather than from the voice engine itself.

3. Key Arabic-specific challenges:

Arabic introduces additional complexity due to:

  • Ambiguity caused by absence of written diacritics (tashkeel)
  • Multiple valid pronunciations for the same spelling
  • Dialect vs. Modern Standard Arabic (MSA) differences
  • Mixed usage of MSA and dialectal forms
  • Rare or region-specific Arabic names
  • Numeric and mixed-language expressions
  • Complex number, date, and currency expressions
  • Proper nouns, brand, province, street and building names without Arabic equivalents.
  • Undefined abbreviations
  • Missing SSML pauses (no clear separation between phrases and/or numbers).

These factors may lead to pronunciation that feels unexpected without prior normalization or guidance.

4. Pronunciation and Language Customization

4.1. Abbreviations

Abbreviation management allows custom pronunciations for abbreviations.
Default system behavior applies unless explicitly customized.

Abbreviation Customization

For advanced control, refer to: Customization for Synthesis Accuracy

  • Undefined abbreviations are a common source of unexpected pronunciation.
    Tips:
  • Define uncommon or rare abbreviations for precise pronunciations or replace with synonyms if available.
    Examples

    "دكتور أحمد"     →     "د. أحمد"
    "مهندس أحمد"     →     "م. أحمد"

4.2 Names and Brand Names

  • Proper nouns and brand names without Arabic equivalents are a common source of pronunciation variation.
    Tips:

  • Prefer Arabic-script equivalents for Proper nouns and brand names.
    Examples

    "Mike"           →       "مايك"
    "Samsung"     →       "سَامسوْنْغْ" or "سَامسوْنْقْ"
    "Google"       →       "غوغل" or "جوجل"

  • Define abbreviations of brand names without Arabic equivalents.
    Examples

    "Bio"             →     "بَاْيُوْ"
    "Peugeot"     →       "بّيْجوْ"

  • Expand the abbreviations to spelled letter-by-letter if needed in the context.
    Examples

    "ADX"             →     "إِي دِي إكْس" - (letter-by-letter)
    "ATM"             →     "إيه تي إم" - (letter-by-letter)

  • Expand the abbreviations to spelled segment-by-segment for different dialects and spellings of multiple valid pronunciations.
    Examples

    "Google"       →       "قُ وْقِلْ" - (segment-by-segment)



4.3 Normalization

Normalization converts written text into a spoken-friendly representation before synthesis to help improve pronunciation of numbers, dates, and other complex text elements.

Normalization behavior is automatic and not user-configurable
  • Spoken output depends on detected format and context

4.3.1. Numbers

Numeric values are automatically converted to spoken forms
  • Context determines whether numbers are read digit-by-digit or as whole values
  • Long numeric sequences cause numbers to sound too fast or dense.

Tips:

  • Insert SSML pauses and rewrite numbers in words for natural synthesis of numbers.
  • Write numbers in the exact representation form of the numeric value for correct synthesis of number values.
    Examples

    "123"           (no pauses)               →       "مائة وثلاثة وعشرون"
    "1 2 3 4"   (with pauses)            →       "واحد اثنان ثلاثة أربعة"

4.3.2. Dates

Numeric and alphanumeric date forms are automatically converted to spoken forms
  • Month phrases are always spoken even when the input is provided in the numeric format.
  • Spoken output depends on detected format and surrounding context.
  • Inputs of unsupported date formats causes incorrect synthesis of date values.

Tips:

  • Write dates in the supported formats for correct synthesis of date values.
    Examples

    Numeric date forms:           "واحد مايو الفين وخمسة وعشرين"     →                                                 "١/٥/٢٠٢٥"
    Alphanumeric date forms:  "واحد مايو الفين وخمسة وعشرين"     →                                                  "1/5/2025"
    Written date forms:             "واحد مايو الفين وخمسة وعشرين"       →     spoken as-is
    Alphanumeric date forms:  "واحد مايو الفين وخمسة وعشرين"     →                                           "١ مايو ٢٠٢٥"
    Alphanumeric date forms:  "واحد مايو الفين وخمسة وعشرين"     →                                            "1 مايو 2025"

4.3.3. Times

Numeric and alphanumeric time forms are automatically converted to spoken forms

  • Spoken output depends on detected format and surrounding context.
  • Inputs of unsupported time formats causes incorrect synthesis of time values.

Tips:

  • Write time in the supported formats for correct synthesis.
    Examples

    Numeric time forms:             "تسعة وخمسة واربعين دقيقة"                       →                                                 "٩:٤٥"
    Numeric time forms:             "تسعة وخمسة واربعين دقيقة"                       →                                                 "9:45"
    Numeric time forms:             "تسعة وخمسة واربعين دقيقة"                       →      "تسعة وخمسة واربعين دقيقة"

4.3.4. Currencies

Currency names and symbols are normalized to spoken forms automatically

⚠️ Currency codes are not Supported.

  • Inputs of unsupported time formats causes incorrect synthesis of time values.

Tips:

  • Write currencies in the supported formats for correct synthesis.
    Examples

    Currency names: "دولار"      →     "دولار"
    Currency symbols: "$"         →      "دولار"

  • Define abbreviations of currency codes.
    Examples

    Currency symbols: "USD"      →      "دولار"

4.3.5. Addresses

There is no address-specific normalization

⚠️ The numeric parts in addresses conform to normalization rules of Numbers.

  • Province, street and building names without Arabic equivalents may be interpreted differently than expected.
  • Long numeric sequences may sound dense without pauses.

Tips:

  • Define abbreviations for street or location names
  • Consider adding pauses for clarity.
    Examples

    "شارع الملك فهد رقم ألف وميتين وأربعة وثلاثين"      →      "شارع الملك فهد رقم 1234"
    "شارع الملك فهد, رقم 1234"    →       (spoken as-is but with more clarity)

4.3.6. Symbols

Limited symbol normalization

⚠️ The default pronunciation of Symbols is Fusha.

  • Symbols may be interpreted differently than expected, depending on dialect and surrounding context.
    Example

    Input: "%"
    Default pronunciation: "بِالمِئَةْ"

Tips:

  • Define abbreviations for symbols.

    Add to abbreviations "%""بِالمِيِّةْ"

  • Customized pronunciation with diacritics

    Alphanumeric symbol forms: "١٠٠%""مِيِّةْ بِالمِيِّةْ"

4.3.7. Emails

  • Emails are automatically normalized to spoken forms automatically, but might be mispronounced.,
  • Domain pronunciations are mostly correct.
  • Symbols, punctuation marks and Latin characters in default structure of emails may be interpreted differently than expected.
    Examples

    Input:                                    "info@sestek.com"
    Possible pronunciation:      "إنفو آتِ سِستيكِ نُقْطَةْ كُوْمْ"

    Input:                                    "help@outlook.com"
    Possible pronunciation:      "هِلبْ آتْ آوتلُوك نُقْطَةْ كَمْ"

Tips:

  • Write non-Arabic symbols and handles in the exact desired spoken form and customize with diacritics based on dialect for correct synthesis.
    Examples

    "info@sestek.com"       →         "إنفو آتِ سِستيكِ نُقْطَةْ كُوْمْ"
    "help@outlook.com"     →         "هِلْبْ آتْ آوْتْلُوْكْ دوت كُمْ"


5. Diacritics (Tashkeel) and Dialect Sensitivity

Arabic text is typically written without diacritics, which can cause ambiguity.

5.1 Automatic Diacritics

Sestek TTS applies automatic tashkeel by default.

  • Diacritics are required for intelligible Arabic speech.
  • Missing or incorrect tashkeel may result in valid but unintended pronunciations.
    Example:

    Input:                                     "دخل أحمد البنك"
    Possible pronunciations:
                                                   "دَخَلَ أَحْمَدُ الْبَنْكَ"
                                                   "دَخَلَ أَحْمَدُ الْبَنْكْ"
                                                   "دَخَلْ أَحْمَدْ الْبَنْكْ"

Tips:

  • Define diacritics to adjust spelling and guide pronunciation as desired.

    Desired pronunciation Input:       →      "دَخَلَ أَحْمَدُ الْبَنْكَ"

5.2. Dialect vs. MSA Expressions:

  • Dialectal expectations may differ from MSA.
    Examples:

    MSA:                                                    "الآن"
    Dialect variations:

    • (Egyptian)                                        "دلوقتي"
    • (Palestinian/Jordanian)                   "هسّا"
    • (Najdi/Kuwaiti/Emirati/Saudi)          "هالحين" / "دحين" / "الحين"
    • (Kuwaiti)                                           "الحزَّة"
    • (Syrian)                                             "هلأ" / "هلّق"

Tips:

  • Use dialect relevant expression and phrases or use MSA.

    Ideal expression for Jassim Voice (ar-KW):       →      "الحزَّة"


6. Pauses and Flow Control (SSML)

SSML tags allow customizing the pronunciation, intonation, and emphasis of their audio content, creating a more engaging and personalized experience for the listener.
Our support for the most commonly used SSML tags, including break, say as, and even insert audio files, all with just a few lines of code. For further details, refer to: SSML Tag Support.

⚠️ Lack of pauses in long numeric sequences may cause wrong synthesis of a number.

Tips:

  • Use pauses sparingly.
  • Avoid excessive breaks in short sentences.
  • Place pauses at semantic boundaries.
  • For long numeric sequences, group digits (3–3–4 or 2–2–2…) and add short breaks.
    Examples

    Without pauses:            "رقم طلبك هو 123456789" (may sound dense)
    With pauses (SSML):    "<speak> رقم طلبك هو <break time='300ms'/> 123 <break time='250ms'/> 456 <break time='250ms'/> 789</speak>"
    Without pauses:            "للتواصل اتصل على 0501234567"
    With pauses (SSML):    "<speak>للتواصل اتصل على <break time='200ms'/> 050 <break time='200ms'/> 123 <break time='200ms'/> 4567</speak>"


7. Speech Rate, Volume & Voice Selection

7.1. Adjustable Parameters

  • Voice Selection – different Arabic voices may vary in clarity. Check Supported Languages and Voices for detailed information.
  • Rate: controls speaking tempo of the voice.
  • Volume: controls base volume (loudness) level of the voice.

7.2. Recommended Values for Arabic Voices

Start with the recommended values below, then adjust in small steps (±0.05) and validate until the voice sounds natural and remains clearly intelligible.

  • Rate: a value between 1.1 and 1.3.
  • Volume: default value 1.0

Important: Optimal rate and volume are subjective to the listener and scenario.


8. Emotion & Tone/ Phoneme Tags

⚠️ Explicit emotion or tone controls and Phoneme Tags are not supported.

  • Emotional delivery cannot be directly controlled via markup, but Indirect influence is possible through:
    • Sentence structure
    • Punctuation
    • Strategic pause placement

9 Why Output May Sound Unexpected

9.1. Common Ambiguous Arabic Words

  • Ambiguous words may receive a valid but unintended reading
  • Manual diacritics may still be overridden in ambiguous contexts.
    Examples

    "ملك"   →   "مَلِك" (king)                 /   "مَلَك" (angel)              /    "مَلَكَ" (owned)
    "علم"   →   "عِلْم" (knowledge)     /    "عَلَم" (flag)                 /    "عَلَّمَ" (taught)
    "قدر"   →   "قَدَر" (fate)                 /    "قِدْر" (cooking pot)
    "جمل"   →   "جَمَل" (camel)             /    "جُمَل" (sentences)
    "سلم"   →   "سِلْم" (peace)             /    "سُلَّم" (stairs/ladder)   /   "سَلَّمَ" (handed over)
    "كتب"   →   "كُتُب" (books)             /    "كَتَبَ" (he wrote)
    "عين"   →   "عَيْن" (eye / spring)    /   "عَيَّنَ" (appointed) (context-dependent)

Tips:

  • Disambiguate with tashkeel or rephrasing.
    Examples

    Ambiguous:       "ضع الطعام في القدر"
    Clear:                  "ضَعِ الطَّعَامَ فِي القِدْرِ" (cooking pot)

    Ambiguous:       "هذا قدر الإنسان" (may be spoken with unintended vowels)
    Clear:                  "هَذَا قَدَرُ الإِنْسَانِ" (fate)

    Ambiguous:       "رفع علم الدولة" (may be spoken with unintended vowels)
    Clear:                  "رَفَعَ عَلَمَ الدَّوْلَةِ" (flag)


9.2. Text Preparation & Cleaning (Critical)

High-quality Arabic TTS output starts with clean and well-prepared input text.

Tips:

  • Split long paragraphs into shorter sentences
  • Split long paragraphs into shorter sentences
  • Avoid dense numeric blocks in a single sentence
  • Ensure proper spacing between Arabic words
    Examples

    Poor:       "يرجىالاتصال علرقم 0501234567 فيحال وجودأيستفسار"
    Better:     "يرجى الاتصال على الرقم 0501234567 في حال وجود أي استفسار"


9.3. Mixed-Language Text (Arabic + English)

Excessive language-switching within a single sentence may reduce naturalness.

Tips

  • Don't Keep English words in Latin characters unless defined in Abbreviations.

  • Use Arabic-script equivalents (Arabic transliteration) for English words to guide pronunciation:
    Examples

    Poor:      "إلى أحدث إصدار الآن WhatsApp قم بتحديث تطبيق"
    Better:    "قم بتحديث تطبيق واتساب إلى أحدث إصدار الآن"


10. Troubleshooting Guide

Issue 1: Word is pronounced incorrectly

Possible Causes

  • Missing diacritics
  • Ambiguous spelling
  • Foreign-origin word

Suggested Actions

  • Add tashkeel
  • Split the word to modify spelling
  • Use phonetic alternatives

Issue 2: Numbers sound too fast or unclear

Possible Causes

  • Long numeric sequences
  • Missing or not enough pauses

Suggested Actions

  • Insert SSML pauses
  • Rewrite numbers in words

Issue 3: Output does not sound dialectal

Explanation

  • Default behavior prioritizes MSA
  • Dialectal pronunciation is not explicitly selectable

11. Summary

✅ Do

  • Prefer clear, unambiguous dialect specific Arabic text
  • Apply pauses where clarity is critical
  • Keep parameters consistent across turns

❌ Don’t

  • Mix dialects within a sentence
  • Overuse diacritics
  • Expect emotion control tags
  • Pass raw numeric-heavy text without preparation