International Journal of Speech Technology

, Volume 21, Issue 4, pp 895–906 | Cite as

Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic

  • Amal HouidhekEmail author
  • Vincent Colotte
  • Zied Mnasri
  • Denis Jouvet


This paper investigates the use of hidden Markov models (HMM) for Modern Standard Arabic speech synthesis. HMM-based speech synthesis systems require a description of each speech unit with a set of contextual features that specifies phonetic, phonological and linguistic aspects. To apply this method to Arabic language, a study of its particularities was conducted to extract suitable contextual features. Two phenomena are highlighted: vowel quantity and gemination. This work focuses on how to model geminated consonants (resp. long vowels), either considering them as fully-fledged phonemes or as the same phonemes as their simple (resp. short) counterparts but with a different duration. Four modelling approaches have been proposed for this purpose. Results of subjective and objective evaluations show that there is no important difference between differentiating modelling units associated to geminated consonants (resp. long vowels) from modelling units associated to simple consonants (resp. short vowels) and merging them as long as gemination and vowel quantity information is included in the set of features.


Parametric speech synthesis Statistical modelling Arabic language Speech unit modelling Vowel quantity Gemination 



This research work was conducted under PHC-Utique Program in the framework of CMCU (Comité Mixte de Coopération Universitaire) Grant No. 15G1405.


  1. Abdel-Hamid, O., Abdou, S. M., & Rashwan, M. (2006). Improving Arabic HMM based speech synthesis quality. In Interpseech 2006, 9th Annual Conference of the International Speech Communication Association. Pittsburgh, Pennsylvania, USA.Google Scholar
  2. Abdelmalek, R., & Mnasri, Z. (2016). High quality Arabic Text-to-speech synthesis using unit selection. In SSD’2016, IEEE Conference on Signal, Systems and Devices. Leipzig, GermanyGoogle Scholar
  3. Ahmed, B. (2004). Réalisation d’un système hybride de synthèse de la parole Arabe utilisant un dictionnaire de polyphones. In JEP-TALN2004. Journées d’Etude sur la Parole. Maroc: Fès.Google Scholar
  4. Al-Ani, S. H. (1970). Arabic phonology: An acoustical and physiological investigation. In ERIC.Google Scholar
  5. Baloul, S. (2003). Développement d’un système automatique de synthèse de la parole à partir du texte arabe standard voyellé. Doctoral dissertation, Le Mans.Google Scholar
  6. Black, A., Taylor, P., Caley, R., & Clark, R. (1998). The festival speech synthesis system.Google Scholar
  7. Black, A. W., Zen, H., & Tokuda, K. (2007). Statistical parametric speech synthesis. In ICASSP 2007, IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4 (pp. IV-1229). Honolulu, HI, USA.Google Scholar
  8. Buchholz, S., & Latorre, J. (2011). Crowdsourcing preference tests, and how to detect cheating. In INTERSPEECH’2011, Annual Conference of the International Speech Communication Association.Google Scholar
  9. Chalamandaris, A., Tsiakoulis, P., Karabetsos, S., & Raptis, S. (2013). The ILSP/INNOETICS Text-to-Speech System for the Blizzard Challenge 2013. In The Blizzard Challenge 2013 workshop, September 2013. Reykjavik, Iceland.Google Scholar
  10. Cheffour, N., Benabbou, A., & Mouradi, A. (2000). Étude et Evaluation de la Di-Syllabe comme Unité Acoustique pour le Système de Synthèse Arabe PARADIS. In LREC’2000, International Conference on Language Resources and Evaluation, Athens, Greece.Google Scholar
  11. Halabi, N. (2015). Modern standard Arabic speech corpus. Doctoral dissertation in University of Southampton.Google Scholar
  12. Halabi, N., & Wald, W. (2016). Phonetic inventory for an Arabic speech corpus. In LREC 2016, 10th International Conference on Language Resources and Evaluation, (pp. 734–738) Slovenia.Google Scholar
  13. Halpern, J. (2009). Word stress and vowel neutralization in modern standard Arabic. In Proceedings of the Second International Conference on Arabic Language Resources and Tools.Google Scholar
  14. Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In ICASSP’1996, IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (pp. 373–376). Atlanta Georgia, USA.Google Scholar
  15. ITU (1996). Recommendation P.800. Methods for subjective determination of transmission quality. International Telecommunication Union.Google Scholar
  16. Jurafsky, D., & Martin, J. H. (2009). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Englewood Cliffs: Pearson/Prentice Hall.Google Scholar
  17. Kawahara, H., Masuda-Katsuse, I., & De Cheveigné, A. (1999). Restructuring speech representations using a pitch-adaptive time frequency smoothing and an instantaneous-frequency- based F0 extraction: Possible role of a repetitive structure in sounds. In Speech Communication, vol. 27 (pp. 187–207).Google Scholar
  18. Khalil, K., & Cherif, M. C. (2013). Arabic HMM-based speech synthesis. In ICEESA’2013, International Conference on Electrical Engineering and software Applications, (pp. 1–5). Tunisia: Hammamet.Google Scholar
  19. Khouja, M. K., & Zrigui, M. (2005). Durée des consonnes géminées en parole Arabe: mesures et comparaison. In TALN-RECITAL 2005, Rencontres des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues, Dourdan, France.Google Scholar
  20. Kishore, S. P., & Black, A. W. (2003). Unit size in unit-selection speech synthesis. In EUROSPEECH’2003, Eighth European Conference on Speech Communication and Technology.Google Scholar
  21. Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer. The Journal of the Acoustical Society of America, 67, 971–995.CrossRefGoogle Scholar
  22. Koishida, K., Tokuda, K., Kobayashi, T., & Imai, S. (1997). Efficient encoding of mel-generalized cepstrum for CELP coders. In ICASSP’1997, IEEE International Conference on Acoustic, Speech and Signal Processing (pp. 1355–1358).Google Scholar
  23. Kouloughli, D. (1976). Contribution à l’étude de l’accent en arabe littéraire. In Annales de l’Université d’Abidjan Série H: Linguistique Abidjan, vol. 9 (pp. 115–130).Google Scholar
  24. Krstulovic, S., Hunecke, A., & Schroder, M. (2007). An HMM-based speech synthesis system applied to German and its adaptation to a limited set of expressive football announcements. In EUROSPEECH’2007, European Conference on speech Communication and Technology, vol. 7.Google Scholar
  25. Laufer, A., & Baer, T. (1998). The emphatic and pharyngeal sounds in Hebrew and in Arabic. Language and Speech, 31(2), 181–205.CrossRefGoogle Scholar
  26. Le Maguer, S., Barbot, N., & Boeffard, O. (2013). Evaluation of contextual descriptors for HMM-based speech synthesis in French. In SSW’2013, ISCA Tutorial and Research Workshop on Speech Synthesis (pp. 153–158). Spain: Barcelona.Google Scholar
  27. Moulines, E., Emerard, F., Larreur, D., Le Saint Milon, J. L., Le Faucheur, L., Marty, F., Charpentier, F., & Sorin, C. (1990). A real-time French text-to-speech system generating high-quality synthetic speech. In ICASSP’1990, IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 309–312).Google Scholar
  28. Newman, D. (1984). The phonetics of Arabic. In Journal of the American Oriental Society, vol. 46 (pp. 1–6).Google Scholar
  29. Rajouani, A., Najim, M., Chiadmi, D., & Zyoute, M. (1987). Synthesis-by-rule of Arabic language. In EUROSPEECH’987, European Conference on Speech Technology.Google Scholar
  30. Schwarz, D., Beller, G., Verbrugghe, B., & Britton, S. (2006). Real-time corpus-based concatenative synthesis with catart. In DAFx’2006, 9th International Conference on Digital Audio Effects (pp. 279–282).Google Scholar
  31. Selouani, S. A., & Caelen, J. (1998). Arabic phonetic features recognition using modular connectionist architectures. In IVTTA’1998, IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (pp. 155–160). Torino, Italy.Google Scholar
  32. Silén, H., Helander, E., Nurminen, J., & Gabbouj, M. (2010). Analysis of duration prediction accuracy in HMM-based speech synthesis. In SP’2010, Speech Prosody.Google Scholar
  33. Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., & Hirschberg, J. (1992). Tobi: A standard for labelling English prosody. In ICSLP’1992, International Conference on Spoken Language Processing, vol. 1 (pp. 867–870).Google Scholar
  34. Taylor, P. (2009). Text-to-speech synthesis. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  35. Taylor, P. A., Nairn, I. A., Sutherland, A. M., Jack, M. A., Bagshaw, P. C., Renals, S., & Sutherland, A. M. (1991). A real time speech synthesis system. In IEEE Symposium (pp. 101–106).Google Scholar
  36. Tokuda, K., Zen, H., & Black, A. W. (2002). An HMM-based speech synthesis system applied to English. In IEEE Speech Synthesis Workshop (pp. 227–230).Google Scholar
  37. Watts, O., Stan, A., Clark, R. A., Mamiya, Y., Giurgiu, M., Yamagishi, J., & King, S. (2013). Unsupervised and lightly-supervised learning for rapid construction of TTS systems in multiple languages from ‘found’data: evaluation and analysis. In SSW’2013, 8th ISCA Speech Synthesis Workshop.Google Scholar
  38. Wu, Z., Watts, O., & King, S. (2016). Merlin: An open source neural network speech synthesis system. In SSW’2016, 9th ISCA Speech Synthesis Workshop, Sunnyvale, USA.Google Scholar
  39. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In EUROSPEECH’1999, European Conference on Speech Communication and Technology.Google Scholar
  40. Young, S. J. (1994). The HTK hidden Markov model toolkit: Design and philosophy. Department of Engineering, Cambridge University, UK, Tech. Rep. TR.152.Google Scholar
  41. Zen, H., & Sak, H. (2015). Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In ICASSP’2015, IEEE International Conference on Acoustics, Speech and Signal Processing, 2015 (pp. 4470–4474).Google Scholar
  42. Zen, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In ICASSP’2013, IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 7962–7966).Google Scholar
  43. Zen, H., Toda, T., & Tokuda, K. (2006). The Nitech-NAIST HMM-based speech synthesis system for the Blizzard Challenge 2006. In Proceedings Blizzard Challenge Workshop.Google Scholar
  44. Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. In Speech Communication, vol. 51, no 11 (pp. 1039–1064).Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Amal Houidhek
    • 1
    • 2
    Email author
  • Vincent Colotte
    • 2
  • Zied Mnasri
    • 1
  • Denis Jouvet
    • 2
  1. 1.Electrical Engineering Department, Ecole Nationale d’Ingénieurs de TunisUniversity Tunis El ManarTunisTunisia
  2. 2.Université de Lorraine, CNRS, Inria, LORIANancyFrance

Personalised recommendations