Advertisement

International Journal of Speech Technology

, Volume 19, Issue 2, pp 269–280 | Cite as

Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization

  • Amine ChennoufiEmail author
  • Azzeddine Mazroui
Special Issue Article

Abstract

The absence of short vowels in Arabic texts is the source of some difficulties in several automatic processing systems of Arabic language. Several developed hybrid systems of automatic diacritization of the Arabic texts are presented and evaluated in this paper. All these approaches are based on three phases: a morphological step followed by statistical phases based on Hidden Markov Model at the word level and at the character level. The two versions of the morpho-syntactic analyzer Alkhalil were used and tested and the outputs of this stage are the different possible diacritizations of words. A lexical database containing the most frequent words in the Arabic language has been incorporated into some systems in order to make the system faster. The learning step was performed on a large Arabic corpus and the impact of the size of this learning corpus on the performance of the system was studied. The systems use smoothing techniques to circumvent the problem of missing transitions words and the Viterbi algorithm to select the optimal solution. Our proposed system that benefits from the wealth of morphological analysis and a large diacritized corpus presents interesting experimental results in comparison to other automatic diacritization systems known until now.

Keywords

Arabic language Automatic diacritization Morphological analysis Smoothing method Hidden Markov model Large corpus Viterbi algorithm 

Notes

Compliance with ethical standards

Conflict of Interest

The authors declare that they have no conflict of interest.

Informed Consent

The authors declare that this study does not involve human participation.

Research Involving Human Participants and/or Animals

The authors declare that this research not involves human subjects and/or animals Research.

References

  1. Alghamdi, M., Muzaffar, S. Z., & Alhakami, H. (2010). Automatic restoration of arabic diacritics: A simple, purely statistical approach. The Arabian Journal for Science and Engineering, 35(2C), 125–135.Google Scholar
  2. Attia, M., Choukri, K., & Yaseen, M. (2005). Specifications of the Arabic written corpus produced within the Nemlar project. Technical report, NEMLAR, Center for Sprogteknologi.Google Scholar
  3. Bebah, M. O. A. O., Chennoufi, A., Mazroui, A., & Lakhouaja, A. (2014). Hybrid approaches for automatic vowelization of Arabic texts. International Journal on Natural Language Computing (IJNLC) 3(4), 53–71.Google Scholar
  4. Bebah, M. O. A. O., Meziane, A., Mazroui, A., & Lakhouaja, A. (2011). Alkhalil Morpho Sys. In 7th international computing conference in Arabic, May 31–June 2, 2011, Riyadh, Saudi Arabia.Google Scholar
  5. Boudchiche, M., Mazroui, A., Bebah, M. O. A. O., & Lakhouaja, A. 2014. L’Analyseur Morphosyntaxique AlKhalil Morpho Sys 2. 1ère Journée Doctorale Nationale sur L’ingénierie de la Langue Arabe, (JDILA’14), 8 February 2014, Rabat, Morocco.Google Scholar
  6. Buckwalter, T. (2002). Arabic morphological analyzer version 1.0. In Linguistic Data Consortium, University of Pennsylvania, LDC Catalog No.: LDC2002L49.Google Scholar
  7. Buckwalter, T. (2004). Arabic morphological analyzer version 2.0—LDC2004L02. In Linguistic Data Consortium, University of Pennsylvania, LDC Cat alog No.: LDC2004L02. ISBN 1-58563-324-0.Google Scholar
  8. Chennoufi, A., & Mazroui, A. (2014). Méthodes de lissage d’une approche morpho-statistique pour la voyellation automatique des textes arabes. Actes de la 21e conférence sur le Traitement Automatique des Langues Naturelles (TALN’2014), Marseille, France, Juillet 2014, [P.Et1.3], pp. 443–448.Google Scholar
  9. Debili, F., & Achour, H. (1998). Voyellationautomatique de l’arabe. In Proceedings of the workshop on Computation approaches to Semitic languages, COLING-ACL’98, pp. 42–49.Google Scholar
  10. Elshafei, M., Al-Muhtaseb, H., & Alghamdi, M. (2006). Machine generation of arabic diacritical marks. In The 2006 world congress in computer science computer engineering, and applied computing (pp. 128–133). Las Vegas, USA.Google Scholar
  11. Emam, O., & Fischer, V. (2005). Hierarchical approach for the statistical vowelization of arabic text. Technical report, IBM Corporation Intellectual Property Law, Austin, TX, US.Google Scholar
  12. Gal, Y. (2002). An hmm approach to vowel restoration in arabic and hebrew. In Proceedings of the workshop on computational approaches to semitic languages—Philadelphia—Association for Computational Linguistics, pp. 27–33.Google Scholar
  13. Habash, N., & Rambow, O. (2007). Arabic diacritization through full morphological tagging. In Proceeding NAACL-Short’07 human language technologies 2007: The conference of the North American chapter of the association for computational linguistics, companion volume, short papers, Rochester, NK, USA, pp. 53–56.Google Scholar
  14. Hifny, Y. (2013). Restoration of arabic diacritics using dynamic programming. InThe 8th international conference on computer engineering & systems (ICCES’2013), 26–28 Nov. 2013, Cairo, Egypt.Google Scholar
  15. Kübler, S., & Mohamed, E. (2008). Memory-based vocalization of Arabic. In Proceedings of the LREC workshop on HLT and NLP within the arabic world.Google Scholar
  16. Maamouri, M., Bies, A., & Kulick, S. (2006). Diacritization: a challenge to Arabic treebank annotation and parsing. In Proceedings of the British computer society arabic NLP/MT conference.Google Scholar
  17. Manning, C., & Schutze, H. (1999). Foundations of statistical natural language processing. Massachusetts Institute of Technology Press—Library of Congress Cataloging in publication Information.Google Scholar
  18. Messaoudi, A., Lamel, L., & Gauvain, J. L. (2004). The limsi rt04 b arabic system. In Proceedings DARPA RT04, Palisades, NY.Google Scholar
  19. Mohamed, E., & Kübler, S. (2009). Diacritization for real-world arabic texts. In Proceedings of the international conference RANLP 2009 (pp. 251–257). Borovets, Bulgaria.Google Scholar
  20. Nelken, R., & Shieber, S. M. (2005). Arabic diacritization using weighted finite-state transducers. In Proceedings of the ACL 2005 workshop on computational approaches to semitic languages (pp. 79–86). Ann Arbor, Michigan, USA.Google Scholar
  21. Neuhoff, D. L. (1975). The Viterbi algorithm as an aid in text recognition.In IEEE transaction on information theory, pp. 222–226, March 1975.Google Scholar
  22. Ney, H. & Essen, U. (1991). On smoothing techniques for bigram-based natural language modelling. In Proceedings of the IEEE international conference on acoustics, speech and signal processing’91, 1991, Vol. 2, pp. 825–829.Google Scholar
  23. Ney, H., Essen, U., & Kneser, R. (1994). On structuring probabilistic dependences in stochastic language modeling. Computer, Speech, and Language, 8(1), 38.CrossRefGoogle Scholar
  24. Rashwan, M., Al-Badrashiny, M., Attia, M., & Abdou, S. M. (2009). A hybrid system for automatic arabicdiacritization. In Natural language processing and knowledge engineering NLP-KE 2009 (pp. 1–8). Cairo, Egypt.Google Scholar
  25. Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S. M., & Rafea, A. (2011). A stochastic arabic diacritizer based on a hybrid of factorized and unfactorized textual features. Audio, Speech, and Language Processing, 19(1), 166–175.CrossRefGoogle Scholar
  26. Schlippe, T., Guyen, T., & Vogel, T. (2008). Diacritization as a machine translation problem and as a sequence labeling problem. In 8th AMTA conference, Hawai, pp. 21–25.Google Scholar
  27. Shaalan, K., Abo Bakr, H. M., & Ziedan, I. (2009). A hybrid approach for building Arabic diacritizer In Proceedings of the EACL 2009 workshop on computational approaches to semitic languages (pp 27–35).Google Scholar
  28. Sidrine, S., Souteh, Y., Bouzoubaa, K., & Loukili, T. (2010). SAFAR: vers une plateforme ouverte pour le traitement automatique de la langue Arabe. In The 6th Intelligent Systems: Theory and Applications SITA’10, Rabat, Morocco.Google Scholar
  29. Vergyri, D., & Kirchhoff, K. (2004). Automatic diacritization of arabic for acoustic modeling in speech recognition. In Proceedings of the workshop on computational approaches to arabic script-based languages (pp 66–73). COLING, Geneva.Google Scholar
  30. Zitouni, I., & Sarikaya, R. (2009). Arabic diacritic restoration approach based on maximum entropy models. Computer Speech & Language, 23(3), 257–276.CrossRefGoogle Scholar
  31. Zitouni, I., Sorensen, J. S., & Sarikaya, R. (2006). Maximum entropy based restoration of arabic diacritics. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics. Workshop on computational approaches to semitic languages, Sydney, Australia. July 2006, pp. 577–584.Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Department of Mathematics and Computer Science, Faculty of SciencesUniversity Mohamed FirstOujdaMorocco

Personalised recommendations