Advertisement

International Journal of Speech Technology

, Volume 19, Issue 2, pp 289–302 | Cite as

A hybrid Arabic POS tagging for simple and compound morphosyntactic tags

  • N. AbabouEmail author
  • A. Mazroui
Article

Abstract

The objective of this work is to develop a POS tagger for the Arabic language. This analyzer uses a very rich tag set that gives syntactic information about proclitic attached to words. This study employs a probabilistic model and a morphological analyzer to identify the right tag in the context. Most published research on probabilistic analysis uses only a training corpus to search the probable tags for each words, and this sometimes affects their performances. In this paper, we propose a method that takes into account the tags that are not included in the training data. These tags are proposed by the Alkhalil_Morpho_Sys analyzer (Bebah et al. 2011). We show that this consideration increases significantly the accuracy of the morphosyntactic analysis. In addition, the adopted tag set is very rich and it contains the compound tags that allow analyze the proclitics attached to words.

Keywords

Part of speech tagging Morphological analysis Hidden Markov model Smoothing Training set Testing set 

References

  1. Al Shamsi, F., & Guessoum, A. (2006). A hidden markov model-based POS tagger for Arabic. In Proceedings of the 8th International Conference on the Statistical. Besançon, France.Google Scholar
  2. Al-Taani, A. T., & Al-Rub, S. A. (2009). A rule-based approach for tagging non-vocalized Arabic words. International Arab Journal of Information Technology, 6(3), 320–328.Google Scholar
  3. Altabba, M., Al-Zaraee, A., & Shukairy, M. A. (2010). An Arabic morphological analyzer and part-of-speech tagger. Thesis, Faculty of Informatics Engineering, Arab International University, Damascus.Google Scholar
  4. Antony, P. J., & Soman, K. P. (2011). Parts of speech tagging for Indian languages: A literature survey. International Journal of Computer Applications (0975-8887), 34(8), 22–29.Google Scholar
  5. Atiyya, M., Choukri, K., & Yaseen, M. (2005, September 29). NEMLAR Arabic written corpus. Retrieved June 11, 2015, from http://www.rdi-eg.com/Downloads/Lang%20Tech/Nemlar-specifications-resources-WC-V3.0_Final.doc.
  6. Attia, M., Yaseen, M., & Choukri, K. (2005). Specifications of the Arabic Written Corpus produced within the NEMLAR project. http://www.medar.info/The_Nemlar_Project/Publications/WC_design_final.pdf.
  7. Bebah, M. O. A. O., Meziane, A., Mazroui, A., & Lakhouaja, A. (2011). Alkhalil morpho sys. In 7th International computing conference in Arabic.Google Scholar
  8. Boudchiche, M., Mazroui, M., ould Abdallahi Ould Bebah, M., & Lakhouaja, A. (2014). L’analyseur Morphosyntaxique Alkhali Morpho Sys 2. In 1st National Doctoral Day of Engineering Arabic Language.Google Scholar
  9. Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the workshop on speech and natural language (pp. 112–116). Association for Computational Linguistics.Google Scholar
  10. Buckwalter, T. (2002). Buckwalter Arabic morphological analyzer version 2.0. Linguistic Data Consortium, University of Pennsylvania. LDC Catalog No. LDC2002L49. ISBN 1-58563-324-0.Google Scholar
  11. Chalabi, A. (2004). Sakhr Arabic lexicon. In NEMLAR international conference on Arabic language resources and tools (pp. 21–24).Google Scholar
  12. Darwish, K., Abdelali, A., & Mubarak, H. (2014). Using stem-templates to improve Arabic POS and gender/number tagging. In International conference on language resources and evaluation (LREC-2014).Google Scholar
  13. Diab, M. (2009). Second generation AMIRA tools for Arabic processing: Fast and robust tokenization, POS tagging, and base phrase chunking. In 2nd International conference on Arabic language resources and tools. Cairo, Egypt.Google Scholar
  14. Diab, M., Hacioglu, K., & Jurafsky, D. (2004). Automatic tagging of Arabic text: From raw text to base phrase chunks. In Proceedings of HLT-NAACL 2004: Short papers (pp. 149–152). Association for Computational Linguistics.Google Scholar
  15. El Jihad, A., & Yousfi, A. (2005). Etiquetage morpho-syntaxique des textes arabes par modèle de Markov caché. In Proceedings of Rencontre Des Etudiants Chercheurs En Informatique Pour Le Traitement Automatique Des Langues (pp. 649–654). Dourdan, FranceGoogle Scholar
  16. El-Jihad, A., Yousfi, A., & Si-Lhoussain, A. (2011). Morpho-syntactic tagging system based on the patterns words for Arabic texts. International Arab Journal of Information Technology, 8(4), 350–354.Google Scholar
  17. Ghoul, D. (2011). Outils génériques pour l’étiquetage morphosyntaxique de la langue arabe: segmentation et corpus d’entraînement.Google Scholar
  18. Huang, L., Peng, Y., Wang, H., & Wu, Z. (2002). Statistical part-of-speech tagging for classical Chinese. In Text, speech and dialogue (pp. 115–122). BrnoGoogle Scholar
  19. Khoja, S. (2001). APT: Arabic part-of-speech tagger. In Proceedings of the student workshop at NAACL (pp. 20–25).Google Scholar
  20. Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.zbMATHGoogle Scholar
  21. Nakagawa, T., & Uchimoto, K. (2007). A hybrid approach to word segmentation and POS tagging. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions (pp. 217–220). Association for Computational Linguistics.Google Scholar
  22. Neuhoff, D. L. (1975). The Viterbi algorithm as an aid in text recognition (Corresp.). IEEE Transactions on Information Theory, 21(2), 222–226.CrossRefGoogle Scholar
  23. Ney, H., Essen, U., & Kneser, R. (1994). On structuring probabilistic dependences in stochastic language modelling. Computer Speech and Language, 8(1), 1–38.CrossRefGoogle Scholar
  24. Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., et al. (2014). A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. Reykjavik: LREC.Google Scholar
  25. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the international conference on new methods in language processing (Vol. 12, pp. 44–49). Manchester.Google Scholar
  26. Thibeault, M. (2004). La catégorisation grammaticale automatique: adaptation du catégoriseur de Brill au français et modification de l’approche. Université Laval.Google Scholar
  27. Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology (Vol. 1, pp. 173–180). Association for Computational Linguistics.Google Scholar
  28. Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269.CrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Department of Mathematics and Computer Science, Faculty of SciencesUniversity Mohamed FirstOujdaMorocco

Personalised recommendations