Developing a Competitive HMM Arabic POS Tagger Using Small Training Corpora

  • Mohammed Albared
  • Nazlia Omar
  • Mohd. Juzaiddin Ab Aziz
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6591)


Part Of Speech (POS) tagging is the ability to computationally determine which POS of a word is activated by its use in a particular context. POS is one of the important processing steps for many natural language systems such as information extraction, question answering. This paper presents a study aiming to find out the appropriate strategy to develop a fast and accurate Arabic statistical POS tagger when only a limited amount of training material is available. This is an essential factor when dealing with languages like Arabic for which small annotated resources are scarce and not easily available. Different configurations of a HMM tagger are studied. Namely, bigram and trigram models are tested, as well as different smoothing techniques. In addition, new lexical model has been defined to handle unknown word POS guessing based on the linear interpolation of both word suffix probability and word prefix probability. Several experiments are carried out to determine the performance of the different configurations of HMM with two small training corpora. The first corpus includes about 29300 words from both Modern Standard Arabic and Classical Arabic. The second corpus is the Quranic Arabic Corpus which is consisting of 77,430 words of the Quranic Arabic.


Arabic languages Hidden Markov model Unknown words 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Farghaly, A., Shaalan, K.: Arabic Natural Language Processing: Challenges and Solutions. ACM Transactions on Asian Language Information Processing (TALIP), 1–22 (2009), doi: Scholar
  2. 2.
    Maamouri, M., Bies, A., Kulick, S.: Enhanced Annotation and Parsing of the Arabic Treebank. In: INFOS (2008)Google Scholar
  3. 3.
    Fischl, W.: Part of Speech Tagging - A solved problem? Center for Integrative Bioinformatics Vienna, CIBIV (2009) (Unpublished report)Google Scholar
  4. 4.
    Nakagawa, T.: Multilingual word segmentation and part-of-speech tagging: a machine learning approach incorporating diverse features. PhD Thesis, Nara Institute of Science and Technology, Japan (2006)Google Scholar
  5. 5.
    Ratnaparkhi, A.: A maximum entropy part of speech tagger. In: Brill, E., Church, K. (eds.) Conference on Empirical Methods in Natural Language Processing. University of Pennsylvania, Philadelphia (1996)Google Scholar
  6. 6.
    Brants, T.: TnT: A statistical part-of-speech tagger. In: Proceedings of the 6th Conference on applied Natural Language Processing, Seattle, WA, USA (2000)Google Scholar
  7. 7.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning, MA, USA (2001)Google Scholar
  8. 8.
    Goldwater, S., Griffiths, T.: A fully Bayesian approach to unsupervised part-of-speech tagging. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (2007)Google Scholar
  9. 9.
    Brill, E.: A Corpus-based Approach to Language Learning. PhD thesis, Department of Computer and Information Science. University of Pennsylvania, Philadelphia (1993)Google Scholar
  10. 10.
    Giesbrecht, E., Stefan, E.: Is Part-of-Speech Tagging a Solved Task? An Evaluation of POS Taggers for the German Web as Corpus. In: Proceedings of the 5th Web as Corpus Workshop (WAC5), Donostia (2009)Google Scholar
  11. 11.
    Padró, M., Padró, L.: Developing Competitive HMM PoS Taggers Using Small Training Corpora. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 127–136. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  12. 12.
    Ferrández, S., Peral, J.: Investigating the Best Configuration of HMM Spanish PoS Tagger when Minimum Amount of Training Data Is Available. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 341–344. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  13. 13.
    Attia, M.: Handling Arabic Morphological and Syntactic Ambiguity within the LFG Framework with a View to Machine Translation. PhD thesis, School of Languages, Linguistics and Cultures, Univ. of Manchester, UK (2008)Google Scholar
  14. 14.
    AlGahtani, S., Black, W., McNaught, J.: Arabic Part-Of-Speech Tagging using Transformation-Based Learning. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt (2009)Google Scholar
  15. 15.
    Kulick, S.: Simultaneous Tokenization and Part-of-Speech Tagging for Arabic without a Morphological Analyzer. In: Proceedings of ACL 2010 (2010)Google Scholar
  16. 16.
    Diab, M., Kadri, H., Daniel, J.: Automatic tagging of Arabic text: from raw text to base phrase chunks. In: Proceedings of the 2004 Conference of the North American Chapter of the ACL (2004)Google Scholar
  17. 17.
    Habash, N., Rambow, O.: Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd Annual Meeting on ACL, Ann Arbor, Michigan (2005), doi:10.3115/1219840.1219911Google Scholar
  18. 18.
    Al Shamsi, F., Guessoum, A.: A hidden Markov model-based POS tagger for Arabic. In: Proceeding of the 8th International Conference on the Statistical Analysis of Textual Data, France, pp. 31–42 (2006)Google Scholar
  19. 19.
    Albared, M., Omar, N., Ab Aziz, M., Ahmad Nazri, M.: Automatic Part of Speech Tagging for Arabic: An Experiment Using Bigram Hidden Markov Model. In: Yu, J., Greco, S., Lingras, P., Wang, G., Skowron, A. (eds.) RSKT 2010. LNCS, vol. 6401, pp. 361–370. Springer, Heidelberg (2010), doi:10.1007/978-3-642-16248-0_52CrossRefGoogle Scholar
  20. 20.
    Albared, M., Omar, N., Ab Aziz, M.J.: Arabic Part Of Speech Disambiguation: A Survey. International Review on Computers and Software, 517–532 (2009)Google Scholar
  21. 21.
    El Hadj, Y., Al-Sughayeir, I., Al-Ansari, A.: Arabic Part-Of-Speech Tagging using the Sentence Structure. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt (2009)Google Scholar
  22. 22.
    Goweder, A., De Roeck, A.: Assessment of a Significant Arabic Corpus. In: Proc. of Arabic NLP Workshop at ACL/EACL (2001)Google Scholar
  23. 23.
    Dukes, K., Habash, N.: Morphological Annotation of Quranic Arabic. In: Language Resources and Evaluation Conference (LREC), Valletta, Malta (2010)Google Scholar
  24. 24.
    Viterbi, A.J.: Error bounds for convolution codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information, 260–266 (1967)Google Scholar
  25. 25.
    Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Computer Science Group, Harvard University, Cambridge(1998)Google Scholar
  26. 26.
    Carrasco, R.M., Gelbukh, A.: Evaluation of TnT Tagger for Spanish. In: Proceedings of the 4th Mexican international Conference on Computer Science. IEEE Computer Society, Washington, DC (2003)Google Scholar
  27. 27.
    Mihalcea, R.: Performance analysis of a part of speech tagging task. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 158–167. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  28. 28.
    Samuelsson, C.: Handling sparse data by successive abstraction. In: COLING 1996, Copenhagen, Denmark (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Mohammed Albared
    • 1
  • Nazlia Omar
    • 1
  • Mohd. Juzaiddin Ab Aziz
    • 1
  1. 1.Faculty of Information Science and Technology, Department of Computer ScienceUniversity Kebangsaan MalaysiaMalaysia

Personalised recommendations