Abstract
In this paper, we present an efficient part-of-speech (POS) tagger for Arabic which is based on a Hidden Markow Model. We explore different enhancements to improve the baseline system. Despite the morphological complexity of Arabic our approach is a data driven approach and does not utilize any morphological analyzer or a lexicon as many other Arabic POS taggers. This makes our approach simple, very efficient and valuable to be used in real-life applications and the obtained accuracy results are still comparable to other Arabic POS taggers. In the experiments, we also thoroughly investigate different aspects of Arabic POS tagging including tag sets, prefix and suffix analyses which were not examined in detail before. Our part-of-speech tagger achieves an accuracy of 95.57% on a standard tagset for Arabic. A detailed error analysis is provided for a better evaluation of the system. We also applied the same approach on different languages like Farsi and German to show the language independent aspect of the approach. Accuracy rates on these languages are also provided.
This work was supported by Applications Technology, Inc. (Apptek).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
AlGahtani, S., Black, W., McNaught, J.: Arabic part-of-speech tagging using transformation-based learning. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools. The MEDAR Consortium, Cairo (2009)
Brants, T.: Tnt – a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 224–231. Association for Computational Linguistics, Seattle (2000)
Buckwalter, T.: Buckwalter arabic morphological analyzer, version 2.0 (2004)
Diab, M.: Improved arabic base phrase chunking with a new enriched pos tag set. In: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 89–96. Association for Computational Linguistics, Prague (2007)
Diab, M., Hacioglu, K., Jurafsky, D.: Automatic tagging of arabic text: From raw text to base phrase chunks. In: Susan Dumais, D.M., Roukos, S. (eds.) HLT-NAACL 2004: Short Papers, pp. 149–152. Association for Computational Linguistics, Boston (2004)
Elming, J., Habash, N.: Syntactic reordering for English-Arabic phrase-based machine translation. In: Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, pp. 69–77. Association for Computational Linguistics, Athens (2009), http://www.aclweb.org/anthology/W09-0809
Federico, M., Bertoldi, N., Cettolo, M.: IRSTLM: an open source toolkit for handling large scale language models. In: Proceedings of Interspeech 2008, Brisbane, Australia (September 2008)
Habash, N., Rambow, O.: Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 573–580. Association for Computational Linguistics, Ann Arbor (2005)
Hadj, Y.O.M.E., Al-Sughayeir, I.A., Al-Ansari, A.M.: Arabic part-of-speech tagging using the sentence structure. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools. The MEDAR Consortium, Cairo (2009)
Hajic, J.: Morphological tagging: Data vs. dictionaries. In: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 94–101. Association for Computational Linguistics, Seattle (2000)
Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. Prentice-Hall, Englewood Cliffs (2009)
Khoja, S.: Apt: Arabic part-of-speech tagger. In: Proceedings of the Student Workshop at the Second Meeting of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Pittsburgh (2001)
Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: ICASSP 1995: International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 181–184 (May 1995)
Koehn, P., Hoang, H.: Factored translation models. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 868–876. Association for Computational Linguistics, Prague (2007)
Maamouri, M., Bies, A., Buckwalter, T., Mekki, W.: The penn arabic treebank: Building a large-scale annotated arabic corpus. In: Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools, Cairo, Egypt, pp. 102–109 (2004)
Mansour, S., Sima’an, K., Winter, Y.: Smoothing a lexicon-based pos tagger for arabic and hebrew. In: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 97–103. Association for Computational Linguistics, Prague (2007)
Niehues, J., Kolss, M.: A POS-based model for long-range reorderings in SMT. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 206–214. Association for Computational Linguistics, Athens (2009), http://www.aclweb.org/anthology/W09-0435
Rottman, K., Vogel, S.: Word reordering in statistical machine translation with a POS-based distortion model. In: Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 171–180. University of Skovde, Skovde (2007)
Witten, I.H., Bell, T.C.: The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory 37(4), 1085–1094 (1991)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Köprü, S. (2011). An Efficient Part-of-Speech Tagger for Arabic. In: Gelbukh, A.F. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2011. Lecture Notes in Computer Science, vol 6608. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19400-9_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-19400-9_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19399-6
Online ISBN: 978-3-642-19400-9
eBook Packages: Computer ScienceComputer Science (R0)