An Efficient Part-of-Speech Tagger for Arabic

Köprü, Selçuk

doi:10.1007/978-3-642-19400-9_16

Selçuk Köprü¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6608))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2207 Accesses
3 Citations

Abstract

In this paper, we present an efficient part-of-speech (POS) tagger for Arabic which is based on a Hidden Markow Model. We explore different enhancements to improve the baseline system. Despite the morphological complexity of Arabic our approach is a data driven approach and does not utilize any morphological analyzer or a lexicon as many other Arabic POS taggers. This makes our approach simple, very efficient and valuable to be used in real-life applications and the obtained accuracy results are still comparable to other Arabic POS taggers. In the experiments, we also thoroughly investigate different aspects of Arabic POS tagging including tag sets, prefix and suffix analyses which were not examined in detail before. Our part-of-speech tagger achieves an accuracy of 95.57% on a standard tagset for Arabic. A detailed error analysis is provided for a better evaluation of the system. We also applied the same approach on different languages like Farsi and German to show the language independent aspect of the approach. Accuracy rates on these languages are also provided.

This work was supported by Applications Technology, Inc. (Apptek).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

AlGahtani, S., Black, W., McNaught, J.: Arabic part-of-speech tagging using transformation-based learning. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools. The MEDAR Consortium, Cairo (2009)
Google Scholar
Brants, T.: Tnt – a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 224–231. Association for Computational Linguistics, Seattle (2000)
Chapter Google Scholar
Buckwalter, T.: Buckwalter arabic morphological analyzer, version 2.0 (2004)
Google Scholar
Diab, M.: Improved arabic base phrase chunking with a new enriched pos tag set. In: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 89–96. Association for Computational Linguistics, Prague (2007)
Chapter Google Scholar
Diab, M., Hacioglu, K., Jurafsky, D.: Automatic tagging of arabic text: From raw text to base phrase chunks. In: Susan Dumais, D.M., Roukos, S. (eds.) HLT-NAACL 2004: Short Papers, pp. 149–152. Association for Computational Linguistics, Boston (2004)
Chapter Google Scholar
Elming, J., Habash, N.: Syntactic reordering for English-Arabic phrase-based machine translation. In: Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, pp. 69–77. Association for Computational Linguistics, Athens (2009), http://www.aclweb.org/anthology/W09-0809
Chapter Google Scholar
Federico, M., Bertoldi, N., Cettolo, M.: IRSTLM: an open source toolkit for handling large scale language models. In: Proceedings of Interspeech 2008, Brisbane, Australia (September 2008)
Google Scholar
Habash, N., Rambow, O.: Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 573–580. Association for Computational Linguistics, Ann Arbor (2005)
Google Scholar
Hadj, Y.O.M.E., Al-Sughayeir, I.A., Al-Ansari, A.M.: Arabic part-of-speech tagging using the sentence structure. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools. The MEDAR Consortium, Cairo (2009)
Google Scholar
Hajic, J.: Morphological tagging: Data vs. dictionaries. In: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 94–101. Association for Computational Linguistics, Seattle (2000)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. Prentice-Hall, Englewood Cliffs (2009)
Google Scholar
Khoja, S.: Apt: Arabic part-of-speech tagger. In: Proceedings of the Student Workshop at the Second Meeting of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Pittsburgh (2001)
Google Scholar
Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: ICASSP 1995: International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 181–184 (May 1995)
Google Scholar
Koehn, P., Hoang, H.: Factored translation models. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 868–876. Association for Computational Linguistics, Prague (2007)
Google Scholar
Maamouri, M., Bies, A., Buckwalter, T., Mekki, W.: The penn arabic treebank: Building a large-scale annotated arabic corpus. In: Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools, Cairo, Egypt, pp. 102–109 (2004)
Google Scholar
Mansour, S., Sima’an, K., Winter, Y.: Smoothing a lexicon-based pos tagger for arabic and hebrew. In: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 97–103. Association for Computational Linguistics, Prague (2007)
Chapter Google Scholar
Niehues, J., Kolss, M.: A POS-based model for long-range reorderings in SMT. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 206–214. Association for Computational Linguistics, Athens (2009), http://www.aclweb.org/anthology/W09-0435
Chapter Google Scholar
Rottman, K., Vogel, S.: Word reordering in statistical machine translation with a POS-based distortion model. In: Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 171–180. University of Skovde, Skovde (2007)
Google Scholar
Witten, I.H., Bell, T.C.: The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory 37(4), 1085–1094 (1991)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Teknoloji Yazilimevi, Ltd., METU Technopolis, 06531, Ankara, TR, Turkey
Selçuk Köprü

Authors

Selçuk Köprü
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Mexico
Alexander F. Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Köprü, S. (2011). An Efficient Part-of-Speech Tagger for Arabic. In: Gelbukh, A.F. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2011. Lecture Notes in Computer Science, vol 6608. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19400-9_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-19400-9_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19399-6
Online ISBN: 978-3-642-19400-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics