Skip to main content

An Efficient Part-of-Speech Tagger for Arabic

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6608))

Abstract

In this paper, we present an efficient part-of-speech (POS) tagger for Arabic which is based on a Hidden Markow Model. We explore different enhancements to improve the baseline system. Despite the morphological complexity of Arabic our approach is a data driven approach and does not utilize any morphological analyzer or a lexicon as many other Arabic POS taggers. This makes our approach simple, very efficient and valuable to be used in real-life applications and the obtained accuracy results are still comparable to other Arabic POS taggers. In the experiments, we also thoroughly investigate different aspects of Arabic POS tagging including tag sets, prefix and suffix analyses which were not examined in detail before. Our part-of-speech tagger achieves an accuracy of 95.57% on a standard tagset for Arabic. A detailed error analysis is provided for a better evaluation of the system. We also applied the same approach on different languages like Farsi and German to show the language independent aspect of the approach. Accuracy rates on these languages are also provided.

This work was supported by Applications Technology, Inc. (Apptek).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. AlGahtani, S., Black, W., McNaught, J.: Arabic part-of-speech tagging using transformation-based learning. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools. The MEDAR Consortium, Cairo (2009)

    Google Scholar 

  2. Brants, T.: Tnt – a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 224–231. Association for Computational Linguistics, Seattle (2000)

    Chapter  Google Scholar 

  3. Buckwalter, T.: Buckwalter arabic morphological analyzer, version 2.0 (2004)

    Google Scholar 

  4. Diab, M.: Improved arabic base phrase chunking with a new enriched pos tag set. In: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 89–96. Association for Computational Linguistics, Prague (2007)

    Chapter  Google Scholar 

  5. Diab, M., Hacioglu, K., Jurafsky, D.: Automatic tagging of arabic text: From raw text to base phrase chunks. In: Susan Dumais, D.M., Roukos, S. (eds.) HLT-NAACL 2004: Short Papers, pp. 149–152. Association for Computational Linguistics, Boston (2004)

    Chapter  Google Scholar 

  6. Elming, J., Habash, N.: Syntactic reordering for English-Arabic phrase-based machine translation. In: Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, pp. 69–77. Association for Computational Linguistics, Athens (2009), http://www.aclweb.org/anthology/W09-0809

    Chapter  Google Scholar 

  7. Federico, M., Bertoldi, N., Cettolo, M.: IRSTLM: an open source toolkit for handling large scale language models. In: Proceedings of Interspeech 2008, Brisbane, Australia (September 2008)

    Google Scholar 

  8. Habash, N., Rambow, O.: Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 573–580. Association for Computational Linguistics, Ann Arbor (2005)

    Google Scholar 

  9. Hadj, Y.O.M.E., Al-Sughayeir, I.A., Al-Ansari, A.M.: Arabic part-of-speech tagging using the sentence structure. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools. The MEDAR Consortium, Cairo (2009)

    Google Scholar 

  10. Hajic, J.: Morphological tagging: Data vs. dictionaries. In: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 94–101. Association for Computational Linguistics, Seattle (2000)

    Google Scholar 

  11. Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. Prentice-Hall, Englewood Cliffs (2009)

    Google Scholar 

  12. Khoja, S.: Apt: Arabic part-of-speech tagger. In: Proceedings of the Student Workshop at the Second Meeting of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Pittsburgh (2001)

    Google Scholar 

  13. Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: ICASSP 1995: International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 181–184 (May 1995)

    Google Scholar 

  14. Koehn, P., Hoang, H.: Factored translation models. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 868–876. Association for Computational Linguistics, Prague (2007)

    Google Scholar 

  15. Maamouri, M., Bies, A., Buckwalter, T., Mekki, W.: The penn arabic treebank: Building a large-scale annotated arabic corpus. In: Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools, Cairo, Egypt, pp. 102–109 (2004)

    Google Scholar 

  16. Mansour, S., Sima’an, K., Winter, Y.: Smoothing a lexicon-based pos tagger for arabic and hebrew. In: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 97–103. Association for Computational Linguistics, Prague (2007)

    Chapter  Google Scholar 

  17. Niehues, J., Kolss, M.: A POS-based model for long-range reorderings in SMT. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 206–214. Association for Computational Linguistics, Athens (2009), http://www.aclweb.org/anthology/W09-0435

    Chapter  Google Scholar 

  18. Rottman, K., Vogel, S.: Word reordering in statistical machine translation with a POS-based distortion model. In: Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 171–180. University of Skovde, Skovde (2007)

    Google Scholar 

  19. Witten, I.H., Bell, T.C.: The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory 37(4), 1085–1094 (1991)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Köprü, S. (2011). An Efficient Part-of-Speech Tagger for Arabic. In: Gelbukh, A.F. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2011. Lecture Notes in Computer Science, vol 6608. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19400-9_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19400-9_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19399-6

  • Online ISBN: 978-3-642-19400-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics