Training and Evaluation of TreeTagger on Amazigh Corpus

  • Amri Samir
  • Zenkouar Lahbib
  • Outahajala Mohamed
Conference paper
Part of the Lecture Notes in Networks and Systems book series (LNNS, volume 37)


Part of Speech (POS) tagging has high importance in the domain of Natural Language Processing (NLP). POS tagging determines grammatical category to any token, such as noun, verb, adjective, person, gender, etc. Some of the words are ambiguous in their categories and what tagging does is to clear of ambiguous word according to their context. Many taggers are designed with different approaches to reach high accuracy. In this paper we present a new tagging algorithm with a Machine Learning algorithm. This algorithm combines decision trees model and HMM model to tag Amazigh unknown words.

Part of Speech (POS) tagging is an essential part of text processing applications. A POS tagger assigns a tag to each word of its input text specifying its grammatical properties. One of the popular POS taggers is TreeTagger which was shown to have high accuracy in English and some other languages. It is always interesting to see how a method in one language performs on another language because it would give us insight into the difference and similarities of the languages. In case of statistical methods such as TreeTagger, this will have added practical advantages also. This paper presents creation of a POS tagged corpus and evaluation of TreeTagger on Amazigh text. The results of experiments on Amazigh text show that TreeTagger provides overall tagging accuracy of 93.15%, specifically, 93.78% on known words and 65.10% on unknown words.


Amazigh Corpus TreeTagger Machine learning POS tagging 


  1. 1.
    Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  2. 2.
    Brill, E.: Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Comput. Linguist. 21(4), 543–565 (1995)MathSciNetGoogle Scholar
  3. 3.
    Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proceedings of EMNLP, Philadelphia, USA (1996)Google Scholar
  4. 4.
    Toutanova, K., Manning, C.: Enriching the knowledge sources used in a maximum entropy part-of speech tagger. In: EMNLP/VLC 1999, pp. 63–71 (1999)Google Scholar
  5. 5.
    Toutanova, K., Dan, K., Manning, C., Yoram, S.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003, pp. 252–259 (2003)Google Scholar
  6. 6.
    Giménez, J., Màrquez, L.: SVMTool: a general POS tagger generator based on support vector machines. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal, pp. 43–46 (2004)Google Scholar
  7. 7.
    Kudo, T., Matsumoto, Y.: Use of support vector learning for chunk identification (2000)Google Scholar
  8. 8.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML 2001, pp. 282–289 (2001)Google Scholar
  9. 9.
    Tsuruoka, Y., Tsujii, J., Ananiadou, S.: Fast full parsing by linear-chain conditional random fields. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), pp. 790–798 (2009)Google Scholar
  10. 10.
    Brants, T.: TnT - a statistical part-of-speech tagger. In: ANLP 2000, Seattle, pp. 224–231 (2000)Google Scholar
  11. 11.
    Black, E., Jelinek, F., Lafferty, J., Mercer, R., Roukos, S.: Decision tree models applied to the labeling of text with parts-of-speech. In: Proceedings of the DARPA workshop on Speech and Natural Language, Harriman, New York (1992)Google Scholar
  12. 12.
    Màrquez, L., Rodríguez, H.: Part-of-speech tagging using decision trees. In: Nédellec, C., Rouveirol, C. (eds.) Proceedings of the 10th European Conference on Machine Learning, ECML 1998. Lecture Notes in AI, Chemnitz, vol. 1398, pp. 25–36 (1998)Google Scholar
  13. 13.
    Outahajala, M., Benajiba, Y., Rosso, P., Zenkouar, L.: POS tagging in amazigh using support vector machines and conditional random fields. In: Natural Language to Information Systems. LNCS, vol. 6716, pp. 238–241. Springer, Heidelberg (2011).
  14. 14.
    Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: International Conference on New Methods in Language Processing, pp. 44–49 (1994)Google Scholar
  15. 15.
    Schmid, H.: Improvements in part-of-speech tagging with an application to German (1995)Google Scholar
  16. 16.
    Bishop, C.: Pattern Recognition and Machine Learning. Springer, New York (2006)zbMATHGoogle Scholar
  17. 17.
    Chafiq, M.: [Forty four lessons in Amazigh]. éd. Arabo-africaines (1991)Google Scholar
  18. 18.
    Aston, G., Burnard, L.: The British National Corpus. Edinburgh University Press, 256 p. (1998)Google Scholar
  19. 19.
    Ide, N., Macleod, C., Grishman, R.: The american national corpus: a standardized resource of American English. In: Proceedings of Corpus Linguistics, vol. 3 (2001)Google Scholar
  20. 20.
    Outahajala, M., Rosso, P., Zenkouar, L.: Building an annotated corpus for Amazigh. In: Proceedings of 4th International Conference on Amazigh and ICT, Rabat, Morocco (2011)Google Scholar
  21. 21.
    Outahajala, M., Zenkouar, L., Rosso, P.: Construction d’un grand corpus annoté pour la langue Amazigh. La revue Etudes et Documents Berbères 33, pp.57–74 (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Amri Samir
    • 1
  • Zenkouar Lahbib
    • 1
  • Outahajala Mohamed
    • 2
  1. 1.LEC Laboratory, EMI SchoolUniversity Med V of RabatRabatMorocco
  2. 2.CESIC LaboratoryIRCAM InstituteRabatMorocco

Personalised recommendations