Annotated Amharic Corpora

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9924)

Abstract

Amharic is one of under-resourced languages. The paper presents two text corpora. The first one is a substantially cleaned version of existing morphologically annotated WIC Corpus (210,000 words). The second one is the largest Amharic text corpus (17 million words). It was created from Web pages automatically crawled in 2013, 2015 and 2016. It is part-of-speech annotated by a tagger trained and evaluated on the WIC Corpus.

References

  1. 1.
    Demeke, G.A., Getachew, M.: Manual annotation of amharic news items with part-of-speech tags and its challenges. In: Ethiopian Languages Research Center Working Papers 2, pp. 1–16 (2006)Google Scholar
  2. 2.
    Firdyiwek, Y., Yaqob, D.: The system for Ethiopic representation in ASCII. J. EthioSci. (1997)Google Scholar
  3. 3.
    Gambäck, B., Olsson, F., Argaw, A.A., Asker, L.: Methods for amharic part-of-speech tagging. In: Proceedings of the First Workshop on Language Technologies for African Languages, pp. 104–111. Association for Computational Linguistics (2009)Google Scholar
  4. 4.
    Gebre, B.G.: Part of speech tagging for Amharic. Ph.D. thesis, University of Wolverhampton, Wolverhampton (2010)Google Scholar
  5. 5.
    Kilgarriff, A.: Getting to know your corpus. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS, vol. 7499, pp. 3–15. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  6. 6.
    Kilgarriff, A., Reddy, S., Pomikálek, J., Avinesh, P.: A corpus factory for many languages. In: LREC (2010)Google Scholar
  7. 7.
    Pomikálek, J.: Removing boilerplate and duplicate content from web corpora. Ph.D. thesis, Masaryk University, Faculty of Informatics (2011)Google Scholar
  8. 8.
    Scannell, K.P.: The crúbadán project: corpus building for under-resourced languages. In: Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, vol. 4, pp. 5–15 (2007)Google Scholar
  9. 9.
    Schmid, H.: Treetagger: a language independent part-of-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart 43, 28 (1995)Google Scholar
  10. 10.
    Suchomel, V., Pomikálek, J., et al.: Efficient web crawling for large text corpora. In: Proceedings of the Seventh Web as Corpus Workshop (WAC7), pp. 39–43 (2012)Google Scholar
  11. 11.
    Tachbelie, M.Y., Menzel, W.: Morpheme-based language modeling for inflectional language–Amharic. John Benjamin’s Publishing, Amsterdam and Philadelphia (2009)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.NLP Centre, Faculty of InformaticsMasaryk UniversityBrnoCzech Republic

Personalised recommendations