New Kazakh Parallel Text Corpora with On-line Access

  • Zhandos Zhumanov
  • Aigerim Madiyeva
  • Diana Rakhimova
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10449)

Abstract

This paper presents a new parallel resource – text corpora – for Kazakh language with on-line access. We describe 3 different approaches to collecting parallel text and how much data we managed to collect using them, parallel Kazakh-English text corpora collected from various sources and aligned on sentence level, and web accessible corpus management system that was set up using open source tools – corpus manager Mantee and web GUI KonText. As a result of our work we present working web-accessible corpus management system to work with collected corpora.

Keywords

Parallel text corpora Kazakh language Corpus management system 

References

  1. 1.
    Sereda, I.: Approaches to corpora classification in modern corpus linguistics (2012)Google Scholar
  2. 2.
    Makhambetov, O., Makazhanov, A., Yessenbayev, Z., Matkarimov, B., Sabyrgaliyev, I., Sharafudinov, A.: Assembling the Kazakh language corpus. In: EMNLP, pp. 1022–1031, October 2013Google Scholar
  3. 3.
    Tiedemann, J., Nygaard, L.: OPUS-an open source parallel corpus. In: Proceedings of the 13th Nordic Conference on Computational Linguistics (NODALIDA). University of Iceland, Reykjavik (2003)Google Scholar
  4. 4.
    Esplá-Gomis, M., Forcada, M.L.: Bitextor, a free/open-source software to harvest translation memories from multilingual websites. In: Proceedings of MT Summit XII, Ottawa, Canada. Association for Machine Translation in the Americas (2009)Google Scholar
  5. 5.
    Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón, V., Nagy, V.: Parallel corpora for medium density languages In: Proceedings of the RANLP 2005, pp 590–596 (2005)Google Scholar
  6. 6.
    Vondřička, P.: Aligning parallel texts with InterText. In: Calzolari, N., et al. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 1875–1879. European Language Resources Association (ELRA) (2014)Google Scholar
  7. 7.
    Rychlý, P.: Manatee/bonito-a modular corpus manager. In: 1st Workshop on Recent Advances in Slavonic Natural Language Processing, pp. 65–70, December 2007Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Zhandos Zhumanov
    • 1
  • Aigerim Madiyeva
    • 1
  • Diana Rakhimova
    • 1
  1. 1.Laboratory of Intelligent Information SystemsAl-Farabi Kazakh National UniversityAlmatyKazakhstan

Personalised recommendations