Design and Development of Media-Corpus of the Kazakh Language

  • Madina Mansurova
  • Gulmira Madiyeva
  • Sanzhar Aubakirov
  • Zhantemir Yermekov
  • Yermek Alimzhanov
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10449)


The aim of this work was design and development of a media-corpus of the Kazakh language. The media-corpus is hosted by the al-Farabi Kazakh National University and serves linguists as an empirical basis for research on contemporary written Kazakh. The information system for media-corpus was built on the basis of component software architecture. To make the processes of collection, storage and analysis of media-texts in the Kazakh language automatic, four components of the information system were designed and developed. The text files are saved in XML format. At the stage of analysis such tasks as text normalization, removing stop words, adding metadata and morphological analysis are performed. The morphological analyzer receives an input of a plain text, and at the output gives the text in XML format, which is further convenient to work with as it is easily converted to JSON format. The XML format is defined using XML Schema Definition (XSD). XSD allows to convert data into any other format, which simplifies the data exchange between the systems. For the case of incomplete morphological markup and the presence of homonymy, a special interface to perform manual markup is developed.


Media-corpus Corpus linguistics Morphological parsing 



This work was supported in part under grant of Foundation of Ministry of Education and Science of the Republic of Kazakhstan “Development of intellectual high-performance information-analytical search system of processing of semi-structured data” (2015–2017).


  1. 1.
    Bekmanova, G.T.: Some approaches to the problems of automatic word changes and morphological analysis in the Kazakh language. Bulletin of the East Kazakhstan State Technical University Named by D. Serikbayev, vol. 1, pp. 192–197 (2009) (In Russian)Google Scholar
  2. 2.
    Zhubanov, A.H.: Basic principles of formalization of the Kazakh text content. Almaty (2002) (In Russian)Google Scholar
  3. 3.
    Turkish National Corpus.
  4. 4.
  5. 5.
    Written corpus of the Tatar language.
  6. 6.
    Portal of the state language of the Committee on languages of the Ministry of culture and information of the Republic of Kazakhstan.!ut/p/
  7. 7.
    Corpus of the Kazakh language created by the workers of National laboratory of Astana of L. Gumilev Eurasian University.
  8. 8.
    Kaldybekov, T.E.: The Anglo-Kazakh parallel corpus for statistical machine translation. J. Young Sci. 6, 92–95 (2014). (In Russian)Google Scholar
  9. 9.
    Portal of a state language of the Republic of Kazakhstan.
  10. 10.
    Makazhanov, O.A., Makhambetov, O.E., et al.: Development of morphological, syntactic and lexical sets of tags for tagging of texts in Kazakh. Philol. Cult. 2(36), 37–39. Kazan University, Kazan (2014) (In Russian)Google Scholar
  11. 11.
  12. 12.
    Szyperski, C.: Component Software: Beyond Object Oriented Programming. Addison-Wesley Professional, Reading (1997)Google Scholar
  13. 13.
    Aubakirov, S.S., Akhmed-Zaki, D.Z., Trigo, P.S.: News classification using apache Lucene. KazNU Bull. Math. Mech. Comput. Sci. Ser. 3(91), 59–65 (2016)Google Scholar
  14. 14.
    Bass, L., Clements, P., Kazman, R.: Software Architecture in Practice (SEI Series in Software Engineering), 3rd edn. Addison Wesley, Boston (2012)Google Scholar
  15. 15.
    Azarova, I.V.: Morphological markup of the texts in Russian, using the formal grammar AGFL. Department of mathematical linguistics of St. Petersburg State University.

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Madina Mansurova
    • 1
  • Gulmira Madiyeva
    • 1
  • Sanzhar Aubakirov
    • 1
  • Zhantemir Yermekov
    • 1
  • Yermek Alimzhanov
    • 1
  1. 1.Al-Farabi Kazakh National UniversityAlmatyKazakhstan

Personalised recommendations