Design and Development of Media-Corpus of the Kazakh Language

Mansurova, Madina; Madiyeva, Gulmira; Aubakirov, Sanzhar; Yermekov, Zhantemir; Alimzhanov, Yermek

doi:10.1007/978-3-319-67077-5_49

Madina Mansurova¹⁸,
Gulmira Madiyeva¹⁸,
Sanzhar Aubakirov¹⁸,
Zhantemir Yermekov¹⁸ &
…
Yermek Alimzhanov¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10449))

Included in the following conference series:

International Conference on Computational Collective Intelligence

1797 Accesses
3 Citations

Abstract

The aim of this work was design and development of a media-corpus of the Kazakh language. The media-corpus is hosted by the al-Farabi Kazakh National University and serves linguists as an empirical basis for research on contemporary written Kazakh. The information system for media-corpus was built on the basis of component software architecture. To make the processes of collection, storage and analysis of media-texts in the Kazakh language automatic, four components of the information system were designed and developed. The text files are saved in XML format. At the stage of analysis such tasks as text normalization, removing stop words, adding metadata and morphological analysis are performed. The morphological analyzer receives an input of a plain text, and at the output gives the text in XML format, which is further convenient to work with as it is easily converted to JSON format. The XML format is defined using XML Schema Definition (XSD). XSD allows to convert data into any other format, which simplifies the data exchange between the systems. For the case of incomplete morphological markup and the presence of homonymy, a special interface to perform manual markup is developed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Bekmanova, G.T.: Some approaches to the problems of automatic word changes and morphological analysis in the Kazakh language. Bulletin of the East Kazakhstan State Technical University Named by D. Serikbayev, vol. 1, pp. 192–197 (2009) (In Russian)
Google Scholar
Zhubanov, A.H.: Basic principles of formalization of the Kazakh text content. Almaty (2002) (In Russian)
Google Scholar
Turkish National Corpus. http://www.tnc.org.tr/index.php/en/
Bashkir poetic corpus. http://web-corpora.net/bashcorpus/search/?interface_language=ru
Written corpus of the Tatar language. http://corpus.tatar/
Portal of the state language of the Committee on languages of the Ministry of culture and information of the Republic of Kazakhstan. http://til.gov.kz/wps/portal/!ut/p/
Corpus of the Kazakh language created by the workers of National laboratory of Astana of L. Gumilev Eurasian University. http://kazcorpus.kz/klcweb/en/
Kaldybekov, T.E.: The Anglo-Kazakh parallel corpus for statistical machine translation. J. Young Sci. 6, 92–95 (2014). (In Russian)
Google Scholar
Portal of a state language of the Republic of Kazakhstan. http://dawhois.com/www/til.gov.kz.html
Makazhanov, O.A., Makhambetov, O.E., et al.: Development of morphological, syntactic and lexical sets of tags for tagging of texts in Kazakh. Philol. Cult. 2(36), 37–39. Kazan University, Kazan (2014) (In Russian)
Google Scholar
Almaty corpus of the Kazakh language. http://web-corpora.net/KazakhCorpus/search/?interface_language=ru
Szyperski, C.: Component Software: Beyond Object Oriented Programming. Addison-Wesley Professional, Reading (1997)
Google Scholar
Aubakirov, S.S., Akhmed-Zaki, D.Z., Trigo, P.S.: News classification using apache Lucene. KazNU Bull. Math. Mech. Comput. Sci. Ser. 3(91), 59–65 (2016)
Google Scholar
Bass, L., Clements, P., Kazman, R.: Software Architecture in Practice (SEI Series in Software Engineering), 3rd edn. Addison Wesley, Boston (2012)
Google Scholar
Azarova, I.V.: Morphological markup of the texts in Russian, using the formal grammar AGFL. Department of mathematical linguistics of St. Petersburg State University. http://www.dialog-21.ru/Archive/2003/AzarovaAFGL.htm

Download references

Acknowledgments

This work was supported in part under grant of Foundation of Ministry of Education and Science of the Republic of Kazakhstan “Development of intellectual high-performance information-analytical search system of processing of semi-structured data” (2015–2017).

Author information

Authors and Affiliations

Al-Farabi Kazakh National University, Almaty, Kazakhstan
Madina Mansurova, Gulmira Madiyeva, Sanzhar Aubakirov, Zhantemir Yermekov & Yermek Alimzhanov

Authors

Madina Mansurova
View author publications
You can also search for this author in PubMed Google Scholar
Gulmira Madiyeva
View author publications
You can also search for this author in PubMed Google Scholar
Sanzhar Aubakirov
View author publications
You can also search for this author in PubMed Google Scholar
Zhantemir Yermekov
View author publications
You can also search for this author in PubMed Google Scholar
Yermek Alimzhanov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Madina Mansurova .

Editor information

Editors and Affiliations

Department of Information Systems, Faculty of Computer Science and Management, Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
Department of Computer Science, University of Cyprus, Nicosia, Cyprus
George A. Papadopoulos
Department of Information Systems, Gdynia Maritime University, Gdynia, Poland
Piotr Jędrzejowicz
Department of Information Systems, Faculty of Computer Science and Management, Wrocław University of Science and Technology, Wrocław, Poland
Bogdan Trawiński
Department of Information Systems, University of Münster, Münster, Germany
Gottfried Vossen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mansurova, M., Madiyeva, G., Aubakirov, S., Yermekov, Z., Alimzhanov, Y. (2017). Design and Development of Media-Corpus of the Kazakh Language. In: Nguyen, N., Papadopoulos, G., Jędrzejowicz, P., Trawiński, B., Vossen, G. (eds) Computational Collective Intelligence. ICCCI 2017. Lecture Notes in Computer Science(), vol 10449. Springer, Cham. https://doi.org/10.1007/978-3-319-67077-5_49

Download citation

DOI: https://doi.org/10.1007/978-3-319-67077-5_49
Published: 07 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67076-8
Online ISBN: 978-3-319-67077-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics