Abstract
The article discusses promising directions for the use of modern automatic methods for analyzing natural language data for solving a wide range of practical problems. The technology of creating electronic corpora (collections) of texts is considered as a tool for the transition from model linguistics to tagged data linguistics. The principles of the creation of marked corpora of texts, the possibilities and limitations of their use are considered. Creation of a marked corpus of texts in which language data that is downloaded from the Internet is processed sequentially before issuing the results to users is described. The conveyor consists of the following steps: uploading data from the Internet; definition of the language in which the text is written; unloading metadata; splitting the texts into paragraphs and sentences; deduplication; tokenization; automatic language markup; uploading cleared and marked data to the network. The prospects for the development of language data analysis systems are presented. Requirements for the creation of corpora for solving problems of public administration and strategic planning are developed. Properties that should have such bodies are considered. Those include: corpus format, corpus volume, the degree of the linguistic analysis depth, corpus-manager structure. A description of the marked corpora of texts developed at the Artificial Intelligence Research Center (AIReC) of Ailamazyan Program Systems Institute of the Russian Academy of Sciences, with a reference to the tasks of extracting information about persons, events and situations from the texts of news reports is presented. A retrospective review of the development of systems for automatic processing of natural language texts in the areas of machine translation and human-machine interaction is given.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Isaksson, A.J., Harjunkoski, I., Sand, G.: The impact of digitalization on the future of control and operations. Comput. Chem. Eng. 114, 122–129 (2018). https://doi.org/10.1016/j.compchemeng.2017.10.037
Comparin, L.: Quality in machine translation and human post-editing: error annotation and specifications. Diss. (2017)
Belonogov, G.G.: Systems of phraseological machine translation of polythematic texts from Russian into English and from English into Russian (RETRANS and ERTRANS Systems). Int. Forum Inf. Documentation 20(2), 29–35 (1995)
Sowah, E.: Natural language processing in cooperative query answering databases (NLPICQA) (2018)
Schneider, D., Zampieri, M., van Genabith, J.: Translation memories and the translator: a report on a user survey. Babel. 64(5–6), 734–762 (2018)
Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Yonghui, W., Chen, Z., Thorat, N., Viegas, F., Wattenberg, M., Corrado, G., Hughes, M., Dean, J.: Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 5, 339–351 (2017)
Macketanz, V., Avramidis, E., Burchardt, A., Helc, J., Srivastava, A.: Machine translation: phrase-based, rule-based and neural approaches with linguistic evaluation. Cybern. Inf. Technol. 17(2), 28–43 (2017)
Costa-jussa, M.R., Fonollosa, J.A.R.: Latest trends in hybrid machine translation and its applications. Comput. Speech Lang. 32(1), 3–10 (2015)
Ferrara, E., Varol, O., Davis, C., Menczer, F., Flammini, A.: The rise of social bots. Commun. ACM 59(7), 96–104 (2016)
Oberer, B., Erkollar, A., Stein, A.: Social bots – act like a human. In: Think Like a Bot. Stumpf, M. (eds.) Digitalisierung und Kommunikation. Europäische Kulturen in der Wirtschaftskommunikation, vol. 31, pp. 311–327. Springer VS, Wiesbaden (2019)
Shi, P., Zhang, Z., Choo, Raymond, K.K.: Detecting malicious social bots based on clickstream sequences. IEEE Access. 1, 1 (2019)
Davis, C., Varol, O., Ferrara, E., Flammini, A., Menczer, F.: BotOrNot: a system to evaluate social bots. arXiv preprint:1602.00975 (2016)
Antropova, V.V.: Speech aggression in the texts of social networks: the communicative aspect. Vestnik VGU. Serija: Filologija. Zhurnalistika. [VSU Herald. Series: Philology. Journalism]. (3), 123–127 (2015). (in Russian)
Lyashevskaya, O.N., Toldova, S.A.: Modern problems and trends in computational linguistics. Voprosy jazykoznanija [Questions of linguistics] 1, 120–145 (2014). (in Russian)
Kozlova, N.V.: Linguistic corpora. Definition of basic concepts and typology. Vestnik NGU Serija: Lingvistika i mezhkul’turnaja kommunikacija [NSU Herald, Series: Linguistics and Intercultural Communication], 11 (1), 79–88 (2013). (in Russian)
Granovsky, D.V., Bocharov, V.V., Bichineva, S.V.: Open corpus: principles of work and prospects. Kompjuternaja lingvistika i razvitie semanticheskogo poiska v Internete: Trudy nauchnogo seminara XIII Vserossijskoj ob’edinennoj konferencii ”Internet i sovremennoe obshhestvo” [Computational linguistics and the development of semantic search on the Internet: Proceedings of the scientific seminar of the XIII All-Russian Joint Conference “The Internet and Modern Society”]. St. Petersburg, October 19–22, 2010, Ed. by V.Sh. Rubashkin, 94 p. (2010). (in Russian)
Belikov, V., Kopylov, N., Piperski, A., Selegey, V., Sharoff, S.: Big and diverse is beautiful: a large corpus of Russian to study linguistic variation. In: Proceedings of the 8th Web as Corpus Workshop (WAC-8) @Corpus Linguistics 2013, 24–29 (2013). http://www.webcorpora.ru/wp-content/uploads/2015/10/wac8-proceedings.pdf
Benko, V., Zakharov, V.P.: Very large Russian corpora: new opportunities and new challenges. DIALOG-2016 (2016). http://www.dialog-21.ru/media/3383/benkovzakharovvp.pdf
Benko, V.: Yet another family of (comparable) Web corpora. In: Conference Text, Speech and Dialogue. 17th International Conference, at Brno, Czech Republic (2014). https://www.researchgate.net/profile/Vladimir_Benko/publication/313904118_Aranea_Yet_Another_Family_of_Comparable_Web_Corpora/links/58c675fdaca272e36dde59c6/Aranea-Yet-Another-Family-of-Comparable-Web-Corpora.pdf
JakubĂÄŤek, M., Kilgarriff, A., Kovář, V., RychlĂ˝, P., Suchomel, V.: The TenTen corpus family. In: Proceedings of the 7th International Corpus Linguistics Conference. Lancaster, pp. 125–127 (2013)
Shavrina, T.O., Shapovalova, O.A.: To the methodology of corpus construction for machine learning: TAIGA syntax tree corpus and parser. Trudy mezhdunarodnoj konferencii “Korpusnaja lingvistika-2017”. In: Proceedings of the International Conference “Corpus Linguistics-2017”, Saint-Petersburg, Ch.13, pp. 78–84 (2017)
Lyashevskaya, O., Droganova, K., Zeman, D., Alexeeva, M., Gavrilova, T., Mustafina, N., Shakurova, E.: Universal Dependencies for Russian: a New Syntactic Dependencies Tagset. Basic reaearch program. Working papers (2016). http://olesar.narod.ru/papers/44LNG2016.pdf
Osipova, E.S., Tarnaeva, L.P.: Using of corpus linguistic resources in the preparation of translators in the field of professional communication. Filologicheskie nauki. Voprosy teorii i praktiki [Philology. Theory and practice], 63(9), 205–209 (2015). (in Russian)
Matveychuk, S.P.: Prospects for the use of text (linguistic) corpora in hunting research. Gumanitarnye aspekty ohoty i ohotnich’ego hozjajstva, trudy konferencii [Humanitarian aspects of hunting and hunting, conference proceedings], pp. 29–35 (2015). (in Russian)
Kovalchuk, A.N.: The relevance of the creation of specialized linguistic corpora for solving practical problems of legal linguistics. Intellektual’nyj potencial XXI veka. Stupeni poznanija [XXI century Intellectual potential. Steps of knowledge] 21, 142–146 (2014). (in Russian)
Acknowledgments
The publication was prepared with the support of the state program AAAA-A19-119020690042-2 «Research and development of data mining methods».
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Vinogradov, A.N., Vlasova, N., Kurshev, E.P., Podobryaev, A. (2020). Modern Approaches to the Language Data Analysis. Using Language Analysis Methods for Management and Planning Tasks. In: Arseniev, D., Overmeyer, L., Kälviäinen, H., Katalinić, B. (eds) Cyber-Physical Systems and Control. CPS&C 2019. Lecture Notes in Networks and Systems, vol 95. Springer, Cham. https://doi.org/10.1007/978-3-030-34983-7_46
Download citation
DOI: https://doi.org/10.1007/978-3-030-34983-7_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34982-0
Online ISBN: 978-3-030-34983-7
eBook Packages: EngineeringEngineering (R0)