Skip to main content

Modern Approaches to the Language Data Analysis. Using Language Analysis Methods for Management and Planning Tasks

  • Conference paper
  • First Online:
Cyber-Physical Systems and Control (CPS&C 2019)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 95))

Included in the following conference series:

Abstract

The article discusses promising directions for the use of modern automatic methods for analyzing natural language data for solving a wide range of practical problems. The technology of creating electronic corpora (collections) of texts is considered as a tool for the transition from model linguistics to tagged data linguistics. The principles of the creation of marked corpora of texts, the possibilities and limitations of their use are considered. Creation of a marked corpus of texts in which language data that is downloaded from the Internet is processed sequentially before issuing the results to users is described. The conveyor consists of the following steps: uploading data from the Internet; definition of the language in which the text is written; unloading metadata; splitting the texts into paragraphs and sentences; deduplication; tokenization; automatic language markup; uploading cleared and marked data to the network. The prospects for the development of language data analysis systems are presented. Requirements for the creation of corpora for solving problems of public administration and strategic planning are developed. Properties that should have such bodies are considered. Those include: corpus format, corpus volume, the degree of the linguistic analysis depth, corpus-manager structure. A description of the marked corpora of texts developed at the Artificial Intelligence Research Center (AIReC) of Ailamazyan Program Systems Institute of the Russian Academy of Sciences, with a reference to the tasks of extracting information about persons, events and situations from the texts of news reports is presented. A retrospective review of the development of systems for automatic processing of natural language texts in the areas of machine translation and human-machine interaction is given.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Isaksson, A.J., Harjunkoski, I., Sand, G.: The impact of digitalization on the future of control and operations. Comput. Chem. Eng. 114, 122–129 (2018). https://doi.org/10.1016/j.compchemeng.2017.10.037

    Article  Google Scholar 

  2. Comparin, L.: Quality in machine translation and human post-editing: error annotation and specifications. Diss. (2017)

    Google Scholar 

  3. Belonogov, G.G.: Systems of phraseological machine translation of polythematic texts from Russian into English and from English into Russian (RETRANS and ERTRANS Systems). Int. Forum Inf. Documentation 20(2), 29–35 (1995)

    Google Scholar 

  4. Sowah, E.: Natural language processing in cooperative query answering databases (NLPICQA) (2018)

    Google Scholar 

  5. Schneider, D., Zampieri, M., van Genabith, J.: Translation memories and the translator: a report on a user survey. Babel. 64(5–6), 734–762 (2018)

    Article  Google Scholar 

  6. Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Yonghui, W., Chen, Z., Thorat, N., Viegas, F., Wattenberg, M., Corrado, G., Hughes, M., Dean, J.: Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 5, 339–351 (2017)

    Article  Google Scholar 

  7. Macketanz, V., Avramidis, E., Burchardt, A., Helc, J., Srivastava, A.: Machine translation: phrase-based, rule-based and neural approaches with linguistic evaluation. Cybern. Inf. Technol. 17(2), 28–43 (2017)

    Google Scholar 

  8. Costa-jussa, M.R., Fonollosa, J.A.R.: Latest trends in hybrid machine translation and its applications. Comput. Speech Lang. 32(1), 3–10 (2015)

    Article  Google Scholar 

  9. Ferrara, E., Varol, O., Davis, C., Menczer, F., Flammini, A.: The rise of social bots. Commun. ACM 59(7), 96–104 (2016)

    Article  Google Scholar 

  10. Oberer, B., Erkollar, A., Stein, A.: Social bots – act like a human. In: Think Like a Bot. Stumpf, M. (eds.) Digitalisierung und Kommunikation. Europäische Kulturen in der Wirtschaftskommunikation, vol. 31, pp. 311–327. Springer VS, Wiesbaden (2019)

    Google Scholar 

  11. Shi, P., Zhang, Z., Choo, Raymond, K.K.: Detecting malicious social bots based on clickstream sequences. IEEE Access. 1, 1 (2019)

    Google Scholar 

  12. Davis, C., Varol, O., Ferrara, E., Flammini, A., Menczer, F.: BotOrNot: a system to evaluate social bots. arXiv preprint:1602.00975 (2016)

    Google Scholar 

  13. Antropova, V.V.: Speech aggression in the texts of social networks: the communicative aspect. Vestnik VGU. Serija: Filologija. Zhurnalistika. [VSU Herald. Series: Philology. Journalism]. (3), 123–127 (2015). (in Russian)

    Google Scholar 

  14. http://www.ruscorpora.ru/

  15. Lyashevskaya, O.N., Toldova, S.A.: Modern problems and trends in computational linguistics. Voprosy jazykoznanija [Questions of linguistics] 1, 120–145 (2014). (in Russian)

    Google Scholar 

  16. Kozlova, N.V.: Linguistic corpora. Definition of basic concepts and typology. Vestnik NGU Serija: Lingvistika i mezhkul’turnaja kommunikacija [NSU Herald, Series: Linguistics and Intercultural Communication], 11 (1), 79–88 (2013). (in Russian)

    Google Scholar 

  17. Granovsky, D.V., Bocharov, V.V., Bichineva, S.V.: Open corpus: principles of work and prospects. Kompjuternaja lingvistika i razvitie semanticheskogo poiska v Internete: Trudy nauchnogo seminara XIII Vserossijskoj ob’edinennoj konferencii ”Internet i sovremennoe obshhestvo” [Computational linguistics and the development of semantic search on the Internet: Proceedings of the scientific seminar of the XIII All-Russian Joint Conference “The Internet and Modern Society”]. St. Petersburg, October 19–22, 2010, Ed. by V.Sh. Rubashkin, 94 p. (2010). (in Russian)

    Google Scholar 

  18. Belikov, V., Kopylov, N., Piperski, A., Selegey, V., Sharoff, S.: Big and diverse is beautiful: a large corpus of Russian to study linguistic variation. In: Proceedings of the 8th Web as Corpus Workshop (WAC-8) @Corpus Linguistics 2013, 24–29 (2013). http://www.webcorpora.ru/wp-content/uploads/2015/10/wac8-proceedings.pdf

  19. Benko, V., Zakharov, V.P.: Very large Russian corpora: new opportunities and new challenges. DIALOG-2016 (2016). http://www.dialog-21.ru/media/3383/benkovzakharovvp.pdf

  20. Benko, V.: Yet another family of (comparable) Web corpora. In: Conference Text, Speech and Dialogue. 17th International Conference, at Brno, Czech Republic (2014). https://www.researchgate.net/profile/Vladimir_Benko/publication/313904118_Aranea_Yet_Another_Family_of_Comparable_Web_Corpora/links/58c675fdaca272e36dde59c6/Aranea-Yet-Another-Family-of-Comparable-Web-Corpora.pdf

  21. Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., Suchomel, V.: The TenTen corpus family. In: Proceedings of the 7th International Corpus Linguistics Conference. Lancaster, pp. 125–127 (2013)

    Google Scholar 

  22. Shavrina, T.O., Shapovalova, O.A.: To the methodology of corpus construction for machine learning: TAIGA syntax tree corpus and parser. Trudy mezhdunarodnoj konferencii “Korpusnaja lingvistika-2017”. In: Proceedings of the International Conference “Corpus Linguistics-2017”, Saint-Petersburg, Ch.13, pp. 78–84 (2017)

    Google Scholar 

  23. Lyashevskaya, O., Droganova, K., Zeman, D., Alexeeva, M., Gavrilova, T., Mustafina, N., Shakurova, E.: Universal Dependencies for Russian: a New Syntactic Dependencies Tagset. Basic reaearch program. Working papers (2016). http://olesar.narod.ru/papers/44LNG2016.pdf

  24. Osipova, E.S., Tarnaeva, L.P.: Using of corpus linguistic resources in the preparation of translators in the field of professional communication. Filologicheskie nauki. Voprosy teorii i praktiki [Philology. Theory and practice], 63(9), 205–209 (2015). (in Russian)

    Google Scholar 

  25. Matveychuk, S.P.: Prospects for the use of text (linguistic) corpora in hunting research. Gumanitarnye aspekty ohoty i ohotnich’ego hozjajstva, trudy konferencii [Humanitarian aspects of hunting and hunting, conference proceedings], pp. 29–35 (2015). (in Russian)

    Google Scholar 

  26. Kovalchuk, A.N.: The relevance of the creation of specialized linguistic corpora for solving practical problems of legal linguistics. Intellektual’nyj potencial XXI veka. Stupeni poznanija [XXI century Intellectual potential. Steps of knowledge] 21, 142–146 (2014). (in Russian)

    Google Scholar 

Download references

Acknowledgments

The publication was prepared with the support of the state program AAAA-A19-119020690042-2 «Research and development of data mining methods».

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrei N. Vinogradov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vinogradov, A.N., Vlasova, N., Kurshev, E.P., Podobryaev, A. (2020). Modern Approaches to the Language Data Analysis. Using Language Analysis Methods for Management and Planning Tasks. In: Arseniev, D., Overmeyer, L., Kälviäinen, H., Katalinić, B. (eds) Cyber-Physical Systems and Control. CPS&C 2019. Lecture Notes in Networks and Systems, vol 95. Springer, Cham. https://doi.org/10.1007/978-3-030-34983-7_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-34983-7_46

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-34982-0

  • Online ISBN: 978-3-030-34983-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics