Skip to main content
Log in

Robust Word Vectors: Context-Informed Embeddings for Noisy Texts

  • Published:
Journal of Mathematical Sciences Aims and scope Submit manuscript

We suggest a new language-independent architecture of robust word vectors (RoVe). It is designed to alleviate the issue of typos and misspellings, common in almost any user-generated content, which hinder automatic text processing. Our model is morphologically motivated, which allows it to deal with unseen word forms in morphologically rich languages. We present the results on a number of natural language processing (NLP) tasks and languages for a variety of related architectures and show that the proposed architecture is robust to typos.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. V. M. Andrjushchenko, Concept and Architecture of the Computer Fund of Russian Language [in Russian], Nauka, Moscow (1989).

  2. S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski, “Linear algebraic structure of word senses, with applications to polysemy,” Transactions of the Association for Computational Linguistics, 6, 483–495 (2016).

    Article  Google Scholar 

  3. R. Astudillo, S. Amir, W. Ling, M. Silva, and I. Trancoso, “Learning word representations from scarce and noisy data with embedding subspaces,” in: Proc. 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, 2015, pp. 1074–1084.

    Google Scholar 

  4. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, 5, 135–146 (2016).

  5. S. Cucerzan and E. Brill, “Spelling correction as an iterative process that exploits the collective knowledge of web users,” in: Conference on Empirical Methods in Natural Language Processing (2004).

  6. S. Demir, I. D. El-Kahlout, E. Unal, and H. Kaya, “Turkish paraphrase corpus,” in: Proc. 8th International Conference on Language Resources and Evaluation (LREC’12) N. Calzolari (Conference Chair), Kh. Choukri, Th. Declerck, M. Uğur Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, eds., European Language Resources Association (ELRA), Istanbul, Turkey, may 2012 (english).

  7. B. Dolan, C. Quirk, and C. Brockett, “Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources,” in: International Conference on Computational Linguistics (2004).

  8. M. F. Porter, “Snowball: A language for stemming algorithms,” 1 (2001).

  9. T. Fawcett, “An introduction to ROC analysis,” Pattern recognition letters, 27, No. 8, 861–874 (2006).

    Article  MathSciNet  Google Scholar 

  10. A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (2013), pp. 6645–6649.

  11. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., 9, No. 8, 1735–1780 (1997).

    Article  Google Scholar 

  12. N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modelling sentences,” in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), 2014, pp. 655–665.

  13. D. Kiela, C. Wang, and K. Cho, “Context-attentive embeddings for improved sentence representations,” CoRR, arXiv:1804.07983 (2018).

  14. A. Kutuzov and I. Andreev, “Texts in, meaning out: neural language models in semantic similarity task for russian,” ArXiv:abs/1504.08183 (2015).

  15. T. Lei, Y. Zhang, and Y. Artzi, “Training RNNs as fast as CNNs,” ArXiv:abs/1709.02755 (2017).

  16. D. Lewis, F. Li, T. Rose, and Y. Yang, “Reuters corpus volume 1 as a text categorization test collection,” 2004.

    Google Scholar 

  17. Q. Li, S. Shah, X. Liu, and A. Nourbakhsh, “Data sets: Word embeddings learned from tweets and general data,” in: Proceedings of the Eleventh International Conference on Web and Social Media, 2017.

    Google Scholar 

  18. W. Ling, T. Luís, L. Marujo, R. F. Astudillo, S. Amir, C. Dyer, A. W. Black, and I. Trancoso, “Finding function in form: Compositional character models for open vocabulary word representation,” CoRR, abs/1508.02096 (2015).

  19. N. Loukachevitch and Y. Rubtsova, “Entity-oriented sentiment analysis of tweets: Results and problems,” Text, Speech, and Dialogue (Chan) in: P. Král and V. Matoušek, eds., Springer International Publishing, 2015, pp. 551–555.

  20. B. McCann, J. Bradbury, C. Xiong, and R. Socher, “Learned in translation: Contextualized word vectors,” NIPS (2017).

  21. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” arXiv:1310.4546 (2013).

  22. K.A. Nguyen, S. Schulte im Walde, and N. Thang Vu, “Neural-based noise filtering from word embeddings,” CoRR, abs/1610.01874 (2016).

  23. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” CoRR, abs/1802.05365 (2018).

  24. Y. Pinter, R. Guthrie, and J. Eisenstein, “Mimicking word embeddings using subword RNNs,” CoRR, abs/1707.06961 (2017).

  25. A.A. Polikarpov, “Towards the foundations of Menzerath’s law,” Contributions to the Science of Text and Language 31, 215–240 (2007).

    Google Scholar 

  26. E. Pronoza, E. Yagunova, and A. Pronoza, “Construction of a russian paraphrase corpus: Unsupervised paraphrase extraction,” in: Proc. RuSSIR 2015. CCIS, vol. 573 (2016), pp. 146–157.

  27. S. R. Bowman, G. Angeli, C. Potts, and C. Manning, “A large annotated corpus for learning natural language inference,” ArXiv:abs/1508.05326 (2015).

  28. K. Sakaguchi, K. Duh, M. Post, and B. Van Durme, “Robsut wrod reocginiton via semicharacter recurrent neural network,” ArXiv:abs/1608.02214 (2016).

  29. J. Saxe and K. Berlin, “expose: A character-level convolutional neural network with embeddings for detecting malicious urls, file paths and registry keys,” CoRR, abs/1702.08568 (2017).

  30. M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, 45, No. 11, 2673–2681 (1997).

    Article  Google Scholar 

  31. I. Segalovich, “A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine,” Machine Learning; Models, Technologies and Applications, 2003, pp. 273–280.

    Google Scholar 

  32. M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidirectional attention flow for machine comprehension,” abs/1611.01603 (2016).

  33. N. Smith and J. Eisner, “Contrastive estimation: Training log-linear models on unlabeled data,” in: Proc. 43rd ACL (2005), 354–362.

  34. R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in: Proc. 2013 EMNLP (2013), 1631–1642.

  35. E. Vylomova, T. Cohn, X. He, and G. Haffari, “Word representation models for morphologically rich languages in neural machine translation,” CoRR, abs/1606.04217 (2016).

  36. J.Wehrmann, W. Becker, H. E. L. Cagnini, and R. C. Barros, “A character-based convolutional neural network for language-agnostic twitter sentiment analysis,” in: International joint conference on neural networks (IJCNN), 2017, pp. 2384–2391.

  37. O. Yildirim, F. Atik, and M. F. Amasyali, “42 bin haber veri kumesi,” Yildiz Teknik Universitesi, Bilgisayar Muh. Bolumu (2003).

  38. X. Zhang, J. J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” CoRR, abs/1509.01626 (2015).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to V. Malykh.

Additional information

Published in Zapiski Nauchnykh Seminarov POMI, Vol. 499, 2021, pp. 248–266.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Malykh, V., Khakhulin, T. & Logacheva, V. Robust Word Vectors: Context-Informed Embeddings for Noisy Texts. J Math Sci 273, 614–627 (2023). https://doi.org/10.1007/s10958-023-06523-w

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10958-023-06523-w

Navigation