Robust Word Vectors: Context-Informed Embeddings for Noisy Texts

Malykh, V.; Khakhulin, T.; Logacheva, V.

doi:10.1007/s10958-023-06523-w

Robust Word Vectors: Context-Informed Embeddings for Noisy Texts

Published: 22 June 2023

Volume 273, pages 614–627, (2023)
Cite this article

Journal of Mathematical Sciences Aims and scope Submit manuscript

V. Malykh^1,2,3,
T. Khakhulin⁴ &
V. Logacheva²

107 Accesses
Explore all metrics

We suggest a new language-independent architecture of robust word vectors (RoVe). It is designed to alleviate the issue of typos and misspellings, common in almost any user-generated content, which hinder automatic text processing. Our model is morphologically motivated, which allows it to deal with unseen word forms in morphologically rich languages. We present the results on a number of natural language processing (NLP) tasks and languages for a variety of related architectures and show that the proposed architecture is robust to typos.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

In Defense of Word Embedding for Generic Text Representation

Improving Classification Robustness for Noisy Texts with Robust Word Vectors

Article 22 June 2023

Waste Not: Meta-Embedding of Word and Context Vectors

References

V. M. Andrjushchenko, Concept and Architecture of the Computer Fund of Russian Language [in Russian], Nauka, Moscow (1989).
S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski, “Linear algebraic structure of word senses, with applications to polysemy,” Transactions of the Association for Computational Linguistics, 6, 483–495 (2016).
Article Google Scholar
R. Astudillo, S. Amir, W. Ling, M. Silva, and I. Trancoso, “Learning word representations from scarce and noisy data with embedding subspaces,” in: Proc. 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, 2015, pp. 1074–1084.
Google Scholar
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, 5, 135–146 (2016).
S. Cucerzan and E. Brill, “Spelling correction as an iterative process that exploits the collective knowledge of web users,” in: Conference on Empirical Methods in Natural Language Processing (2004).
S. Demir, I. D. El-Kahlout, E. Unal, and H. Kaya, “Turkish paraphrase corpus,” in: Proc. 8th International Conference on Language Resources and Evaluation (LREC’12) N. Calzolari (Conference Chair), Kh. Choukri, Th. Declerck, M. Uğur Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, eds., European Language Resources Association (ELRA), Istanbul, Turkey, may 2012 (english).
B. Dolan, C. Quirk, and C. Brockett, “Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources,” in: International Conference on Computational Linguistics (2004).
M. F. Porter, “Snowball: A language for stemming algorithms,” 1 (2001).
T. Fawcett, “An introduction to ROC analysis,” Pattern recognition letters, 27, No. 8, 861–874 (2006).
Article MathSciNet Google Scholar
A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (2013), pp. 6645–6649.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., 9, No. 8, 1735–1780 (1997).
Article Google Scholar
N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modelling sentences,” in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), 2014, pp. 655–665.
D. Kiela, C. Wang, and K. Cho, “Context-attentive embeddings for improved sentence representations,” CoRR, arXiv:1804.07983 (2018).
A. Kutuzov and I. Andreev, “Texts in, meaning out: neural language models in semantic similarity task for russian,” ArXiv:abs/1504.08183 (2015).
T. Lei, Y. Zhang, and Y. Artzi, “Training RNNs as fast as CNNs,” ArXiv:abs/1709.02755 (2017).
D. Lewis, F. Li, T. Rose, and Y. Yang, “Reuters corpus volume 1 as a text categorization test collection,” 2004.
Google Scholar
Q. Li, S. Shah, X. Liu, and A. Nourbakhsh, “Data sets: Word embeddings learned from tweets and general data,” in: Proceedings of the Eleventh International Conference on Web and Social Media, 2017.
Google Scholar
W. Ling, T. Luís, L. Marujo, R. F. Astudillo, S. Amir, C. Dyer, A. W. Black, and I. Trancoso, “Finding function in form: Compositional character models for open vocabulary word representation,” CoRR, abs/1508.02096 (2015).
N. Loukachevitch and Y. Rubtsova, “Entity-oriented sentiment analysis of tweets: Results and problems,” Text, Speech, and Dialogue (Chan) in: P. Král and V. Matoušek, eds., Springer International Publishing, 2015, pp. 551–555.
B. McCann, J. Bradbury, C. Xiong, and R. Socher, “Learned in translation: Contextualized word vectors,” NIPS (2017).
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” arXiv:1310.4546 (2013).
K.A. Nguyen, S. Schulte im Walde, and N. Thang Vu, “Neural-based noise filtering from word embeddings,” CoRR, abs/1610.01874 (2016).
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” CoRR, abs/1802.05365 (2018).
Y. Pinter, R. Guthrie, and J. Eisenstein, “Mimicking word embeddings using subword RNNs,” CoRR, abs/1707.06961 (2017).
A.A. Polikarpov, “Towards the foundations of Menzerath’s law,” Contributions to the Science of Text and Language 31, 215–240 (2007).
Google Scholar
E. Pronoza, E. Yagunova, and A. Pronoza, “Construction of a russian paraphrase corpus: Unsupervised paraphrase extraction,” in: Proc. RuSSIR 2015. CCIS, vol. 573 (2016), pp. 146–157.
S. R. Bowman, G. Angeli, C. Potts, and C. Manning, “A large annotated corpus for learning natural language inference,” ArXiv:abs/1508.05326 (2015).
K. Sakaguchi, K. Duh, M. Post, and B. Van Durme, “Robsut wrod reocginiton via semicharacter recurrent neural network,” ArXiv:abs/1608.02214 (2016).
J. Saxe and K. Berlin, “expose: A character-level convolutional neural network with embeddings for detecting malicious urls, file paths and registry keys,” CoRR, abs/1702.08568 (2017).
M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, 45, No. 11, 2673–2681 (1997).
Article Google Scholar
I. Segalovich, “A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine,” Machine Learning; Models, Technologies and Applications, 2003, pp. 273–280.
Google Scholar
M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidirectional attention flow for machine comprehension,” abs/1611.01603 (2016).
N. Smith and J. Eisner, “Contrastive estimation: Training log-linear models on unlabeled data,” in: Proc. 43rd ACL (2005), 354–362.
R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in: Proc. 2013 EMNLP (2013), 1631–1642.
E. Vylomova, T. Cohn, X. He, and G. Haffari, “Word representation models for morphologically rich languages in neural machine translation,” CoRR, abs/1606.04217 (2016).
J.Wehrmann, W. Becker, H. E. L. Cagnini, and R. C. Barros, “A character-based convolutional neural network for language-agnostic twitter sentiment analysis,” in: International joint conference on neural networks (IJCNN), 2017, pp. 2384–2391.
O. Yildirim, F. Atik, and M. F. Amasyali, “42 bin haber veri kumesi,” Yildiz Teknik Universitesi, Bilgisayar Muh. Bolumu (2003).
X. Zhang, J. J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” CoRR, abs/1509.01626 (2015).

Download references

Author information

Authors and Affiliations

St.Petersburg Department of Steklov Mathematical Institute RAS, St. Petersburg, Russia
V. Malykh
Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia
V. Malykh & V. Logacheva
Institute for Systems Analysis, Federal Research Center “Computer Science and Control” of Russian Academy of Sciences, Moscow, Russia
V. Malykh
Skolkovo Institute of Science and Technology, Moscow, Russia
T. Khakhulin

Authors

V. Malykh
View author publications
You can also search for this author in PubMed Google Scholar
T. Khakhulin
View author publications
You can also search for this author in PubMed Google Scholar
V. Logacheva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to V. Malykh.

Additional information

Published in Zapiski Nauchnykh Seminarov POMI, Vol. 499, 2021, pp. 248–266.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Malykh, V., Khakhulin, T. & Logacheva, V. Robust Word Vectors: Context-Informed Embeddings for Noisy Texts. J Math Sci 273, 614–627 (2023). https://doi.org/10.1007/s10958-023-06523-w

Download citation

Received: 14 January 2019
Published: 22 June 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s10958-023-06523-w

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust Word Vectors: Context-Informed Embeddings for Noisy Texts

Access this article

Similar content being viewed by others

In Defense of Word Embedding for Generic Text Representation

Improving Classification Robustness for Noisy Texts with Robust Word Vectors

Waste Not: Meta-Embedding of Word and Context Vectors

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Navigation

Robust Word Vectors: Context-Informed Embeddings for Noisy Texts

Access this article

Similar content being viewed by others

In Defense of Word Embedding for Generic Text Representation

Improving Classification Robustness for Noisy Texts with Robust Word Vectors

Waste Not: Meta-Embedding of Word and Context Vectors

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation