Skip to main content

Advertisement

Log in

A detailed review on word embedding techniques with emphasis on word2vec

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Text data has been growing drastically in the present day because of digitalization. The Internet, being flooded with millions of documents every day, makes the task of text processing by human beings relatively complex, which is neither adaptable nor successful. Many machine learning algorithms cannot interpret the raw text in its original format, as these algorithms purely need numbers as inputs to accomplish any task (say, classification, regression). A better way to represent text for computers, to understand and process text efficiently and effectively is needed. Word embedding is one such technique. Word embedding, or the encoding of words as vectors, has received much interest as a feature learning technique for natural language processing in recent times. This review presents a better way of understanding and working with word embeddings. Many researchers, who are non-experts in using different text processing techniques, would not know where to start their exploration due to a lack of comprehensive material. This review provides an overview of several word embedding strategies and the entire working procedure of word2vec,both in theory and mathematical perspectives which provides researchers with detailed information so that they may rapidly get to work on their research. Research results of standard word embedding techniques have also been included to better understand how word embedding have been improved from the past years to most recent findings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

  2. Mcdonald, Scott. (2008). Testing the distributional hypothesis: The influence of context on judgements of semantic similarity.

  3. Hillebrand L, Biesner D, Bauckhage C, Sifa R (2021) Interpretable Topic Extraction and Word Embedding Learning Using Non-Negative Tensor DEDICOM. Mach Learn Knowl Extraction 3(1):123–167. https://doi.org/10.3390/make3010007

    Article  Google Scholar 

  4. Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Proceedings of the 27th international conference on neural information processing systems, vol 2, no. December, pp 2177–2185. https://doi.org/10.5555/2969033.2969070

  5. Pennington, J.; Socher, R.; Manning, C. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 25–29 October 2014. 1532–1543.

  6. Deerwester S, Dumias ST, Furmas GW, Lander TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407 (https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/(SICI)1097-4571(199009)41:63.0.CO;2-9)

    Article  Google Scholar 

  7. Hofmann T (1999) Probabilistic latent semantic analysis. In:Thomas H (ed) Probabilistic latent semantic analysis. Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publisher Inc. 289–296

  8. https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05

  9. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    Google Scholar 

  10. Lebret R, Collobert R (2014) Word embeddings through Hellinger PCA. In: 14th Conference of the european chapter of the association for computational linguistics 2014, EACL 2014, 482–490

  11. Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  12. Efficient Estimation of Word Representations in Vector Space Tomas Mikolov Google Inc., Mountain View, CA tmikolov@google.com Kai Chen Google Inc., Mountain View, CA kaichen@google.com Greg Corrado Google Inc., Mountain View, CA gcorrado@google.com Jeffrey Dean Google Inc., Mountain View, CA jeff@google.com arXiv:1301.3781v3 [cs.CL] 7 Sep 2013

  13. arXiv:1411.2738v4 [cs.CL] 5 Jun 2016word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu

  14. Huang EH, Socher R, Manning CD, Ng AY (2012) Improve Word Representation via Global Context and Multiple Word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics no. July.JejuIslan: Korea: Association for Computational Linguistics, 873–882

  15. Sen P, Ganguly D, Jones G (2019) Word-Node2Vec: improving word embedding with document-level non-local word cooccurrences. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics human language technologies, Volume 1 (Long and Short Papers), pp 1041–1051. [Online]. Available: https://www.aclweb.org/anthology/N19–1109

  16. Reisinger J, Mooney RJ (2010) Multi-prototype vector-space models of word meaning. In: NAACL HLT 2010 - human language technologies: the 2010 annual conference of the north american chapter of the association for computational linguistics, proceedings of the main conference, no June, 109–117

  17. Wu Z, Giles CL (2015) Sense-aware semantic analysis: A multi-prototype word representation model using wikipedia. Proc Nat Conf Artif Intell 3:2188–2194

    Google Scholar 

  18. Neelakantan A, Shankar J, Passos A, McCallum A (2014) Efficient non-parametric estimation of multiple embeddings perword in vector space. In: EMNLP 2014 - 2014 Conference on empirical methods in natural language processing, proceedings of the conference, 1059–1069

  19. Li J, Jurafsky D (2015) Do multi-sense embeddings improve natural language understanding? In: Conference proceedings - EMNLP 2015: conference on empirical methods in natural language processing no. September 1722–1732

  20. Chen X, Liu Z, Sun M (2014) A unified model for word sense representation and disambiguation. In: EMNLP 2014 - 2014 Conference on empirical methods in natural language processing, proceedings of the conference, 1025–1035

  21. Tian F, Dai H, Bian J, Gao B, Zhang R, Chen E, Liu TY (2014) A probabilistic model for learning multi-prototypeword embeddings. In: COLING 2014 - 25th International conference on computational linguistics, proceedings of COLING 2014: technical papers, 151–160

  22. Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K,Zettlemoyer L (2018) Deep contextualized word representations.In: Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: humanlanguage, Vol. 1 (Long Papers) Technologies. New Orleans,Louisiana: Association for Computational Linguistics, pp 2227–2237. [Online]. Available: https://www.aclweb.org/anthology/N18-1202

  23. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pretraining of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies. Minneapolis, Minnesota: Association for Computational Linguistics, pp 4171–4186. [Online]. Available: https://www.aclweb.org/anthology/N19-1423

  24. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV (2019) XLNet: generalized autoregressive pretraining for language understanding. In: Advances in neural information processing systems (NIPS 2019), pre-proceedings. Curran Associates, pp 5754–5764.[Online]. Available: http://papers.nips.cc/paper/8812-xlnet-generalized-autoregressive -pretr aining-for-language-understanding.pdf

  25. Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics (ACL-2011), no. February, pp 142–150. [Online]. Available: https://www.aclweb.org/anthology/P11-1015

  26. Mikolov T, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: In Advances in neural information processing systems, 3111–3119

  27. Zhang Z, Lan M (2016) Learning sentiment-inherent word embedding for word-level and sentence-level sentiment analysis. In: Proceedings of 2015 international conference on asian language processing, IALP 2015, 1: 94–97

  28. Yu L-C, Wang J, Lai KR, Zhang X (2017) Refining word embeddings for sentiment analysis. In: Proceedings of the 2017 conference on empirical methods in natural language processing Copenhagen, pp 534–539. [Online]. Available: https://www.aclweb.org/anthology/D17–1056

  29. Rezaeinia SM, Rahmani R, Ghodsi A, Veisi H (2019) Sentiment analysis based on improved pre-trained word embeddings. Expert SystAppl 117:139–147. https://doi.org/10.1016/j.eswa.2018.08.044

    Article  Google Scholar 

  30. Rahimi Z, Homayounpour MM (2021) TensSent: a tensor based sentimental word embedding method. ApplIntell. https://doi.org/10.1007/s10489-020-02163-8

    Article  Google Scholar 

  31. Using Latent Semantic Indexing to Filter Spam Kevin R. Gee Dept. of Computer Science and Engineering The University of Texas at Arlington Arlington, TX 76019 geek@cse.uta.edu

  32. K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, “Feature hashing for large scale multitask learning,” in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, 1113–1120.

  33. G. Salton and M. McGill, editors.Introduction to Modern Information Retrieval.McGraw-Hill, 1983.

  34. https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

  35. Golub G, Reinsch C (1970) Singular value decomposition and least squares solutions. NumerischeMathematik 14(5):403–420

    Google Scholar 

  36. Comparative analysis of word embeddings for capturing word similarities , Martina Toshevska, Frosina Stojanovska and Jovan Kalajdjieski Faculty of Computer Science and Engineering, Ss. Cy.

  37. Liu Z, Li M, Liu Y, Ponraj M (2011) Performance evaluation of Latent Dirichlet Allocation in text mining. Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) 2011:2695–2698. https://doi.org/10.1109/FSKD.2011.6020066

    Article  Google Scholar 

  38. T. Mikolov, W.T. Yih, G. Zweig. Linguistic Regularities in Continuous Space Word Representations. NAACL HLT 2013.

  39. Tshitoyan V, Dagdelen J, Weston L et al (2019) Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571:95–98. https://doi.org/10.1038/s41586-019-1335-8

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Ethics declarations

Conflict of interest

The authors certify that there is no conflict of interest in the subject matter discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Johnson, S.J., Murty, M.R. & Navakanth, I. A detailed review on word embedding techniques with emphasis on word2vec. Multimed Tools Appl 83, 37979–38007 (2024). https://doi.org/10.1007/s11042-023-17007-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17007-z

Keywords

Navigation