A detailed review on word embedding techniques with emphasis on word2vec

Johnson, S. Joshua; Murty, M. Ramakrishna; Navakanth, I.

doi:10.1007/s11042-023-17007-z

A detailed review on word embedding techniques with emphasis on word2vec

Published: 03 October 2023

Volume 83, pages 37979–38007, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

S. Joshua Johnson¹,
M. Ramakrishna Murty¹ &
I. Navakanth²

751 Accesses
3 Citations
Explore all metrics

Abstract

Text data has been growing drastically in the present day because of digitalization. The Internet, being flooded with millions of documents every day, makes the task of text processing by human beings relatively complex, which is neither adaptable nor successful. Many machine learning algorithms cannot interpret the raw text in its original format, as these algorithms purely need numbers as inputs to accomplish any task (say, classification, regression). A better way to represent text for computers, to understand and process text efficiently and effectively is needed. Word embedding is one such technique. Word embedding, or the encoding of words as vectors, has received much interest as a feature learning technique for natural language processing in recent times. This review presents a better way of understanding and working with word embeddings. Many researchers, who are non-experts in using different text processing techniques, would not know where to start their exploration due to a lack of comprehensive material. This review provides an overview of several word embedding strategies and the entire working procedure of word2vec,both in theory and mathematical perspectives which provides researchers with detailed information so that they may rapidly get to work on their research. Research results of standard word embedding techniques have also been included to better understand how word embedding have been improved from the past years to most recent findings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on sentiment analysis methods, applications, and challenges

Article 07 February 2022

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Sentiment Analysis in the Age of Generative AI

Article Open access 05 March 2024

References

https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
Mcdonald, Scott. (2008). Testing the distributional hypothesis: The influence of context on judgements of semantic similarity.
Hillebrand L, Biesner D, Bauckhage C, Sifa R (2021) Interpretable Topic Extraction and Word Embedding Learning Using Non-Negative Tensor DEDICOM. Mach Learn Knowl Extraction 3(1):123–167. https://doi.org/10.3390/make3010007
Article Google Scholar
Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Proceedings of the 27th international conference on neural information processing systems, vol 2, no. December, pp 2177–2185. https://doi.org/10.5555/2969033.2969070
Pennington, J.; Socher, R.; Manning, C. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 25–29 October 2014. 1532–1543.
Deerwester S, Dumias ST, Furmas GW, Lander TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407 (https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/(SICI)1097-4571(199009)41:63.0.CO;2-9)
Article Google Scholar
Hofmann T (1999) Probabilistic latent semantic analysis. In:Thomas H (ed) Probabilistic latent semantic analysis. Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publisher Inc. 289–296
https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Google Scholar
Lebret R, Collobert R (2014) Word embeddings through Hellinger PCA. In: 14th Conference of the european chapter of the association for computational linguistics 2014, EACL 2014, 482–490
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Efficient Estimation of Word Representations in Vector Space Tomas Mikolov Google Inc., Mountain View, CA tmikolov@google.com Kai Chen Google Inc., Mountain View, CA kaichen@google.com Greg Corrado Google Inc., Mountain View, CA gcorrado@google.com Jeffrey Dean Google Inc., Mountain View, CA jeff@google.com arXiv:1301.3781v3 [cs.CL] 7 Sep 2013
arXiv:1411.2738v4 [cs.CL] 5 Jun 2016word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu
Huang EH, Socher R, Manning CD, Ng AY (2012) Improve Word Representation via Global Context and Multiple Word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics no. July.JejuIslan: Korea: Association for Computational Linguistics, 873–882
Sen P, Ganguly D, Jones G (2019) Word-Node2Vec: improving word embedding with document-level non-local word cooccurrences. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics human language technologies, Volume 1 (Long and Short Papers), pp 1041–1051. [Online]. Available: https://www.aclweb.org/anthology/N19–1109
Reisinger J, Mooney RJ (2010) Multi-prototype vector-space models of word meaning. In: NAACL HLT 2010 - human language technologies: the 2010 annual conference of the north american chapter of the association for computational linguistics, proceedings of the main conference, no June, 109–117
Wu Z, Giles CL (2015) Sense-aware semantic analysis: A multi-prototype word representation model using wikipedia. Proc Nat Conf Artif Intell 3:2188–2194
Google Scholar
Neelakantan A, Shankar J, Passos A, McCallum A (2014) Efficient non-parametric estimation of multiple embeddings perword in vector space. In: EMNLP 2014 - 2014 Conference on empirical methods in natural language processing, proceedings of the conference, 1059–1069
Li J, Jurafsky D (2015) Do multi-sense embeddings improve natural language understanding? In: Conference proceedings - EMNLP 2015: conference on empirical methods in natural language processing no. September 1722–1732
Chen X, Liu Z, Sun M (2014) A unified model for word sense representation and disambiguation. In: EMNLP 2014 - 2014 Conference on empirical methods in natural language processing, proceedings of the conference, 1025–1035
Tian F, Dai H, Bian J, Gao B, Zhang R, Chen E, Liu TY (2014) A probabilistic model for learning multi-prototypeword embeddings. In: COLING 2014 - 25th International conference on computational linguistics, proceedings of COLING 2014: technical papers, 151–160
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K,Zettlemoyer L (2018) Deep contextualized word representations.In: Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: humanlanguage, Vol. 1 (Long Papers) Technologies. New Orleans,Louisiana: Association for Computational Linguistics, pp 2227–2237. [Online]. Available: https://www.aclweb.org/anthology/N18-1202
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pretraining of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies. Minneapolis, Minnesota: Association for Computational Linguistics, pp 4171–4186. [Online]. Available: https://www.aclweb.org/anthology/N19-1423
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV (2019) XLNet: generalized autoregressive pretraining for language understanding. In: Advances in neural information processing systems (NIPS 2019), pre-proceedings. Curran Associates, pp 5754–5764.[Online]. Available: http://papers.nips.cc/paper/8812-xlnet-generalized-autoregressive -pretr aining-for-language-understanding.pdf
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics (ACL-2011), no. February, pp 142–150. [Online]. Available: https://www.aclweb.org/anthology/P11-1015
Mikolov T, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: In Advances in neural information processing systems, 3111–3119
Zhang Z, Lan M (2016) Learning sentiment-inherent word embedding for word-level and sentence-level sentiment analysis. In: Proceedings of 2015 international conference on asian language processing, IALP 2015, 1: 94–97
Yu L-C, Wang J, Lai KR, Zhang X (2017) Refining word embeddings for sentiment analysis. In: Proceedings of the 2017 conference on empirical methods in natural language processing Copenhagen, pp 534–539. [Online]. Available: https://www.aclweb.org/anthology/D17–1056
Rezaeinia SM, Rahmani R, Ghodsi A, Veisi H (2019) Sentiment analysis based on improved pre-trained word embeddings. Expert SystAppl 117:139–147. https://doi.org/10.1016/j.eswa.2018.08.044
Article Google Scholar
Rahimi Z, Homayounpour MM (2021) TensSent: a tensor based sentimental word embedding method. ApplIntell. https://doi.org/10.1007/s10489-020-02163-8
Article Google Scholar
Using Latent Semantic Indexing to Filter Spam Kevin R. Gee Dept. of Computer Science and Engineering The University of Texas at Arlington Arlington, TX 76019 geek@cse.uta.edu
K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, “Feature hashing for large scale multitask learning,” in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, 1113–1120.
G. Salton and M. McGill, editors.Introduction to Modern Information Retrieval.McGraw-Hill, 1983.
https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
Golub G, Reinsch C (1970) Singular value decomposition and least squares solutions. NumerischeMathematik 14(5):403–420
Google Scholar
Comparative analysis of word embeddings for capturing word similarities , Martina Toshevska, Frosina Stojanovska and Jovan Kalajdjieski Faculty of Computer Science and Engineering, Ss. Cy.
Liu Z, Li M, Liu Y, Ponraj M (2011) Performance evaluation of Latent Dirichlet Allocation in text mining. Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) 2011:2695–2698. https://doi.org/10.1109/FSKD.2011.6020066
Article Google Scholar
T. Mikolov, W.T. Yih, G. Zweig. Linguistic Regularities in Continuous Space Word Representations. NAACL HLT 2013.
Tshitoyan V, Dagdelen J, Weston L et al (2019) Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571:95–98. https://doi.org/10.1038/s41586-019-1335-8
Article Google Scholar

Download references

Author information

Authors and Affiliations

CSE, Anil Neerukonda Institute of Technology and Sciences(Autonomous), Visakhapatnam, India
S. Joshua Johnson & M. Ramakrishna Murty
CSE, Maturi Venkata Subba Rao (MVSR) Engineering College(Autonomous), Hyderabad, India
I. Navakanth

Authors

S. Joshua Johnson
View author publications
You can also search for this author in PubMed Google Scholar
M. Ramakrishna Murty
View author publications
You can also search for this author in PubMed Google Scholar
I. Navakanth
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Conflict of interest

The authors certify that there is no conflict of interest in the subject matter discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Johnson, S.J., Murty, M.R. & Navakanth, I. A detailed review on word embedding techniques with emphasis on word2vec. Multimed Tools Appl 83, 37979–38007 (2024). https://doi.org/10.1007/s11042-023-17007-z

Download citation

Received: 29 March 2022
Revised: 19 September 2022
Accepted: 11 September 2023
Published: 03 October 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s11042-023-17007-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A detailed review on word embedding techniques with emphasis on word2vec

Abstract

Access this article

Similar content being viewed by others

A survey on sentiment analysis methods, applications, and challenges

Natural language processing: state of the art, current trends and challenges

Sentiment Analysis in the Age of Generative AI

References

Author information

Authors and Affiliations

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A detailed review on word embedding techniques with emphasis on word2vec

Abstract

Access this article

Similar content being viewed by others

A survey on sentiment analysis methods, applications, and challenges

Natural language processing: state of the art, current trends and challenges

Sentiment Analysis in the Age of Generative AI

References

Author information

Authors and Affiliations

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation