1 Introduction

This research investigates the efficacy of word embedding in a deep learning environment for conducting text analytics tasks and summarizes the significant aspects. A systematic literature review provides an overview of existing word embedding and deep learning models. The overall structure of the paper is shown in Fig. 1.

Fig. 1
figure 1

Overall structure of the paper

1.1 Natural language processing (NLP)

NLP is a branch of linguistics, computer science, and artificial intelligence concerned with computer–human interaction, mainly how to design computers to process and evaluate huge volumes of natural language data. NLP integrates statistical, machine learning, and deep learning models with computational linguistics rules-based modeling of human language. Speech recognition, natural language interpretation or understanding (NLI or NLU), and natural language production or generation (NLP or NLG) are all common challenges in natural language processing, as shown in Fig. 2. These technologies allow computers to understand and process human language.

Fig. 2
figure 2

Challenges and evolution of natural language processing

NLP research has progressed from punch cards and batch processing to the world of Google and others, where millions of web pages may be analyzed in under a second. NLP progresses from symbolic to statistical to neural NLP. Many NLP applications leverage deep neural network design and produce state-of-the-art results due to technological advancements, increased computer power, and abundant corpus availability (Young et al. 2018) (Lavanya and Sasikala 2021).

1.2 Text analytics

The majority of text data is unstructured and dispersed across the internet. This text data can yield helpful knowledge if it is properly obtained, aggregated, formatted, and analyzed. Text analytics can benefit corporations, organizations, and social movements in various ways. The easiest way to execute text analytics tasks is to use manually specified rules to link the keywords closely. In the presence of polysemy words, the performance of defined rules begins to deteriorate. Machine learning, deep learning, and natural language processing methods are used in text analytics to extract meaning from large quantities of text. Businesses can use these insights to improve profitability, consumer satisfaction, innovation, and even public safety. Techniques for analyzing unstructured text include text classification, sentiment analysis, named entity recognition (NER) and recommendation system, biomedical text mining, topic modeling, and others, as shown in Fig. 3. Each of these strategies is employed in a variety of contexts.

Fig. 3
figure 3

NLP techniques

1.3 Deep learning models

Deep learning methods have been increasingly popular in NLP in recent years. Artificial neural networks (ANN) with several hidden layers between the input and output layers are known as deep neural networks (DNN). This survey reviews 193 articles published in the last three years focusing on word embedding and deep learning models for various text analytics tasks. Deep learning models are categorized based on their neural network topologies, such as recurrent neural networks (RNN) and convolutional neural networks (CNN). RNN detects patterns over time, while CNN can identify patterns over space.

1.3.1 Convolutional neural networks

CNN is a neural network with many successes and inventions in image processing and computer vision. The underlying architecture of CNN is depicted in Fig. 4. A CNN consists of several layers: an input layer, a convolutional layer, a pooling layer, and a fully connected layer. The input layer receives the image pixel value as input and passes it to the convolutional layer. The convolution layer computes output using kernel or filter values, subsequently transferred to the pooling layer. The pooling layer shrinks the representation size and speeds up computation. Local and location-consistent patterns are easily recognized using CNN. These patterns could be key sentences that indicate a specific objective. CNN has grown in popularity as a text analytics model architecture.

Fig. 4
figure 4

Architecture of CNN

1.3.2 Recurrent neural networks

Text is viewed as a series of words by RNN models designed to capture word relationships and sentence patterns for text analytics. A typical representation of RNN and backpropagation through time is shown in Fig. 5. RNN accepts input xt at time t and computes output yt as the network's output. It computes the value of the internal state and updates the internal hidden state vector ht in addition to the output, then transmits this information about the internal state from the current time step to the next. The function of maintaining the internal cell state is represented by Eq. (1).

Fig. 5
figure 5

A typical representation of RNN

$${h}_{t}= {f}_{w}\left({h}_{t-1},{x}_{t}\right)$$
(1)

where ht represents the current state of the cell, fw represents a function parameterized by a set of weights w, and ht-1 represents the previous state. Wxh is a weight matrix that transforms the input to the hidden state, Whh is the weight that transforms from the previous hidden state to the next hidden state, Why is the hidden state to output.

RNN passes the intermediate information through a non-linear transformation function like tanh, as shown in Eq. (2). The intermediate output is passed through the softmax function, which output values 0 to 1 and adds up to 1, as represented using Eq. (3). RNN uses a backpropagation through time algorithm to learn from the data sequence and improve the prediction capabilities. Backpropagation is the recursive application of the chain rule, where it computes the total loss, L, as represented in Eq. (4) and shown in Fig. 5. RNN suffers due to vanishing and exploding gradients problems. The vanishing gradient problem can be addressed using the Gated Recurrent Unit (GRU) or Long Short Term Memory (LSTM) network architecture.

$${h}_{t}= tanh\left({WT}_{hh} {h}_{t-1}+ {WT}_{xh} {x}_{t}\right)$$
(2)
$${y}_{t}= softmax\left({WT}_{hy} {h}_{t}\right)$$
(3)
$$L= {L}_{1}+ {L}_{2}+..\dots . + {L}_{t}$$
(4)

In an LSTM cell state, at a particular time t, the input vector xt passed through the three gate vectors, hidden state, and cell state. The LSTM architecture is shown in Fig. 6. The input gate receives the input signal and modifies the values of the current cell state using Eq. (5).

Fig. 6
figure 6

The architecture of LSTM

The forget gate ft updates its state using Eq. (6) and removes the irrelevant information. The output gate ot generates the output using Eq. (7) and sends it to the network in the next step. Sigma represents the sigmoid function, and tanh represents the hyperbolic tangent function. The ⊙ operator defines the element-wise product. The input modulation gate, mt is represented by Eq. (8). It uses weight matrices W and bias vector b to update the cell state ct at time t as defined by Eq. (9). The network updates the hidden states using these memory units, as shown in Eq. (10).

$${i}_{t}= sigma({W}_{xi} {x}_{t}+ {W}_{hi} {h}_{(t-1)}+ {b}_{i})$$
(5)
$${f}_{t}= sigma({W}_{xf} {x}_{t}+ {W}_{hf} {h}_{(t-1)}+ {b}_{f})$$
(6)
$${o}_{t}= sigma({W}_{xo} {x}_{t} +{W}_{ho}{ h}_{(t-1)} +{b}_{o})$$
(7)
$${m}_{t}= tanh({W}_{xc} {x}_{t}+ {W}_{hc} {h}_{(t-1)}+ {b}_{c})$$
(8)
$${c}_{t}={f}_{t}\odot {c}_{(t-1)}+ {i}_{t}\odot {m}_{t}$$
(9)
$${h}_{t}= {o}_{t}\odot tanh({c}_{t})$$
(10)

1.4 Word to vector representation models

Recent breakthroughs in deep learning have significantly improved several NLP tasks that deal with text semantic analysis, such as text classification, sentiment analysis, NER and recommendation systems, biomedical text mining, and topic modeling. Pre-trained word embeddings are fixed-length vector representations of words that capture generic phrase semantics and linguistic patterns in natural language. Researchers have proposed various methods for obtaining such representations. Word embedding has been shown to be helpful in multiple NLP applications (Moreo et al. 2021).

Word embedding techniques can be categorized into conventional, distributional, and contextual word embedding models, as shown in Fig. 7. Conventional word embedding, also called count-based/frequency-based models, is categorized into a bag of words (BoW), n-gram, and term frequency-inverse document frequency (TF-IDF) models. The distributional word embedding, also called static word embedding, consists of probabilistic-distributional models, such as vector space model (VSM), latent semantic analysis (LSA), latent Dirichlet allocation (LDA), neural probabilistic language model (NPLM), word to vector (Word2Vec), global vector (GloVe) and fastText model. The contextual word embedding models are classified into auto-regressive and auto-encoding models, such as embeddings from language models (ELMo), generative pre-training (GPT), and bidirectional encoder representations from transformers (BERT) models.

Fig. 7
figure 7

Approaches to represent a word

1.5 Related work

Selecting an effective word embedding and deep learning approach for text analytics is difficult because the dataset's size, type, and purpose vary. Different word embedding models have been presented by researchers to effectively describe a word's meaning and provide the embedding for processing. The word embedding model improved throughout the year to effectively represent out-of-vocabulary words and capture the significance of the contextual word. Previous studies have shown that a deep learning model can successfully predict outcomes by deriving significant patterns from the data (Wang et al. 2020).

The systematic studies on deep learning based emotion analysis (Xu et al. 2020), deep learning based classification of text (Dogru et al. 2021), and survey on training and evaluation of word embeddings (Torregrossa et al. 2021) focus on comparing the performance of word embedding and deep learning models for the domain-specific task. Studies also present an overview of other related approaches used for similar tasks. The focus of this research is to explore the effectiveness of word embedding in a deep learning environment for performing text analytics tasks and recommend its use based on the key findings.

1.6 Motivation and contributions

The primary motivation of this study is to cover the recent research trends in NLP and a detailed understanding of how to use word embedding and deep learning models to achieve efficient results on text analytics tasks. There are systematic studies on word embedding models and deep learning approaches focusing on a specific application. Still, no one includes a reference for selecting suitable word embedding and deep learning models for text analytics tasks and does not present their strengths and weaknesses.

The key contributions of this paper are as follows:

  1. 1.

    This study examines the contributions of researchers to the overall development of word embedding models and their different NLP applications.

  2. 2.

    A systematic literature review is done to develop a comprehensive overview of existing word embedding and deep learning models.

  3. 3.

    The relevant literature is classified according to criteria to review the essential uses of text analytics and word embedding techniques.

  4. 4.

    The study explores the effectiveness of word embedding in a deep learning environment for performing text analytics tasks and discusses the key findings. The review includes a list of prominent datasets, tools, and APIs available and a list of notable publications.

  5. 5.

    A reference for selecting a suitable word embedding approach for text analytics tasks is presented based on a comparative analysis of different word embedding techniques to perform text analytics tasks. The comparative analysis is presented in both tabular and graphical forms.

  6. 6.

    This paper provides a concise overview of the fundamentals, advantages, and challenges of various word representation approaches and deep learning models, as well as a perspective on future research.

The overall structure of the paper is shown in Fig. 1. Section 1 introduces the overview of NLP techniques for performing text analytics tasks, deep learning models, approaches to represent word to vector form, related work, motivation, and key contribution of the study. Section 2 presents the overall development of word embedding models. Section 3 explains the methodology of the conducted systematic literature review. It also covers the eligibility criteria, data extraction process, list of popular journals, and available tools and API. Sections 4 and 5 discuss studies on significant text analytics applications, word embedding models, and deep learning environments. Section 6 discusses a comparative analysis and a reference for selecting a suitable word embedding approach for text analytics tasks. Section 7 concludes the paper with a summary and recommendations for future work, followed by Annexures A and B, which contain an overview of all review papers and the benefits and challenges of various word embedding models.

2 Word representation models

This section will examine the techniques for word embedding training, describing how they function and how they differ from one another.

2.1 Conventional word representation models

2.1.1 Bag of words

The BoW model is a representation that simplifies NLP and retrieval. A text is an unordered collection of its words, with no attention to grammar or even word order. For text categorization, a word in a document is given a weight based on how frequently it appears in the document and how frequently it appears in different documents. The BoW representation for two statements consisting of words and their weights are as follows.

Statement 1: One cat is sleeping, and the other one is running.

Statement 2: One dog is sleeping, and the other one is eating.

 

One

Cat

Is

Sleeping

And

The

Other

Dog

Running

Eating

S1

2

1

2

1

1

1

1

0

1

0

S2

2

0

2

1

1

1

1

1

0

1

The two statements have ten distinct words, representing each as ten element vector. Statement-1 is represented by [2,1,2,1,1,1,1,0,1,0], and statement-2 is represented by [2,0,2,1,1,1,1,1,0,1]. Each vector element is represented as a count of the corresponding entry in the dictionary.

BoW is suffering due to some limitations, such as sparsity. If the length of a sentence is large, it takes a more significant time to obtain its vector representation and needs considerable time to get sentence similarity. Frequent words have more power as a word occurs more times. Its frequency count increases, ultimately increasing its similarity scores, ignoring word orders and generating the same vector for totally different sentences, losing the sentence's contextual meaning out of vocabulary that cannot handle unseen words.

2.1.2 n-grams

It is a contiguous sequence of n tokens. For n = 1, 2, and 3, it is termed as 1-gram, 2-gram, and 3-gram, also termed as unigram model, bigram, and trigram. The n-gram model divides the sentence into word or character-level tokens. Consider two statements,

Statement-1: One cat is sleeping, the other is running.

Statement-2: One dog is sleeping, and the other one is eating.

The unigram and bigram word and character level representation is shown in the example below.

1-gram (unigram)

Word level tokens

[One, cat, is, sleeping, and, the, other, one, is, running]

[One, dog, is, sleeping, and, the, other, one, is, eating]

Character level tokens

[O, n, e, _, c, a, t, _, i, s, _, s, l, e, e, p, i, n, g, _, a, n, d, _, t, h, e, _, o, t, h, e, r, _, o, n, e, _, i, s, _, r, u, n, n, i, n, g]

[O, n, e, _, d, o, g, _, i, s, _, s, l, e, e, p, i, n, g, _, a, n, d, _, t, h, e, _, o, t, h, e, r, _, o, n, e, _, i, s, _, e, a, t, i, n, g]

2-gram (bigram)

Word level tokens

[One cat, cat is, is sleeping, sleeping and, and the, the other, other one, one is, is running]

[One dog, dog is, is sleeping, sleeping and, and the, the other, other one, one is, is eating]

Character level tokens

[On, ne, e_, _c, ca, at, t_, _i, is, s_, _s, sl, le, ee, ep, pi, in, ng, g_, _a, an, nd, d_, _t, th, he, e_, _o, ot, th, he, er, r_, _o, on, ne, e_, _i, is, s_, _r, ru, un, nn, ni, in, ng]

[On, nn, ne, e_, _d, do, og, g_, _i, is, s_, _s, sl, le, ee, ep, pi, in, ng, g_, _a, an, nd, d_, _t, th, he, e_, _o, ot, th, he, er, r_, _o, on, ne, e_, _i, is, s_, _e, ea, at, ti, in, ng]

2.1.3 Term frequency-inverse document frequency

TF-IDF is used to find how relevant the word is in the document. Word relevance is the amount of information that gives about the context. Term frequency measures how frequently a term occurs in a document, and the term has more relevance than other terms for the document. Consider two statements,

Statement-1: One cat is sleeping, and the other one is running.

Statement-2: One dog is sleeping, and the other one is eating.

The TF score of a word in sentences is shown in the example below.

Statment 1

Words

One

Cat

Is

Sleeping

And

The

Other

Running

TF score

2/10

1/10

2/10

1/10

1/10

1/10

1/10

1/10

Value

0.2

0.1

0.2

0.1

0.1

0.1

0.1

0.1

Statment 2

Words

One

Dog

Is

Sleeping

And

The

Other

Eating

TF score

2/10

1/10

2/10

1/10

1/10

1/10

1/10

1/10

Value

0.2

0.1

0.2

0.1

0.1

0.1

0.1

0.1

The TF score for both statements shows misleading information that the words “one” and “is” have more importance than the other word as they obtain the same higher score of 2. This result focuses on the need to calculate inverse document frequency.

Statment 1

Words

One

Cat

Is

Sleeping

And

The

Other

Running

IDF score

log(2/2)

log(2/1)

log(2/2)

log(2/2)

log(2/2)

log(2/2)

log(2/2)

log(2/1)

Value

0

0.3

0

0

0

0

0

0.3

Statment 2

Words

one

dog

is

sleeping

and

the

Other

eating

IDF score

log(2/2)

log(2/1)

log(2/2)

log(2/2)

log(2/2)

log(2/2)

log(2/2)

log(2/1)

Value

0

0.3

0

0

0

0

0

0.3

The TF-IDF score is shown in the example below.

Statment 1

Words

One

Cat

Is

Sleeping

And

The

Other

Running

TF score

0.2

0.1

0.2

0.1

0.1

0.1

0.1

0.1

IDF score

0

0.3

0

0

0

0

0

0.3

TF-IDF value

0

0.03

0

0

0

0

0

0.03

Statment 2

Words

One

Dog

Is

Sleeping

And

The

Other

Eating

TF score

0.2

0.1

0.2

0.1

0.1

0.1

0.1

0.1

IDF score

0

0.3

0

0

0

0

0

0.3

TF-IDF value

0

0.03

0

0

0

0

0

0.03

The value of TF-IDF shows more informative words concerning a particular statement. For statement-1, cat and running, whereas for statement-2, dog and eating represent more informative. Using TF-IDF, relativeness in the document is obtained, and the more informative words rule out the frequent word. As in the previous case, the word “one” and “is” shows higher frequency than other words in a document.

Calculating the cosine similarity of statements 1 and 2 using the formula. In BOW, the frequency of words affects the cosine similarity.

Cosine similarity

\(\frac{(\mathrm{A}*\mathrm{B}) }{(|\mathrm{A}|*|\mathrm{B}|)}\)

Cosine similarity using BOW

\(\frac{([\mathrm{2,1},\mathrm{2,1},\mathrm{1,1},\mathrm{1,0},\mathrm{1,0}] * [\mathrm{2,0},\mathrm{2,1},\mathrm{1,1},\mathrm{1,1},\mathrm{0,1}])}{(\mathrm{sqrt}(4+1+4+1+1+1+1+0+1+0) *\mathrm{ sqrt}(4+0+4+1+1+1+1+1+0+1))}\)= \(\frac{12}{14}\)  = 0.85

Cosine similarity using TF-IDF

\(\frac{([\mathrm{0,0.03,0},\mathrm{0,0},\mathrm{0,0},0.03] * [\mathrm{0,0.03,0},\mathrm{0,0},\mathrm{0,0},0.03])}{(\mathrm{sqrt}(0.0009+0.0009) *\mathrm{ sqrt}(0.0009+0.0009))}\)= \(\frac{0.0018}{0.0018}\)  = 1

2.2 The distributional representation model

In the distributional representation model, the context in which a word is used determines its meaning in a sentence. Distributional models predict semantic similarity based on the similarity of observable contexts. If the two words have similar meanings, they frequently appear in the same context (Harris 1954) (Firth 1957) (Ekaterina Vylomova 2021). VSM is an algebraic representation of text as a vector of identifiers. A collection of documents \({D}_{i}\) from a documents space are identified by index terms \({T}_{j}\) and assign weights 0 or 1 according to their importance. Each document is represented by a t-dimensional vector as \({D}_{i}\) = \({( d}_{i1}, {d}_{i2}, \dots \dots , {d}_{it}),\) with weight assign using TF-IDF scheme for representing the difference in information provided by each terms. The term \({d}_{ij}\) represents the weight assign to the jth term in ith document.

The similarity coefficient between two document \({D}_{i}\) and \({D}_{j}\), represented as S(\({D}_{i}\), \({D}_{j})\) is computed to express the degree of similarity between terms and their weights. Two documents with similar index terms are close to each other in the space. The distance between two document points in the space is inversely correlated with the similarity between the corresponding vectors (Salton et al. 1975). A distributional model represents a word or phrase in context, but a VSM represents meaning in a high-dimensional space (Erk 2012). VSM suffers due to the curse of dimensionality resulting from a relatively sparse vector space with a larger dataset.

2.2.1 Latent semantic analysis

LSA is an automatic statistical technique for extracting and inferring predicted contextual use relations of words in discourse sequences. Singular value decomposition (SVD) is computed using a latent semantic indexing technique. The term-document matrix is first created by determining the correlation structure that defines the semantic relationship between the words in a document. SVD extracts data-associated patterns, ignoring the less important terms. Consistent phrases emerge in the document, indicating that it is associated with the data. The SVD of the term-document (t x d) matrix, X, is decomposed into three sub-matrices, such as \(\mathrm{X}= {\mathrm{T}}_{0}{\mathrm{ S}}_{0 }{\mathrm{D}}_{0}^{\mathrm{^{\prime}}}\). Where, \({\mathrm{T}}_{0}\mathrm{ and }{\mathrm{D}}_{0}^{\mathrm{^{\prime}}}\) are left, and right singular vectors matrices and have orthogonal unit-length columns, and \({\mathrm{S}}_{0}\) is the diagonal matrix of singular values. The SVD takes a long time to map new terminology and documents and confront complex issues. The Latent Semantic Indexing (LSI) approach solves the synonymy problem by allowing numerous terms to refer to the same thing. It also helps with partial polysemy solutions (Scott Deerwester et al. 1990) (Flor and Hao 2021).

2.2.2 Latent dirichlet allocation

The LDA model is a probabilistic corpus model assigning high probability to corpus members and other comparable texts. It is a three-tier hierarchical Bayesian model in which each collection item is represented as a finite mixture across a set of underlying themes. Afterward, each topic is modeled as an infinite mixture of topic probability. For text modeling, topic probabilities provide an explicit description of a document. The latent topic is determined by the likelihood that a word appears in the topic. Even though LDA cannot collect syntactic data, it relies entirely on topic data. (Campbell et al. 2015) The LSA and LDA models construct embeddings using statistical data. The LSA model is based on matrix factorization and is subject to the non-negativity requirement. In contrast, the LDA model is based on the word distribution and is expressed by the Dirichlet prior distribution, which is the multinomial distribution's conjugate (Li and Yang 2018).

2.3 Neural probabilistic language model

Learning the joint probability function of sequences of words in a language is one of the goals of statistical language modeling. The curse of dimensionality is addressed with an NPLM that learns a distributed representation for words. Language modeling is the prediction of the probability distribution of the following word, given a sequence of words as shown in Eq. (11), and in each subsequent step, the product of conditional probabilities with the assumption that they are independent, as represented by Eq. (12).

$$P({x}_{t+1}/{x}_{t} , \dots \dots , {x}_{1})$$
(11)
$$\begin{gathered} P\left( {x_{t + 1} /x_{t} , \ldots \ldots ,x_{1} } \right) = P\left( {x_{1} } \right)P\left( {x_{2} /x_{1} } \right)P\left( {x_{3} /x_{2} ,x_{1} } \right) \ldots P\left( {x_{t} /x_{t - 1} , \ldots \ldots ,x_{1} } \right) \hfill \\ = \pi_{1}^{t} P\left( {x_{t} /x_{1}^{t - 1} } \right) = \pi_{1}^{t} P\left( {x_{t} /x_{t - 1} , \ldots \ldots ,x_{1} } \right) \hfill \\ \end{gathered}$$
(12)

where the term \({x}_{t}\) is the tth word. The conditional probability is represented by probability function C maps to the vocabulary V and maps function g to a conditional probability distribution over the word in V to obtain the following word \({x}_{t}\), as shown in Eq. (13). The conditional probability is decomposed into two sub-parts.

$$f\left(i, {w}_{t-1}, \dots \dots , {w}_{t-n+1}\right)= g\left(i, {C(w}_{t-1}\right), \dots \dots , {C(w}_{t-n+1})$$
(13)

The output of function g represents the estimated probability \(P\left({x}_{t}= i/{x}_{1}^{t-1}\right)\). Language models based on neural networks outperform n-gram models substantially (Bengio et al. 2003) (See 2019).

2.3.1 Word2Vec model

Conventional and static word representation methods treat words as atomic units represented as indices in a dictionary. These methods do not represent the similarity between words. The Word2Vec is a collection of model architectures and optimizations for learning word embeddings from massive datasets. The distributed representations technique uses neural networks to express word similarity adequately.

In several NLP applications, Word2Vec models such as continuous bag-of-word (CBOW) and Skip-Gram models are used to efficiently describe the semantic meanings of words (Mikolov et al. 2013a). The Word2Vec model takes a text corpus as input, processes it in the hidden layer, and outputs word vectors in the output layer. The model identifies the distinct word, creates a vocabulary, builds context, and learns vector representations of words in vector space using training data, as depicted in Fig. 8. Each unique word in the training set corresponds to a specific vector in space. Each word can have various degrees of similarity, indicating that words with similar contexts are more related.

Fig. 8
figure 8

Word2Vec model

The CBOW and Skip-Gram model architecture is shown in Fig. 9. The CBOW uses context words to forecast the target word. For a given input word, the Skip-Gram model predicts the context word.

Fig. 9
figure 9

The architecture of (a) CBOW model, (b) Skip-Gram model

The input is a one-hot encoded vector. The weights between the input and hidden layers are represented by the input weight vector, a V x N matrix, W. Each row of the matrix W represents the N-dimensional vector representation of the word input layers. The output weight vector represents the weights between the hidden and output layers, an N x V matrix, W'. The input and output weight vectors are used to award a score to each word in the vocabulary. In CBOW, the N-dimension vector representation vw of the related word of the input layer is represented in each row of W. The ith row of matrix W is \({\mathrm{v}}_{\mathrm{w}}^{\mathrm{T}}\), given a context word, assuming \({\mathrm{x}}_{k}.\) \({\mathrm{x}}_{k}=1\) and \({\mathrm{x}}_{{\mathrm{k}}^{\mathrm{^{\prime}}}}=0\) for \({\mathrm{k}}^{\mathrm{^{\prime}}}\ne \mathrm{k}\). The hidden layer activation function is linear, passing information from the previous layer to the next layer, i.e. copy the kth row of matrix W to the hidden state value h. The vector representation of the input word \({\mathrm{w}}_{\mathrm{I}}\) is represented by \({\mathrm{v}}_{\mathrm{WI}}\). The updated value of h is as shown in Eq. (14). The output weight matrix \({\mathrm{W}}^{\mathrm{^{\prime}}}=\{{\mathrm{w}}_{\mathrm{ij}}^{\mathrm{^{\prime}}}\}\) is used to compute the score from vocabulary for each word uj. The jth column of the matrix W' is represented by \({v}_{wj}^{^{\prime}}\), as shown in Eq. (15).

$$\mathrm{h}= {\mathrm{W}}^{\mathrm{T}}\mathrm{ x}= {\mathrm{v}}_{\mathrm{WI}}^{\mathrm{T}}$$
(14)
$${\mathrm{u}}_{\mathrm{j}}= {{\mathrm{v}}_{\mathrm{wj}}^{\mathrm{^{\prime}}}}^{\mathrm{T}}\mathrm{h}$$
(15)

The output layer uses the softmax activation function to compute the multinomial probability distribution of words. The jth unit output contains word representation from the input weight vector \({\mathrm{v}}_{\mathrm{w}}\) and output weight vector \({v}_{w}^{^{\prime}}\), as illustrated in Eq. (16).

$$\mathrm{p}\left({\mathrm{w}}_{\mathrm{j}}/{\mathrm{w}}_{\mathrm{I}}\right)= {\mathrm{y}}_{\mathrm{j}}= \frac{\mathrm{exp}({{\mathrm{v}}_{\mathrm{wj}}^{\mathrm{^{\prime}}}}^{\mathrm{T}}{\mathrm{v}}_{\mathrm{WI}})}{\sum_{{\mathrm{j}}^{\mathrm{^{\prime}}}=1}^{\mathrm{v}}\mathrm{exp}({{\mathrm{v}}_{{\mathrm{w}}_{{\mathrm{j}}^{\mathrm{^{\prime}}}}}^{\mathrm{^{\prime}}}}^{\mathrm{T}} {\mathrm{v}}_{\mathrm{WI}})}$$
(16)

For a window size of 2, the word wt-2, wt-1, wt+1, wt+2 are the context word for the target word wt. Compared to the CBOW model, the Skip-Gram model is the polar opposite. Based on the input word, the Skip-Gram model predicts context words. For a window size of 2, the word wt is the input word for the output context words wt-2, wt-1, wt+1, wt+2. The input weight vector is computed using a similar approach to the CBOW model. For the input wI the output of jth word on C multinomial distribution is represented by \({\mathrm{y}}_{\mathrm{c},\mathrm{j}}\). Input to the jth unit is represented by \({\mathrm{u}}_{\mathrm{c},\mathrm{j}}\). The jth word of the output layer is \({\mathrm{w}}_{\mathrm{c},\mathrm{j}}\) from the cth panel and the word \({\mathrm{w}}_{\mathrm{o},\mathrm{c}}\) represents the output context word. The output for each word is computed using the output weight vector, as represented in Eq. (17).

$$\mathrm{p}\left({\mathrm{w}}_{\mathrm{c},\mathrm{j}}= {\mathrm{w}}_{\mathrm{o},\mathrm{c}}/{\mathrm{w}}_{\mathrm{I}}\right)= {\mathrm{y}}_{\mathrm{c},\mathrm{j}}= \frac{\mathrm{exp}({\mathrm{u}}_{\mathrm{c},\mathrm{j}})}{{\sum }_{{\mathrm{j}}^{\mathrm{^{\prime}}}=1}^{\mathrm{v}}\mathrm{exp}({\mathrm{u}}_{{\mathrm{j}}^{\mathrm{^{\prime}}}})}$$
(17)

Multiplying the input by the input weights between the input and the hidden layer yields the input-hidden matrix. The output layer computes multinomial distributions using the hidden output weight matrix. The resulting errors are calculated by element-wise adding the error vectors. The error is propagated back to update the weight until the true element is found. The weights obtained between the hidden and output layers after training are called the word vector representation (Mikolov et al. 2013b).

2.3.2 GloVe

Word embeddings learned through Word2Vec are better at capturing word semantics and exploiting word relatedness. Word2Vec focuses solely on information collected from the local context window, whereas global statistic data is neglected. The GloVe is a hybrid of LSA and CBOW that is efficient and scalable for large corpora (Jiao and Zhang 2021). The GloVe is a popular model based on the global co-occurrence matrix, where each element xij in the matrix indicates the frequency with which the words wi and wj co-occur in a given context window. The number of times a particular word appears in the context of the word i, is denoted by Xi. The Pij represents the likelihood of the word j appearing in the context of the word i, as presented in Eqs. (18)–(19).

$${\mathrm{X}}_{\mathrm{i}}= \sum_{\mathrm{k}}{\mathrm{X}}_{\mathrm{ik}}$$
(18)
$${\mathrm{P}}_{\mathrm{ij}}=\mathrm{P}\left(\mathrm{j}/\mathrm{i}\right)= \frac{{\mathrm{X}}_{\mathrm{ij}}}{{\mathrm{X}}_{\mathrm{i}}}$$
(19)

A weighted least squares regression model approximates the relationship between a word embedding and a co-occurrence matrix. The function f(Xij) represents a weighting function for the vocabulary of size V. The \(\mathrm{w}\) represents the word vectors and \(\widetilde{\mathrm{w}}\) represents the context word vectors. The term \({b}_{i}\) and \({\widetilde{b}}_{j}\) are bias for words wi and wj to restore the symmetry. When the word frequency is too high, a weight function f(x), as shown in Eqs. (20)–(21), ensures that the weight does not increase significantly.

$$J= \sum_{i,j=1}^{V}f\left({X}_{ij}\right){({w}_{i}^{T} {\widetilde{w}}_{j}+ {b}_{i}+ {\widetilde{b}}_{j}-\mathrm{log}{X}_{ij})}^{2}$$
(20)
$$f\left( x \right) = \left\{ \begin{gathered} (x/x_{max} )^{3/4} \quad if\quad x < x_{max} \hfill \\ 1\quad \quad \quad \quad \quad otherwise \hfill \\ \end{gathered} \right.$$
(21)

The GloVe is an unsupervised learning technique for constructing word vector representations. The resulting illustrations highlight significant linear substructures of the word vector space, trained using a corpus's aggregated global word-word co-occurrence information. Glove pre-trained word embedding is based on 400 K vocabulary words trained on Wikipedia 2014 and Gigaword 5 as the corpus and 50, 100, 200, and 300 dimensions for word display (Pennington et al. 2014).

2.3.3 fastText

The fastText model uses internal subword information in the form of character n-grams to acquire information about the local word order and allows it to handle unique, out-of-vocabulary terms. The method creates word vectors to reflect the grammar and semantic similarity of words and produce vectors for unseen words. The Facebook AI Research lab announced fastText, an open-source technique for generating vectors for unknown words based on morphology. Each word w is expressed as w1, w2,…, wn in n-gram features and utilized as input to the fastText model. For example, the character trigram for the word “sleeping” is < sl, sle, lee, eep, epi, pin, ing, ng > . Each n-gram will create a vector, and the original vector will be combined with the vector of all its related n-grams during the training phase, as shown in Fig. 10.

Fig. 10
figure 10

The model architecture of fastText

Input to the model contains entire word vectors and character-level n-gram vectors, which are combined and averaged simultaneously (Joulin et al. 2017). Pre-trained word vectors generated from fastText using standard crawl and Wikipedia are available for 157 languages. The fastText model is trained using CBOW in dimension 300, with character n-grams of length five and a size 5 and 10 negatives window.Footnote 1

2.4 Contextual representation models

The conventional and distributional representation approaches learn static word embedding. After training, each word representation is identified. The semantic meaning of the word polysemy can vary depending on the context. Understanding the actual context is required for most downstream tasks in natural language processing. For example, “apple” is a fruit but usually refers to a firm in technical articles. The vectors of words in the contextualized word embedding can be modified according to the input contexts utilizing neural language models.

2.4.1 Embeddings from language models

The ELMo representations use vectors derived from a bidirectional LSTM (BiLSTM) trained on a large text corpus. The ELMo model effectively addresses the problem of comprehending the syntax and semantic meaning of words and the language contexts in which they are used. ELMo considers the complete sentence when assigning an embedding to each word. It employs a bidirectional design, embedding depending on the sentence's next and preceding words, as shown in Fig. 11.

Fig. 11
figure 11

The architecture of ELMo

For a sequence of N tokens (t1, t2, …, tN), the aim is to find the language model's greatest probability in both directions. The likelihood of the sequence is computed using a forward language model, which models the chance of token tk considering the history (t1, t2, t3, …, tk). A backward language model is identical to a forward language model but goes backward through the sequence, anticipating the previous token based on the future context. The forward and backward language model and the join expression that optimizes the log probability in both directions are shown in Eqs. (22)–(24) (Peters et al. 2018).

$$p\left({t}_{1}, {t}_{2}, \dots , {t}_{N}\right)= \prod_{k=1}^{N}p\left({t}_{k} \right| {t}_{1}, {t}_{2, }\dots , {t}_{k-1})$$
(22)
$$p\left({t}_{1}, {t}_{2}, \dots , {t}_{N}\right)= \prod_{k=1}^{N}p\left({t}_{k} \right| {t}_{k+1}, {t}_{k+2, }\dots , {t}_{N})$$
(23)
$$\sum_{k=1}^{N}(\mathit{log}p\left({t}_{k} \right| {t}_{1}, {t}_{2, }\dots , {t}_{k-1})+ \mathit{log}p\left({t}_{k} \right| {t}_{k+1}, {t}_{k+2, }\dots , {t}_{N}))$$
(24)

2.4.2 Generative pre-training

The morphology of words in the application domain can be extensively exploited with GPT. GPT uses a one-way language model, transformer, to extract features, whereas ELMo employs a BiLSTM. The architecture of GPT is shown in Fig. 12.

Fig. 12
figure 12

The architecture of GPT

A standard language modeling objective for a sequence of tokens (t1, t2,…, tN) to maximize the likelihood is shown in Eq. (25). The language model employs a multi-layer transformer decoder with a self-attention mechanism to anticipate the current word through the first N-word (Vaswani et al. 2017). To achieve a proper distribution over target words, the GPT model employs a multi-headed self-attention operation over the input contextual tokens, accompanied by position-wise feed-forward layers, as shown in Eqs. (26)–(28).

$${L}_{1}\left(X\right)= \sum_{i}\mathrm{log P}\left({t}_{i} \right| {t}_{i-N}, \dots , {t}_{i-1}; \theta )$$
(25)
$${\mathrm{h}}_{0}={\mathrm{UW}}_{\mathrm{e}}+ {\mathrm{W}}_{\mathrm{p}}$$
(26)
$${\mathrm{h}}_{1}={\mathrm{transformer}}_{\mathrm{block}\left({\mathrm{h}}_{\mathrm{l}-1}\right){\forall }_{\mathrm{i}}}\in [1,\mathrm{n}]$$
(27)
$$\mathrm{P}\left(\mathrm{u}\right)=\mathrm{softmax}({\mathrm{h}}_{\mathrm{n}}{\mathrm{W}}_{\mathrm{e}}^{\mathrm{T}})$$
(28)

The number of layers is represented as n, \({W}_{e}\) represents the token embedding matrix, the position embedding matrix \({W}_{p}\) and U is the context vector of tokens (Radford et al. 2018).

2.4.3 Bidirectional encoder representations from transformers

The ELMo model takes a feature-based approach and adds pre-trained representation as a feature. The GPT model uses a fine-tuning technique and only uses task-specific parameters that have been trained on downstream tasks. BERT model architecture includes a multi-layer bidirectional transformer encoder, as depicted in Fig. 13.

Fig. 13
figure 13

BERT Architecture

BERT employs masked language modeling to optimize and combine position embedding with static word embeddings as model inputs. It follows frameworks for both pre-training and fine-tuning. The model is trained on unsupervised learning from several pre-training tasks during pre-training. The BERT model is fine-tuned by first initializing it using the pre-trained parameters and then fine-tuning all parameters using labeled data from the downstream jobs (Devlin et al. 2019).

BERT uses word-piece embeddings. A special classification token [CLS] is always the first token in every sequence. Use the special token [SEP] to separate the sentences. BERT uses a deep, pre-trained neural network with transformer architecture to create dense vector representations for natural language. The BERT base or large category TF Hub model has L = 12/24 hidden layers (transformer blocks), H = 768/1024 hidden size, and A = 12/16 attention heads (TensorFlow Hub).

3 Search strategy

A comprehensive search for possibly relevant literature was undertaken in three electronic data sources (EDS), namely Institute of Electrical and Electronics Engineers (IEEE) Xplore, Scopus, and Science Direct, following the systematic guidelines outlined and declared by (Kitchenham 2004) (Okoli and Schabram 2010) for the journal and peer-reviewed conference articles published between the year 2019 to 2021. The search included the keywords “word embedding” or Word2Vec or GloVe in conjunction with deep learning. The set of search phrases and words used for each EDS is shown in Table 1.

Table 1 Set of search phrases and words for each of the EDS

3.1 Eligibility criteria

Article eligibility and inclusion is an essential and strict inspection method for including the best potential articles in the study. The following points are defined to choose research examining the impact of word embedding models on text analytics in deep learning environments. The primary study selection criteria are categorized into inclusion criteria and exclusion criteria.

3.1.1 Inclusion criteria

  • Studies focus primarily on word embedding models that have been applied or reviewed for analytics.

  • Any analytics task, such as text classification, sentiment analysis, text summarization, and other text analysis activities utilizing word embedding models, will be included in the articles.

  • The research article from the database is selected only from the subject of computer science.

  • Research papers have been accepted and published in important and determinant peer-reviewed conferences focusing on word embedding and natural language processing and published in reputed journals.

  • Studies were published from 2019 to 2021.

3.1.2 Exclusion criteria

  • Studies not in the English language.

  • Studies focused only on understanding deep learning models, such as their architectural behaviors or motivation to utilize them.

  • Articles that do not meet the inclusion criteria are excluded.

  • Articles that were already examined in other EDS will be excluded.

The EDS database is used to find the literature with the keywords “word embedding OR Word2Vec OR GloVe” and “deep learning” used in the title, abstract, and keywords section. The overall number of articles shown by the database is huge. When the research is confined to 2019 to 2021, the number drops to 207. The process is needed to filter more for the quality of the review. The language is selected only English, and the subject area is chosen as computer science. The published articles in important and determinant peer-reviewed conferences focusing on word embedding and natural language processing and reputed journals are included for the study's reliability and quality. The PRISMA diagram shown in Fig. 14 depicts the criteria for selecting articles and information about the article for review and record.

Fig. 14
figure 14

PRISMA diagram

The summary of articles selected for review is shown in Table 2. The 09 studies are excluded as duplicate articles from different EDS, and the 05 studies irrelevant to this review are also excluded. The final 193 articles on word embedding models in conjunction with deep learning and its applications in text analytics are selected to analyze the literature and find the gap and research direction.

Table 2 Summary of articles selected for review

3.2 Data extraction process

A detailed data extraction format is prepared in the spreadsheet to minimize any bias in the data extraction process. The spreadsheet was primarily used to extract and maintain each chosen research study data. A detailed overview of the data extraction procedure is discussed in Table 3.

Table 3 Description of data extraction

3.3 Popular journals and year-wise studies

The research is restricted to important and determinant peer-reviewed conferences focusing on word embedding and natural language processing and reputed journal publications published between 2019 and 2021. The terms word embedding, deep learning, and their applications in text analytics were used in the search. Only papers that meet the inclusion and exclusion criteria are chosen for review. The study began in the fourth quarter of 2021; hence, fewer publications than in 2020. It is expected to have more publications in the coming years. Articles selected for the study are shown year-wise in Fig. 15(a). Google TrendsFootnote 2 is used to analyze word embedding and NLP topics in Google search queries worldwide from 2019 through 2021. The comparison of the search volume of queries over time is displayed in Fig. 15(b). According to recent trends, the embedding technique for natural language processing jobs has evolved significantly. The choice of an effective embedding strategy is critical to the success of an NLP task.

Fig. 15
figure 15

(a) Year-wise publication records selected for review, (b) Analysis of search query on word embedding and NLP in Google Trends

For review, articles published in important and determinant peer-reviewed conferences focusing on word embedding and natural language processing and reputed journals are chosen. It has been discovered that Elsevier publishes nearly 50% of the selected publications, almost 25% are published by IEEE, and Springer Nature publishes nearly 10%.

The journals of Elsevier publications, Information Processing and Management, Knowledge-Based Systems, and Applied Soft Computing, had 34 papers selected for review, the most of any other publication. IEEE Access is ranked second on the list, with 27 articles chosen for evaluation. The third journal on the list is Springer's Neural Computing and Applications. A circular dendrogram depicting the name of peer-reviewed conferences and journals selected for current review by year is shown in Fig. 16. The peer-reviewed conference and journal's names and abbreviations are listed in Table 13 in Annexure A.

Fig. 16
figure 16

Peer-reviewed conferences and journals selected for the current review

3.4 Tools and APIs available for implementing word embedding models

This section provides an overview of the available tools and API for implementing word embedding models.

Natural Language Toolkit: Natural Language Toolkit (NLTK)Footnote 3 is a free and open-source Python library for natural language processing. NLTK provides stemming, lowercase, categorization, tokenization, spell check, lemmatization, and semantic reasoning text processing packages. It gives access to lexical resources like WordNet.

Scikit-learn: Scikit-learnFootnote 4 is a Python toolkit for machine learning that supports supervised and unsupervised learning. It also includes tools for model construction, selection, assessment, and other features, such as data preprocessing. For the development of traditional machine learning algorithms, two Python libraries, NumPy and SciPy, are useful.

TensorFlow: TensorflowFootnote 5 is a free and open-source library for creating machine learning models. TensorFlow uses a Keras-based high-level API for designing and building neural networks. TensorFlow was created to perform machine learning and deep neural network research by researchers on the Google Brain team. Its flexible architecture enables computing to be deployed over various platforms like CPU, GPU, and TPU and makes it significantly easier for developers to transition from model development to deployment.

Keras: KerasFootnote 6 is a Google-developed high-level deep learning API for implementing neural networks. It is built in Python and is used to simplify neural network implementation. It also enables the computation of numerous neural networks in the backend. Keras support the frameworks such as Tensorflow, Theano, and Microsoft Cognitive Toolkit. Keras allows users to create deep models for smartphones, browsers, and the java virtual machine. It also allows distributed deep-learning model training on clusters of GPU and TPU.

PyTorch: PyTorchFootnote 7 is an open-source machine learning framework initially created by Facebook AI Research lab (FAIR) to speed up the transition from research development to commercial implementation. PyTorch has a user-friendly interface that allows quick, flexible experimentation and output. It supports NLP, machine learning, and computer vision technologies and frameworks. It enables GPU-accelerated Tensor calculations and the creation of computational graphs. The most recent version of PyTorch is 1.11, which includes data loading primitives for quickly building a flexible and highly functional data pipeline.

Pandas: PandasFootnote 8 is an open-source Python framework that supports high-performance, user-friendly information structures and analytic tools for Python. Pandas are applied in various scientific and corporate disciplines, including banking, business, statistics, etc. Pandas 1.4.1 is the most recent version and is more stable in terms of regression support.

NumPy: Travis Oliphant built Numerical Python (NumPy)Footnote 9 in 2005 as an open-source package that facilitates numerical processing with Python. It has matrices, linear algebra, and the Fourier transform functions. The array object in NumPy is named ndarray, and it comes with a slew of helper functions that make working with it a breeze. The latest version of NumPy is 1.22.3, and it is used to interface with a wide range of databases smoothly and quickly.

SciPy: NumPy includes a multidimensional array with excellent speed and array manipulation features. SciPyFootnote 10 is a Python library based on NumPy and is available for free. SciPy consists of several functions that work with NumPy arrays and are helpful for a variety of scientific and engineering tasks. The latest version of the SciPy toolkit is 1.8.0, and it offers excellent roles and methods for data processing and visualization.

4 Key applications of text analytics

Techniques for analyzing unstructured text include text classification, sentiment analysis, NER and recommendation systems, biomedical text mining, and topic modeling.

4.1 Text analytics

4.1.1 Text classification

Text classification is the process of categorizing texts into organized groups. Text gathered from a variety of sources offers a great deal of knowledge. It is difficult and time-consuming to extract usable knowledge from unstructured data. Text classification can be done manually or automatically, as shown in Fig. 17.

Fig. 17
figure 17

Approaches for text classification

Automatic text classification is becoming progressively essential due to the availability of enormous corpora. Automatic text classification can be done using either a rule-based or data-driven technique. A rule-based technique uses domain knowledge and a set of predefined criteria to classify text into multiple groups. Text is organized using a data-driven approach based on data observations. Machine learning or deep learning algorithms can be used to discover the intrinsic relationship between text and its labels based on data observation.

A data-driven technique fails to extract relevant knowledge from a large dataset using solely handmade characteristics. An embedding technique is used to map the text into a low-dimensional feature vector, which aids in extracting relationships and meaningful knowledge (Dhar et al. 2020).

4.1.2 Sentiment analysis

Sentences can be articulated in a variety of ways. It might be expressed through various emotions, judgments, visions or insights, or people's perspectives. The meaning of individual words has an impact on readers and writers. The writer uses specific words to communicate feelings, and the readers strive to interpret the emotion depending on their abilities to analyse. Deep learning systems have already demonstrated outstanding performance in NLP applications such as sentiment classification and emotion detection within many datasets. These models do not require any predefined selected characteristics. Instead, it learns advanced representations of the input datasets on its own (Dessì et al. 2021). Sentiment analysis techniques are divided into lexicon-based approaches, machine-learning approaches, and a combination of the two (Mohamed et al. 2020). The internet is an unorganized and rich source of knowledge that contains many text documents offering thoughts and reviews. Personal decisions, businesses, and institutions can benefit from sentiment recognition (Onan 2021).

4.1.3 Named entity recognition

A named entity is a word used to differentiate one object from a set of entities that share similar features. It restricts the range of entities that describe a subject by using one or more restrictive identifiers. At the sixth Message Understanding Conference, the term Named Entity was first used to describe the problem of recognizing names of enterprises, persons, and physical locations in literature and price, timing, and proportion statements. Then there was a surge in interest in NER, with numerous researchers devoting significant time and effort to the subject (Grishman and Sundheim 1996), (Nasar et al. 2021). The extraction of intelligent information from text relies heavily on NER. The NER task is difficult due to the polymorphemic behavior of many words (Khan et al. 2020). NER is used in various NLP applications, including text interpretation, information extraction, question answering, and autonomous text summarization. In NER, four main approaches are used: (1) Rule-based approaches, which rely on hand-crafted rules, (2) Unsupervised learning methods, which use unsupervised algorithms rather than hand-labeled training instances (3) Feature-based supervised learning techniques primarily depend on supervised learning algorithms that have been carefully engineered, (4) Deep-learning-based techniques that generate representations necessary for classification and identification from training dataset in an end-to-end way.

4.1.4 Biomedical text mining

Healthcare experts are struggling to classify diseases based on available data. Humans must recognize clinically named entities to assess massive electronic medical records effectively. Conventional rule-based systems require a significant amount of human effort to create rules and vocabulary, whereas machine learning-based approaches require time-consuming feature extraction. Deep learning models like LSTM with conditional random field (CRF) performed admirably in several datasets. Clinical named entity recognition is a process that identifies specific concepts from unorganized texts, medical tests, and therapies. It is crucial to convert unorganized electronic medical record material into organized medical information. (Yang et al. 2019).

4.1.5 Topic modeling

Topic modeling aims to ascertain how underlying document collections are structured. Topic models were first created to retrieve information from massive document collections. Without relying on metadata, topic models can be used to explore sets of journals by article subject. The LSA uses SVD to extract the fundamental themes from a term-document matrix, resulting in mathematically independent issues. Similar to how principal component analysis reduces the number of features in a prediction task, topic models are simply a compression technique that maximizes topic variance on a simplified representation of a document collection (Zhao et al. 2021). Text classification is the process of organizing text to extract valuable information from it. In contrast, topic modeling is determining an abstract topic for a group of texts or documents. Topic modeling is commonly used to extract semantic information from textual material (Kumar et al. 2021).

4.2 Datasets used for text analytics

This section outlines the datasets commonly used for text analytics purposes, as shown in Table 4. Researchers have offered several text analytics datasets. Text classification, sentiment analysis, NER, recommendation systems, and topic modeling are among the application fields found in the literature. The overview of attributes in terms of application area, datasets, model architecture, embedding methods, and performance evaluation are illustrated in Annexure A.

Table 4 Dataset used for text analytics purpose

Amazon dataset: Customer reviews of products purchased through the Amazon website are included in the dataset. The dataset consists of binary and multiclass classifications for review categories. The data is arranged into training and testing sets for both product classification categories.

Arabic news datasets: The Arabic newsgroups dataset contains documents posted to several newsgroups on various themes. Different versions of this dataset are used for text classification, text clustering, and other tasks. The Arabic news texts corpus is organized into nine categories: culture, diversity, economy, international news, local news, politics, society, sports, and technology. It contains 10,161 documents with a total of 1.474 million words.

Fudan dataset: This is an image database containing pedestrian detection images. The photographs were taken in various locations around campus and on city streets. At least one pedestrian will appear in each photograph. The heights of tagged pedestrians lie between (180, 390) pixels. All of the pedestrians who have been classified are standing up straight. There are 170 photos in all, with 345 pedestrians tagged, with 96 photographs from the University of Pennsylvania and 74 from Fudan University.

i2b2: Informatics for Integrating Biology & the Bedside (i2b2) is a fully accessible clinical data processing and analytics exploration platform allowing heterogeneous healthcare and research data to be shared, integrated, standardized, and analyzed. All labeled and unannotated, de-identified hospital discharge reports are provided for academic purposes.

Movie review dataset: The movie review dataset is a set of movie reviews created to identify the sentiment involved with each study and decide whether it is favorable or unfavorable. There are 10,662 sentences, with an equal amount of negative and positive examples.

Yelp dataset: Two sentiment analysis tasks are included in the Yelp dataset. One method is to look for sentiment labels with finer granularity. The other predicts both excellent and negative emotions. Yelp-5 has 650,000 training data and 50,000 testing data for negative and positive classes, while Yelp-2 has 560,000 training datasets and 38,000 testing datasets.

SemEval: SemEval is a domain-specific dataset with reviews of laptops and restaurant services thoroughly annotated by humans. The overall aspect of a sentence, section, or text span, irrespective of the entities or their characteristics, the SemEval dataset, is frequently used. The dataset comprises over three thousand reviews in English for each product category.

Sogou dataset: The Sogou news dataset combines the news corpora from SogouCA and SogouCS. This Chinese dataset includes around 2.7 billion words and is published by a Chinese commercial search engine.

Stanford Sentiment Treebank (SST) dataset: The SST dataset is a more extended version of the movie review data. The SST1 includes fine-grained labels in a multiclass movie review dataset with training, testing, and validation sets. The binary label dataset in SST2 is split into three sections: training, testing, and validation.

Twitter dataset: With the tremendous increase in online social networking websites like blogs, vital information in sentiments, thoughts, opinions, and epidemic outbreaks is being conveyed. Twitter generates vast data about epidemic outbreaks, customer reviews about the product, and survey information. The Twitter Streaming API can be used to obtain a dataset from Twitter that includes disease information and a geographical study of Twitter users.

Wikipedia: Wikipedia pages are taken as the corpus to train the model. The preprocessing operations on the pages extract helpful information such as an article abstract. Processing takes place using a dictionary of selected terms.

WordSim: WordSim is a set of tests for determining the similarity or relatedness of words. The WordSim353 dataset consists of two groups: the first set includes 153-word pairs for evaluating similarity assigned by 13 subjects, and the other contains 16-word pairs for evaluating relatedness given by 16 subjects.

5 Review on text analytics, word embedding application, and deep learning environment

For many domains, researchers have created numerous text analytics models. When creating text analytics models, the primary concern that comes to mind is “what type of embedding method is suited for which application area and the appropriate deep learning strategy”. A description of various text analytics strategies with different embedding methods and deep learning algorithms is shown in Annexure A. It depicts the multiple approaches utilized and their performance as a function of the application domain.

5.1 Text classification

Text categorization issues have been extensively researched and solved in many real-world applications. Text classification is the process of grouping together texts like tweets, news articles, and customer evaluations. The construction of text classification and document classification techniques includes extracting features, dimension minimization, classifier selection, and assessments (Jang et al. 2020). Recent advances have focused on learning low-dimension and continuous vector representations of words, known as word embedding, which may be applied directly to downstream applications, including machine translation, natural language interpretation, and text analytics (El-Alami et al. 2021) (Elnagar et al. 2020). Word embedding uses neural networks to represent the context and relationships between the target word and its context words (Almuzaini and Azmi 2020). An attention mechanism and feature selection using LSTM and character embedding achieve an accuracy of 84.2% in classifying Chinese text (Zhu et al. 2020b). Deep feedforward neural network with the CBOW model achieves an accuracy of 89.56% for fake consumer review detection (Hajek et al. 2020).

LSTM with the Word2Vec model achieves an F1-score of 98.03% for word segmentation in the Arabic language (Almuhareb et al. 2019). Neural network-based word embedding efficiently models a word and its context and has become one of the most widely used methods of word distribution representation (N.H. Phat and Anh 2020)(Alharthi et al. 2021).

Machine learning algorithms such as Naive Bayes classifier (NBC), support vector machine (SVM), decision tree (DT), and the random forest (RF) were famous for information retrieval, document categorization, image, video, human activity classification, bioinformatics, safety and security (Shaikh et al. 2021). Deep learning model such as CNN and GloVe embedding improves citation screening and achieves an accuracy of 84.0% (V Dinter et al. 2021). To classify meaningful information into various categories, the deep learning model GRU with GloVe embedding achieves an accuracy of 84.8% (Zulqarnain et al. 2019). Information retrieval systems are applications that commonly use text classification methods (Greiner-Petter et al. 2020), (Kastrati et al. 2019). Text classification can be used for a variety of purposes, such as the classification of news articles (Spinde et al. 2021), (Roman et al. 2021), (Choudhary et al. 2021), (de Mendonça and da Cruz Júnior 2020), (Roy et al. 2020). The performance of Word2Vec, GloVe, and fastText is compared to match the corresponding activity pair. The experimental evaluation shows that the fastText embedding approach achieves the F1-socre of 91.00% (Shahzad et al. 2019). Extracting meta-textual features and word-level features using the BERT approach gains an accuracy of 95% for classifying insincere questions on question-answering websites (Al-Ramahi and Alsmadi 2021). CNN with the Word2Vec model achieves an accuracy of 90% for text classification tasks (Kim and Hong 2021), (Ochodek et al. 2020). It is challenging to extract discriminative semantic characteristics from text that contains polysemic words. The construction of a vectorized representation of semantics and the use of hyperplanes to break down each capsule and acquire the individual senses are proposed using capsule networks and routing-on-hyperplane (HCapsNet) techniques. Experimental investigation of a dynamic routing-on-hyperplane approach utilizing Word2Vec for text classification tasks like sentiment analysis, question classification, and topic classification reveals that HCapsNet achieves the highest accuracy of 94.2% (Du et al. 2019). A hierarchical attention network based on Word2Vec embedding achieves an accuracy of 84.57% for detecting fraud in an annual report (Craja et al. 2020). Text classification by transforming knowledge from one domain to another using LSTM and Word2Vec embedding model achieves an accuracy of 90.07% (Pan et al. 2019a). Social media tweets analysis (Hammar et al. 2020). Domain-specific word embedding outperforms the BERT embedding model and achieves an F1-score of 94.45% (Grzeça et al. 2020), (Zuheros et al. 2019), (Xiong et al. 2021). Ensemble deep learning model with RoBERT embedding achieves an accuracy of 90.30% to classify tweets for information collection (Malla and Alphonse 2021), (Hasni and Faiz 2021), (Zheng et al. 2020). CNN with a domain-specific word embedding model, achieves an F1-score of 93.4% to classify tweets into positive and negative (Shin et al. 2020).

Text categorization algorithms have been successfully applied to Korean/French/Arabic/Tigrinya/Chinese languages for document/tweets classification (Kozlowski et al. 2020), (Jin et al. 2020). CNN with the CBOW model achieves an accuracy of 93.41% for classifying text in the Trigniya language (Fesseha et al. 2021). LSTM with Word2Vec achieves 99.55% for tagging morphemes in the Arabic language (Alrajhi and ELAffendi 2019). With word2vec, CNN achieves an accuracy of 96.60% on Chinese microblogs. This result demonstrates that word vectors employing Chinese characters as feature components produce better accuracy than word vectors (Xu et al. 2020). The lexical consistency of the Hungarian language can be improved by embedding techniques based on sub-word units, such as character n-grams and lemmatization (Döbrössy et al. 2019). To accurately assess pre-trained word embeddings for downstream tasks, it is necessary to capture word similarity. Traditionally the similarity is determined by comparing it to human judgment. A Wikipedia Agent Using Local Embedding Similarities (WALES) is proposed as an alternative and valuable metric for evaluating word similarity. The WALES metric depends on a representative traversing the Wikipedia hyperlink graph. A performance evaluation of a graph-based technique on English Wikipedia demonstrates that it effectively measures similarity without explicit human labeling (Giesen et al. 2022). A Doc2Vec word embedding model is used to extract features from the text and pass them through CNN for classification. The experimental evaluation of the Turkish Text Classification 3600 (TTC-3600) dataset shows that the model efficiently classifies the text with an accuracy of 94.17% (Dogru et al. 2021). LSTM with CBOW achieves an accuracy of 90.5% for comparing the semantic similarity between words in the Chinese language (Liao and Ni 2021). The review of text classification techniques in terms of data source, application area, datasets, and performance evaluation are illustrated in Table 7 of Annexure A.

5.2 Sentiment analysis

Sentiment analysis determines the sentiment and perspective of points of view in textual data. The problem can be expressed as a binary or multi-class problem. Multi-class sentiment analysis divides texts into fine-grained categories or multilevel intensities, whereas binary sentiment analysis divides texts into positive and negative classes (Birjali et al. 2021). Social communication platforms such as websites, which include comments, discussion forums, blogs, microblogs, and Twitter, are among the sources for sentiment analysis. Sentiment analysis provides information on what customers like and dislike, and the company better understands its product's qualities (Liu et al. 2021b). Using lexicon-based and Word2Vec embedding and a Bidirectional enhanced dual attention model, the aspect-based sentiment analysis task gets an F1-score of 87.21% (Rida-e-fatima et al. 2019). Sentiment analysis includes emotion classification, qualitative or quantitative analysis, and opinion extraction. Consumer data are evaluated to actively analyze public opinion and aid decision-making (Harb et al. 2020), (Vijayvergia and Kumar 2021). Sentiments and opinion analyses are examined at the document level, sentence level, or aspect level (Liu and Shen 2020), (Alamoudi and Alghamdi 2021). Using a hybrid framework of Word2Vec, GloVe, and BOW with an SVM classifier, an extended ensemble sentiment classifier approach achieves an accuracy of 92.88% (Mohamed et al. 2020). Sentiment analysis efficiently determines customer opinion to analyze patient mental health via social media posts (Dadkhah et al. 2021), (Agüero-Torales et al. 2021), (Sharma et al. 2021). An LSTM model with imitated and polarised word embedding yields an F1-score of 96.55% for human–robot interaction (Atzeni and Reforgiato Recupero 2020).

The advancement of big data, cloud technology, and blockchain has broadened the scope of applications, allowing sentiment analysis to be employed in virtually any subject. Customers' impressions of goods or services are evaluated to make informed decisions (Ayu and Khotimah 2019), (Onan 2021). Bidirectional GRU with refined global word embedding achieves an F1-score of 91.3% for the sentiment analysis task (Wang et al. 2021a). Aspect-based sentiment analysis for Arabic/Korean/Russian/Turkish language can efficiently classify text into lexicon-based, machine learning-based, and deep learning-based categories (Song et al. 2019), (Smetanin and Komarov 2021), (Kilimci and Duvar 2020), (Alwehaibi et al. 2021). Sentiment analysis on Arabic Twitter data using domain-specific embedding and the CNN model achieves an accuracy of 73.86% (Fouad et al. 2020).

Researchers confront significant problems, such as handling context, mocking, statements expressing many emotions, expanding Web jargon, and semantic and grammatical ambiguity, despite several moods and emotion recognition approaches (Naderalvojoud and Sezer 2020). Establishing an effective technique to express the feeling and emotions of people is a time-consuming undertaking (Hao et al. 2020), (Naderalvojoud and Sezer 2020). In a low-resource language, extracting numerous features and emotions from a multi-opinion statement is challenging. Word embedding approaches are used to acquire meanings, compare text, and determine the text's relevance for decision-making (Wang et al. 2021c). Profanity detection using LSTM and fastText achieves an accuracy of 96.15% (Yi et al. 2021). Contextualized word embedding is based on the context of a particular word, and its representation changes dynamically depending on the context. The use of a word embedding strategy in conjunction with deep learning models can detect hate, toxicity, irony, and objectionable content in text and categorise it into a specific category (Kapil and Ekbal 2020), (Alatawi et al. 2021), (González et al. 2020), (Beddiar et al. 2021). Machine learning and deep learning models such as DT, RF, Multilayer perceptron (MLP), CNN, LSTM, and BiLSTM are compared utilizing Word2Vec, BERT, and a domain-specific embedding technique in terms of performance. The LSTM model with domain trained embedding achieves an accuracy of 95.7% to detect whether reviews on social media contain toxicity comments (Dessì et al. 2021). An offensive stereotype technique is suggested as a systematic way to detect hate speech and profanity on social media platforms. The proposed method locates the quantitative indicator of bias in the pre-trained embedding model, which effectively classifies the text as containing hate speech (Elsafoury et al. 2022). The prejudices connected to various social categories are investigated. The study demonstrates how the biases associated with multiple social categories are mitigated and how they overlap over a one-dimensional subspace for each individual (Cheng et al. 2022). Metric learning is mapping the embedding space that places comparable data adjacent to each other and vice versa. The pre-trained transformer-based language model is suggested to be used self-supervised to generate appropriate sentence embedding. Deep Contrastive Learning for Unsupervised Textual Representations (DeCLUTR) requires fewer trainable parameters. The universal sentence encoder performed well in the unsupervised evaluation of the SentEval task (Giorgi et al. 2021). A deep canonical correlation analysis-based network called the Interaction Canonical Correlation Network is suggested to learn correlations between text, audio, and video. The features that are retrieved from all three modes are then used to create the multimodal embedding, which performs multimodal sentiment analysis and emotion recognition. On the CMU-MOSI movie review dataset, the suggested network attains the best accuracy of 83.07% (Sun et al. 2020b). An unordered structure model is suggested to build phrase embedding for sentiment analysis tasks in various Arabic dialects, independent of the order and grammar of the context's words. On the Arabic Twitter Dataset, the suggested method outperforms others in classifying the sentiment of various dialects with an accuracy of 88.2% (Mulki et al. 2019). To learn the contextual word relationships within each document and the inductive learning of new words. Graph Neural Network (GNN) is created for a document and generates the embedding for all the words in the document. The TextING and Glove are used for inductive learning utilising the GNN. The experimentation is performed on four datasets: the movie reviews dataset, the Reuters newswire 8 and 52 categories dataset, and the cardiovascular diseases dataset. The result shows that the TextING approach achieves the highest accuracy of 98.04% on the R8 dataset in modeling local word-word relations and word significances in the text (Zhang et al. 2020). To predict Bitcoin price using text sentiment, the LSTM model with fastText embedding achieved the most remarkable accuracy of 89.13% compared to Word2Vec, GloVe with RNN and CNN (Kilimci 2020). Compared to GloVe, ELMo with LSTM, the CNN model with BERT embedding extracts linguistic and psycholinguistic information with an accuracy of 72.10% to detect a person's personality (El-Demerdash et al. 2022), and the multilayer CNN model with BERT embedding is 80.35% (Ren et al. 2021). The review of sentiment analysis techniques in terms of data source, application area, datasets, and performance evaluation are illustrated in Table 8 of Annexure A.

5.3 Biomedical text mining

Integrating deep learning and an NLP model in a healthcare environment improves diagnosis. Massive amounts of health-related information are available for processing, including digital text in electronic health records (EHR), medical text on social networks, and text in a computerized report. Image annotation and labeling are done using medical images and radiological reports. NLP can be used to complete annotations and labeling in less time with less effort. NLP assists in exiting relationships between entities, allowing for a more accurate medical diagnosis (Pandey et al. 2021), (Moradi et al. 2020). The biomedical literature's unique character, quantity, and complexity present challenges for automated classification algorithms. In a multilabel situation, word embedding techniques can be helpful for biomedical text categorization. Medical Subject Headings (MeSH) are represented as ontologies, giving machine-readable labels and specifying the issue space's dimensionality. ELMo embedding-based automated biomedical literature classification efficiently classifies biomedical text and gets an F1-score of 77% (Koutsomitropoulos and Andriopoulos 2021). A biomedical word sense disambiguation strategy using the BiLSTM model obtains a macro average of 96.71% to improve medical text classification (Li et al. 2019b). The BiLSTM model with Word2Vec embedding yields an F1-score of 98% regarding acronyms within the text and is classified into respective diseases. (Magna et al. 2020).

The performance of the deep contextualized attention BiLSTM model utilizing ELMo, fastText, Word2Vec, GloVe, and TF-IDF is compared. The BiLSTM model correctly classifies malignant and normal cells with an accuracy of 86.3% (Jiang et al. 2020a). Using an ontology-based strategy to preserve data-driven and knowledge-driven information in pre-trained embedding enhances the model's similarity measure (Racharak 2021). Domain-specific embedding is used for disease diagnosis to analyze patients' medical inquiries and structured symptoms. The fusion-based technique obtains the maximum accuracy of 84.9% and effectively supports telemedicine for meaningful drug prescriptions (Faris et al. 2021).

The LSTM with the CBOW model achieves the highest accuracy of 94% in recognizing disease-infected people from tweets about disease outbreaks on online social networking sites (OSNS) (Amin et al. 2020). Colloquial phrases are collected from tweets available on OSNS using BERT embedding, and the model achieves an accuracy of 89.95% in categorizing health information. (Kalyan and Sangeetha 2021). An attention-based BiLSTM-CRF (Att-BiLSTM–CRF) model with ELMo achieves an F1-score of 88.78% to efficiently analyze electronic health information and clinical named entity recognition (CNER) challenge (Yang et al. 2019). Similarly, BiLSTM with CRF and BERT embedding performs F1-score of 98.32% for the CNER task (Catelli et al. 2021). EHR analysis for identifying cause and effect relationships using CNN and Att-BiLSTM models achieves F1-score of 52% (Akkasi and Moens 2021). The use of domain-specific embeddings BioWordVec improves visual prognostic predictions from EHR and reaches a 99.5% accuracy (Wang et al. 2021b). Domain-specific embedding, ClinicalBERT enhances the performance of EHR categorization into clinical and non-clinical categories (Goodrum et al. 2020), (Pattisapu et al. 2019). Multi-label classification of health records using bidirectional GRU (BiGRU) and ELMo achieves an accuracy of 63.16% and enhances the EHR classification based on diseases (Blanco et al. 2020). BiLSTM with CRF and GloVe embedding achieves F1-score of 75.62% for biomedical NER tasks (Ning and Bai 2021). In a Spanish clinical case, domain-specific embedding achieves an F1-score of 90.84% to improve NER (Akhtyamova et al. 2020). The CNN with Word2Vec embedding achieves an accuracy of 90.20% in predicting a therapeutic peptide's illness (Wu et al. 2019). A deep learning model such as CNN with Word2Vec embedding achieves an accuracy of 90.31% for predicting protein family (Yusuf et al. 2021). For type III secreted effector prediction, a model combining CNN and Word2Vec embedding and a position-specific scoring matrix for feature extraction obtains an accuracy of 81.20%. (Fu and Yang 2019). An enhancer comprises CNN with a Word2Vec embedding that achieves an accuracy of 77.50% for detecting eukaryotic gene expression. (Khanal 2020). An enhancer made up of a sequence generative adversarial network (GAN) with a Skip-Gram model obtains an accuracy of 95.10% (Yang et al. 2021b). A model comprising an Att-CNN, BiGRU with Word2Vec embedding yields an accuracy of 92.14% in predicting chromatin accessibility (Guo et al. 2020). A model utilizing BERT with language embedding obtained an accuracy of 94% in detecting adverse medication events (Fan et al. 2020). The review of biomedical text mining techniques in terms of data source, application area, datasets, and performance evaluation are illustrated in Table 9 of Annexure A.

5.4 Named entity recognition and recommendation system

Information retrieval, question answering, machine translation, and other downstream applications use NER as a pre-processing step. In an end-to-end multitasking context, word embedding methods like Word2Vec and fastText are used to improve speech translation (Chuang et al. 2021). Cross domains adversarial learning models comprised of CNN, BiLSTM, and Word2Vec embedding are utilized to categorize the information from EHR available in the Chinese language and achieve F1-score of 74.39%. (Wen et al. 2020). The Chinese word embedding-based model with LSTM acquires an F1-score of 95.53% to understand the semantics of words and efficiently analyze the features. (Zhang et al. 2021). A domain-specific word embedding approach with a fuzzy metric that focuses on a unique entity recognition task is proposed to adopt cooking recipes from a set of all available recipes. The model achieves 95% confidence in selecting appropriate recipes (Morales-Garzón et al. 2021). For the Chinese clinical NER task, the LSTM, CRF, and BERT models obtain an accuracy of 91.60% for EHR categorization. (Li et al. 2020b). An LSTM with domain-specific word embedding Tex2Vec is utilized to extract valuable insides from Urdu literature and attain an F1-score of 81.10%. (Khan et al. 2020). The BiLSTM with BERT embedding yields a greater accuracy of 90.84% than the EMLO or GloVe embedding model to perform biochemical named entity identification tasks (Liu et al. 2021a). BiLSTM with domain-specific embedding defined for clinical de-identification on COVID-19 Italian data gains a micro F1-score of 94.48% (Catelli et al. 2020). The localization of software bugs using GloVe and the POS tagging methodology achieved a maximum average precision of 30.70% (Liu et al. 2019). A single neural network model to jointly learn the task of POS and semantic annotation is proposed to enhance the performance of existing rule-based systems for the Welsh language. The proposed approach achieves an accuracy of 99.23% for multitask taggers and improves out-of-vocabulary coverage for the Welsh language using fastText pre-trained embedding (Ezeani et al. 2019). The discontinuous nature of the text is handled using a GAN2vec technique. The suggested method produces real-valued vectors like the Word2Vec paradigm. The discontinuous nature of the text is handled using a GAN2vec technique. The experimental GAN2vec evaluation on the dataset of Chinese poetry yields a BLUE score of 66.08% (Budhkar et al. 2019). An ensemble approach is suggested to classify brief text sequences from the texts of various Arabic-speaking nations. The results of the experiments demonstrate that the performance of the proposed ensemble model is comparable to the Prediction by Partial Matching (PPM) character language model. It obtains an F1-score of 63.4% on the Arabic Dialect Corpus dataset (Lippincott et al. 2019). A sparse self-attention LSTM (SSALSTM) approach is proposed to learn sentiment lexicons from Twitter. The method employs a self-attention approach to determine the sentiment polarity associated with each word, demonstrating that the sparse characters are semantically and emotionally equivalent. The suggested SSALSTM approach effectively determines sentiment polarity and is helpful for named entity recognition. The sentiment-aware word embedding is used for evaluation on the SemEval dataset, which shows that the SSALSTM approach achieves an accuracy of 84.32% to generate the sentiment lexicon (Deng et al. 2019). To recognize a software flaw on large datasets, BiGRU with Dec2Vec yields an F1-score of 96.11%, whereas fastText performs better on short datasets (Jeon and Kim 2021). Drug name extraction and recognition from the text for clinical application are performed using BiLSTM, CNN with CRF, and Sence2Vec embedding and achieve an F1-score of 80.30% (Suárez-Paniagua et al. 2019). The CNN model and Word2Vec embedding create an efficient recommender system for e-commerce applications based on user preferences with an RMSE of 0.863 (Khan et al. 2021). For a word-level NER test in a language mix of English and Hindi, a multichannel neural network model consisting of BiLSTM and Word2Vec embedding gets an F1-score of 83.90% (Shekhar et al. 2019). A hierarchical attention network for reviewing toys and games products requires extracting meaning at the word and sentence level and obtains an accuracy of 85.13% (Yang et al. 2021a). An attention distribution directed information transmission network gets the lowest mean square error of 1.031% (Sun et al. 2020a). Deep learning models are applied to collect relevant characteristics from product reviews on musical instruments, and for the item recommendation job, the model obtains a mean absolute error of 9.04% (Dau et al. 2021). The Word2Vec model recognizes an entity from Chinese news articles and performs public opinion orientation analysis with an accuracy of 87.23% for the product assessment and recommendation task (Wang et al. 2019). A deep learning model such as CNN with Skip-Gram embedding achieves a 94% accuracy for question categorization and entity identification on a Turkish question dataset (Kapil and Ekbal 2020). The review of NER techniques and recommendation system in terms of data source, application area, datasets, and performance evaluation are illustrated in Table 10 of Annexure A.

5.5 Topic modelling

The technique of providing an overview of the themes mentioned in documents is known as topic modeling. For topic modeling and recommendation tasks, the semantic similarity of word vectors is employed to extract keywords. Word2Vec effectively expresses the relationship between job and worker, improving the system's overall performance (Pan et al. 2019b). An ontology-based word embedding is utilized to extract key geoscience terms and gets an F1-score of 40.7% (Qiu et al. 2019). A CNN with Word2Vec is used for bug localization to the associated bug file and yields an accuracy of 81.00% (Xiao et al. 2018). In topic modeling, the Lead2Trend embedding achieves an accuracy of 80% compared to the Skip-Gram model of Word2Vec embedding (Dridi et al. 2019). A multimodal word representation model achieves an accuracy of 78.23%, utilizing syntactic and phonetic information (Zhu et al. 2020a). The feelings and views connected with text in Arabic subjects are utilized for efficient sentiment analysis and topic modeling (Nassif et al. 2021). The learning of bilingual word embeddings (BWE) for the Arabic to English (Ar-En) language pair is investigated using the Bilingual Bag-of-Words without Alignment (Bil-BOWA) model. This model considers different morphological segmentations and various training settings, including sentence length and embedding size. Experimental evaluation shows that increasing the size of word embedding enhances the learning process of Ar-En BWE (Alqaisi and O’Keefe 2019). It is suggested to use multilingual word embedding to represent the lexicon of many languages. The proposed BilLex is tested against English, French, and Spanish texts to pinpoint the precise fine-grained word alignment based on lexical meanings. The outcome demonstrates the BilLex application's effectiveness in obtaining the cross-lingual equivalents of words and sentences in other languages (Shi et al. 2019). As part of the Multi-Arabic Dialect Applications and Resources (MADAR) shared challenge, LSTM with fastText predicts the Arabic dialect from a collection of Arabic tweets with an accuracy of 50.59% (Talafha et al. 2019). Urdu is a low-resource language that needs a framework for interpretable subject modeling. Pre-trained embedding models, like Word2Vec and BERT, perform well when applied to datasets of Urdu tweets, demonstrating their effectiveness in classifying the text into useful topics (Nasim 2020). For Chinese and English language datasets, a topic modeling based item recommendation approach using sense-based embedding obtains the smallest RMSE of 0.0697 (Xiao et al. 2019). Software vulnerability identification from a vast corpus using domain-specific word embedding achieves 82% accuracy in identifying admitted coding errors (Flisar and Podgorelec 2019). The subject evolution study of scientific literature utilizing Word2Vec and geographical correlation yields a better result, with an RMSE of 3.259 for the spatial lagging model (Hu et al. 2019). The embedding method extracts semantic similarity between terms at a low abstraction level, achieving a standard deviation of 0.5 and reducing the amount of feedback necessary for efficient processing (El-Assady et al. 2020). Word2API embedding maps the relationship between words and APIs and achieves an average mean precision of 43.6% to extract a topic based on relatedness (Li et al. 2018). The review of topic modeling in terms of data source, application area, datasets, and performance evaluation is illustrated in Table 11 of Annexure A.

Table 5 The most prominent word embedding models published from 2013 to 2020

5.6 Importance of word embedding

In a nutshell, word embedding is the representation of text as vectors. The use of vector representations of text can aid in the discovery of word similarities. With the advancement of embedding techniques, deep learning is currently being employed efficiently in NLP (Verma and Khandelwal 2019) (Wang et al. 2020). The Skip-Gram model of Word2Vec efficiently represents the CNN model's architecture for performing image classification tasks (Dharmaretnam et al. 2021), efficiently explores the semantic correlations in music (Chuan et al. 2020), and effectively utilizing computational resources and parallelizing the technique in shared and distributed memory environment (Ji et al. 2019). Pre-trained embedding models assign similar embedding vectors to Words with similar meanings. A unique embedding should be given to words because their definitions vary depending on their context. The results of an experimental evaluation of a word similarity test demonstrate that the global relationship between the individual words and sub-words effectively represents the word vector. The suggested method minimizes the pre-trained model size while retaining the word embedding standard (Ohashi et al. 2020). An alternative word model called a graph of words is suggested to address the shortcomings of the Bag of Words model. The word order and distance are taken into account by the graph-of-words model. The experiment demonstrates that the graph-of-word model performs well on various tasks, including text summarization, ad-hoc information retrieval, and document keyword extraction (Vazirgiannis 2017). A model utilizing Skip-Gram is presented to determine whether spelling changes impact the effectiveness of word embedding. The study of spelling variation focuses on words with the same meaning but various spellings. In contrast to the non-conventional form, which represents spelling variants, the conventional form represents without spelling variation. The results of the experiment indicate that the word embedding model partially encodes the patterns of spelling variation (Nguyen and Grieve 2020). In contrast to the skip-gram negative sampling (SGNS) technique, which uses both word and context vectors, the context-free (CF) algorithm employs a word vector. The suggested CF method effectively distinguishes between positive and negative word similarity. It produces results comparable to those of the SGNS algorithm (Zobnin and Elistratova 2019). An isotropic iterative quantization (IIQ) method is suggested for compacting embedding feature vectors into binary ones to satisfy the required isotropic property of pointwise mutual information (PMI)-based approaches. This approach uses the iterative quantization technique, which is well-established for image retrieval (Liao et al. 2020). A method for obtaining vector representations of noun phrases is suggested. Each noun phrase's semantic meaning is assumed to be represented as a vector of the phrase's meaning. The bigram composition method is used to comprehend the semantic meaning of a word, which effectively teaches the importance of a phrase. A specific dimension is essential for improving the phrase's semantic characteristics. Experiment evaluation of proposed constraints on the WordNet dataset efficiently represents the grammatically informed and understandable conceptual phrase vectors (Kalouli et al. 2019). An approach combining principal component analysis and a post-processing algorithm is proposed to minimize the dimensionality of Word2Vec, GloVe, and fastText pre-trained embedding models. The suggested method creates efficient word embeddings in lower dimensions for the binary text classification problem. It achieves the highest Spearman rank correlation coefficient (91.6) compared to other baseline models (Raunak et al. 2019). The reduction of the dimension of word embedding without sacrificing accuracy is achieved using a distillation ensemble strategy, which uses an intelligent transformation of word embedding. The Word2Vec model is used to extract the features, and the LSTM and CNN models are used to train them. The experiment evaluation reveals that the distillation ensemble strategy achieves 93.48% accuracy (Shin et al. 2019). A self-supervised post-processing strategy is suggested to obtain pre-trained embedding for domain-specific tasks, which improves end-task performance by choosing from a menu of reconstructing transformations (MORTY). In a multi-task environment using GloVe embedding, the MORTY technique yields smaller but more consistent benefits and works particularly well with smaller corpora (Rethmeier and Plank 2019). The performance of pre-trained words embedding models such as Word2Vec (CBOW and Skip-Gram), fastText, and the BERT model on a Kannada language text classification task is evaluated. The experimentation evaluation reveals that the CBOW model gives more efficient results than the Skip-Gram model, and the fastText model outperforms the Word2Vec model on the News Classification dataset (Ebadulla et al. 2021). An iterative mimicking (IM) strategy is suggested to treat out-of-vocabulary (OOV) terms. The IM framework iteratively improves the word and character embedding model, assigning a vector to the input sequence for any OOV word. Evaluation of experimental results demonstrates that the suggested framework performs better on the word similarity task than the baseline strategy (Ha et al. 2020). The BiGRU with domain-specific embedding and fastText yields up to 64% micro-average precision for downstream tasks in the patent categorization (Risch et al. 2019). The fastText embedding strategy and the RMSProp optimizer extract relationships between word pairs from the Turkish corpus, with a 90.76% accuracy (Yildirim 2019). The Skip-Gram model shows the highest semantic clustering accuracy with a mean of 6.7 words out of 10 words utilizing Korean word embedding (Ihm et al. 2019), sequence-to-sequence auto encoder efficiently utilized to understand phonetic information using audio Word2Vec embedding (Chen et al. 2019). Gaussian LDA model provides adequate service discovery queries by acquiring meaningful information in the discovery process (Tian et al. 2019). Big corpus scaling is achieved using Word2Vec, a 7.5 times acceleration achieved on GPU without accuracy drop (Li et al. 2019a). The adaptive cross-contextual word embedding model achieves F1-score of 76.9%, considering word polysemy (Li et al. 2021). The LSTM with Word2Vec embedding model efficiently utilizes the log information to predict the next alarm in process plants and achieves an accuracy of 81.40% (Cai et al. 2019). Mirror Vector Space (MVS) embedding is an ensemble of Concept-Net, Word2Vec, GloVe, and BERT. The MVS model enhances the performance and achieves an accuracy of 83.14% for the text classification task (Kim and Jeong 2021). Improved word vector (IWV) created by combining CNN with Word2Vec, GloVe, Pos2Vec, Lexicon2Vec, and Word-position2Vec improves sentiment analysis task performance and reaches 87% accuracy (Rezaeinia et al. 2019). BiLSTM with CRF and Law2Vec embedding technique for representing legal texts obtains an F1-score of 88% (Chalkidis and Kampas 2019). The Word2Vec embedding with BiLSTM model hyperparameters optimization approaches reaches a classification task accuracy of 93.8% (Yildiz and Tezgider 2021). The meaning of polysemy words is efficiently extracted utilising sentence BERT and improves the overall textual similarity task performance (Wang and Kuo 2020). The examination of pooling procedures in conjunction with basic correlation coefficients produces the best results on subsequent semantic textual similarity problems. It demonstrates the value of applying statistical correlation coefficients to groups of word vectors as a strategy for computing similarity (Zhelezniak et al. 2019). The LDA topic model and Word2Vec are utilized to determine how similar the two terms are. Based on their similarity, the terms' semantic graph is created. By grouping the terms into various communities, each of which serves as a concept, the community detection algorithms are utilised to automatically extract concepts from text (Qiu et al. 2020a). The performance of biometric-based surveillance systems for monitoring user activity is improved using GloVe embedding with the BiLSTM model (Toor et al. 2019). The review of the importance of word embedding in terms of data source, application area, datasets, and performance evaluation is illustrated in Table 12 of Annexure A.

Table 6 A reference for selecting a suitable word embedding approach and deep learning model for text analytics tasks

5.7 Deep learning environment

Artificial neural networks gave rise to deep learning technology, which is now a hot issue in computing and is used extensively in a wide range of fields, including cyber security, healthcare, visual identification, and many more. Nevertheless, the dynamic nature and fluctuations of real-world problems and data make it difficult to create an acceptable DL model. Additionally, the absence of fundamental knowledge transforms DL techniques into passive black boxes that limit standard-level advancement. This section gives a concise overview of deep learning techniques and includes a taxonomy that takes important application domains into account.

Deep learning is becoming an increasingly important component of security systems. In the field of computer security, the paper covers the appropriate approaches and the standards for comparing and assessing methods. The performance of deep learning architectures such as MLP, CNN, and LSTM is compared between 4 to 6 layers of different types. Additionally, the study suggests adopting and implementing intrusion detection systems and vulnerability identification techniques in computer security (Warnecke et al. 2020). A dynamic prototype network based on sample adaptation for few-shot malware detection was presented to formalize the identification of unknown malware. The method makes it possible to detect malware by enabling dynamic feature extraction based on sample adaptation and using a metric-based method to determine the distance between the query sample and the prototype. The suggested method performs better than the current few-shot malware detection algorithms (Chai et al. 2022). A deep reinforcement learning-based data poisoning attack approach is developed to aid hostile personnel in endangering TruthFinder while remaining undetected. The workers experiment with various attack methods and refine their poisoning techniques to maximize their attack strategy and limit information extraction (Li et al. 2020a).

A system called DeepAutoD is suggested to use a deep convolutional neural network, which learns feature information for malicious code identification while removing the influence of reinforcement. The system increases the effectiveness of mobile communication and the security of networked computers (Lu et al. 2022). A unique ensemble deep learning-based web attack detection system is suggested to protect IoT network environments from web attacks. The three distinct deep learning models, such as the MRN, LSTM, and CNN, first identify web attacks independently before coming together as an ensemble to create an ensemble classifier that will ultimately determine the outcome. The feature vector is formed using TF-IDF, word2vec, and FastText. The experimental results on the HTTP CSIC dataset demonstrate that the proposed ensemble system can accurately identify online attacks with low false positive and negative rates (Luo et al. 2021).

For spatiotemporal data mining applications, deep learning models like CNN and RNN have shown amazing success (Wang et al. 2022). Deep learning models like CNN are used for feature extraction from spatial–temporal data, while GRU is used to improve query trajectory prediction accuracy. The investigations on the Porto dataset demonstrate that the suggested model achieves a mean absolute percentage error of 0.070% while approximating the properties of each segment of trajectory data at the time level (Qiu et al. 2020b). Identifying information that discriminates based on gender depends heavily on the meaningful classification of text from digital media. Word embedding is done using the ELMo and GloVe models, and sentence embedding is done using a BERT model. The experimentation shows that the suggested deep learning models effectively complete multi-label classification (Parikh et al. 2019).

Knowledge graphs, particularly domain knowledge graphs, are already playing significant roles in the field of knowledge engineering and serving as the foundation for intelligent Internet applications that are knowledge-driven. A Graph Convolutional Network (GCN) (Kipf and Welling 2017) is a multilayer neural network that specifically focuses on a graph and generates embedding vectors of nodes based on the characteristics of their neighborhood to accomplish state-of-the-art categorization.

GCN is suggested as a method for classifying text. The Text-GCN learns the embedding for both words and documents after initializing with a one-hot representation of each. The experimental outcome demonstrates Text-GCN's tolerance to minimal training data in text classification. Text-GCN can effectively use the limited labeled documents and collect information on global word co-occurrence (Yao et al. 2019). For text categorization, the graph-of-docs paradigm is proposed to represent numerous documents as a single graph. The suggested method recognizes a term's significance across the board in a document collection and encourages the inclusion of relationship edges across documents. Experimental results demonstrate that the suggested model outperforms the baseline models with an accuracy score of 97.5% (Giarelis et al. 2020).

Graph-based NLP combines the structural information in text and the representation learning ability of deep neural networks. Graph-based NLP approaches are extensively used in text clustering and multitask learning (Wu et al. 2022). Deep neural networks are suggested to produce compositional word embedding and sentence processing. The model multiplies matrices to create unitary matrices for big units that encode lexical data. These lexicons depict the embedding without diluting the information or considering the context (Bernardy and Lappin 2022).

6 Remarks and critical discussion

The selection of appropriate word embedding methods and deep learning models in text analytics is essential. This research aims to look at the steps different word embedding methods take and the behavior of various deep learning models in terms of text analytics task performance. In this part, the study's practical implications are examined. The advancement in the deep learning model approaches directly affects the growth of NLP techniques. The in-depth analysis of methods for analyzing unstructured text includes text classification, sentiment analysis, NER and recommendation system, biomedical text mining, and topic modeling, as shown in Fig. 3. Each of these strategies is employed in a variety of contexts.

6.1 The model architecture used for word embedding

Complex deep neural network models are becoming easier to train as technology advances on hardware and software fronts. As a result, researchers have begun integrating the characteristics of numerous deep neural networks and adding some innovative features to their design. Section 1 discusses the architectural constraints used in developing deep learning models. Section 2 discusses the development of word embedding methods for efficiently and accurately representing the word’s meaning. The most prominent word embedding models discussed in section 2 are summarized in Table 5, and their citation counts.

It is observed from Table 5 the paper that proposed the Word2Vec embedding model has the highest citations among all other models. The Word2Vec model assigns probabilities to terms that perform well in word similarity tests. In contrast, the GloVe is a count-based model that combines the local context window approach and global matrix factorization approaches. The Glove model was proposed in 2014 and had a considerable number of citations representing their utilization by the researchers. The current review reflects the same information about the Word2Vec and GloVe models as shown in Fig. 18, indicating that the researchers have explored the performance of both models to perform a specific task in almost all domains. Each language consists of specific rules and patterns that require the base model to be modified for better results. The models learn static word embeddings, with each word’s representation determined after training. The performance of the embedding model is enhanced to handle out-of-vocabulary words and the proposed fastText model. The fastText is a Word2Vec extension that recognizes words as character n-grams. It generates an efficient and effective vector representation of infrequent words.

Fig. 18
figure 18

Overview of the embedding approach used by the researchers

Embedding models are further enhanced to handle polysemy words and represent the word’s contextual meaning for a different language to perform more domain-specific related tasks. A polysemy word’s meaning might change depending on the situation. Each word’s vector representation can be altered in a contextualized word embedding approach depending on the input contexts, sentences, or documents. Domain-specific word embedding, on the other hand, is an effective strategy for task analysis for specific domain activities in research. The DSWE has grown as a more valuable solution than general word embedding since it concentrates on one particular aim of text analytics, as shown in Fig. 18.

BERT contextual embedding model has the most citations of all the recently published models in citation counts. The current review on embedding models for text analytics tasks shows that the researchers deeply explore the BERT model compared to ELMo and GPT models. Recently proposed, a variant of the GPT model was also utilized to perform domain-specific operations and is expected to achieve more citation and exploration among researchers. The description, benefits, and drawbacks of various word representation models are discussed in Table 14 of Annexure B. As per the current review, several model designs and methodologies have emerged to perform text analytics tasks. The remaining section summarizes, contrasts, and compares numerous word embedding and deep learning models and presents a detailed understanding of how to use these models to achieve efficient results on text analytics tasks.

6.2 Comparative analysis of word embedding models for text analytics tasks

The performance of word embedding techniques and deep learning models for various text analytics tasks observed from the current review is shown in Fig. 19. The study shows that the domain-specific word embedding performance is higher than the generalized embedding approach for performing domain-specific tasks related to text analytics. Specifically for the text classification task, the CBOW model of Word2Vec and domain-specific embedding performance is similar in the current review. The GloVe, fastText, and BERT embedding models show considerable performance and are limited to a few applications. The researchers utilize the ELMo and GPT models for text classification tasks in minimal circumstances, as per the current review.

Fig. 19
figure 19

Based on the current review, (a) performance of word embedding models and (b) performance of deep learning models for various text analytics tasks

Domain-specific word embedding is the preferred choice of the researchers to perform a task related to sentiment analysis. The researchers focus on character, word, or sentence levels to identify sentiment associated with the text. The performance of domain-specific embedding, which focuses on specific granules of text for evaluation, is higher than the generalized embedding approach, as shown in Fig. 19(a). The CBOW and BERT model also performed efficiently, considering specific evaluation features to identify sentiments. The researchers determined that the GloVe and fastText models also performed well for a limited number of situations. In contrast, the performance of the ELMo and GPT model is not competitive compared to the BERT model for sentiment analysis tasks as per the current review.

Generalized word embedding models fail to capture the ontologies information available in domain-specific structured resources. The subword information from unlabeled biomedical text is combined with MeSH vocabulary to form a BioWordVec domain-specific word embedding, which creates an essential foundation for biomedical NLP. As per the current review, the researchers use domain-specific embedding as an efficient approach for biomedical text mining classification, as shown in Fig. 19(a). The CBOW, ELMo, and BERT embedding models are also good choices for biomedical text mining following a generic approach. The researchers utilize the CBOW and domain-specific word embedding to perform the named entity recognition and recommendation tasks. The other embedding models, such as Skip-Gram, GloVe, fastText, and BERT, are also explored and give better results for a limited number of situations, as shown in Fig. 19(a). The researcher utilizes domain-specific embedding heavily for the topic modeling task compared to Skip-Gram and ELMo embedding models.

It is observed from the review that CBOW and domain-specific word embedding models are used frequently by researchers. It performs better in analyzing word embedding models' impact on domain-specific text analytics. At the same time, the other models, such as Skip-Gram, GloVe, fastText, and BERT, are also explored for the possibility of a better outcome in a few instances.

6.3 Comparative analysis of deep learning models for text analytics tasks

The performance of deep learning models in various application areas is shown in Fig. 19(b). It is found from the current review that the researchers heavily recommend the CNN model to perform text classification tasks. The LSTM model is another alternative to efficiently perform text classification tasks, whereas there are few instances where the GRU or hybrid model achieves better performance. The LSTM model is strongly recommended for sentiment analysis tasks, and the CNN model can be another alternative for the same. Researchers discovered that both CNN and LSTM could be used for text classification tasks in the biomedical domain. The LSTM model is strongly recommended for named entity recognition and recommendation system tasks, as shown in Fig. 19(b), based on the model’s performance. The other deep learning models, such as GRU, CNN, and hybrid models, prove their effectiveness in a few cases. The CNN and GRU models can be utilized for topic modeling tasks. It is observed from the current review that analyzing the impact of the word embedding model on the text analytics domain needs a powerful deep learning model. The LSTM model is preferred for analyzing the performance of the embedding model compared to the CNN and GRU models. Apart from the LSTM model, the CNN model can also be explored to perform the analysis.

6.4 Selection criteria for word embedding and deep learning models to perform text analytics tasks

Text analytics uses machine learning, deep learning, and NLP to extract meaning from vast amounts of text. Businesses may use this information to boost revenue, customer satisfaction, innovation, and public safety. This study explores the effectiveness of utilizing word embedding techniques in a deep learning environment for text analytics tasks. The review reveals three main types of word embeddings: conventional representation, distributional representation, and contextual representation model. Deep learning models such as CNN, GRU, LSTM, and a hybrid approach are utilized by most researchers to accomplish text analytics tasks. The selection of word embedding and deep learning models for better outcomes is a vital step. It requires thorough knowledge of various types of embedding and deep learning models to accomplish the designated task in a specified time. A reference selection criteria for selecting a suitable word embedding and deep learning model for text classification tasks is illustrated in Table 6. It is revealed from the current review that domain-specific word embedding achieves the first preference as the most suitable embedding for the majority of application areas related to text analytics.

The CBOW model also achieves the first preference for performing text classification tasks, whereas GloVe, fastText, and BERT models achieve the second preference, as shown in Table 6. The CBOW and BERT model achieves the second preference for performing the sentiment analysis task. The CBOW, BERT, and ELMo models achieve second preference for performing biomedical text mining tasks. The CBOW model is the second choice for performing operations on the NER and recommendation system. The Skip-Gram and GloVe model achieves the second preference to perform topic modeling-related tasks. The domain-specific word embedding and CBOW embedding models are recommended as the first preferences, whereas the Skip-Gram model is recommended as a second preference to analyze the impact of the word embedding model on text analytics tasks.

Various deep learning models have been proposed and utilized to perform text analytics tasks. It is revealed from the current review that the CNN model achieves the first preference and the LSTM model attains the second preference to perform text classification tasks. Similarly, the LSTM model reaches the first preference for sentiment analysis tasks, named entity recognition and recommendation system tasks, and the hybrid approach is the second preference. The CNN and the LSTM model achieve the first preference for biomedical domain text classification tasks, and the hybrid approach achieves the second preference. The CNN and the GRU model attain the first preference for topic modeling tasks. As per the current review, for analyzing the impact of word embedding, the LSTM model achieves the first preference, and the CNN model achieves the second preference.

In the current review, comparing the performance of various word embedding and deep learning models for text analytics tasks reveals specific word embedding and deep learning models as the preferred choice to perform particular tasks. In conclusion, using the domain-specific word embedding and LSTM model can improve the overall performance of text analytics tasks.

7 Conclusion and future directions

7.1 Concluding remarks

In recent years, there has been an increase in interest in using word embedding and deep learning for analysis and prediction, and the research community has proposed various approaches. This paper studies a systematic literature review to capture the state-of-the-art word embedding and deep learning models for text analytics tasks and discusses the key findings.

Three different electronic data sources were used to find and classify relevant articles about the influence and use of the word embeddings model on text analytics in a deep learning context. The relevant literature is categorized based on criteria to review the key applications of text analytics and word embedding techniques. Techniques for analyzing unstructured text include text classification, sentiment analysis, NER, recommendation systems, biomedical text mining, and topic modeling.

Deep learning models utilize multiple computing layers to learn hierarchical representations of data. Several model designs and methodologies have emerged for text analytics. This paper reviews the performance of various word embedding methodologies proposed by the researchers and the deep learning models employed to get better results. The review contains a summary of prominent datasets, tools, and APIs available and a list of notable publications. A reference for selecting a suitable word embedding approach and deep learning model for text analytics tasks is presented in Section 6. The comparative analysis is presented in both tabular and graphical forms.

According to the current review, domain-specific word embedding is the first preference for performing text analytics tasks. The CBOW model can be the first preference for performing operations like text classification tasks or analyzing the impact of word embedding. The CBOW model and the BERT model attain the second preference for performing the operations related to text analytics. The review shows that the researchers preferred CNN and LSTM models compared to the GRU and the hybrid approach to perform text analytics tasks. It can be concluded from the findings of this study that domain-specific word embedding and the LSTM model can be used to improve overall text analytics task performance.

7.2 Future directions

The selection of appropriate word embedding models plays an important role in the success of NPL applications. It is difficult to predict what kind of semantic or syntactic information is captured inherently in a contextualized word embedding. Extraneous tasks are the only way to evaluate contextualized word embeddings. It would be crucial to identify whether the goal of context-dependent representation has been achieved and assess the scope of this possible achievement. The expression of each embedding strongly depends on individual tasks for sentence representations. The essential basic components of the sentence required by various tasks are at different levels. It is necessary to understand how to learn sentence representations and even higher levels of text representation for various languages in the future.

Moreover, even though the present word vector model has generated significant results in various NLP tasks, these approaches have some limitations. For example, the model parameters are excessively huge, the lengthy training process, and existing neural network-based systems are incomprehensible. As a result, figuring out how to cut the cost of neural network training while improving the model interpretability is another area of research. Sizes of the corpus should be considered when evaluating the embedding. Analyze the outcomes of reducing the embedding dimension and the steps that must be followed for a particular task in a given domain.

Pretrained embedding models have a large number of word vectors and need more storage space. On a system with limited resources, this expense represents a deployment constraint. Examine the best ways to increase isotropy and decrease dimension in pre-trained embedding models. Investigate approaches for learning multilingual lexicons in a single embedding space, enhance ways for learning multilingual word embedding, and employ semantic information to transmit knowledge in a range of cross-lingual NLP tasks.

Contextualized word embeddings have achieved outstanding results in significant NLP tasks. Further research is required to develop a reliable contextual language model for the text analytics problem using a combination strategy leveraging the contextual word embedding model and multitask learning approach. Contextual embeddings and other sorts of spelling variation can be investigated in future studies. Investigate various classifiers and feature representations to capture the interaction between two embeddings for diagnostic classifiers. Explore how to get the correlation between text, audio, and video using enhanced deep canonical correlation analysis. These distinctive features are collected to provide multimodal embedding for the optimum downstream task. Extend the performance of the transformer-based language model to generate representation, reducing the dependency that requires human-labeled training data and efficiently extending for performing other downstream tasks.

8 Appendix A

Text analytics techniques include text classification, sentiment analysis, biomedical text mining, named entity recognition, recommendation system, and topic modeling. In terms of data source, application area, datasets, and performance evaluation, Tables 7, 8, 9, 10, 11, and 12 illustrate the approaches-wise reviews of word embedding and deep learning models employed.

Table 7 Review of text classification
Table 8 Review of sentiment analysis
Table 9 Review of biomedical text mining
Table 10 Review of named entity recognition and recommendation system
Table 11 Review of topic modelling
Table 12 Review of the importance of word embedding
Table 13 List of publishers/journals

9 Annexure B

See Table 14.

Table 14 The description, benefits, and drawbacks of various word representation models