1 Introduction

In recent times, the workings of society have changed in a great manner due to the application of different technologies related to the internet [18]. People place orders on E-Commerce websites, communicate on social media, and use other web utility services. Different social networking platforms like Facebook and Twitter have grown up as a place for expansion of business, political campaigning, innovative idea sharing, promotion of different goods and services [67]. Users share their perspectives on these platforms and appraise the services and the products by giving star ratings [7] or thumbs up or down [17, 64]. Consequently, these platforms are significant resources of opinions given by the consumers. Regarding this context, Sentiment Analysis (SA) is an important research area focused on analyzing user opinions and partitioning them into mutually exclusive sets. SA is considered to be a subfield of Natural Language Processing (NLP), an extensive area in the domain of Artificial Intelligence [5], that works on the intercommunication between computers and natural human languages. Another popular name that is tagged with SA is opinion mining which helps in the extraction of emotion within the text [9, 33]. It classifies the text data into positive, neutral, or negative class [11] and also quantifies the polarity level [66]. SA can be employed on varieties of languages to capture the sentiment [7, 39].

During the previous decades scientists have explored the SA extensively [4, 48, 49]. The most popular approaches in SA contain lexicon-based, rule-based and Machine Learning (ML) based techniques [13, 37]. In the case of lexicon-based techniques, the sentiment polarity within the token of text is decided based on the semantic orientations of the text constituents [55]. Semantic orientation is the measurement of subjectivity and opinion within textual information. The rule-based approach [70] is based on the algorithms of association rule mining that discover rules with higher confidence and support which are capable of deciding the underlying sentiments of certain interesting features effectively. The rule-based technique finds the opinion words in the text and subsequently classifies the text depending on the number of negatively polarized and positively polarized words. It considers the various classification rules like polarity in the dictionary, boosting words, negation words, etc. ML employs different varieties of learning algorithms that use labeled data sets to train the classifier for determining the inherent sentiments [7, 57] which exist in the data set. Along with the mainstream techniques, the popularity of the hybrid methods is also growing in the sentiment classification.

Neural networks have gained so much popularity these days among different ML techniques in different applications including text mining [16, 32, 38], image processing [44, 54], video processing [34, 59] and audio [71] processing. Now-a-days, with the steady evolution of ML, more complex deep neural network models can be handled on large data sets [23, 31, 36]. Convolution Neural Networks (CNN) are becoming prominent Deep Learning (DL) techniques to achieve very accurate results in the domain of computer vision [31]. Within CNN, stacking of multiple convolution layers and pooling layers help in the task of sequential extraction of the hierarchical representation of the input data [29, 35]. Along with its presence in different domains, CNN is also showing improvement in SA [27, 28]. CNN can capture the non-linearity within the data and tries to learn the local embedding. Besides CNN, Recurrent Neural Network (RNN) [10] has recently become prominent that gives memory element for capturing dependencies in long-term sequence. In various language processing based tasks, RNN has shown very fruitful results [23, 61]. The main issues in the simple RNN are the vanishing gradient problem, where the value of the gradient tends to zero, and the exploding gradient problem, where the gradient tends to have a very high value. Long Short Term Memory (LSTM) is a variant of simple RNN that tries to mitigate the issues of vanishing gradient problem and exploding gradient problem [10, 24]. Term Frequency-Inverse Document Frequency (TF-IDF) [53] is a technique in text processing that is used for quantifying the tokens in the textual data. TF-IDF is a numerical statistical measurement that indicates how much value a constituent holds to the document. It is prevalent for scoring the words in the text categorization algorithms [62]. During the weighing of tokens, this technique considers its importance in a single document and considers its relevance in the whole corpus. The weighing factor gives an edge to the TF-IDF text vectorization technique over the Bag of Words (BoW) scheme [73]. TF-IDF based approaches have shown promising results for different automated text analysis tasks like information retrieval [72], text classification [21, 62] etc.

Due to the growing importance of designing an integrated sentiment classification approach, we have presented a deep hybrid sentiment analysis framework by incorporating state-of-the-art deep learning techniques with a modified TF-IDF based text vectorization approach. Summarization of the proposed framework is described in the following sequence of steps:

  1. (i)

    The data pass through the cleaning (garbage removal, slang correction, stopwords removal) and preprocessing (tokenization of text, parts of speech tagging) phase.

  2. (ii)

    A modified TF-IDF based approach combined with the k-best selection method is devised for feature vectorization.

  3. (iii)

    The pre-trained Word2Vec model on Google News corpus [22] is used for the embedding of the feature vector in 300 dimensional space. The pre-trained Word2Vec models are capable of giving high end embedding vectors [41, 52].

  4. (iv)

    At the final stage, we appoint DL techniques by integrating CNN and LSTM. The DL techniques help extract the locally encoded dominant features, and the LSTM is used to apprehend the historical information.

We have shown the efficacy of the proposed framework compared to different baseline models with the help of various performance metrics like accuracy, precision, recall and F-Measure. The remaining portion of this paper is segmented in the following way. Section 2 describes the correlated research in the domain of Sentiment Analysis and the various techniques which are amalgamated within the hybrid approach. Section 3 discusses the proposed method and framework designing strategy in detail. Section 4 highlights the setup for the experiment and empirical observation analysis, and finally, Section 5 talks about the conclusion briefly and the future prospect of this paper.

2 Related work

Within this segment, we will talk over the related explorations that have been accomplished in the field of SA and the different methods we have utilized. A thorough overview of different algorithms, their amplification, and real-life applications in SA are demonstrated in [40]. It also discusses how to build a resource for sentiment analysis tasks. As a whole, it tries to give an overall scenario of SA and the associated fields with a brief introduction. One of the most popular techniques for weighing the constituents of textual data is TF-IDF. This is a statistical measurement based strategy for scoring the elements of the text. The mathematical foundation of the simple TF-IDF scheme is well explained in [53]. This paper has provided theoretical arguments and shown some problems with the information theory based approaches. It also provides theoretical justifications that the TF-IDF scheme is a comparatively better probabilistic model. The impact of the TF-IDF scheme, as a feature extraction scheme in the context of sentiment analysis, is discussed in [3]. By experimental results, it has shown that the TF-IDF scheme can give 3% to 4% higher performance than using the N-gram feature. A comparative study of the Bag of Words model and TF-IDF scheme is done in [15]. Experimental observations show that the TF-IDF scheme performs better than the Bag of Words model of feature selection in the context of sentiment analysis. When they employed the Bag of Words model for feature selection in their movie review dataset, they achieved the accuracy of near about 86.6%. But when they employ TF-IDF scheme, the accuracy improves to 89.0%. Similar experiments have been conducted on movie review data in [21] where TF-IDF achieves better performance with an accuracy of 73.8% , but its counterpart achieves an accuracy of 66.5%. Now, the performance of simple TF-IDF scheme can further be improved if some weighting factors are imposed on top of the basic scheme. In [62] linear weighting factors are imposed on the basic TF-IDF scheme, and experimentally it has been shown that the weighted TF-IDF gives better results with improvement in accuracy in the range of 2.9% to 7.4%. Similar kinds of experiments have been carried out in [50]. Here the authors have considered data from different categories and, by experimental observations, they have shown that the TF-IDF scheme works better when the weighting factor is imposed on it. For example, in programmer category f-measure value increases to 0.79 from 0.74, in data configuration category f-measure value increases to 0.80 from 0.67, in operating system category f-measure value increases to 0.83 from 0.81 etc. As a result, to achieve better performance, we have modified the basic TF-IDF scheme by introducing a local weighting factor and a non-linear global weighting factor to accomplish the task of feature extraction.

For mathematical modeling, the textual data elements should be converted to floating values, and such a process is popularly termed Word to Vector embedding. Word2Vec is one of the widely used embedding approach [26] where each constituent of the text is converted to a floating-point vector. A comparative study of the bag of words embedding and Word2Vec embedding is presented in [20]. Here authors have considered cross-domain text classification results to show that Word2Vec embedding performs better than a bag of words embedding. For example, in the books domain, with a bag of words embedding, they achieve a classification accuracy of 66.90%, but with Word2Vec embedding, classification accuracy is improved to 68.74%. Similarly, in the music domain, the accuracy achieved with a bag of words embedding is 67.94%, but with Word2Vec embedding, accuracy improves to 68.53%. A comparative study between pre-trained embeddings, namely Glove embedding and pre-trained Word2Vec embedding, is depicted in [52]. The authors have shown that Word2Vec embedding works better than Glove embedding by experimental observations. When the authors use a static model, the Glove embedding gives the accuracy of 80.1%, 84.5%, 80.7% and 43.2% for MR, SST, TR, and SST-1 datasets, respectively. The Word2Vec embedding improves the accuracy to 80.6%, 84.9%, 81.0% and 46.1% respectively. When the authors have considered the non-static model, the Glove embedding shows the accuracy of 81.0%, 85.0%, 81.2% and 45.6% for respective datasets. The Word2Vec embedding gives improvement with accuracy of 81.2%, 85.3%, 81.5% and 46.1% respectively. Now how the performance varies with the underlying model of the embedding process is discussed in [41]. In this paper, by experimental results, the authors have shown that Skip-gram model based embedding works better than NNLM and CBOW model based embedding. The average accuracy that they have gotten for NNLM and CBOW based modelings are 50.8% and 63.7%, respectively. But for Skip-gram model based embedding, the average accuracy is 65.6%. A similar kind of study has been done in [42]. Here CBOW model based Word2Vec architecture achieves an accuracy of 0.772 and the Skip-gram model based Word2Vec architecture achieves an accuracy of 0.834. This study also has experimented on the effect of dimensionality of embedding. Experimentally it has shown that 100-dimensional and 200-dimensional embedding have the accuracy of 0.798 and 0.804 respectively. The 300-dimensional embedding outperforms them by giving the maximum accuracy of 0.808. Considering the previous research, for enhancement of performance, we have considered the Skip-gram model based 300-dimensional pre-trained Word2Vec embedding, which is trained on Google News corpora.

Among different Machine Learning models, Neural Network is the one that is gaining popularity these days. It shows better performance in sentiment detection compared to other traditional approaches, which is experimentally observed in [11]. When authors employ a neural network for sentiment categorization in this study, the f-measure improves to 0.812 from 0.778 for the EC dataset. Similarly, for the MP3 dataset and Blog dataset, the f-measure improves to 0.735 from 0.683 and to 0.875 from 0.698, respectively. CNN is a kind of neural network that helps sequential extraction of features and hierarchical data processing. In the first stage, CNN showed great success in image processing [30]. Gradually, it has made its presence in language processing and text analysis. In [29] sentiment analysis has been performed on Twitter data using unigram and bigram features where SVM and MaxEnt have given the third-highest and second-highest performance with an accuracy of 81.6% and 83.0% respectively. The CNN outperforms them with an accuracy of 87.4%. Similarly, by experimental observation, the fruitfulness of CNN in sentiment analysis is shown in [28]. When authors have done sentiment analysis on the SED dataset, SVM and Logistic regression model have given f-score of 85.08% and 86.08% respectively. The CNN improves the performance with an f-score of 87.66%. For the SSTd dataset, the SVM and Logistic regression model have f-score of 77.90% and 76.18%, respectively. Here also, CNN gives the highest f-score of 80.72%. A similar scenario is observed in the STSGd dataset, where the SVM and logistic regression model have shown an f-score of 69.21% and 70.66%. Here CNN shows a remarkable improvement with an f-score of 82.65%. In [27], authors have collected reviews from different websites and created customized datasets. They have performed sentiment analysis on those customized datasets. In the case of the hotel dataset, the authors have found that Random Forest has given the third-highest performance with an accuracy of 83.0%, and SVM has shown the second highest accuracy of 92.3%. Here CNN gives the highest accuracy value of 94.3%. For their automobiles dataset, SVM and Random Forest have the accuracy of 84.8% and 85.6%. CNN outperforms them with an accuracy value of 93.4%. So previous studies have confirmed the success of CNN in a sentiment categorization task. RNN is a kind of neural network that can help in storing historical details and modeling the sequence data through its feedback connections. For simple vanilla RNN, the vanishing gradient (gradient becomes too small) and the exploding gradient (gradient becomes too high) problems complicate practical applications. LSTM is a variant of RNN that mitigates the obstacles of vanishing and exploding gradient. How the issues of vanishing and exploding gradient are resolved in LSTM are theoretically discussed in [24]. LSTM has successfully made its presence in sentiment analysis. Performance of LSTM in sentiment analysis is analyzed in [6]. Here authors have shown the superiority of LSTM in sentiment analysis over multi-layer perceptron and traditional machine learning approaches. They have performed sentiment analysis on a movie dataset where the logistic regression model, support vector machine, and multi-layer perceptron have achieved the accuracy of 85.50%, 82.89% and 87.70% respectively. The LSTM outperforms them with an accuracy of 88.46%. Similar kinds of experiments have been carried out in [74]. Here authors have performed sentiment analysis on micro-blogging data where Bayesian network, random forest, support vector machine achieve the f-measure of 63.97%, 67.18%, and 62.35% respectively. The LSTM improves the performance with an f-measure of 72.20%. In [63], covid-19 related tweets are analyzed using Naïve Bayes classifier, SVM, and LSTM. Here SVM and Naïve Bayes have shown the accuracy of 69% and 71%, respectively. The LSTM further improves the accuracy to 79%. So previous research has demonstrated a good foothold of LSTM in sentiment analysis tasks.

In this paper, different techniques are modified and combined for providing a better hybrid framework of sentiment polarity analysis. After preprocessing of data, the basic TF-IDF scheme is modified by introducing a local weighting factor and a non-linear global weighting factor, and k-best selection is made based on that modified scheme. Next, pre-trained Word2Vec embedding is employed. Finally, a deep neural network is designed by combining both the CNN and LSTM to take both advantages of sequential feature extraction and dependency preservation. The deep neural network architecture takes the embedded vectors as input to produce the desired sentiment label as output. It has been shown that the proposed hybrid framework performs better than the traditional machine learning approaches by testing the framework on various datasets by different performance metrics.

3 Proposed method and framework design

The broad view of the proposed framework is depicted in Fig. 1. From a broader perspective, the framework has four stages. The stages in sequence are Data Preprocessing, Text Vectorization, Embedding of Feature and ultimately, the application of deep network, which are described in detail in the following subsections. In Data Preprocessing phase, unnecessary details are removed, and the data is organized in a structured format. After that, the basic TF-IDF scheme is modified in the Text Vectorization phase by introducing a global weighting factor. This modified scheme is applied to data and the best features are selected from the data. After vectorization, textual features are converted into numerical structure by pre-trained Google News corpora-based embedding to be processed by neural networks more prominently. In the deep network phase, CNN and LSTM are combined to take advantage of both. CNN helps in the hierarchical extraction of predominant features in data, and LSTM assists in dependency preservation within the data. The embedded features are passed as input to the deep network, and it ultimately provides the sentiment label as the output.

Fig. 1
figure 1

Broad view of hybrid framework

3.1 Preprocessing of data

In the proposed framework preprocessing of data is the first footstep. Whenever data are collected, various kinds of noise, non-dictionary terms, expressions through emoji, acronyms, grammar irregularities, and poorly constructed sentences exist within the collected data. These types of deformities lead to the decaying of performance. Data preprocessing steps are essential to eliminate such irregularities. The whole procedure can be more efficient by representing the data in a structured shape. Data preprocessing consists of the following following sub-stages described below.

  1. (i)

    Garbage elimination: The web links, URLs, and numeric values don’t contain sentiment-related detailing. As a result, URLs, numeric values, non-ASCII characters and non-alphabetic phrases are eliminated from the data with the help of our designed custom regular expression

  2. (ii)

    Emoji substitution: Emoji is basically used to express the writer’s instant frame of mind with an icon. In this step, the emoji is substituted with its equivalent text by using the emoji package of python [19]. It assists in apprehending the sentiment information lying within the icon.

  3. (iii)

    Slang substitution: The slang and abbreviations are expanded to their complete form to interpret their inner meaning. Suppose if we come across the word “ttyl”, it is substituted by “talk to you later”. We have created our custom dictionary by aggregating data from [25] and [14], and this dictionary is used for slang substitution. The primary working mechanism of replacing slang is described in Algorithm 1.

  4. (iv)

    Apostrophe reference substitution: At this stage, the short apostrophe forms are expanded to their full expression to evaluate sentiment polarity better. Suppose if we come across “we’ve”, it will be substituted by “we have” and the negative reference such as “can’t” will become “can not”. Specifically, the negative references are vital in resolving the associated sentiment. For this task, we have designed a customized dictionary by taking data from [45]. The working mechanism behind this step is described in Algorithm 2.

  5. (v)

    Tokenization of text: The words, phrases, and symbols are the meaning-bearing units of the text that are termed tokens. At this step of tokenization, the text is fragmented into tokens. Punkt Sentence Tokenizer [46] along with Penn Treebank Word Tokenizer [47] are imported from the python NLTK package for accomplishing this task. A customized data structure has been created to reserve the tokens for each sentence and sentence associated with each document.

  6. (vi)

    Stopword elimination: Certain words such as “an”, “the” etc. are quite common in discussion but do not hold too much significance while deciding the sentiment of the overall text. Such kinds of words are labeled as stopwords. To eliminate such kinds of stopwords, we import the NLTK package[43] of python. During the elimination of stopwords, the negation words like “not”, “no” etc., are kept as they are valuable in sentiment polarity classification.

  7. (vii)

    Tagging parts of speech: At this stage, each constituent within the text is tagged with associated part of speech (POS) depending upon its utilization within the text. To accomplish the task of POS tagging, POS tagger [8] has been employed from the NLTK package of python. The custom data structure has been maintained for the accumulation of the sentence tokens and the corresponding POS tags. After POS tagging, we only keep the adjectives, adverbs, nouns and adverbs as the sentiments can mainly be recognized from these POS.

Algorithm 1
figure a

SLANG_SUBSTITUTION.

Algorithm 2
figure b

APOSTROPHE_SUBSTITUTION.

3.2 Text vectorization

To accomplish the task of sentiment analysis, the sentences in each review need to be represented as a feature vector. In NLP, one of the popular feature vector representation schemes is BoW [73] representation where the text is considered as a bag of words. In BoW, the ordering of the words and the grammar are disregarded. The basic principle of the TF-IDF technique is based upon the BoW scheme. In addition, the TF-IDF technique considers the weighing of the words in the text depending upon their occurrences. We have modified the basic TF-IDF scheme and then considered the k-best selection method. The first component of the TF-IDF scheme is TF, which is term frequency. Term frequency is the frequency of occurrence of a term in a document. Now the term frequency is proportional to document length, which means term frequency generally tends to be higher for a longer document. To mitigate the effect of document length, the term frequency of a term in a document is normalized by the sum of the term frequencies of all the terms in the document. This normalized term frequency is fundamentally the local weighing factor (LWF), as defined by the Eq. (1), where lwfm,n is the LWF of the mth term of the vocabulary (tm) corresponding to the nth document (dn) and fm,n is the frequency of occurrence of tm in dn. The second component of the TF-IDF scheme is IDF, which is inverse document frequency. The inverse document frequency indicates the importance of the term in the whole corpus, which is simply computed using subdivision of the total number of documents by the number of documents in which the term has occurred and taking the log function of it. The inverse document frequency is fundamentally the global weighing factor (GWF). In the modified new GWF (MNGWF), we have two modules, one is max GWF (MGWF) and another is smooth GWF (SGWF). The MGWF and the SGFW are specified in Eqs. (2) and (3) respectively, where mgwfm and sgwfm are the MGWF and SGWF of tm, fm indicates the number of documents in which tm occurs and N is the total number of documents. The addition of MGWF and SGWF computes the MNGWF, and the final simplified form is given in Eq. (4). Finally, the modified term frequency-inverse document frequency (MTFIDF) is given by the multiplication of LWF and MNGWF, which is shown in Eq. (5). In the case of the sentiment categorization task, each review is considered as a document. After preprocessing of data, MTFIDF is computed for each of the constituents of the individual review. Afterward, the K-best selection method is adopted for text vectorization. The whole mechanism is depicted in Algorithm (3).

$$ \begin{array}{@{}rcl@{}} {}lwf_{m,n}&=&\frac{f_{m,n}}{{\sum}_{p}f_{p,n}} \end{array} $$
(1)
$$ \begin{array}{@{}rcl@{}} mgwf_{m}&=&\log(\frac{max\{m^{\prime} \in d\}f_{m^{\prime}}}{1+f_{m}}) \end{array} $$
(2)
$$ \begin{array}{@{}rcl@{}} {}sgwf_{m}&=&\log(\frac{N}{1+f_{m}}) \end{array} $$
(3)

mngwfm = mgwfm + sgwfm

\(=\log (\frac {max\{m^{\prime } \in d\}f_{m^{\prime }}}{1+f_{m}})+\log (\frac {N}{1+f_{m}})\)

\(=\log (\frac {max\{m^{\prime } \in d\}f_{m^{\prime }}N}{(1+f_{m})^{2}})\)

\(=\log ({max\{m^{\prime } \in d\}f_{m^{\prime }}})+\log N-\log (1+f_{m})^{2}\)

\(=\log ({max\{m^{\prime } \in d\}f_{m^{\prime }}})+\log N-2\log (1+f_{m})\)

\(=\log ({max\{m^{\prime } \in d\}f_{m^{\prime }}})-2\log (1+f_{m})\)

[Ignoring \(\log N\) as for a given set of documents it is fixed]

$$ \begin{array}{@{}rcl@{}} {}mngwf_{m}&=&\log({max\{m^{\prime} \in d\}f_{m^{\prime}}})-2\log(1+f_{m}) \end{array} $$
(4)
$$ \begin{array}{@{}rcl@{}} mtfidf_{m,n}&=&\frac{f_{m,n}}{{\sum}_{p}f_{p,n}}[\log({max\{m^{\prime} \in d\}f_{m^{\prime}}})-2\log(1+f_{m})] \end{array} $$
(5)
Algorithm 3
figure c

MTIDF_K_BEST_SELECTION.

3.3 Embedding of feature

RV that is computed by the Algorithm (3) is a set of feature vectors where each feature vector corresponds to an individual review, and each component of the feature vector represents a feature corresponding to that review. In this process of feature embedding, each component of the feature vector is converted to a floating vector so that further neural network based processing can be accomplished. Google’s Word2Vec model is used [22] for the embedding of the feature, which is pre-trained on Google News dataset that contains nearly 100 billion words. The mechanism of the embedding process is depicted in Algorithm (4). Here, the vectorized review set RV is passed as the input to the algorithm, and it produces embedded review set Reb as output. Each component of individual review is embedded as a 300-dimensional vector that means \(R_{{eb}_{p}}\) is a matrix where \(R_{{eb}_{p}} \in {\text {I\!R}^{k\times 300}}\). If a review vector is having nc number of non-null components where nc < k then matrix of dimension \({\text {I\!R}^{n_{c}\times 300}}\) is concatenated with null matrix of dimension \({\text {I\!R}^{(k-n_{c})\times 300}}\) to get the whole matrix of dimension IRk×300. Reb is the complete embedding set that contains each and every \(R_{{eb}_{p}}\).

Algorithm 4
figure d

REVIEW_EMBEDDING.

3.4 Deep network

The deep network stage consists of three phases, namely Convolutional Network, Long Short Term Memory Network, and Densely Connected Network, discussed subsequently. The architecture of the deep network is depicted in the box section of Fig. (1).

  1. (i)

    Convolutional network: The Convolutional Network consists of a sequence of convolution operations, non-linear activation and max-pooling. The embedded review set Reb is given as input to the convolutional network and it is passed through the series of sub-operations of the convolutional network. The convolution, non-linear activation and max-pooling operations are discussed in the following subsections. The block diagram of sequential processes in the convolutional network is depicted in the upper fragment of the deep network section in Fig. 1.

    One of the most important components of a convolutional network is convolution operation. The convolution operation is helpful in the extraction of features from the local regions. Using a convolution kernel reduces the number of parameters that have to be learned in the network by sharing weights. The kernel slides over the input, and the portion of input within the window is point-wise multiplied with the w1 × w2 kernel. The values are summed together to get a single representative value for the corresponding input window. Here, the convolution operation is performed with a stride value of 1, which indicates that the kernel should be shifted by one position over the input in each turn. For undefined values, zero padding is considered so that the convoluted output size becomes the same as the input size. The output feature map after convolution operation is expressed by the Eq. (6), where f is the output feature map after convolution, I is the input and W is the kernel of size w1 × w2.

    $$ \begin{array}{@{}rcl@{}} f(x,y)=\sum\limits_{p=0}^{w_{1}-1}\sum\limits_{q=0}^{w_{2}-1}I(x-p,y-q)W(p,q) \end{array} $$
    (6)

    The convoluted output is filtered through a non-linear function that gives the input as output if the input is greater than zero. Otherwise, it gives zero as output. This activation function is popularly known as Rectified Linear Unit (ReLU). This ReLU activation function reduces the vanishing gradient problem. The mathematical expression of ReLU is depicted in Eq. (7).

    $$ \begin{array}{@{}rcl@{}} a(x)=max(0,x) \end{array} $$
    (7)

    One of the popular sub-sampling methods in a convolutional network is max pooling which tries to select the most dominant feature from a region. It helps in dimensionality reduction while protecting the most valuable information. After max pooling, the number of parameters of the network gets reduced, which assist in mitigating the overfitting problem. In max pooling, a window of size w1 × w2 slides over the input according to the stride size, and the maximum value of the input is selected from the corresponding window. The max-pooling operation is depicted in Eq. (8), where mw([x,x + w1 − 1][y,y + w2 − 1]) is the output after max-pooling operation on the input window of position x to (x + w1 − 1) in vertical axis and y to (y + w2 − 1) in horizontal axis.

    $$ \begin{array}{@{}rcl@{}} m_{w}([x,x &+& w_{1}-1][y,y+w_{2}-1])=\\ &&\max\limits_{i \gets 1\thinspace to\thinspace w_{1}} \max\limits_{j \gets 1\thinspace to\thinspace w_{2}} ([x,x+i-1][y,y+j-1]) \end{array} $$
    (8)
  2. (ii)

    Long short term memory network: Vanilla RNN effectively preserves historical dependency, but vanishing and exploding gradients create problems. The Long Short Term Memory (LSTM) is an RNN variant that tries to mitigate these problems while memorizing the past dependencies. Convolutional Network output is flattened and then fed to the LSTM network as the input. A Block diagram of the LSTM network is shown in the middle fragment of the deep network section in Fig. 1 that has two layers, and each layer consists of 64 neurons. The LSTM computes the output \(\hat {z_{t}}\) at any arbitrary time instant t with the help of current input to the network xt and the previous internal state st− 1 at time instant t − 1 which are shown in Eqs. (9) to (15). In those equations, ⊙ indicates Hadamard product, sigma() refers to sigmoid function and hyperbolic tangent function is indicated by tanh() function. LSTM with a basic RNN structure contains different gating units for controlling the flow of information. First, there is the input gate it where we take the dot product of xt that is the network input at time instant t and weight \(w_{x_{i}}\) and then the dot product of previous internal state st− 1 and weight \(w_{s_{i}}\) is taken. Finally, the sigmoid function is applied on the summation of them with the input gate bias bi, as shown in Eq. (9). After the input gate, we have forget gate ft that decides how much previous information needs to be ignored and how much historical information needs to be recalled for future processing. The expression of forget gate output is almost similar to that of input gate output; the only dissimilarity lies in the weighing factor. Here, xt is multiplied by weight \(W_{x_{f}}\) and weighing factor for st− 1 is \(W_{s_{f}}\). After the dot products, they are summed up with forget gate bias bf, and the sigmoid function is applied to it to obtain forget gate output that is depicted in Eq. (10). Next, we have the modulation gate gt. The mathematical expression of modulation gate output is also similar to that of input and forget gate; the distinction only lies in the weights, which is shown in Eq. (11). Then current memory content ct at time instant t is computed with the help of memory content at the previous time instant, input gate, forget gate and modulation gate outputs using the Eq. (12). Subsequently, we have the output gate whose mathematical form is expressed in Eq. (13). Next, current internal state st is calculated with the help of output gate ot and current memory content ct using Eq. (14) and at last the output of the network is computed from the current internal state st following the Eq. (15).

    $$ \begin{array}{@{}rcl@{}} {}i_{t}&=&\sigma (W_{x_{i}}x_{t}+W_{s_{i}}s_{t-1}+b_{i}) \end{array} $$
    (9)
    $$ \begin{array}{@{}rcl@{}} f_{t}&=&\sigma (W_{x_{f}}x_{t}+W_{s_{f}}s_{t-1}+b_{f}) \end{array} $$
    (10)
    $$ \begin{array}{@{}rcl@{}} {\kern8pt}g_{t}&=&\tanh (W_{x_{g}}x_{t}+W_{s_{g}}s_{t-1}+b_{g}) \end{array} $$
    (11)
    $$ \begin{array}{@{}rcl@{}} {}c_{t}&=&g_{t} \odot i_{t}+c_{t-1} \odot f_{t} \end{array} $$
    (12)
    $$ \begin{array}{@{}rcl@{}} {}g_{t}&=&\sigma (W_{x_{o}}x_{t}+W_{s_{o}}s_{t-1}+b_{o}) \end{array} $$
    (13)
    $$ \begin{array}{@{}rcl@{}} {}s_{t}&=&o_{t} \odot \tanh (c_{t}) \end{array} $$
    (14)
    $$ \begin{array}{@{}rcl@{}} {}z_{t}&=&\sigma (W_{s_{z}}s_{t}+b_{z}) \end{array} $$
    (15)
  3. (iii)

    Densely connected network: The Densely Connected Network (DCN) is practically a reflection of the multilayer perceptron. The output of the LSTM network is passed to the DCN as input. The first layer of DCN has 64 neurons, the next layer consists of 32 neurons, and then there is a layer of c neurons where c is the number of output classes. After that softmax function makes a decision on the sentiment label. The block of DCN is shown in the lower fragment of the deep network section in Fig. 1. The output of individual units within a layer is computed using Eq. (16). The softmax operation at the final layer is shown in Eq. (17) where \({O^{l}_{q}}\) represents the output of the qth node of the current layer and Ol− 1 is the output of the pth node in the previous layer. Wpq is the weight of the link between the pth node of the previous layer and the qth node in the current layer. \({b^{l}_{q}}\) indicates the bias associated with the qth node of the current layer. \({O^{f}_{i}}\) indicates the output of the ith node of the final decision-making layer.

    $$ \begin{array}{@{}rcl@{}} {O^{l}_{q}}&=&\sigma(\sum\limits_{p}O^{l-1}_{p}W_{pq}+{b^{l}_{q}}) \end{array} $$
    (16)
    $$ \begin{array}{@{}rcl@{}} softmax({O^{f}_{i}})&=&\frac{e^{{O^{f}_{i}}}}{\sum\limits_{j}e^{{O^{f}_{j}}}} \end{array} $$
    (17)

4 Experimental observation and analysis

In the initial phase, we have discussed the experimental setup to perform the experiments. Next, we have introduced different datasets that have been utilized for comparative performance analysis. Then, the different baseline models that have been employed for comparative performance analysis are described. After that, various metrics of performance comparison that are measured for analysis of results with baseline models are discussed briefly. Finally, we have compared the observations shown by the proposed hybridized framework using different metrics of performance comparison with various baseline models.

4.1 Collection of dataset and setup of experiment

To compare the observations resulting from the proposed deep hybrid framework of sentiment analysis with the baselines, eight different datasets have been gathered from various sources. The various sources include Amazon alexa dataset (AN) [56], ETSY dataset (EY) [1], “Big Basket” app reviews dataset (BB) [65], Facebook dataset (FB) [69], Financial News dataset (FN) [58], Twitter dataset (TT) [2], Wine dataset (WN) [51]. The experiments have been carried out on an HP PC that has a core i7 Pentium processor, 32 GB of RAM and NVIDIA GPU (GTX Geforce 1080). Python has been primarily used for development with a main focus on two packages, namely NLTK for text processing and Keras and the Tensorflow in the backend for implementation of the deep network. The deep network has been trained using the backpropagation algorithm, and dropout technique [60] is employed with a drop out rate of 0.1. By the dropout technique, during training, randomly chosen neurons in the network are temporarily dropped following some probability where the activated values of corresponding neurons are not accumulated to the downstream neurons in case of the forward pass, and weight updations are not applied for those neurons in case of the backward pass. Dropout is one of the important regularization techniques that helps the network to prevent overfitting.

4.2 Baselines

To analyze performance, several baseline models of classification have been used that including IB-k, J48, JRip, NB, PART, RF, Logistic, SMO, CNN and LSTM, which are implemented using Weka [68] toolkit. A very brief introduction of the baseline models are given below.

  1. (i)

    IB-k is an Instance-Based classification model based on k-nearest neighbors. When very minimal knowledge of the data distribution is available, then IB-k is an acceptable choice to perform classification.

  2. (ii)

    J48 is a very popular decision tree-based classifier where entropy-based measure information gain is used to select attributes. The attribute giving the highest information gain is selected for further splitting.

  3. (iii)

    JRip is based on learning the propositional rules, and the error here is brought down by incremental fashion pruning.

  4. (iv)

    NB classifier works following the Bayes Theorem along with the assumption that all the features of the instances are independent in nature.

  5. (v)

    PART follows the divide and conquer strategy for constructing a rule and further accomplishes the task of classification based on that rule base.

  6. (vi)

    RF is an ensemble classifier where the decisions of a large number of decision trees are combined to give the final label.

  7. (vii)

    The logistic classifier follows the logistic regression method and finally computes the posterior probability with the help of the sigmoid function.

  8. (viii)

    SMO is the acronym for Sequential Minimal Optimization technique, which is used for finding solutions to quadratic programming problems that arise during the training of Support Vector Machines (SVM).

  9. (ix)

    Convolutional Neural Network is popularly termed CNN. It is a neural network based classifier capable of hierarchical extraction of dominant features from the data.

  10. (x)

    LSTM is the acronym for Long Short Term Memory. This neural network based model contains memory cells that help preserve historical dependency in data.

4.3 Metrics of performance

For analysis of performance, six performance metrics have been considered. The first performance metric is accuracy. First accuracy for each class is computed using Eq. (18) and then weighted average is calculated by Eq. (19) where aci is the set of instances that belong to class i and pci is the set of instances that are predicted as class i. Precision is the next performance metric. Like accuracy, first precision is computed for each class with the help of Eq. (20), and then the weighted average is taken using Eq. (21). Then recall for each class is measured using Eq. (22) and the overall recall is computed by weighted average with the help of Eq. (23). After that, f-measure for each class is the harmonic mean of precision and recall for that class which is depicted in Eq. (24), and the weighted mean is computed by the Eq. (25). The next performance metric is Area Under the Curve (AUC), where the area under the Receiver Operating Characteristics (ROC) curve is computed. For each class, the ROC curve is drawn by plotting the True Positive Rate (TPR) on the y-axis and False Positive Rate (FPR) on the x-axis and the area is calculated under that curve. After computation of AUC for each class weighted average is taken for the overall AUC. The Higher value of AUC indicates a better model. After that, a statistical measure named Cohen’s kappa co-efficient [12] is used for the evaluation of the model.

$$ \begin{array}{@{}rcl@{}} accuracy_{i}=\frac{|\{e:e \in ac_{i} \land e \in pc_{i} \}|}{{\sum}_{j=1}^{c}|ac_{j}|} \end{array} $$
(18)
$$ \begin{array}{@{}rcl@{}} {}accuracy=\sum\limits_{i=1}^{c}\frac{|ac_{i}|*accuracy_{i}}{{\sum}_{j=1}^{c}|ac_{j}|} \end{array} $$
(19)
$$ \begin{array}{@{}rcl@{}} precision_{i}=\frac{|\{e:e \in ac_{i} \land e \in pc_{i} \}|}{|\{e:e \in ac_{i} \land e \in pc_{i} \}|+|\{e:e \notin ac_{i} \land e \in pc_{i} \}|} \end{array} $$
(20)
$$ \begin{array}{@{}rcl@{}} {}precision=\sum\limits_{i=1}^{c}\frac{|ac_{i}|*precision_{i}}{{\sum}_{j=1}^{c}|ac_{j}|} \end{array} $$
(21)
$$ \begin{array}{@{}rcl@{}} {\kern15pt}recall_{i}=\frac{|\{e:e \in ac_{i} \land e \in pc_{i} \}|}{|\{e:e \in ac_{i} \land e \in pc_{i} \}|+|\{e:e \in ac_{i} \land e \notin pc_{i} \}|} \end{array} $$
(22)
$$ \begin{array}{@{}rcl@{}} {}recall=\sum\limits_{i=1}^{c}\frac{|ac_{i}|*recall_{i}}{{\sum}_{j=1}^{c}|ac_{j}|} \end{array} $$
(23)
$$ \begin{array}{@{}rcl@{}} {}f-measure_{i}=\frac{2*precision_{i}*recall_{i}}{precision_{i}+recall_{i}} \end{array} $$
(24)
$$ \begin{array}{@{}rcl@{}} {}f-measure=\sum\limits_{i=1}^{c}\frac{|ac_{i}|*f-measure_{i}}{{\sum}_{j=1}^{c}|ac_{j}|} \end{array} $$
(25)

4.4 Performance analysis

The proposed hybrid framework is compared with the baseline models concerning previously depicted metrics of performance comparison.

  1. (i)

    Accuracy analysis: Accuracy is a performance measure representing what percentage of test data is perfectly categorized. A comparison of the accuracy of various models on different datasets is expressed in Table 1. LSTM and RF show the 3rd highest and 2nd highest accuracy of 82.3% and 83.2% respectively for the Amazon dataset. The hybridized framework improves the accuracy to 84.4%. When the ETSY dataset is considered, CNN and LSTM have the accuracy of 74.2% and 74.7%. The hybridized method shows improvement with the accuracy percentage of 75.0%. In the case of the Big Basket dataset, CNN sets out the accuracy of 62.6%, and RF shows betterment with the accuracy of 62.8%. The hybridized approach delivers almost similar performance to RF with the accuracy of 62.9%. When the Facebook dataset is considered, RF and CNN give the accuracy of 57.8% and 59.3% respectively, whereas the hybridized model shows the highest accuracy of 62.1%. In the case of the Finance dataset, the 3rd highest and 2nd highest performance are given by the Logistic Model and LSTM with accuracy percentages of 78.6% and 79.4% respectively. The hybridized framework shows further improvement with an accuracy of 80.3%. For the Twitter dataset, CNN and LSTM have classification accuracy of 73.8% and 74.4%, respectively. The hybridized approach shows better accuracy of 76.9%. When the Wine dataset is considered, RF and LSTM have 84.9% and 86.8% accuracy. The hybridized framework further improves the performance with the accuracy of 89.2%. The proposed hybridized approach achieves the highest accuracy for all the datasets, as shown in Table 1 with boldface.

  2. (ii)

    Precision analysis: Precision is a performance measure that indicates what proportion of a class identification is flawless. The comparison of precision values of various models on different datasets is shown in Table 2. For the Amazon dataset, CNN and LSTM have the same precision value of .810. Then RF and the hybridized approach achieve the same higher precision value of .835. In the case of the ETSY dataset, LSTM shows the 3rd highest, and RF gives the highest performance with precision values of .712 and .722, respectively. Compared to RF, the proposed hybridized framework sets out slightly less precision of .715. When the Big Basket data set is considered, CNN and RF have the precision of .622 and .629, respectively. The hybrid model achieves further improvement with the precision of .637. For the Facebook dataset, CNN and SMO set out the 3rd highest and 2nd highest precision with values of .502 and .504, respectively. The hybridized approach shows betterment with the precision value of .508. While considering the Finance dataset, the Logistic Model and LSTM give the precision of .774 and .788, respectively. The proposed hybridized framework achieves further improvement with the precision value of .806. In the case of the Twitter dataset, the Logistic Model and LSTM have the precision of .734 and .788, respectively. Compared to them, the hybridized model shows good improvement with higher precision of .819. For the Wine dataset, LSTM gives the 3rd highest, and RF shows the highest precision values of .808 and .850, respectively. The proposed hybridized method shows less precision than RF but slightly better performance than the LSTM with a precision value of .809. The highest precisions for different datasets are marked with bold faces in Table 2.

  3. (iii)

    Recall analysis: The recall is one of the important performance measures that represents what proportion of the base truth of a class is identified flawlessly. A comparison of recall values of various models on different datasets is represented in Table 3. For the Amazon dataset, CNN and LSTM give recall values of .814 and .823, respectively. The hybridized framework shows improvement with a recall value of .841. In the case of the ETSY dataset, the Logistic model and LSTM have recall values of .741 and .743, respectively. The proposed hybridized approach achieves a further higher recall value of .750. While considering the Big Basket dataset, CNN gives the 3rd highest, and RF shows the highest performance with recall values of .598 and .626, respectively. The hybridized model achieves a slightly lower recall value of .624 than RF. For the Facebook dataset, RF and CNN give recall of .579 and .602, respectively. The proposed hybridized framework improves performance with a recall value of .620. In the case of the Finance dataset, the Logistic model gives the 3rd highest performance, and the LSTM shows the highest performance with recall values of .786 and .791, respectively. The hybridized model achieves the highest recall value of .803. While the Twitter dataset is considered, the CNN and LSTM have recall of .737 and .744, respectively. The hybridized approach further improves the performance with a recall value of .769. For Wine dataset, the Logistic model shows the 3rd highest recall value of .848 and LSTM gives the 2nd highest recall value of .868. The proposed hybridized framework achieves the highest recall value of .892. The highest recall values for different datasets are marked with bold faces in Table 3.

  4. (iv)

    F-measure analysis: F-measure of a class is assessed by the harmonic mean of the precision and recall of the corresponding class. A comparison of the f-measure of various models on different datasets is depicted in the Table 4. For the Amazon dataset, LSTM and RF give an f-measure of .819 and .825, respectively. The hybridized framework shows better performance with the f-measure value of .836. In the case of the ETSY dataset, CNN and LSTM have f-measure values of .718 and .723, respectively. The proposed hybridized method achieves an f-measure value of .728. While considering the Big Basket dataset, CNN and RF give the 3rd highest and 2nd highest f-measure values of .614 and .627 respectively. The hybridized approach shows the highest performance with a slight improvement of f-measure to .629. For the Facebook dataset, SMO and CNN set out the f-measure of .515 and .524, respectively. The proposed hybridized framework further improves the performance with an f-measure value of .526. In the case of the Finance dataset, the Logistic model and LSTM have f-measure of .778 and .789, respectively. The hybridized model improves performance with an f-measure value of .804. While considering the Twitter dataset, CNN and LSTM give the f-measure values of .734 and .760, respectively. The hybridized method sets out a further higher f-measure value of .772. For Wine dataset, LSTM and RF show the 3rd highest and 2nd highest f-measure values of .821 and .841 respectively. The hybridized approach achieves greater performance with the f-measure value of .842. The proposed hybridized framework sets out the highest f-measure values for all the datasets, which are bold-faced in the Table 4.

  5. (v)

    AUC analysis: The higher the AUC under the ROC curve better the model is. A comparison of AUC of various models on different datasets is depicted in Table 5. For the Amazon dataset, LSTM gives the 3rd highest, and RF sets out the highest AUC values of .862 and .904, respectively. The hybridized model gives slightly less AUC of .899 compared to RF. In the case of the ETSY dataset, RF and Logistic model give 3rd highest and highest AUC of .875 and .881, respectively. The hybridized approach shows slightly less performance than the Logistic model but very similar performance compared to RF with an AUC of .876. While considering the Big Basket dataset, CNN and RF have AUC of .849 and .857, respectively. The proposed hybridized framework sets out a further higher AUC of .861. For the Facebook dataset, RF gives the highest AUC of .771, and the Logistic model shows 2nd highest AUC of .754. The hybridized model sets out 3rd highest AUC of .734. In the case of the Finance dataset, the Logistic model and LSTM have AUC of .843 and .852, respectively. The hybridized approach shows further betterment with an AUC of .867. While considering the Twitter dataset, CNN and LSTM show AUC of .891 and .897, respectively. The hybridized model gives further betterment with an AUC of .901. For the Wine dataset, LSTM and RF have the 3rd highest and 2nd highest AUC of .772 and .796 respectively. The proposed hybridized framework achieves the highest performance with an AUC of .802. The highest AUC values for different datasets are marked with bold-face in Table 5.

  6. (vi)

    Cohen’s Kappa co-efficient analysis: One of the popular statistical measurements used for assessing classification performance is Cohen’s Kappa coefficient(ckce). This statistic is used to measure inter-rater reliability for categorical data. The higher the value of ckce, the better the model is. A comparison of ckce values of various models on different datasets is depicted in Table 6. For the Amazon dataset, CNN and LSTM show ckce values of .578 and .582. The hybridized framework improves the performance with ckce value of .686. In case of the ETSY dataset, the 3rd highest ckce of .513 is given by CNN and the 2nd highest ckce of .536 is shown by LSTM. The hybridized model achieves the highest ckce of .618. While considering the Big Basket dataset, CNN and RF set out the ckce of .532 and .534, respectively. The proposed hybridized approach shows betterment with ckce of .576. For the Facebook dataset, LSTM and CNN give ckce of .416 and .438, respectively. The hybridized framework further improves the performance with ckce of .526. In the case of the Finance dataset, the Logistic model and LSTM set out ckce of .587 and .602, respectively. The hybridized model shows a higher ckce of .637. While considering Twitter dataset, Logistic model and LSTM have 3rd highest and 2nd highest ckce of .589 and .598 respectively. The proposed hybridized approach achieves the highest performance with ckce of .625. For the Wine dataset, CNN and LSTM give the ckce of .268 and .277, respectively. The hybridized framework further improves the ckce to .293. For all the datasets, the proposed hybridized framework shows higher ckce values compared to the baseline models and the highest ckce values for all the datasets are marked with boldface in Table 6.

Table 1 Accuracy analysis (in %)
Table 2 Precision analysis
Table 3 Recall analysis
Table 4 F-measure analysis
Table 5 AUC analysis
Table 6 Cohen’s Kappa coefficient analysis

5 Conclusion and future prospect

Inside this paper, a modified TF-IDF based integrated deep hybrid framework has been designed, deployed and evaluated. It comprises modified TF-IDF based text vectorization, google news corpus-based pre-trained embedding followed by a deep network. When the analysis is done with respect to the accuracy, the proposed hybridized framework shows better performance than the baseline models for all datasets. When precision is considered as the performance metric, for ETSY and Wine datasets, the hybridized model shows 2nd highest performance. For all other datasets, it sets out the highest precision value. While considering the Recall value, the hybridized framework gives 2nd highest value in the case of the Big Basket dataset. For all other datasets, it achieves the highest recall value. In the case of the F-Measure as a performance metric, the hybridized approach produces the highest value for all the datasets. While considering the AUC, the hybridized model shows the 2nd highest performance compared to baseline models in the case of Amazon and ETSY datasets and the 3rd highest value for the Facebook dataset. For all other datasets, the hybridized framework shows better AUC values. From the statistical perspective, the proposed hybridized approach outperforms all the baselines for all the datasets in the case of Cohen’s Kappa coefficient. So what has been observed is that the hybridized frame shows the best accuracy for all the datasets. Though it does not provide the best possible precision and recall for some portions, but for every dataset, it gives the best F-measure value which is harmonic combination of precision and recall. From the statistical viewpoint, the proposed approach also performs better than the rest of the baseline models for all the datasets. As a consequence, from the overall performance summary, it is concluded that the proposed hybrid deep framework adds value in the area of Sentiment Analysis.

As a future prospect, it can be investigated how the framework performance varies with the deepness of the network. In the case of a multiclass sentiment classification problem, there is a likelihood that the dataset may accommodate vague and uncertain data with overlapping classes, and there may be an imbalance among instances of different classes. Such uncertainty, vagueness, overlapping of classes and imbalance may be resolved with the help of various machine learning and soft computing techniques, which are the further scopes of the paper.