1 Introduction

Due to the rapidly growing access to the Internet, a huge amount of data, such as text, images, and videos, is generated freely every day on social media platforms such as Facebook, Twitter, and Instagram. This data expresses social media users’ feelings, opinions, and experiences about events, topics, services, and products. Analyzing this data helps politicians make better decisions, and the consumers and service providers develop business strategies and improve their products and services [1]. As a result, Natural Language Processing (NLP) is getting more attention every day, and Sentiment Analysis (SA), a branch of NLP that concentrates on finding the sentiment orientation of a sentence or a document, is getting increasingly popular [2, 3].

SA approaches commonly fall into Machine Learning (ML) and knowledge-based approaches. Knowledge-based methods usually employ general-purpose lexicons or knowledge graphs to find the sentiment orientation of sentences [4, 5]. Both kinds of approaches have their pros and cons. Knowledge-based approaches do not need labeled data. They are computationally effective and scalable. However, they cannot detect the labels correctly when the margin between the labels is too small, the samples are short, or the data is noisy, complex, and ambiguous. Additionally, their performance varies remarkably in different domains [6, 7]. On the other hand, ML-based approaches require labeled data, and providing the labeled data is commonly expensive and time-consuming. As a result, remarkable research attention has been paid to hybrid approaches that combine knowledge- and ML-based methods using a two-staged pipeline. First, a general-purpose lexicon is utilized to classify the samples. Second, a classifier is trained by these labeled samples to predict the sentiment of unseen data [8,9,10,11,12,13,14,15].

However, the hybrid approaches suffer from not correctly recognizing the sentiments of domain- or context-dependent words since SA is a domain- or context-dependent task. The sentiment of words varies in different domains. For instance, the “low price“ conveys positive sentiment in a product review, but the ”low salary” conveys negative sentiment in a job description. In another instance, the adjective ”easy” has a positive sentiment in the phrase “easy to use“ in a software product review but has a negative sentiment in the phrase ’easy game’ in a computer game review. As a result, a domain-independent lexicon can not recognize the sentiment of domain-dependent words. Additionally, an ML model trained on a specific domain can not be utilized in another domain [16, 17].

In recent years, self-supervised methods, a kind of unsupervised method, have been employed to solve these problems. These methods automatically generate pseudo-labels using the contents of samples. They borrow a list of sentiment words and their sentiments from the lexicons, extract the sentiment words of the samples, and estimate the pseudo-labels. Later on, a classifier is trained using these pseudo-labeled samples to predict the sentiments of unseen data. These methods widely use the number of positive and negative words corresponding to each sample to estimate the pseudo-labels. For instance, Sazzed et al. [9] and Rendon et al. [18] used the number of positive and negative words of each sample to estimate the appropriate pseudo-labels and trained the SVM and LR classifiers with these pseudo-labeled data. However, these methods utilize domain-independent lexicons, so they can not recognize the sentiment of domain-dependent words.

To address the previously described problem not correctly recognizing the sentiments of domain- or context-dependent words, this paper proposes a novel self-supervised SA approach that does not need any labeled data and considers the context of samples to generate pseudo-labels. To do this, the proposed method offers a semantic-based pseudo-label generator that estimates the pseudo-label of samples using contextual embeddings and semantic similarity between the context of samples and their corresponding sentiment words. It uses two newly introduced concepts: Soft-Cosine Similarity [19] of a sample with its Positive words (SCSP) and Soft-Cosine Similarity of a document with its Negative words (SCSN). The Soft-Cosine similarity is a text similarity measure that calculates the semantic similarity between two sentences, even if they have no common words but the same meaning. The semantic-based pseudo-label generator converts all the words into dense feature vectors, calculates SCSP and SCSN, and estimates the pseudo-labels. Additionally, when the SCSP and SCSN are equal, another two concepts are calculated and used: Cosine Similarity [20] of a document with its Positive words (CSP) and Cosine Similarity of a document with its Negative words (CSN). Later on, a new method is proposed to find the samples with highly accurate pseudo-labels. Finally, a hybrid classifier, composed of Convolutional Neural Network (CNN) [21], and Gated Recurrent Unit (GRU) [22] is trained with these pseudo-labeled samples. To the best of our knowledge, it is the first time that the semantic similarity, contextual embeddings, and the number of sentiment words are considered jointly to estimate the pseudo-labels based on the context of samples.

The main contribution of this paper can be summarized as follows:

  • Proposing a self-supervised SA method that does not require labeled data.

  • Proposing a novel semantic-based pseudo-label generator that estimates the pseudo-labels of samples based on semantic similarity and the number of sentiment words.

  • Proposing a hybrid sentiment classifier composed of CNN and GRU model.

The rest of the paper is structured as follows: Section 2 explains the research basics, which are the proposed method’s building blocks. Section 3, Literature Review, describes the previous similar methods, Section 4 explains the proposed method in detail, and Section 5 contains metrics and evaluation results. Finally, Section 6 consists of the conclusion and future work.

2 Basics of research

In this section, the building blocks of the proposed method, including Document Embedding, Soft Cosine similarity, CNN, and GRU are explained in detail.

2.1 Document embedding

ML algorithms need their input to be represented as fixed-length feature vectors. Bag-of-words (BOW) and bag-of-n-grams are widely used methods to convert texts into fixed-length feature vectors, but they do not capture the word order. As a result, the sentences composed of the same words in different orders have the same representations. Bag-of-n-grams capture the order of words, but they have the problem of data sparsity and high dimensionality. Additionally, these algorithms do not capture the distance between words correctly. In other words, they do not consider the semantics of words. For instance, the words “powerful“, “strong“, and “Paris“ have the same distance, while “powerful“ is semantically closer to “strong“ [23]. Paragraph Vector or Doc2Vec, inspired from [24], is a framework that converts every sentence, paragraph, or text of different lengths into fixed-length feature vectors. This method concatenates or averages the word vectors to predict the next word in the sentence. It works based on two different modes: the Distributed Memory Model of Paragraph Vectors (PV-DM) and the Distributed Bag of Words of Paragraph Vectors (PV-DBOW). Like the continuous bag of words, the former is more complex but performs better. The PV-DM method either concatenates or averages all word embeddings of a document to calculate document embeddings. Like skip-gram, the latter is simpler and usually leads to a higher error rate [23].

2.2 Cosine and soft-cosine similarity

Calculating the similarity of texts is essential in various tasks of NLP, such as question answering, plagiarism detection, SA, etc. The Cosine similarity [20] is widely used to measure the similarity between the texts. It calculates the Cosine of the angle between the feature vectors. To use the Cosine similarity, each text should be represented as a vector of feature values, and each feature corresponds to a dimension in the Vector Space Model (VSM). In the field of NLP, the most widely used features are words and n-grams. The Cosine similarity is calculated as below:

$$\begin{aligned} Cosine(A,B)=\frac{\begin{matrix} \sum _{i=1}^n A_i B_i \end{matrix}}{\sqrt{\sum A_i^2} \times \sqrt{\sum B_i^2}} \end{aligned}$$
(1)

Where A and B are two vectors. However, the Cosine similarity does not consider the number of features the texts share and the number of zero features. In other words, it does not measure the similarity between the features of vectors, which are the words, in the context of NLP [25]. For instance, consider the following sentences:a: a player will play a game they like to playb: they play the game they likeThe bag-of-words representations of a and b are the following vectors:a = (2, 1, 1, 2, 1, 1, 1, 1, 0)b = (0, 0, 0, 1, 1, 2, 1, 0, 1)where the values indicate the number of words a player will play in a game, they like, too, and the in sentences. The Cosine similarity of a and b, based on (1), is zero [19]. Soft-Cosine similarity [19] is another semantic measure that considers the similarity of features and is calculated as below:

$$\begin{aligned} Soft-Cosine(A,B)=\frac{\sum \begin{matrix} \sum _{i,j}^N s_{ij} a_i b_j \end{matrix}}{\sqrt{\sum \begin{matrix} \sum _{i,j}^N s_{ij} a_i a_j \end{matrix}} \times \begin{matrix} \sum _{i,j}^N s_{ij} b_i b_j \end{matrix}} \end{aligned}$$
(2)
$$\begin{aligned} s_{ij} = cosine(e^i, e^j) \end{aligned}$$
(3)

Where \(e^i\) and \(e^j\) are the primary vectors of \(a_i\) and \(b_j\). When the \(a_i\) and \(b_j\) are completely different, the \(s_{ij}\) will be 0. As shown in (2), the Soft-Cosine similarity calculates the similarity between the features.

2.3 Convolutional neural network

Deep learning is a sub-area of ML algorithms inspired by artificial neural networks. A Deep Neural Network (DNN) is a sequence of layers that learns data representations. Since DNNs can automatically identify and extract text features, they are increasingly used in various NLP tasks, such as SA. They are inspired by the structure of the human brain and consist of a large number of information processing units, called neurons, are organized in a sequence of layers. They can learn to perform tasks such as regression and classification by adjusting the connection weights between neurons, mimicking the learning process of a human brain [26].

Among the different types of DNNs, Recurrent Neural Networks (RNN) and CNNs have been widely used in SA. It has been proven that CNNs can improve the accuracy of text classification since they extract local and deep features [27]. CNNs are feed-forward neural networks composed of three layers: convolution, pooling, and fully connected. The convolution layer extracts the features, and the pooling layer reduces the features. The convolution layer applies different filters on embeddings to perform feature selection and create feature maps. The pooling layer reduces the computational workload and speeds up operations for the next layers. The last layer is a fully connected neural network, which consists of an activation function and relates the text or image features to target classes [21].

2.4 Gated recurrent unit

The GRU and Long Short-Term Memory (LSTM) models, a kind of RNNs, have been proposed to solve the vanishing gradient problem. The GRU is similar to the LSTM but does not have a memory cell [28, 29]. It has a simpler architecture and has shown better results in different NLP tasks such as text classification [22]. It uses two gates. The Update gate, indicated by \(Z_t\), decides the amount of data that needs to be kept in the future. On the other hand, the Reset gate, indicated by \(r_t\), decides the amount of data that can be forgotten. The \(h_{(t-1)}\) includes the data of the previous state, and the \(\hat{h_t}\) determines the data that should be removed from the previous state. The following equations show how the GRU network works. In these equations, \(\sigma \) represents the sigmoid function, and \(\odot \) means element-wise multiplication [30,31,32].

$$\begin{aligned} r_t= \sigma (W_{r}x_t + U_{r}h_{t-1}) \end{aligned}$$
(4)
$$\begin{aligned} z_t= \sigma (W_{z}x_t + U_{z}h_{t-1}) \end{aligned}$$
(5)
$$\begin{aligned} \hat{h_t}= tanh(W X_t + U(r_t\odot h_{t-1})) \end{aligned}$$
(6)
$$\begin{aligned} \ h_t= (1-z_t)h_{t-1} + z_t \hat{h}_t \end{aligned}$$
(7)
Fig. 1
figure 1

Sentiment Analysis Methods

3 Literature review

In this section, first, we discuss some unsupervised and self-supervised SA methods. Second, a review of hybrid SA methods is conducted.

3.1 Unsupervised and Self-supervised SA methods

In general, unsupervised methods rely on statistical features of the document, such as word co-occurrence or the presence of sentiment words. In contrast, as shown in Fig. 1, self-supervised methods are a subset of unsupervised learning in which the output labels are generated automatically by extracting patterns from data.

A Self-supervised and syntax-based method (SESS) was proposed in [33] that firstly calculated the sentiment score of each document using the positive and negative seeds provided by Subjclueslen1-HLTEMNLP05Footnote 1 dictionary, and the list of seeds was updated iteratively in each step. Secondly, a Naïve Bayes classifier was trained using these labeled documents. In the last step, the train Naïve Bayes (NB) classifier was applied to all datasets to find the labels of documents. Additionally, to improve the quality of labels, they found three types of compound and complex sentences, i.e., coordination, concession, or condition, and considered their sentiment while calculating the sentiment score of documents. SESS was evaluated on Amazon product review dataset [34].

Qiu et al. [35] proposed a Self-Supervised Model for Sentiment Classification (SELC) of the Chinese dataset that includes two steps. First, the HowNetFootnote 2 dictionary and a negation list were employed to classify the reviews. Second, a Support Vector Machine (SVM) classifier was trained by these labeled samples. Consequently, this classifier is used to predict the sentiment of unseen reviews. The authors have used the TF-IDF method to make feature vectors of texts which does not consider the semantic meaning of words. Additionally, they have evaluated SELC on the reviews concern with ten domainsFootnote 3: Monitors, Mobile phones, Digital Cameras, MP3 players, Computer parts, Video cameras and lenses, Networking, Office equipment, Printers, Computer peripherals. He and Zhou [36] proposed a self-supervised method that borrowed a list of sentiment words from the MPQA lexicon and trained a classifier on the Amazon Review [34] and the Cornel Movie Review datasets [37].

Zhou et al. [38] proposed an unsupervised method called graph co-regularized non-negative matrix tri-factorization (GNMTF) from the geometric perspective. GNMTF assumes that if two words (or documents) are sufficiently close to each other, they have the same sentiment. They constructed the nearest neighbor graphs in conjunction with a non-negative matrix tri-factorization framework.

In [39], the authors propose a lexicon-based method called SmartSA to predict sentiments. This method extracts sentiment from sentiment lexicons. It was proved by analysis and experimental observations that this method works well and with better performance than the SentiStrength [40] method. Jimenz et al. [41] presented an unsupervised Aspect-based sentiment classification method. First, they extract different aspects of each entity. Second, they used Bing Liu, MPQA, and SentiWordNet [42] lexicons to extract the sentiment corresponding to each aspect.

Fernandez et al. [43] proposed an unsupervised dependency parsing-based text classification method that borrows a list of seeds from SO-CAL. Then, it uses linguistic rules to find the sentiments. They evaluated their method on Cornell Movie Review [37], Obama-McCain Debate, and SemEval-2015 datasetFootnote 4.

Vilares et al. [44] proposed an unsupervised SA method based on the compositional syntax-based rule. They borrowed a list of seeds as prior knowledge from SO-CAL and evaluated their method on Cornell Movie Review [37], German, and Spanish datasets.

Vanishta and Suzan [45] proposed an unsupervised method based on fuzzy logic that includes four major steps: tokenization, formulation of a bag of words model, formulation of fuzzy sentiment score, and assigning polarity. They have calculated the cardinality of positive and negative words using SentiWordNet [42] and AFINN [46] dictionaries separately. When the cardinality of positive words is equal to or greater than negative words, the label of the document is considered positive. When the cardinality of positive words is less than negative words, the label of the document is considered negative. They have evaluated the proposed method on polarity dataset v2.0 by Pang and Lee1 [37] and IMDB [47]. The third dataset provides reviews of a single hotelFootnote 5.

Also, Sazzed et al. [9] proposed a method called SSentiA and used an opinion lexicon to generate pseudo-labels. Then, they utilized these labeled data to train the SVM and LR classifiers and predicted the sentiment of the unseen data. Rendon-Cardona et al. [18] have extended the SSentiA self-supervised method by adding a translation module for sentiment analysis of Spanish texts so that the module can translate Spanish texts into English texts with higher accuracy and performance.

Seilepour et al. [5] used the sum of Cosine and Soft Cosine similarities between the samples and their corresponding sentiment words to estimate the pseudo-labels. Later on, they trained a RoBERTa-GRU classifier using these labeled data. Additionally, they have used the Whale Optimization Algorithm to fine-tune the hyperparameters of the GRU network. In another work, Seilsepour et al. [15] employed the semantic similarity and WMD distance [48] simultaneously to estimate the pseudo-labels. Additionally, they trained a RoBERTa-LSTM classifier with the samples having highly accurate pseudo-labels.

3.2 Hybrid SA methods

Hybrid SA methods, a subset of unsupervised methods, utilize a lexicon to find the appropriate labels corresponding to each sample. Later on, these labeled samples train an ML method to predict the label of unseen data. For instance, An entity-level sentiment analysis method was proposed in [49], which trained an SVM classifier to predict the sentiment of unseen data and a vocabulary-based approach for document labeling.

Iqbal et al. [8] employed SentiWordNet [42] to find the appropriate labels of samples and Bag of Words (BOW) to make feature vectors. Later on, a Genetic Algorithm (GA) was used to reduce the number of features. Finally, these sample data trained the Naive Bayes (NB) classifier. The proposed method was performed on IMDB [47], YelpFootnote 6, and Amazon review datasets [34].

Aljedaani et al. [10] utilized TextBlob [50] to calculate the sentiment score of reviews about US AirlinesFootnote 7. They used the Bag of Word and TF-IDF to make feature vectors. Later on, they trained ML models, such as Random Forest (RF), Decision Tree (DT), Logistic Regression (LR), Extra Trees Classifier (ETC), Support Vector Classifier (SVC), and DNN models, such as CNN, GRU, LSTM, LSTM-GRU, and CNN-LSTM. The LSTM-GRU and LSTM achieved the highest accuracy. Azlinah et al. [12] used VADER [51] to find the appropriate labels and Word2vec and Glove to create feature vectors. In the next step, they trained SVM, CNN, LSTM, CNN-LSTM, and LSTM-CNN classifiers. The comparison results showed that CNN outperformed the other methods. Khan et al. [13] estimated the labels using several lexicons such as AFFIN, GL, OL, SentiWordNet, So-CAL, Subjectivity lexicon, WordNet-Affect, NRC, SenticNet5, and SentiSense. In the next step, they created sentence embeddings using BERT and trained a hybrid network composed of a BILSTM and CNN to classify unseen data.

Mardjo et al. [11] collected 3.5 million tweets and used VADER [51] to estimate the labels. Later on, they used TF-IDF to create feature vectors and trained an RF classifier. Additionally, they used Grey Wolf Optimizer (GWO) to fine-tune the hyperparameters of the RF classifier.

Kathuria et al. [14] estimated the labels for the feedback of postgraduate students using the SentiWordNet. Additionally, they used TF-IDF to make feature vectors. Later on, they trained the classifiers such as SVM, Multinominal Naive Bayes (MNB), LR, RF, DT, and K-Nearest Neighbor (KNN). The comparison results showed that the RF performed better.

Table 1 shows the list of SA methods.

Table 1 Unsupervised (U), Self-Supervised (S), and Hybrid (H) SA methods

4 Proposed method

Since SA is a domain- or context-dependent task, the sentiment of words varies in different domains. For instance, the word “unpredictable“ conveys a positive sentiment in the phrase “unpredictable plot“ in the movie review context but a negative sentiment in the phrase “unpredictable steering“. Hence, the SA approaches based on a domain-independent lexicon or an ML model trained on a specific domain can not recognize the sentiment of domain-dependent words correctly [16]. To address this problem, this paper proposes a novel hybrid self-supervised SA approach that does not need labeled data. The proposed method offers a semantic-based pseudo-label generator that captures the semantic relationships between the samples and their corresponding sentiment words and the number of sentiment words to estimate the pseudo-labels. The proposed method utilizes the Cosine [20] and Soft Cosine [19] similarity measures to capture the semantic similarity. As described in Section 2.2, the Cosine similarity is widely used to calculate the semantic similarity between the texts, but it does not consider the similarity between the features of vectors. On the other hand, the Soft Cosine similarity obtains the similarity between the features, too. As a result, the proposed method uses the Cosine and Soft Cosine. It introduces four new semantic concepts: Soft Cosine Similarity of a document with its Positive words (SCSP), Soft Cosine Similarity of a document with its Negative words (SCSN), Cosine Similarity of a document with its Positive words (CSP), and Cosine Similarity of a document with its Negative words (CSN). The semantic-based pseudo-label generator borrows a list of sentiment words from Opinion Lexicon [55] and calculates the SCSP and SCSN. When the SCSP is bigger than SCSN, and the number of positive words is bigger than the number of negative words, the pseudo-label is considered positive. When the SCSP is less than the SCSN and the number of positive words is less than the number of negative words, the pseudo-label is considered negative. In other cases, the CSP and CSN are calculated using the Doc2vec embedding techniques. When the CSP is bigger than the CSN, the pseudo-label is considered positive, and vice versa. The semantic-based pseudo-label generator also introduces a novel method to find highly accurate pseudo-labels. Later on, the samples with highly accurate pseudo-labels are fed into a CNN-GRU classifier. Fig. 2 shows that the proposed method includes four steps: Preprocessing, Semantic Pseudo-label Generator, Finding highly accurate pseudo-labels, and CNN-GRU classifier. In the following subsections, each step is explained in more detail.

4.1 Preprocessing

The preprocessing step is essential in NLP tasks since social media comments are usually full of links, emoticons, etc, and some SA methods are sensitive to errors and mistakes in user-generated content [15]. This step removes all non-alphabetical characters, links, Unicode, punctuation marks, and stop words. Moreover, all characters will be converted into lowercase, and the tokens will be stemmed by Porter StemmerFootnote 8.

4.2 Semantic-based pseudo-label generator

As explained earlier, the semantic pseudo-label generator captures the semantic relationships between the samples and their corresponding sentiment words. In the following sections, each step is explained in more detail.

Fig. 2
figure 2

Architecture of Proposed Method

4.2.1 Discovering the sentiment words

The semantic-based pseudo-label generator borrows a list of sentiment words and their sentiments from the Opinion Lexicon [55], including 4783 negative and 2006 positive terms. This domain-independent word list is employed as seeds to extract the sentiment words of each sample. For instance, “Nice“ conveys positive sentiment in all domains. We suppose that the dataset DS is a collection of N samples, \({S_1, S_2, S_3, ..., S_N}\). \(PW_i\) and \(NW_i\) are positive and negative words of sample \(S_i\), respectively. DS, \(PW_i\), and \(S_i\) are defined as below:

$$\begin{aligned} DS = \{S_1, S_2, S_3, ..., S_N\} \end{aligned}$$
(8)
$$\begin{aligned} PW_i = \{PW_1, PW_2, PW_3, ..., PW_{nPW}\} \end{aligned}$$
(9)
$$\begin{aligned} NW_i = \{NW_1, NW_2, NW_3, ..., NW_{nNW}\} \end{aligned}$$
(10)

Where \(nPW_i\) and \(nNW_i\) are the numbers of positive and negative words of \(S_i\), respectively.

4.2.2 Discovering the negations and polarity shifters

The negations like no, never, and not reverse the polarity of their successive word. For instance, “The taste of food is not good“ conveys a negative sentiment because of not. In addition, the models, such as should and could reverse sentences’ sentiment orientation. For instance, “The quality of food could be better“ conveys a negative sentiment [9]. As a result, the successive terms of negations and polarity shifters for each sample \(S_i\) are found and appended to the opposite sets of \(PW_i\) and \(NW_i\), positive words to \(NW_i\), and vice versa.

4.2.3 Calculating SCSP and SCSN

In this step, the semantic similarity of each sample \(S_i\) with its positive and negative words is calculated individually. To this end, the semantic-based pseudo-label generator uses the Soft-Cosine Similarity [19] measure, an extended version of Cosine similarity that considers the similarity between features of sentences. As explained in Section 2.2, Soft-Cosine calculates the similarity between two sentences even if they share no common words. The Soft-Cosine similarity is calculated according to (2). The Soft-Cosine similarity of \(S_i\) with its positive and negative sentiment words are defined below:

$$\begin{aligned} SCSP(S_i) = \begin{matrix} \sum _{j=1}^{nPWi} Soft-Cosine(S_i,PW_j) \end{matrix}, \forall i \in \{1,2,...,N\} \end{aligned}$$
(11)
$$\begin{aligned} SCSN(S_i) = \begin{matrix} \sum _{j=1}^{nNWi} Soft-Cosine(S_i,NW_j) \end{matrix}, \forall i \in \{1,2,...,N\} \end{aligned}$$
(12)

Where the \(SCSP(S_i)\) is the Soft-Cosine Similarity of \(S_i\) with its Positive words, and the \(SCSN(S_i)\) is the Soft-Cosine Similarity of \(S_i\) with its Negative words. To calculate the Soft-Cosine similarity, all words of samples should be converted into feature vectors. Here, we use the Word2Vec [24] technique. It is a commonly used technique that works based on the theory of information, supposing the words used together convey similar meanings. It uses a shallow neural network to create dense feature vectors of words in a simple and reasonably fast way.

4.2.4 Estimating the pseudo-labels

Here, the pseudo-label of each sample \(S_i\) is estimated based on the Soft Cosine similarity between its sentiment words and the sample \(S_i\) itself and the number of its positive/negative words. The sentiment orientation of sample \(S_i\) is determined by:

$$\begin{aligned} Pseudo-label(S_i) = {\left\{ \begin{array}{ll} 1&{} if\ SCSP(S_i)>SCSN(S_i)\ and\ nPT(S_i)>nNW(S_i)\ \\ 0&{} if\ SCSP(S_i )<SCSN(S_i )\ and \ nPT(S_i)<nNW(S_i) \ \\ tie&{} Otherwise \end{array}\right. } \end{aligned}$$
(13)

Where nPW(\(S_i\)) and nNW(\(S_i\)) are the numbers of positive and negative words of \(S_i\), as denoted in (9) when the Soft Cosine similarity of \(S_i\) with positive words is bigger than the Soft Cosine similarity of \(S_i\) with negative words, and the number of positive terms is bigger than the number of negative words, the pseudo-label of \(S_i\) is considered as 1 (positive), and vice versa. However, in some cases, the Soft Cosine similarity and the number of sentiment terms do not match. In these cases, the Cosine similarity measure that calculates the Cosine of angles between two vectors is utilized. The cosine between two document vectors, A and B, is calculated according to (1). The cosine similarity of \(S_i\) with its positive and negative words is defined as below:

Algorithm 1
figure a

Semantic-based Pseudo-label Generator

$$\begin{aligned} CSP(S_i) = \begin{matrix} \sum _{j=1}^{nPWi} Cosine(S_i,PW_j) \end{matrix}, \forall i \in \{1,2,...,N\} \end{aligned}$$
(14)
$$\begin{aligned} CSN(S_i) = \begin{matrix} \sum _{j=1}^{nNWi} Cosine(S_i,NW_j) \end{matrix}, \forall i \in \{1,2,...,N\} \end{aligned}$$
(15)

Where the CSP(\(S_i\)) is the Cosine Similarity of \(S_i\) with its positive words, and the CSN(\(S_i\)) is the Cosine Similarity of \(S_i\) with its negative words. The pseudo-label of tie is calculated as below:

$$\begin{aligned} Pseudo-label(tie) = {\left\{ \begin{array}{ll} 1 &{} if \quad CSP(S_i) \ge CSN(S_i) \ \\ 0 &{} if \quad CSP(S_i ) < CSN(S_i ) \end{array}\right. } \end{aligned}$$
(16)

As shown in (16), when the Cosine similarity of the \(S_i\) with its positive terms is bigger than its Cosine similarity with its negative terms, the pseudo-label is considered 1 (Positive), and vice versa. All dataset samples should be converted into dense document feature vectors to calculate the Cosine similarity. Here, we use the Doc2Vec [23] technique to convert the samples of datasets into document feature vectors. As described in Section 2.1, it averages or concatenates the feature vectors of words composing a sample text or paragraph to create the document feature vectors. We use the word feature vectors created in Section 4.2.3 to create the document feature vectors using Doc2Vec in a simple and reasonably fast way.

Algorithm 1 shows the process of generating pseudo-labels in more detail. As shown in Algorithm 1, first, sentiment words, negations, and polarity shifters of documents are extracted separately. Then, SCSP, SCSN, CSP, and CSN are calculated, and the pseudo-labels are estimated.

4.3 Finding the samples with highly accurate pseudo-labels

The classifier needs to be trained by the samples having highly accurate pseudo-labels. To check and select highly accurate pseudo-labels, we utilize the ratio of positive and negative polarity scores obtained from a review to determine the confidence score. If the review r consists of n sentences, \(s_1, s_2, s_3,\ldots , s_n\) with positive polarity scores of \(P_{pos} (s_1 ), P_{pos} (s_2 ),\ldots , P_{pos} (s_n )\), and negative polarity scores of \(P_{neg} (s_1 ), P_{neg} (s_2 ),\ldots ,\) \(P_{neg} (s_n )\), then overall positive polarity score of review r is calculated as \(P_{pos} (r)=\begin{matrix} \sum _{i=1}^n P_{pos}(s_i) \end{matrix}\). Negative polarity score is calculated as \(P_{neg} (r)=\begin{matrix} \sum _{i=1}^n P_{neg}(s_i) \end{matrix}\). The confidence score of the review r is determined by:

$$\begin{aligned} ConfScore = \frac{abs\big (P_{pos}(r) + P_{neg}(r)\big )}{abs(P_{pos}(r)) + abs(P_{neg}(r))} \end{aligned}$$
(17)

In each review, we calculate the mean confidence score (mcs) and standard deviation (std) across all the predictions to find the threshold thr value. The thr value determines various confidence groups, which is calculated as \(thr = mcs + std\). The confidence group of review r, confGroup(r) is determined as follows,

$$\begin{aligned} ConfGroup(r) = {\left\{ \begin{array}{ll} high &{} if \quad ConfScore \ge thr \ \\ low &{} if \quad ConfScore < thr \end{array}\right. } \end{aligned}$$
(18)

The predicted reviews with a confidence score above the thr fall into the high confidence group. The next category (low) contains predictions with confidence scores below the thr value. Three criteria are considered while categorizing predictions into two groups described.

  1. a)

    Minimize the inclusion of wrong prediction (i.e., highly accurate pseudo-label) into a group so that it can be used as training data for the classifier with minimal error propagation.

  2. b)

    Maximize the number of reviews (more extensive training set) utilized as pseudo-labels for the classifier.

  3. c)

    Show the correlation between the confidence score and the accuracy (i.e., a high confidence score implies high accuracy).

(a) and (b) both are important for having good performance from machine learning and deep learning classifiers, as (a) highly-accurate pseudo-label means less error-propagation to the classifier and (b) a higher number of pseudo-labels means the more extensive training set, that is needed to have good accuracy from machine learning model. (c) is important for group selection, (c) determines which groups should be used as training data and which ones to use as testing data. We find that discretizing the reviews’ predictions into two categories best fulfills the above criteria. After identifying highly confident predictions (high confidence groups), we utilize them as pseudo-labeled training data for the classifier.

4.4 CNN-GRU classifier

This research uses a combination of CNN and GRU for sentiment classification. As explained in Section 2.3, it has been proved that CNNs can improve text classification accuracy since they have a strong capacity for extracting local and deep features from text using convolutional layers [56]. On the other hand, GRUs, explained in Section 2.4, can learn the long-term dependencies, so they are appropriate for modeling sequential data such as text because the sentences can be considered a sequence of words from left to right. GRU networks offer less computational complexity and simpler architecture than LSTM. Considering these facts and inspired by the results of [56] proving that the CNN and GRU achieved higher accuracy in text classification tasks, we use a combination of CNN and GRU for sentiment classification. Additionally, we use the Word2vec [24] embedding method to convert the words into dense feature vectors. As shown in Fig. 3, the proposed CNN-GRU architecture includes the embedding, convolution, max-pooling, GRU, and fully connected layers:

Fig. 3
figure 3

Architecture of CNN-GRU Classifier

  1. a)

    Embedding layer: This layer receives the labeled samples as word embeddings. Assume v is the vocabulary size of the corpus, and d is the size of word embedding (dimension size). Then, an embedding matrix \(EM \in R^{d * v}\) containing all words of the vocabulary is created. Subsequently, a sentence and its embedding can be represented as (19) and (20), respectively:

    $$\begin{aligned} Sentence= [w_1, w_2, \ldots , w_l] \end{aligned}$$
    (19)
    $$\begin{aligned} Sentence\_Embedding= [we_1 ,we_2,\ldots ,we_l] , \quad Sentence\_Embedding \in R^{d * l} \end{aligned}$$
    (20)

    Where \(w_i\) indicates the \(i-th\) word of the sentence, l is the length of the sentence and the column \(we_i\) denotes the word embedding of \(w_i\), \(we_i=EM[w_i]\), \(we_i \in R^d\).

  2. b)

    Convolution layer: This layer extracts the local features. Suppose \(K \in R^{d * w}\) is the kernel size, which is applied to each window of size w, a bias term is added to the result of the convolutional operation, and a feature map \(fm \in R^{l-w+1}\) is created as follows:

    $$\begin{aligned} FM= [fm_1, fm_2,\ldots ,fm_(l-w+1)], \quad fm \in R^{l-w+1} \end{aligned}$$
    (21)

    Then, the following equation shows the \(i-th\) element of the feature map:

    $$\begin{aligned} fm_i= \sigma ( \sum (EM[*, i:i + w] \circ K)+b \end{aligned}$$
    (22)

    Where \(\sigma \) is a non-linear activation function like ReLu or tanh.

  3. c)

    Pooling layer: In the next step, the feature maps are fed into the pooling layer to find the essential features and reduce the dimensions. The pooling layer, widely used after the CNN layers, performs dimension reduction and consequently decreases the computation time. For example, we use the Max-pooling layer with a pool size equal to 2, which converts the feature map of size \(l-w+1\) to \(\left\lfloor \frac{l-w+1}{2}\right\rfloor \). The output of the pooling layer is:

    $$\begin{aligned} P= \Big [p_1, p_2, \ldots , p_{\left\lfloor \frac{l-w+1}{2}\right\rfloor } \Big ],p \in R^{\left\lfloor \frac{l-w+1}{2}\right\rfloor } \end{aligned}$$
    (23)

    Where \(p_i\) is calculated as follows:

    $$\begin{aligned} p_i=\text {max}(fm_{2*i-1} ,fm_{2*i}) \end{aligned}$$
    (24)
  4. d)

    GRU layer: The GRU layer receives the features obtained by the pooling layer to find the long-term dependencies. The output of GRU is \(g \in R^n\), encoding a complete sentence.

  5. e)

    Fully connected layer: The output of the GRU layer is sent to a fully connected layer that uses the sigmoid activation function. Passing the feature vectors to the sigmoid function yields a probability score over sentiment classes. Sigmoid function is calculated as follows:

    $$\begin{aligned} sigmoid(g)=\frac{1}{(1+e^{-g})} \end{aligned}$$
    (25)

Where g denotes the advanced feature vector created by the GRU.

5 Evaluation

During this section, first, we describe the proposed method’s evaluation setup and metrics, datasets, and hyperparameter settings. Later on, we compare the results of the proposed method with the other lexicons, base-line classifiers, and similar methods. Finally, we calculate the computational complexity of the proposed method.

5.1 Evaluation setup and metrics

We used the Google Colab platform, with a K80 GPU and 12 GB of RAM, to run the proposed method. In addition, the proposed method was implemented by the Python Language version 3.8 and the Keras library [57] to implement the DNNs. Since the DNNs use random initialization, they give different results in each run. We ran each algorithm ten times and reported the average results.

Since the purpose of the proposed method is to predict the label of texts as positive and negative, the number of true and false predicted labels plays an important role in evaluating it. Hence, we utilized Accuracy, Precision, Recall, and F1-Score, which are widely used to evaluate the classification tasks. Additionally, similar existing TSA approaches utilized these metrics for evaluating their methods, so we can compare our method with them [28].

Accuracy, which is the number of correct choices relative to all choices, is calculated as follows:

$$\begin{aligned} Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
(26)

Where:

  • TP: the number of samples where the predicted class label and the actual class label are positive.

  • FP: the number of samples where the predicted class label is positive, but the actual class label is negative.

  • FN: the number of samples where the predicted class label is negative, but the actual class is positive.

  • TN: the number of samples where the predicted class label is negative and the actual class label is negative.

Precision calculates the number of class labels truly predicted for each class. This metric is calculated using Eq 21:

$$\begin{aligned} Precision = \frac{TP}{TP + FP} \end{aligned}$$
(27)

The recall metric is the weighted average of the correct labels which are correctly predicted for each class and calculated as follows:

$$\begin{aligned} Recall = \frac{TP}{TP + FN} \end{aligned}$$
(28)

F1-score is the harmonic mean of precision and recall metrics.

$$\begin{aligned} F1 - score = \frac{2*Precision*Recall}{Precision + Recall} \end{aligned}$$
(29)

5.2 Dataset description

To show that the proposed method is independent of domains, we have chosen five English review datasets of different domains, such as movies, books, DVDs, electronics, and kitchens. The first dataset is the second version of the Pang Lee dataset [37], known as Movie Review (MR02), containing 1000 positive and 1000 negative reviews on movies collected from IMDBFootnote 9. What is more, we have employed the Multi-Domain Dataset (MDS), collected by Blitzer et al. [34], containing the reviews of four different domains (Book, DVD, Electronics, and Kitchen) from AmazonFootnote 10. Table 2 shows the number of positive samples (#ps), the number of negative samples (#ns), the number of positive words (#pw), the number of negative words (#nw), and the number of negations (#neg) corresponding each dataset.

Table 2 Description of datasets
Table 3 Hyperparameters OF CNN-GRU classifier

5.3 Hyperparameter setting

To set the hyperparameters of the proposed method, we performed it with different values of hyperparameters and found the best values. We set the vector size to make dense feature vectors as 100, 200, and 300. The vector size of 200 achieved the lowest error rate.

Also, the Word2vec works based on the skip-gram and Continuous Bag-of-Word (CBOW). We tested each of them, and the skip-gram obtained the lowest error.

In addition, we used the Doc2vec model to convert the samples into document vectors. As proposed in [23], the Doc2vec works in PV-DM and PV-DBOW modes. Like the continuous bag of words, the former is more complex but performs better. The PV-DM method either concatenates or averages all word embeddings of a document to calculate document embeddings. Like skip-gram, the latter is simpler and usually leads to a higher error rate. As a result, we made three Doc2vec models: D2V-DM-Concat, D2V-DM-Average, and D2V-DBOW. We made a logistic regression model on each dataset to compare these models. Its input is document embeddings, and its target is the sentiment labels. Subsequently, Lee and Mikolov suggest that the concatenation of document embeddings created by PV-DM and PV-DBOW improves the performance of the Doc2Vec model [23]. To this end, we concatenated two distributed models (D2V-DM-Concat and D2V-DM-Average) with D2V-DBOW separately and created two concatenated models named D2V-BOW-Concat and D2V-BOW-Average. The error rates of the two concatenated models decreased remarkably, and the D2V-BOW-Concat achieved the lowest error. Finally, we used this model in the following steps. Moreover, we configured the CNN-GRU classifier as listed in Table 3.

Table 4 Comparison of the proposed method with other lexicons
Table 5 Comparison of the proposed method with other classifiers

5.4 Comparison with lexicons and other classifiers

First, this section compares the proposed method with other widely used lexicons such as TextBlob, SentiStrength, AFINN, VADER, and Flair. Later on, we compare it with other classifiers. To compare with the lexicons, we used the APIs presented by these lexicons to classify the sentiments of datasets. TextBlob [50] provides an API for some NLP tasks such as SA, part-of-speech tagging, and noun phrase extraction. SentiStrength [58] employs word-matching tools to classify the text and outputs a number demonstrating the polarity of the text. AFINN [46] lexicon generates a number between -5 and +5 to show the sentiment of texts. VADER (Valence Aware Dictionary and sEntiment Reasoner) [51] is a rule-based SA tool. Flair [59] employs an embedding method called contextualized string embedding. As shown in Table 4, Flair, which uses contextual embedding, obtained the closest results to the proposed method, but still, the proposed method outperforms others.

Now, the proposed method is compared with other classifiers. As listed in Table 5, the results obtained by CNN-LSTM are close to the proposed method, and in the case of the DVD dataset, the accuracy of CNN-LSTM (0.83) is higher than the proposed method (0.75). In other cases, the proposed method outperforms the other classifiers.

Table 6 Comparison of the proposed method with other similar methods

5.5 Comparison with other methods

In this section, we compare the proposed method with unsupervised methods such as Zhou et al. [38], Fernandez et al. [43], and Vilares et al. [44], and self-supervised methods such as He and Zhou [36], SSentiA [9], and SESS [33], explained in Section 3.1. These methods employed the MR02 and MDS datasets, the same as the proposed method. As can be seen in Table 6, the results obtained by the proposed method are better than the results reported by the authors of other methods, only in the case of the Kitchen dataset, SESS [33] achieved the higher F1-score.

As explained in Section 3.1, Zhang et al. [33], He and Zhou [36], Fernandez et al. [43], and Sazzed et al. [9] borrow a list of seeds from domain-independent lexicons to calculate the sentiment score of each document. Later on, they train a classifier. Finally, they employ this trained classifier to predict the label of unseen data. Mostly they aggregate the sentiment score of the words forming a document to calculate its sentiment score. In comparison with the proposed method, these methods do not consider the semantic relationships between the documents and their sentiment words using the Soft-Cosine measure.

Zhou et al. [38] uses a method called graph co-regularized non-negative matrix to find the label of documents. However, this method only uses the Cosine similarity to find the similarity between the documents and sentiment words, which is not enough, as we explained in Section 2.2, so its results are not close to the proposed method.

On the other hand, these methods usually use the TF-IDF to convert the texts into feature vectors. Against contextual embedding methods such as Doc2Vec, TF-IDF does not consider the contexts and semantic meaning of texts and will be slow for larger vocabularies. As a result, these methods are not scalable and cannot be utilized for larger datasets with larger vocabularies. The proposed method does not have this limitation. The proposed method can be used in different domains as we evaluated it on the datasets of various domains

Additionally, these methods train classifiers like SVM or LR that do not capture long-term dependencies while processing sequential data like texts. In comparison, as explained in Section 2.3, the proposed method utilizes the CNN-GRU classifier that finds the local and deep features using CNN and captures long-term dependencies using GRU.

5.6 Complexity complexity

Regarding the computational complexity, the complexity of the proposed method is non-trivial and depends on the building blocks of the method, including Doc2vec, the Soft-Cosine, and the CNN-GRU classifier. The complexity of the Doc2vec embedding method is linear since it is composed of a single-layer model. So, it is presented by \(O(N)\), where N is the number of documents. The complexity of Soft-Cosine is at most \(O(L \times N)\), in which each document length equals L.

Additionally, the complexity of the CNN and GRU are \(O( s \times n \times d^2 )\) and \(O( n \times d^2 )\) where s, n, and d are kernel size, sequence length, and representation dimension, respectively [60]. So, the complexity of CNN-GRU is calculated as below:

$$\begin{aligned} O(CNNGRU) = O( s \times n \times d^2 ) + O( n \times d^2 ) \end{aligned}$$
(30)

Finally, the complexity of the proposed method is:

$$\begin{aligned} O(Proposed Method) = O( N ) + O( N \times L ) + O( s \times n \times d^2 ) + O( n \times d^2 ) \end{aligned}$$
(31)

In real-world datasets, \(N\) is much bigger than \(L\), so \(L\) can be omitted in (31). Indeed, the number of samples in the dataset \((N)\) is always much higher than the length of samples \((L)\). Moreover, the \(s\) is constant. Therefore, the complexity of the proposed method can be calculated by the below equation:

$$\begin{aligned} O(Proposed Method) = O( N ) + O( n \times d^2 ) \end{aligned}$$
(32)

6 Conclusion and future work

SA is a domain-dependent task, so the knowledge-based SA methods that use domain-independent lexicons can not recognize the sentiment of domain-dependent words, and the ML methods trained on a specific domain can not be utilized in other domains. To address this problem, this research proposes an SA method that considers the domain of samples using contextual embeddings, Soft-Cosine similarity, and a CNN-GRU classifier. The proposed method offers a semantic-based pseudo-label generator that estimates the pseudo-labels based on the Soft-Cosine similarity and the number of sentiment words. It uses a list of positive and negative seeds to extract the sentiment words of each sample.

In addition, since the classifier needs to be trained by samples having highly accurate pseudo-labels, another method based on the confidence score is proposed to find the highly accurate pseudo-labels. Then, the samples having highly accurate pseudo-labels are fed into a hybrid CNN-GRU classifier. The CNNs can extract local features deeply, and GRUs capture long-term dependencies. The evaluation results demonstrate that the proposed method outperforms the existing similar approaches.

The comparison of semantic-based pseudo-label generator with other similar existing SA methods such as TextBlob, SentiStrength, VADER, AFINN, and Flair demonstrates that using contextual embeddings and semantic similarity jointly can solve the problem of SA methods, not considering the domain of domain-dependent sentiment words while extracting their sentiments. Contextual embedding methods such as Doc2Vec convert the text into dense feature vectors, and the Soft-Cosine similarity calculates the semantic similarity between the feature vectors of samples and their corresponding sentiment words. Later on, these similarities are used to estimate pseudo-labels.

Additionally, the comparison of CNN-GRU with other classifiers such as CNN, LSTM, GRU, and CNN-LSTM shows that the proposed CNN-GRU classifier outperforms other classifiers in terms of accuracy, precision, recall, and F1. Just in the case of the DVD dataset, the CNN-LSTM performs better than the CNN-GRU.

In the future, the transformers could enhance the proposed method to create more meaningful feature vectors. The Transformers such as RoBERTa and ALBERT were trained on the huge amount of data like Wikipedia and Books and utilized the attention mechanism to overcome the vanishing gradient problem. As a result, they create rich feature vectors that consist of semantic aspects of texts.

In addition, the other text similarity measures, such as Jaccard, can be compared with the Cosine and Soft-Cosine measures, and the proposed method can be extended to estimate the pseudo-labels as a range of numbers to express the intensity of sentiments. Moreover, the proposed method can be evaluated on the datasets of other languages.