1 Introduction

Sentiment Analysis (SA) [1] which is a sub-branch of Natural Language Processing (NLP) expresses emotion or view on the given text or speech or any other communication. ABSA [2] extends to be well refined in nature to sentiment analysis providing crucial information for NLP tasks. Aspect-category (ACSA) and Aspect-term (ATSA) forms two sub-tasks of ABSA [3, 4]. ACSA predicts the \(sentiment polarity\) for a given \(aspect\) in predefined sets hidden in sentence.“The food is tasty but restaurant is untidy”, the ACSA could predict the \(sentiment polarity\) for aspect "eatable" yet not available in sentence. On the other hand, ATSA predicts the \(sentiment polarity\) of aspect term as a sequel to sentence, where the “The food” form the example sequel to the above sentence [5].

Sentiment polarity differs for various independent aspects. For an ABSA task, the pivotal point is the given aspect term. Furthermore, many words in the sentence are futile on sentiment prediction for a given term. Taking into consideration the given aspect as "restaurant" for weakly associative words, certain words like "food" and "tasty" is impertinent for the sentiment prediction and produces inconsistent results.

Albeit of the excellent performance of Neural Networks (NN) [6,7,8] in various research domains such as \(language translation\), \(paraphrase recognition\), \(question answering\) and \(text summarization\), they are still in nonage with ASC. Certain works in target dependent classification gets benefitted with only target information but not aspect information which forms crucial part to evaluate the ASC.

The original ASC task proposed with pre-trained word embedding and task-centric neural framework faces a bottleneck even with improved accuracy or F1-score. This bottleneck is caused due to task-agnostic embedding layer initialized with Word2Vec [9,10,11] and GloVe [12] which is insufficient to capture intricate semantic relevance in the sentence, providing only context- independent word-level features. The limitation in the dataset size to train task-centric frameworks is solved by a deep LSTM [13, 14] with pre-trained word embedding layer.

Overriding the promising nature of attention based models [15,16,17] the insufficiency to seize dependencies between \(context\) and \(aspect\) present in a sequence occurs, which then leads to the given aspect which mistakenly attend syntactically irrelevant context words as \(descriptors\). The example, “Its model is ideal and function is excellent." explains that excellent is mistakenly taken as descriptor of aspect model. Certain models does not fully exploit syntactical structure but rather imposed syntactical constraints on attention-weights.

Issues of attention-based mechanisms is solved by Convolution Neural Networks (CNN) [18,19,20] by finding features as continuous words and using convolution operations over word sequences to predict sentiment of an aspect but failed to capture sentiment polarity for multiple words that are inconsecutive to one another. The sentence, "The workers should bit more work sincerely", makes incorrect prediction for CNN based model by considering "workers" as the aspect and "more work sincerely" as descriptive phrase which reverses the effect on the sentiment.

\(BERT\) [21,22,23] overcomes the bottlenecks with the information from entire sentence as input to calculate token level representation while \(Word2Vec\) or \(GloVe\) based embedding provides single \(context-independent\) representation alone. The modeling power of \(BERT\) is investigated on ASC in this paper and is considered as well-known pre-trained model equipped with Transformer. The SSA-GRU-AE is proposed to impose the model by attending an important part of sentence with respect to a particular aspect.

Principal benevolent features are:

  1. (1)

    The SSA-GRU-AE algorithm is proposed to achieve ASC by examining various portions of the sentence when several aspects are considered.

  2. (2)

    The input and aspect embedding are generated by the BERT pre-trained model.

  3. (3)

    Aspect which plays a major role in this work is employed in two ways, where in the first part, \(aspect embedding\) vector is concatenated with the input embedding vector and the second part with the hidden vector of the sentence is generated by GRU to compute attention weights which efficiently extracts contextual semantic relationship between aspect and sentence.

  4. (4)

    Sparse self-attention mechanism, proposed in our method can effectively filter out the unimportant words in sequence that does not contribute to sentiment analysis and also learns sentiment aware word embedding by applying weights on word embedding of input and aspect terms.

  5. (5)

    Implementation of \({{\text{L}}}_{1}\)-regularize on the attentions makes sure that minimal words influence semantic and sentiment of sequence.

  6. (6)

    Results on ASC datasets reveals that proposed model outperformed the \(state-of-the-art\) models discussed in the literature.

The remainder of this article is structured as, Sect. 2 discusses the related work, Sect. 3 discloses the proposed model, Sect. 4 analyses the result and finally concluded in Sect. 5 (Table 1).

Table 1 List of abbreviations

2 Related Work

The various models that are put forth in the literature survey is studied for ASC task. ASC is a classification technique which is a well refined task in ABSA. Several of the modern approaches could identify polarity of entire sequence even in the unavailability of aspects. Conventional approaches are utilized to design a \(bag-of-words\) and \(sentiment lexicons\) as features which trains SVM classifier [24] to perform ABSA. In spite of intensive labor in feature engineering, the results are greatly dependable on the caliber of features.

The invention of learning distributed representations has made the Neural Network (NN) approaches become popular for ASC. Classical models such as RecNN [25, 26], CNN [27, 28], RNN, LSTM [29,30,31], GRU [32], GCN [33] and Tree-LSTMs [34] were applied for ASC. Syntax structures of sequences used by Tree-based LSTMs though proved as an effective approach to solve ASC problems also has failed due to syntax parsing errors commonly found in resource lacking languages. LSTMs and GRUs have achieved great success in ASC. Liang et al. introduced a \(deep transition\) model called as AGDT [35] using GRU encoder to effectively utilize the aspect from the scratch to improve the feature selection and extraction. The utilization of target information by TD-LSTM and TC-LSTM [36] has made remarkable achievement in ASC task where the target vector attained by TC-LSTM is acquired, by computing the average of word vectors related to aspect, however is insufficient to capture its semantics, resulting in lousy performance.

Attention or gating mechanism is implemented on several existing models that could capture context related features. Ma et al. [37] elucidated a hierarchical \(attention\) model which primarily visits aspect words then goes through the whole sentence and later integrates this information with external practical knowledge using sentic-LSTM which resolves the word conflict problems caused by the basic LSTM models. Position information of aspect plays a key role for ASC task which is realized by [38] by proposing a HAPN capable of learning the position information of the aspect followed by fusing the aspects and contexts to bring about the final sentence representation. Zheng et al. [39] proposed a rotatory \(attention\) mechanism based neural network which associates the aspect and the left/right contexts implementing three LSTMs one for the \(left context\), second for the \(target phrase\) and third for the \(right context\). Laddha et al. [40] introduced an \(attention\) based \(Bi-LSTM\) model that productively seizes relation between the multiple aspects and the context words by ignoring the effect of one \(aspect\) on another. Finally a CRF is used to model dependencies among output labels. Lin et al. proposed a DSMN [41] to guide multi-hop attention mechanism by computing distance between an aspect and its context to capture aspect aware context information.

Numerous existing models combines CNN with LSTM to seize context related word-level information. Liu et al. [19] proposed a model by combining both regional \(CNN\) and \(Bi-LSTM\) which obtains the context information as well as the relationship between \(aspect\) and \(context\) and also gate mechanism which improves word vector representation to make model language independent. Zhang et al. [27] proposed a CMA-MemNet to obtain semantic information from aspects and sentences. Convolution proposed in this model captures context related information and multi-head self-attention obtains semantic information. CNN based models inadequately determines sentiments for multiple words that are not consecutive.

Graph Convolution Network (GCN) models are developed to effectively capture dependency relation between \(aspect\) and \(context\) due to lack in performance of existing attention based models. Zheng et al. [42] proposed ASGCN which has an LSTM and a multi-layer GCN, to fully leverage the syntactical dependency structure within a sentence. The LSTM generates contextual word embedding for word orders and the multi-layer GCN filters out unimportant words leaving back the important information which is then fed into the attention based LSTM to generate aspect based features to predict sentiments. Hou et al. [43] proposed SAGCN with self-attention, effectively enables the interaction between \(aspect\) and its \(opinion words\) even if aspect term is far away from it, which later considers the connection between target and its syntactic neighbors. Sun et al. [33] proposed a \(convolution over dependency tree\) which utilizes the Bi-LSTM and captures the important features from the sentence that could be fed as input to GCN which then transfers information from the opinion words to aspect words. Liang et al. [44] proposed an interactive \(multi task\) learning model by incorporating a new message passing mechanism which utilizes dependency relation embedded GCN to completely exploit the syntactic knowledge for end-to-end ABSA. Wu et al. [45] introduced a GCN with \(attention\) utilizing BERT to capture the relation between \(aspect\) and its \(context\), where attention controls information flow in the GCN.

The influence of the overall contextual score is worsened due to failure of dependency tree based models to capture hidden vector representation based on aspect. Veyseh et al. [46] elucidated a \(graph-based model\) with gate vectors that could customize hidden vectors towards aspect terms and also a \(dependency tree\) based mechanism to acquire importance scores for every word in sequence.

In spite of existing GCN models captures the entire tree which makes it complex during optimization, it is required that only a small part of the \(dependency tree\) is necessary in ASC task. Wang et al. [47] reshaped and pruned only the important part of the dependency tree to specifically focus on target aspects. Then this pruned tree is fed into the \(R-GAT\) to encode the \(dependency relations\) and also establishes connections between the \(aspects\) and \(contexts\).

There are few issues observed in the existing models,

  • Overriding the promising nature of attention based models the insufficiency to capture dependencies between \(context\) and \(aspect\) present in a sequence occurs, which then leads to the given aspect which mistakenly attend syntactically irrelevant context words as descriptors.

  • Certain models does not fully exploit syntactical structure but rather imposed syntactical constraints on attention-weights.

  • Large-scale corpus training improves neural network models. Manually labeling aspect targets to generate aspect-level training data is difficult.

  • As comments and other corpora with \(document-level\) sentiment labels are hard to obtain, gathering users' preferences about multiple aspect categories becomes infeasible.

To differentiate various sentiment polarities at a well refined aspect level, is highly demanding in spite of the efficiency in all the methods. Hence a design of a dominant neural network which could completely engage aspect information for ASC is vital. To address these issues this work proposes a novel SSA-GRU-AE to effectively classify the sentiments with respect to the aspects. Especially the Sparse self-attention mechanism introduced in our work devalues the unimportant words and computes the important words related to the aspect words and also helps to outperform the existing models available in the literature.

3 Proposed Work

This paper focuses on ASC for the given input sequence which employs BERT pre-trained model to compute contextualized word embedding vectors for sentence as well as the aspect terms which then forms the input to the SSA-GRU-AE to classify the aspect-level sentiments.

3.1 Input Layer

The given input sentence \({\text{S}}=\left\{{{\text{s}}}_{1},{{\text{s}}}_{2},{{\text{s}}}_{3},\dots ,{{\text{s}}}_{{\text{a}}},{{\text{s}}}_{{\text{a}}+1},{{\text{s}}}_{{\text{a}}+2},\dots {{\text{s}}}_{{\text{a}}+\left({\text{m}}-1\right)},\dots ,{{\text{s}}}_{{\text{N}}}\right\}\) of length \({\text{N}}\) with \({\text{m}}\) aspect words \(\left\{{{\text{s}}}_{{\text{a}}},{{\text{s}}}_{{\text{a}}+1},{{\text{s}}}_{{\text{a}}+2},\dots {{\text{s}}}_{{\text{a}}+({\text{m}}-1)}\right\}\) has to be recast into contextualized word embedding vectors by using \(pre-trained\) BERT model. The input format of the given sentence \({\text{S}}\) to the BERT is given by the split up of \("[{\text{CLS}}]+{\text{sentence}}+[{\text{SEP}}]+\mathrm{aspect term}+[{\text{SEP}}]"\). Thus this input format extracts the overt interactions between sentence and aspect term. The embedding information of sub-words generated by BERT is subjected to average pooling which produces ultimate embedding vector \({\text{X}}\in {{\text{R}}}^{{\text{N}}\times {\text{d}}}\), where \({\text{d}}\) is dimension of BERT output. The vector \({\text{X}}\) is then represented as vector sequence, \(\left\{{{\text{w}}}_{1},{{\text{w}}}_{2},\dots {{\text{w}}}_{{\text{n}}}\right\}\) for sentence and \(\left\{{{\text{v}}}_{{\text{a}}},{{\text{v}}}_{{\text{a}}+1},{{\text{v}}}_{{\text{a}}+2},\dots {{\text{v}}}_{{\text{a}}+({\text{m}}-1)}\right\}\) for aspect terms respectively.

3.2 Gated Recurrent Unit

The \(vanilla RNN\) suffers from vanishing or exploding \(gradient\) problem for long sequences due to replacement of entire sequence in hidden state during every time step \({\text{t}}\). The Gated Recurrent Unit (GRU) solves this issue by the inclusion of two additive gates in their architecture. The GRU contains two gates where one is \(update gate\) and other is \(reset gate\). The gates present in GRU removes the unimportant and retains only the important information. The reset gate combines current input with the important part of the previous \(hidden state\) to produce a new \(hidden state\). The update gate determines the amount of information from current hidden state to be included with final hidden state. This allows network to retain long-term dependencies. The workflow diagram of GRU has been illustrated in Fig. 1.Calculation process of GRU is given from Eq. (1) to (5):

$$ {\text{g}}_{{\text{r}}} = {\upsigma }\left( {{\text{W}}_{{{\text{ir}}}} \cdot {\text{x}}_{{\text{t}}} + {\text{W}}_{{{\text{hr}}}} \cdot {\text{h}}_{{{\text{t}} - 1}} } \right) $$
(1)
$$ {\text{r}} = {\text{tanh}}\left( {{\text{g}}_{{\text{r}}} \odot \left( {{\text{W}}_{{\text{h}}} \cdot {\text{h}}_{{{\text{t}} - 1}} } \right) + {\text{W}}_{{\text{x}}} .{\text{x}}_{{\text{t}}} } \right) $$
(2)
$$ {\text{g}}_{{\text{u}}} = {\upsigma }\left( {{\text{W}}_{{{\text{iu}}}} \cdot {\text{x}}_{{\text{t}}} + {\text{W}}_{{{\text{hu}}}} .{\text{h}}_{{{\text{t}} - 1}} } \right) $$
(3)
$$ {\text{u}} = {\text{g}}_{{\text{u}}} \odot {\text{h}}_{{{\text{t}} - 1}} $$
(4)
$$ {\text{h}}_{{\text{t}}} = {\text{r}} \odot \left( {1 - {\text{g}}_{{\text{u}}} } \right) + {\text{u}} $$
(5)
Fig.1
figure 1

Gated recurrent unit

The class labels represented in the proposed work \(\left\{{\text{positive}},\mathrm{ negative},\mathrm{ neutral}\right\}\).

3.3 Self-Attention Based GRU with Aspect Embedding

The general GRU struggles to identify the key part of a sentence for the ASC task. To overcome this problem, this work proposes a novel SSA-GRU-AE. This work integrates sparse self-attention mechanism with GRU to capture a relevant part of sentence with respect to a given aspect. Aspect information play key role to classify polarity of given sentence. To utilize aspect information effectively this work proposes to generate embedding vector using BERT for each aspect. Here \({{\text{V}}}_{{{\text{a}}}_{{\text{i}}}}\in {\mathbb{R}}^{{{\text{d}}}_{{\text{a}}}}\) is embedding of aspect i, where \({{\text{d}}}_{{\text{a}}}\) is dimension of aspect embedding. \({\text{H}}\in {\mathbb{R}}^{{\text{d}}\times {\text{N}}}\) is matrix formed by hidden vectors \(\left[ {{\text{h}}_{1} ,{\text{h}}_{2} ,{\text{h}}_{3} , \ldots ,{\text{h}}_{{\text{N}}} } \right]\) that GRU generated. Here \({\text{d}}\) is size of the hidden layers and \({\text{N}}\) is the length of given sentence. \({\text{e}}_{{\text{N}}} \in {\mathbb{R}}^{{\text{N}}}\) is vector of ones. An attention weight vector \({\upalpha }\) and a hidden weight vector \({\text{ r}}\) is produced by self-attention mechanism, as specified in Eqs. (6) to (8).

$$ {\text{M}} = {\text{tanh}}\left( {\left[ {\begin{array}{*{20}c} {{\text{w}}_{{\text{h}}} {\text{H}}} \\ {{\text{W}}_{{\text{v}}} {\text{V}}_{{\text{a}}} \otimes {\text{e}}_{{\text{N}}} } \\ \end{array} } \right]} \right) $$
(6)
$$ {\upalpha } = {\text{softmax}}\left( {{\text{W}}^{{\text{T}}} {\text{M}}} \right) $$
(7)
$$ {\text{r}} = H\alpha ^{T} $$
(8)

where \({\text{ M}} \in {\mathbb{R}}^{{\left( {{\text{d}} + {\text{d}}_{{\text{a}}} } \right) \times {\text{N}}}}\), \({\upalpha } \in {\mathbb{R}}^{{\text{N}}}\), \({\text{r}} \in {\mathbb{R}}^{{\text{d}}}\). \({\text{W}}_{{\text{h}}} \in {\mathbb{R}}^{{{\text{d}} \times {\text{d}}}}\), \({\text{W}}_{{\text{V}}} \in {\mathbb{R}}^{{{\text{d}}_{{\text{a}}} \times {\text{d}}_{{\text{a}}} }}\) and \({\text{W}} \in {\mathbb{R}}^{{{\text{d}} + {\text{d}}_{{\text{a}}} }}\) are weight parameters. \(\otimes\) is a concatenation operator, which repeatedly concatenates \({\text{V}}\) for \({\text{N}}\) times. \({\text{W}}_{{\text{V}}} {\text{V}}_{{\text{a}}} \otimes {\text{e}}_{{\text{N}}}\) repeats the linearly tranformed \({\text{V}}_{{\text{a}}}\) as many times till the last word in the given sentence and eventually represented in the Eq. (9),

$$ {\text{h}}^{*} = {\text{tanh}}\left( {{\text{W}}_{{\text{p}}} {\text{r}} + {\text{W}}_{{\text{x}}} {\text{h}}_{{\text{N}}} } \right) $$
(9)

where \({\text{h}}^{*} \in {\mathbb{R}}^{{\text{d}}}\), \({\text{W}}_{{\text{p}}}\) and \({\text{W}}_{{\text{x}}}\) are weight parameters.

The self-attention mechanism captures significant part of sentence with respect to an aspect. Here \({\text{h}}^{*}\) represents feature of a particular sentence with respect to an aspect. Then a linear layer is added to transform sentence vector \({\text{e}}\), which is a vector of length which equals class number \(\left| {\text{C}} \right|\). Finally a softmax layer is applied to transform \({\text{e}}\) to conditional probability distribution as specified in Eq. (10).

$$ {\text{y}} = {\text{softmax}}\left( {{\text{W}}_{{\text{s}}} {\text{h}}^{*} + {\text{b}}_{{\text{s}}} } \right) $$
(10)

where \({\text{W}}_{{\text{s}}}\) and \({\text{b}}_{{\text{s}}}\) are the weight and bias parameter of \({\text{softmax}}\) layer. Further \({\text{L}}_{1}\) regularize is applied to make sure only minimal words of the sentence contributes to the semantic and sentiments of the sentence. The sparse self-attention mechanism is depicted in Eq. (11).

$$ \left| {\text{y}} \right|_{{{\text{L}}_{1} }} = \left| {{\text{softmax}}\left( {{\text{W}}_{{\text{s}}} {\text{h}}^{*} + {\text{b}}_{{\text{s}}} } \right)} \right| $$
(11)

The proposed sparse self-attention mechanism effectively removes the words which are unimportant to predict the sentiment and calculates the key part of the sentence. After computation of the important part of the sentence, a weighted summation is performed to predict the sentiment polarity which is specified in Eq. (12).

$$ {\text{d}}^{{\text{k}}} = {\text{y}}_{1} {\text{x}}_{1} + {\text{y}}_{2} {\text{x}}_{2} + \ldots + {\text{y}}_{{\text{n}}} {\text{x}}_{{\text{n}}} $$
(12)

where \({\text{x}}_{{\text{i}}}\) denotes embedding of \({\text{i}}^{{{\text{th}}}}\) word in sentence, \({\text{n}}\) is length of sentence. Finally output of sparse self-attention layer is obtained using Eq. (13).

$$ {\hat{\text{y}}} = {\text{softmax}}\left( {{\text{Wd}}^{{\text{k}}} + {\text{b}}} \right) $$
(13)

where \({\hat{\text{y}}}\) is a predicted sentiment polarity which is a \(2 - {\text{D}}\) vector where \(\left( {1,0} \right)\) and \(\left( {0,1} \right)\) are \(positive\) and \(negative\) labels respectively.

The model is trained using back propagation where \(cross entropy\) loss function is used to minimize error between \({\text{y}}\) and \({\hat{\text{y}}}\) for all sentences which is specified in Eq. (14).

$$ {\text{loss}} = - \mathop \sum \limits_{{\text{i}}} \mathop \sum \limits_{{\text{j}}} {\text{y}}_{{\text{i}}}^{{\text{j}}} {\text{log}}\widehat{{{\text{y}}_{{\text{i}}} }}^{{\text{j}}} $$
(14)

where \({\text{i}}\) and \({\text{j}}\) is index of sentence and class respectively.

Then an Adagrad optimization [48] is adopted to train the model over mini batches. This optimizer improves the strength of SGD on learning process and also performs increase in updates to learning rate for unusual parameters and decrease in updates to learning rate for usual parameters. The workflow diagram of proposed SSA-GRU-AE is illustrated in Fig. 2.

Fig.2
figure 2

Proposed SSA-GRU-AE model

4 Experiments

The proposed SSA-GRU-AE model performs the ASC task in which both the \(word\) and \(aspect\) embedding vectors are formed by pre-trained BERT and length of attention weights equals length of input sentence. The hidden size \(\left( {{\text{dim}}_{{\text{h}}} } \right)\) of BERT is 768 and transformer layers \(\left( {\text{L}} \right)\) are 12. The pre-trained BERT model initializes word and aspect embedding vectors while initialization of all other weight parameters are done by random sampling with normal distribution \({ }{\mathcal{N}}\left( {0;{ }0:0.2} \right)\). The attention weights are the same length as the sentences. To implement SSA-GRU-AE, there are a total of 8 attention heads and 12 training layers are utilized. Implementation of proposed and all baseline models are performed by Tensorflow and hyper parameters applied are specified in Table 2.

Table 2 Hyper parameters for baseline models

Averaging of 3 runs applying random initialization considering evaluation metrics as \(Accuracy\) as well as \(F1 - score\) gives the required results is shown in Table 3. A \({\text{Friedman }} - test\) was done on both the \(Accuracy\) and \(F1 - score\) to check the competency of our model with baseline models as shown in Table 10.

Table 3 Evaluation metrics

4.1 Datasets

5 datasets \(\left( {{\text{twitter}},{\text{ LAP}}14,{\text{ REST}}14,{\text{ REST}}15{\text{ and REST}}16} \right)\) are employed to prove that the proposed method mastered the existing baseline models.The 5 datasets contains reviews of users with a set of aspects with its polarities from which the sentences with contradictory polarities or inexplicit aspects are removed. The purpose of the proposed work is to ascertain aspect polarity of a sentence with respect to aspect. The dataset details are exhibited in Table 4.

Table 4 Statistical representation of dataset

4.2 Baselines and Experimental Setting

Ten sentiment classification models are implemented as baseline models for comparison with proposed model is studied as:

  1. 1.

    SVM: SVM [24] identifies many qualities of aspects are pointed out, and a quantitative look at each part is effective.

  2. 2.

    LSTM: LSTM [36] models the sentiment representation and prediction.

  3. 3.

    Interactive attention network (IAN): IAN model [53] utilizes two LSTMs one for the target and the other for context that are used to extract meaningful information independently with an interactive attention mechanism which later concatenates them to classify the sentiments.

  4. 4.

    .Memory Network (MemNet): MemNet [54] is a combination of attention mechanism and explicit memory. The multi-hop attention mechanism proposed in this model helps to improve sentiment classification.

  5. 5.

    AOA [55]: This model learns the aspects and the sentence representation jointly and also explicitly captures the interaction between them.

  6. 6.

    ASGCN [42]: This model utilizes LSTM and generates contextual information followed by a GCN to obtain aspect specific features which are then fed to a masking mechanism to remove non-aspect words and later fed back to another LSTM to predict the sentiment.

  7. 7.

    SAGCN [43]: SAGCN utilizes GCN to find correlation between aspect and sentence using dependency tree.

  8. 8.

    IGCN [28]: A bidirectional gating proposed in IGCN computes the relation between the \(aspect\) and its \(context\).

  9. 9.

    DSMN [41]: The multi-hop attention is guided by dynamically selected context memory which then integrates the \(aspect\) information with the memory networks.

  10. 10.

    CMA-MemNet [27]: The rich semantic information between the \(aspect\) and the \(sentence\) is extracted by this memory networks.

5 Experimental Results

In the experiments, \(accuracy\) and \(F1 - score\) are the metrics which evaluates performance of proposed method. To evaluate stability of model, the method is run thrice and the \(mean accuracy\) and \(standard deviation\) are reported in Table 5, 6, 7, 8, 9. The Friedman test verifies the magnitude in differences between the proposed and the other approaches with \(p - value\) of 0.05.

Table 5 \(Accuracy\) and \(F1 - score\) on Twitter dataset
Table 6 \(Accuracy\) and \(F1 - score\) on Lap14 dataset.
Table 7 \(Accuracy\) and \(F1 - score\) on Rest14 dataset.
Table 8 \(Accuracy\) and \(F1 - score\) on Rest15 dataset.
Table 9 \(Accuracy\) and \(F1 - score \) on Rest16 dataset.

5.1 Discussion

The result analysis of proposed work with eighteen baseline models clearly proves that proposed work outperformed the existing models in terms of \(accuracy\) and \(F1 - score.\) The SSA-GRU-AE model consistently performed better than all the baseline models in all the five datasets \(\left( {twitter, LAP14, REST14, REST15 and REST16} \right)\) whereas SVM and LSTM performed poorly in all five datasets consistently due to the manual feature engineering of SVM and the lack of aspect information in LSTM for ASC.

The IAN and AOA model is implemented using attention mechanism to solve ASC task by attending all the aspect and context words that helps to outperform both the SVM and baseline LSTM models. The experimental results on all the five datasets proved that both the IAN and AOA consistently performed well. The issue with the attention mechanism is that if the dataset is noisy or it contains multiple aspects this mechanism falsely assigns high scores to irrelevant words and also these attention based models attend all the aspect words with different weights which incorrectly guides \(aspect term\) to focus on syntactically \(unrelated words\). These issues causes AOA and IAN models performs moderately on all five datasets, in terms of \(accuracy\) and \(F1-score\).

GCN based models effectively captures both syntactic word dependencies and long range word relations. These models worked well with datasets which are rich in syntax information with good grammatical structure. Especially all the GCN based models effectively worked on \(LAP14, REST15 and REST16\) datasets and also improved both the \(accuracy\) and \(F1-score\). These models failed to work well with the datasets which has less grammatical information and less sensitive syntax information. That is on both the Twitter and REST14 datasets these GCN models produced less \(accuracy\) and \(F1 - score\). The results shown in Table 5, 6, 7, 8, 9 and 10 clearly proved that proposed work outperformed existing GCN based models in terms of \(accuracy\) and \(F1-score\).

Table 10 Friedman’s test

Memory network based models worked well on all five datasets continuously in terms of \(accuracy\) and \(F1 - score\). The base memory network model effectively captured aspect-sequence modeling well, but it failed to capture context and sequence information and hence decreased performance of the MemNet model in all the five datasets. But CMA-MemNet and DSMN captured both the information effectively, which helps them to perform better in all five datasets. The results on all five datasets showed that CMA-MemNet and DSMN performed well on all five datasets which outperformed all basic, \(attention\) and \(GCN\) models. Although these models performed well on all datasets, still it failed to recognize all aspects correctly, which led to performance loss.

The proposed novel sparse self-attention GRU with aspect embedding implementing BERT outperformed all the baseline models on all five datasets in terms of \({\text{accuracy}}\) and \({\text{F}}1 - {\text{score}}\). The BERT model used in our work effectively captured contextual information which helped to capture semantic information. Also sparse self-attention mechanism introduced in our work effectively removed unimportant words and captured only important words related to sentence. Also \({\text{L}}1\)-regularize applied on attentions helped to ensure that only a few words contributed to sentiment of sentences. This helped proposed model to perform better on datasets which are grammatically poor and noisy \(\left( {{\text{Twitter and REST}}14} \right)\). These advantages helped our model to outperform the existing baseline models on all five data sets where especially our model consistently performed well on \({\text{Twitter and REST}}14\) datasets too. Figures 3, 4, 5, 6, 7 shows the comparative results of all the baseline models with the proposed model on five different datasets.

Fig. 3
figure 3

Result analysis on Twitter dataset

Fig. 4
figure 4

Result analysis on Lap14 dataset

Fig. 5
figure 5

Result analysis on Rest14 dataset

Fig. 6
figure 6

Result analysis on Rest15 dataset

Fig. 7
figure 7

Result analysis on Rest16 dataset

5.2 Ablation Study

An ablation research is carried out to examine the significance of each component in the proposed model. First pre-trained BERT model is replaced with GloVe model and then executed on five datasets. The proposed model without BERT performed poorly in terms \(accuracy\) and \(F1 - score\) compared to model with BERT. The model without BERT produced 2.9%, 2.7%, 1.6%, 1.4%, and 1.3% less \(accuracy\) and 2.6%, 2.4%, 1.3%, 1.1% and 1% less \(F1 - score\) than model with BERT on \(Twitter, REST14, LAP14, REST15 and REST16\) datasets respectively. The proposed model without BERT performed poor when compared to some of GCN and memory network models too. The BERT model effectively captured semantic relation between aspect and context. This helped to improve performance of proposed model.

To identify importance of \(sparse self - attention mechanism\), a model without \(sparse self - attention mechanism \) is designed. Then model without sparse self-attention mechanism was executed on five datasets. The proposed model without sparse self-attention mechanism performed poorly in terms \({\text{accuracy}}\) and \({\text{F}}1 - {\text{score}}\) compared to model with model with sparse self-attention mechanism. The model without sparse self-attention mechanism produced 2.6%, 2.3%, 1.4%, 1.2%, and 1.1% less \({\text{accuracy}}\) and 2.3%, 2.1%, 1.1%, 0.9% and 0.7% less \({\text{F}}1 - {\text{score}}\) than model with sparse self-attention mechanism on \({\text{Twitter}},{\text{ REST}}14,{\text{ LAP}}14,{\text{ REST}}15{\text{ and REST}}16\) datasets respectively.The experimental results on five datasets proved that proposed sparse self-attention mechanism improved performance of proposed model. The sparse self-attention mechanism effectively captured the importance of each word in the context with respect to \(aspect\). This helped to remove unimportant words from sentence and keep only important part of sentence. This helped to improve performance of proposed model over model without sparse self-attention mechanism.

5.3 Case Study

For instance, consider the following phrase: " \(\user2{Even if it^{\prime}s a good day}\), \(\user2{I don^{\prime}t feel it}\). \(\user2{I^{\prime}m really miserable}\). " The terms "\(miserable\)," "\(feel\)," and "\(don^{\prime}t\)" are extremely significant in predicting the sentiment polarity of this sentence than the words "\(good\)" and "\(day\)". Many words, including "the," "in," "it," and "I'm," are also unimportant. Therefore, it's crucial to create a model that can accurately depict the significance of each word in the documents. Additionally, it must remain sparse enough that only a \(few words\) can accurately categorize the sentiment labels of the sentence. The self-attention layer of the proposed SSA-GRU-AE was used to determine the significance of each word in sentence. In order to make sure that only a handful of words are needed to identify the sentiment polarities of sentences, an L1 regularization is then used for these weights.

Consider the following sentence," The meal is tasty, but the restaurant is untidy. ", the proposed model predicts \(sentiment polarity\) for the aspect "meal" as \(positive\) and "restaurant" as \(negative\). The self-attention layer in the proposed model accurately identifies the \(aspect\) term "meal" and its \(context\) as "tasty" and also the \(aspect\) term "restaurant" and its \(context\) as "untidy". The sparse nature of the \(self - attention\) mechanism of the proposed model helps to retain only important words from the sentence such as "meal", "tasty", "restaurant" and "untidy" and also helps to avoid unimportant words such as "The", "is" and "but". The ability of sparsification in retaining only important words from the sentence helps to achieve ASC tasks efficiently. The self-attention mechanism serves as a sparsification mechanism in the proposed model, where it is regarded as a type of regularization that potentially enhances the quality of the model by efficiently diminishing noise within it. Due to the sparse nature of the proposed SSA-GRU-AE, neurons combine the output activations which are very similar, change their biases, and then rewire the network to reflect these changes. This sparsification enhances the efficiency of models by allowing them to function effectively in feature spaces with high dimensions. This also reduces the complexity of representation, wherein only a subset of dimensions is utilized at any given moment and decrease complexity by nullifying specific subsets of the model parameters. This resulted in reducing many useless words present in the sentence for \(sentiment polarity\) prediction with respect to an \(aspect\). Because of these advantages, the proposed model identifies the \(aspect\) phrases and its corresponding \(context\) words as a sequel to sentences, such as "meal" and "tasty" in the above statement.

Another example is, " The laptop's model is ok, and its performance is great. ". The proposed SSA-GRU-AE model identifies the aspect term "model" and its context as "ok" correctly and predicts its sentiment polarity as neutral. But for the aspect term "performance" the context is identified as "great" and predicts its sentiment polarity as positive. The sparse-self-attention mechanism assigns high weights to the terms "laptop", "model", "ok", " performance", and "great" and low weights to the terms "The", "is", "its" and "and". This example shows the importance of sparsification in self-attention mechanism for ASC tasks. The L1-regularize applied on attentions helped to ensure that only a few words contributed to the \(sentiment\) of sentences. This is due to the fact that when entire neurons or filters are eliminated, the principles of associativity and distributivity can be employed to convert a sparsified structure into a more compact, dense structure. Nevertheless, in the event that eliminating arbitrary components of a weight matrix, it becomes necessary to retain the indices corresponding to the non-zero items that remain. The process of model sparsification alters the model's characteristics, although it does not modify the sparsity pattern observed after successive inferences or forward passes. This helped proposed model to perform better on datasets which are grammatically poor and noisy.

Consider the following example, " The workers should do more work truly. " by considering "workers" as the aspect and "more work truly" as its context phrase to identify its sentiment polarity. The existing DNN models identify the aspect term as "workers" and its context term as "work truly" and predicts the sentiment polarity as positive. But actually the weight of the term "more" is important in this context. The sparse-self-attention in the proposed model helps to identify the context term "more work truly" to predicts its actual sentiment polarity "negative". This elucidates the importance of the sparse nature in self-attention mechanism. This also shows that the proposed model can able to capture the context words with implicit meaning (Fig. 8).

Fig.8
figure 8

Case study example

6 Conclusion

SSA-GRU-AE is proposed to perform ASC task which contains three parts. The first part is BERT embedding layer, second part is GRU layer and third part is sparse self-attention layer. The BERT pre-trained model used in this work effectively captures contextual word embedding of both the sentence and aspect and also captures relation between them. The sparse self-attention mechanism proposed in our work effectively captures the important part of sentences with respect to aspect. The experimental results on 5 datasets \(\left( {{\text{twitter}},{\text{ LAP}}14,{\text{ REST}}14,{\text{ REST}}15{\text{ and REST}}16} \right)\) proved that proposed method outperformed the existing baseline models in terms of \({\text{accuracy}}\) and \({\text{F}}1 - {\text{score}}\). Addition of ablation study and discussion has further demonstrated the proficiency of proposed model.