1 Introduction

Opinion Extraction (OE) is a sort of information retrieval that facilitates in generating the proper opinion about a product by observing web data and analyzing the reviews. Sentiment analysis is another term for this type of analysis, which is a branch of natural language processing. It is commonly used to assess public opinion or emotions toward a subject or product, so that marketers may determine whether the public's attitude is favorable or unfavorable (Tan et al. 2019; Haque et al. 2018). This can assist them in improving their flaws and altering their techniques in order to achieve the best specific results. By understanding why a given product achieved high ratings and what makes that product outstanding, any company or manufacturer can capitalize on this fact and create new things with similar attributes.

Opinion extraction, according to Pang and Lee (2008), is concerned with determining the orientation of opinion in a piece of text with relation to a specific topic. Opinion extraction is made up of three parts, including (1) Opinion Holder: It is a person who offers an opinion. (2) Opinion Object: The subject of the opinion; (3) Opinion Orientation: Determine whether an object's polarity is positive, neutral or negative (Qiu et al. 2011; Turney 2002). For example, "the camera quality of the Samsung mobile is superb." The user who has posted this review is the opinion holder in this review statement. The opinion object is the Samsung mobile's "camera quality," and the opinion term is "superb," which has a positive orientation.

The primary task of opinion extraction is identifying the semantic orientation of text at different levels. It takes place on three levels (Liu 2012): (1) document level: the entire document is classified as positive, neutral or negative; (2) sentence level: the entire sentence is evaluated as positive, neutral or negative; (3) aspect level/feature level: at this level, each feature present in the document/sentence is classified as positive, negative or neutral (Ratmele and Thakur 2020).

Different ways have been proposed, which depends on how rules were obtained to analyze textual data. Rule-based, statistical, and hybrid approaches (Wu et al. 2018) are the most commonly applied methodologies in opinion extraction. In this research work, statistical approach-based deep learning and HAN models are used to extract the accurate opinion from Amazon’s Smartphone’s reviews.

1.1 Need

Opinion extraction is needed in a variety of applications, including recommendation systems, search engines, web ad filtering, Email filtering and question–answering systems. It is convenient in everyday life because it can improve the interaction between human and computer, government intelligence, business intelligence and citation analysis. For opinion extraction from small amount of data, machine learning can be used but for large amount of data, deep learning produces good results. In this research work, a large dataset of 56,000 Amazon’s Smartphone’s reviews is used to learn the machine for extraction of opinions. Therefore, deep learning is suitable for this large amount of data.

1.2 Motivation

In the digital world, e-commerce platforms allow millions or billions of people to buy a variety of things. On e-commerce platforms, customers can openly submit their thoughts and reviews of specific products. These websites are regarded as a great source of information for determining consumer opinions and estimating product quality based on their opinions. Through the rapid development of online platforms, there are huge amount of reviews in terms of opinions. It is a very tedious task to interpret the general opinion of consumers on a particular product. Other customers can use this information to make decisions, company executives may improve their products, and governments can improve the product quality standards (Chen and Xie 2008; Abrahams et al. 2013). For example, lockdown was imposed across the country as a result of the COVID-19 pandemic. People were locked to their homes and were unable to purchase any products outside of their homes. Online portals were used to purchase goods and essentials at that time. If people want to purchase the mobile phone from online sites and they want to know about the mobile phone's quality based on the available reviews. However, there are numerous reviews available. Manually processing and evaluating the massive volume of review data is a difficult task. This motivates the issue of opinion extraction, which involves identifying the polarity of reviews based on whether they are best, worst, or neutral. The requirement of interpreting statements provided in an unstructured form has increased the demand for opinion prediction. To extract the correct opinions from a vast number of reviews, deep learning and a semantic-based classification are used.

The work's main focus is on deciphering the opinions conveyed in Amazon's product reviews that are written in an unstructured format. The benefits of polarity identification allow users to classify their reviews into five classes such as neutral, positive, extremely positive, negative and extremely negative.

1.3 Challenges

Opinions are frequently written informally and are extremely different, so that facing a number of challenges (Balahur et al. 2012). Some challenges are discussed as follows-

  1. 1.

    False review detection Sometimes, some people post the reviews on product even if they did not buy the product. Reviews given by those people are false and it is difficult to identify the false review.

  2. 2.

    Different writing style Individual writing styles fluctuate, and a sentence might be written in a variety of ways. Both positive and negative characteristics can be found in an opinion. As a result, analyzing an opinion at the phrase level, which incorporates both positive and negative feelings, is challenging.

  3. 3.

    Implication of polarity If the implications of examining the true polarity of an individual word in relation to its context are neglected, the overall accuracy of the polarity recognition process may suffer. For ex-, in the review "Camera size of this mobile is small," the word small is used in a positive sense, while in "The battery duration is similarly small," here the term small is used in a negative sense (Dhokrat et al. 2015). One drawback of the opinion extraction was that it determines the polarities of words individually without taking into consideration the polarity assigned to surrounding words. In fact, the failure to use the meaning of opinions offered in the target context frequently resulted in poor performance.

  4. 4.

    Effects of syntax on semantics Review written in casual form even it is not written in syntactically correct manner. Structure of reviews, contextual opinions and heterogeneous reviews may present major challenges.

  5. 5.

    Limitation of lexicon-based methods Context-based words must be handled carefully in their true context for efficient opinion extraction. Lexicons-based opinion extraction methods are ineffective when dealing with context-dependent terms.

1.4 Problem statement

In general, people express their thoughts in a casual way; it might be regarded as a type of opinion. These casual/unstructured reviews contain a wealth of data that must be examined and extracted in order to acquire opinions or overall sentiments. The aim of this research is to figure out how to distinguish and summarize Amazon’s Smartphone characteristics from large collections of consumer’s reviews. It might be regarded as a type of opinion extraction research because the aim is to extract relevant knowledge from customer reviews.

1.5 Our contribution

In this research, a deep learning framework is offered for opinion extraction from unstructured product reviews. Proposed approach performs the word embedding and classification of opinions with the help of the HAN model. This paper's contribution can be outlined as follows:

  1. (a)

    To scrap the dataset of Amazon’s Smartphone reviews.

  2. (b)

    To design a framework for opinion extraction that identifies the true polarity of customer reviews on their product.

  3. (c)

    To normalize and pre-process the noisy text collected from Amazon's website for appropriate sentiment classification.

  4. (d)

    To design a HAN (hierarchical attention network)-based framework that recognizes the opinions communicated in freestyle. In this framework, Glove (Pennington et al. 2014) is used for embedding and bi-directional GRU is used for semantic-based classification of opinions.

  5. (e)

    To assign a five-point polarity scale to the individual word/statement as well as a set of statements of the review that contained the product detail information as neutral, positive, negative, extremely positive and extremely negative statements.

1.6 Paper organization

The remaining paper is organized as follows: Section 2 outlines the important research related to opinion extraction techniques. Section 3 presents a proposed framework and methodology for opinion extraction on the basis of semantic classification using HAN. Section 4 presents the experiments and evaluation, Sect. 5 discusses the experimental results and evaluates the proposed methods, and Sect. 6 concludes the paper.

2 Literature survey

In recent years, with the expansion of online reviews, opinion extraction has received more attention to obtain the opinion of user. To enhance the online business, user’s reviews cannot be under estimated. Therefore, many researchers have developed various systems to extract the opinion or information from reviews (Khan et al. 2014; Vo et al. 2018; Hussein 2018; Wang et al. 2019; Ratmele and Thakur 2019; Hajek et al. 2020). Some of the most relevant research is presented here.

Wei et al. (2020) presented an approach for sentiment analysis based on bi-directional LSTM with multi-polarity orthogonal attention. In this approach, to improve the Bi-LSTM model for implicit sentiment analysis, a novel multi-polarity orthogonal attention-based mechanism was introduced. They use a multi-polarity attention mechanism to initialize attentions by employing embeddings of words that differ with respect to sentiment polarity; as a result, their model can better capture the properties of each sentiment polarity. Abdalgader and Al Shibli (2020) suggested a lexicon-based word’s polarity recognition approach on customer’s reviews. In this approach, they compute the semantic relationship between the context expansion list of the target word and a synonym expansion list of all words around the target word inside the input text of reviews. Using this approach, they can get the highest score of semantic and sentimental knowledge, and people can easily understand the context of a large review. Yi and Liu (2020) designed a hybrid recommendation system (HRS) for customer sentiment analysis using a regression model of machine learning. This technique has been shown to be useful in identifying a customer's preferred shop based on the things they have purchased. The most notable characteristic of this HRS strategy is that there is no human intervention when it comes to determining customer shopping preferences. The mean absolute percentage error value for HRS was close to 98 percent, indicating a high level of accuracy. Kauffman et al. (2019) proposed a sentiment analysis on customer reviews to make a decision for marketing purpose. In this work, they used a quantity-based star score rating and sentiment analysis mechanism to check the subjective customer reviews and then classify the opinion of buyers. To make a decision for purchase of a product, they designed a framework in which NLP, sentiment analysis, data mining and clustering methods are used to calculate the sentiment score of a particular product’s feature. The framework calculated the total score of the product on the basis of price and earlier mentioned sentiment score.

Lots of researches have been done in sentiment analysis for opinion extraction (Ratmele and Thakur 2021), and these are compared in terms of advantage and disadvantages. Do et al. (2019) proposed a review study of 40 deep learning methods for sentiment analysis, in which they classified deep learning models into CNN, RNN, recursive NN and hybrid approach. They mention the advantage and weakness of all the three models. Advantage of CNN model is that it performs fast computation. It is capable of obtaining regional patterns from training data and produce nonlinear dynamics. The disadvantage of CNN is that it requires high demand of data. The advantage of RNN is that they need fewer amounts of data as well parameters. The disadvantage of RNN is that it cannot store the long-term dependencies and pick the last hidden state to indicate the statement that could result in an inaccurate prediction. The advantage of recurrent neural network is that it has basic architecture and easily able to understand tree structure but as a disadvantage it requires parser for tree structure that might be slow. It has been stated that RNN-based models outperform CNN-based models and that exploratory research into recurrent neural networks is needed. Park looked into the qualities that are incorporated in product reviews for five distinct categories of products and looked at how they affect review usefulness. Four data mining algorithms were tested on five real-life review datasets obtained from Amazon.com to see which one best predicts review helpfulness for every product type. The research shows that reviews for different kinds of products have different psychological and linguistic aspects, as well as different elements influencing review usefulness (Park 2018).

Jianqiang et al. (2018) proposed a deep CNN-based approach for sentiment analysis of twitter data. Initially the approach integrates with the word embedding features generated by the Glove on the basis of sentiments or polarity. The feature-based training data were processed through the deep CNN and then produced the model file for anticipating the sentiments by analyzing the twitter data. During the analysis of twitter data, text preprocessing plays an important role. Authors have compared and analyzed the various text preprocessing methods in another part (Jianqiang and Xiaolin 2017). For document classification, Yang et al. (2016) proposed a hierarchical attention network (HAN) model. The model has two features: (1) it has a hierarchical structure that relates to the hierarchy of documents. (2) When building the document representation, it has two levels of attention mechanisms used at the word and sentence levels, allowing it to attend differently to more and less relevant stuff. The proposed model gradually generates a document vector through integrating the important words into sentence vectors and subsequently aggregating important sentence vectors to document vectors. It is useful to extract important words and sentences from documents (Table 1).

Table 1 Previous research on opinion extraction and sentiment analysis

3 Proposed work

3.1 OpExHAN framework

In this paper, OpExHAN framework is proposed for opinion extraction, which is based on the concept of word embedding and hierarchical attention mechanism. The opinions are extracted from Amazon Smartphone review dataset, which is scrapped from amazon.in. The overall framework of OpExHAN is shown in Fig. 1 which contains different phases for opinion extraction. The process starts from collecting real dataset through web scraping from amazon.in and then preprocesses the data for getting clean and normalized reviews. Then, we apply Glove embedding to extract word vector representation of reviews which identifies contextual information of words. These word vectors are fed into the hierarchical attention network and produce the vectors at word level and sentence level. Finally reviews are classified into five classes like extremely positive, positive, extremely negative, negative and neutral. Every phase of this framework is described in detail in further sections. In Fig. 1, OE represents Opinion Extraction, P, R, F represent Precision, Recall and F-Score, respectively, and Acc represents Accuracy.

Fig. 1
figure 1

Framework for OpExHAN

3.2 Scraping the Amazon’s dataset

Web scraping is performed by extracting data from the amazon.in website where the result page is "smartphone". For web scraping, the required URL is stored in a text file and then extracts the relevant information from the given URL. The extracted results are converted in columns and saved into CSV (comma separated) format. The extracted list of items are mentioned as ‘mobile name’, ‘asin number’, ‘title review’, ‘user review’ and ‘star rating score’.

3.3 Preprocessing

The reviews are given by customers in a casual way on an online platform that’s why it encloses a lot of noise such as informal words, multilingual words, hyperlinks and insignificant words. This noisy information increases the dimensionality of the issue. This problem is addressed by preprocessing, which is performed on each review, and generates the clean review file for further steps. In the proposed framework, preprocessing is performed with several steps which are mentioned below.

  • Remove hyperlinks.

  • Remove surplus spaces between words.

  • Remove special characters.

  • Convert informal words into its formal form like convert ‘I’ve’ to ‘I have’ etc.

  • Perform sentence tokenization and word tokenization.

  • Convert to lower case.

  • Remove stop words.

  • Apply Lemmatization (Gupta et al. 2016) on word tokens to identify root words.

3.4 Word embedding

Word embedding is an approach to convert word into vector representation. In the proposed method, pretrained Glove is used for embedding of words with vectors (Pennington et al. 2014). Glove is already trained on dataset of one billion words with vocabulary of 400,000 words. Although Glove has different embedding vector sizes like 50, 100, 200 and 300 dimensions, in this work 100 dimensions are selected for embedding. In the proposed work, the review can be mentioned as a document which is denoted as d with m sentences like {s1, s2, s3,……….sm} and the length of ith sentence is Mi. The ith sentence denoted as si, consists of words, which are denoted as Mi. The pre-trained glove model creates a word level vector, which represents as wewij and mentioned in Eq. (1).

$$ {\mathbf{w}}_{{\mathbf{e}}} {\mathbf{w}}_{{{\mathbf{ij}}}} = \left\{ {w_{i1} , \, w_{i2} , \, w_{i3} , \ldots \ldots \ldots w_{Mij} } \right\} $$
(1)

3.5 Hierarchical Attention Network (HAN)

In the proposed work, hierarchical attention network (HAN) is used for opinion extraction. Basically, HAN model is designed to take the hierarchical structure in the form of review → sentences → words, where each review considered as a single document. This model incorporates an attention mechanism which is used to identify the contextually related most important words and sentences in a review. Figure 2 depicts the overall architecture of the hierarchical attention network. This architecture contains several components like word level and sentence level, where word level includes word encoder and word attention and sentence level includes sentence encoder, sentence attention. HAN uses gated recurrent unit (GRU) for encoding at both word level and sentence level. Description of every component is presented in the following subsections. In Fig. 2, ExPos represents Extremely Positive, Pos represent Positive, Neu, Neg and ExNeg represent Neutral, Negative and Extremely Negative accordingly.

Fig. 2
figure 2

Process of opinion extraction using hierarchical attention network

3.5.1 Bi-GRU

Bidirectional gated recurrent unit (Bi-GRU) is employed in the proposed work, which is used to map a sequence of word vectors of the review to opinion classes (Bahdanau et al. 2014). Bi-GRU is the one of approaches of the recurrent neural network (RNN) model in which gates of GRU are used to address the problem of gradient vanishing so that long-distance information can be preserved. GRU contains two gates: an update gate and reset gate, where update gate denoted as yt and reset gate denoted as rst. The reset gate regulates how previous information affects the candidate state \(\widetilde{{hd_{t} }}\) and the update gate controls how old information is kept and new information is inserted. The GRU calculates the hidden new state hdt at time t, mentioned in Eq. (2). This shows the linear interpolation based on the new sequence information between the previous state \(hd_{t - 1}\) and the present new state \(hd_{t}\). The gate \(y_{t}\) mentioned in Eq. (3) controls how much old data is retained and how much new data is added, where \(x_{t}\) is denoted as the sequence vector at time t.

$$ hd_{t} = \left( {1 - y_{t} } \right) \odot hd_{t - 1} + y_{t} \odot \widetilde{{hd_{t} }} $$
(2)
$$ y_{t} = \sigma \left( {W_{z} x_{t} + U_{z} hd_{t - 1} + b_{z} } \right) $$
(3)

The candidate state \(\widetilde{{hd_{t} }}\) is mentioned in Eq. (4), which is calculated in the same way as a typical recurrent neural network (RNN). When the reset gate \(rs_{t}\) equals to zero, then it fails to remember the previous state, in that case \(rs_{t}\) is updated, mentioned in Eq. (5).

$$ \widetilde{{hd_{t} }} = \tanh \left( {W_{h} x_{t} + rs_{t} \odot \left( {U_{h} hd_{t - 1} } \right) + b_{h} } \right) $$
(4)
$$ rs_{t} = \sigma \left( {W_{r} x_{t} + U_{r} hd_{t - 1} + b_{r} } \right) $$
(5)

3.5.2 Hierarchical attention mechanism

HAN architecture consists of two levels: word level and sentence level, where GRU applied initially on word level subsequently on sentence level. The HAN model consists of the encoder which provides significant contexts and the attention mechanism which computes the important weights of these contexts as one vector.


Word level

At word level, word tokens denoted as Wi are applied on Glove and perform embedding as wewij, shown in Eq. (1). Vectorized tokens are created after embedding, which are denoted as xim, shown in Eq. (6), where d document contains M words and wm depicts as the mth word in d and xm depicts the mth word vector.

$$ X_{im} = W_{E} W_{im} ,m \in \left[ {1,M} \right] $$
(6)

Word encoder

Vectorized tokens are applied as input to Bi-GRU (Yang et al. 2016) for encoding purpose. A GRU model contains the hidden state, which can be considered as a memory cell to transfer information. Bi-GRU is used to acquire word annotations by summarizing input from both sides of words and integrate contextual information into the annotation. GRU read the sentence si from wi1 to wiM in the forward direction \(\mathop{\rightarrow}\limits_{{hd_{im}}}\) and from wiM to wi1 in the backward direction \(\mathop \leftarrow \limits_{{hd_{im} }}\), shown in Eqs. (7) and (8) accordingly. By concatenating the forward hidden state with the backward hidden state, get an annotation for a specific word wim, mentioned in Eq. (9), which summarizes the content of the entire sentence focused around wim. In implementation, 100 dimensions of embedding are taken as an input to the GRU. It considered 50 dimensions for forward direction and 50 for backward direction, because of the bidirectional nature of the model. Output dimensionality of GRU should be equal to 50, because it is running in forward and backward directions, which returns 100 dimensions which is the dimensionality of inputs.

$$ \mathop \rightarrow\limits_{{hd_{im}}} = \mathop \rightarrow \limits_{{GRU}} \left( {x_{im} } \right), m \in \left[ {1, M} \right] $$
(7)
$$ \mathop \leftarrow \limits_{{hd_{im} }} = \mathop \leftarrow \limits_{{GRU}} \left( {x_{im} } \right), m \in \left[ {M, 1} \right] $$
(8)
$$ hd_{im} = \left[\mathop \rightarrow \limits_{{hd_{im}}} , \mathop \leftarrow \limits_{{hd_{im} }} \right] $$
(9)

Word attention

Word attention is used to extract those words which are important to create the sense of the sentence and combine the representations of these relevant words to produce a sentence vector. In the attention mechanism, the word annotation \(hd_{im}\) is given as input to hidden layer MLP (multilayer perceptron) (Sukhbaatar et al. 2015) and gets the hidden representation depicted as \( ud_{im}\), shown in Eq. (10). Then, using a softmax function, calculate the importance of the word as the similarity of \(ud_{im}\) with a word level context vector \(uv_{w}\), and get a normalized significance weight \(\alpha_{im}\), mentioned in Eq. (11). Subsequently the sentence vector si is computed by concatenating the sum of these significance weights with earlier computed context annotations, shown in Eq. (12).

$$ ud_{im} = \tanh \left( {W_{w} hd_{im} } \right) + b_{w} $$
(10)
$$ \alpha_{im} = \frac{{\exp (ud_{im}^{M} uv_{w} )}}{{\sum t \exp (ud_{im}^{M} uv_{w} )}} $$
(11)
$$ s_{i} = \mathop \sum \limits_{t} \alpha_{im} hd_{im} $$
(12)

Sentence level

The same network, which applied in word level, also applied in sentence level in progression, but focus is on sentence rather than word. Sentence vector si is already generated at word level that’s why there is no need to perform embedding at this step.


Sentence encoder

The sentence vector si is fed to Bi-GRU where GRU performs encoding of sentences and creates a final document vector for a review, which can be used as feature for opinion classification. Bi-GRU summarized the contexts of sentences by traversing the document forward and backward directions, which is mentioned in Eqs. (13) and (14) subsequently. Afterward, concatenate \(\mathop \rightarrow \limits_{{hd_{i} }}\) and \(\mathop \leftarrow \limits_{{hd_{i} }} \) to obtain an annotation of sentence i, mentioned in Eq. (15), where sentence i is the center point and \(hd_{i} \) summarizes the neighbor sentences around sentence i.

$$ \mathop \rightarrow \limits_{{hd_{i} }} = \mathop \rightarrow \limits_{GRU} \left( {s_{i} } \right), i \in \left[ {1, K} \right] $$
(13)
$$ \mathop \leftarrow \limits_{{hd_{i} }} = \mathop \leftarrow \limits_{GRU} \left( {s_{i} } \right), i \in \left[ {K, 1} \right] $$
(14)
$$ hd_{i} = \left[ {{ }\mathop \leftarrow \limits_{{hd_{i} }} { },{ }\mathop \leftarrow \limits_{{hd_{i} }} } \right] $$
(15)

Sentence attention

The attention mechanism is again used to reward sentences that are indications to correctly classify a document. Sentence level context vector \({\text{u}}_{{\text{s}}}\) is developed and we apply the vector to quantify the significance of the sentences. This concept is mentioned in Eqs. (16), (17) and (18), where FV is the document vector which summarizes complete information of sentences in the review.

$$ u_{i} = \tanh \left( {W_{s} hd_{i} } \right) + b_{s} $$
(16)
$$ \alpha_{i} = \frac{{{\text{exp}}\left( {ud_{i}^{M} u_{s} } \right)}}{{\sum i {\text{exp}}\left( {ud_{i}^{M} u_{s} } \right)}} $$
(17)
$$ FV = \mathop \sum \limits_{i} \alpha_{i} hd_{i} $$
(18)

3.5.3 Opinion extraction

The probability of review depicted as pr mentioned in Eq. (19) belongs to each opinion class is computed using the final vector FV. This final vector is calculated as the feature representation of the document. The negative log likelihood of the correct labels is computed as training loss, which is presented in Eq. (20), where g is the label of document d.

$$ pr = softmax\left( {W_{c} FV + b_{c} } \right) $$
(19)
$$ loss = - \mathop \sum \limits_{d} \log pr_{dg} $$
(20)

3.6 Proposed algorithm

The algorithm is proposed for opinion extraction which contains basic steps such as word embedding using GLOVE and then applies Bi-GRU with hierarchical attention mechanism. The input for the procedure is train and test dataset, and output is opinion classes with its accuracy, precision, recall and F-score. Different hyper-parameter values are set at the time of experimentations. Deep learning model is built at two levels: first world level and then sentence level using one Bi-GRU layer and one hidden layer with relu activation function at each level and one drop-out layer after sentence level. Initially, embedding layer performs word embedding on words and creates word vectors of similar words. Then, at word level Bi-GRU is applied on word vector, which is followed by identification of context similarity using attention mechanism and generate sentence vector. At sentence level, time-distributed function is applied to all word level layers on each sentence for sentence linking. The sentence vector is applied to Bi-GRU with attention mechanism to identify contextual similar sentences. After getting sentence sequences, drop-out layer is applied to prevent overfitting. Final document vector is created followed by extracting opinion classes at softmax layer using sigmoid activation function. After building the model, model is compiled using Adam optimizer and categorical cross-entropy loss. After compilation, model is validated with number of epochs and batch size on accuracy and loss. Then, performance of model is evaluated using precision, recall, accuracy and f-score measures.

figure a

4 Experiments and evaluation

4.1 Dataset

OpExHAN model is applied on Amazon Smartphone review dataset, which is scrapped from amazon.in web pages. 150,000 reviews are scrapped from web pages from which 56,000 Smartphone reviews are collected after preprocessing. From 56,000 reviews, 80% data, i.e., 44,800 reviews, are used as training, 10%, i.e., 5,600 reviews, for validation and rest 10%, i.e., 5600 reviews, for testing purpose. Maximum word features that are taken in experiments are 200,000, maximum sentences are 40, maximum word number is 50 and embedding dimension size is 100.

4.2 Evaluation metrics

The performance of OpExHAN model is evaluated using accuracy, precision, recall and F-Score measures. Recall can be represented as the ratio of number of correctly predicted classes of opinions with the cumulative number of reference opinion classes. Precision can be represented as the ratio of number of correct predicted classes of opinions with all predicted opinion classes. F1 score is the harmonic of recall and precision. The recall, precision and F-score are mentioned in Eqs. (21), (22), (23), respectively, where Sreference is represented as reference opinion classes, Spredicted is system-generated opinions. It is best to calculate both precision and recall before computing the F-measure.

$$ R = \frac{{\left| {S_{reference} \cap S_{predicted} } \right|}}{{S_{reference} }} $$
(21)
$$ P = \frac{{\left| {S_{reference} \cap S_{predicted} } \right|}}{{S_{predicted} }} $$
(22)
$$ F = \frac{2*P*R}{{P + R}} $$
(23)

4.3 Experimental setup

The proposed model is implemented in Python using high-level neural network API Keras for deep learning functions, NLTK for preprocessing and BeautifulSoup for web scraping. Keras conforming form is used to perform fast and efficient process. Table 2 presents the setup of various hyper-parameters with its different values and the best value that is selected for the final result. Model is experimented on validation dataset for hyper-parameters tuning with different values such as batch size, epochs and drops out, which is mentioned in Table 2. Hyper-parameters are tuned using Adam optimizer and categorical cross-entropy as loss function. Categorical cross-entropy converts the integer sequence to a one-hot vector, which overcomes any memory issues. The OpExHAN model is experimented on test a dataset using selected best hyper-parameter values and gives best results.

Table 2 Hyper-parameter setup

5 Results and discussion

The OpExHAN model is trained and experimented on Amazon Smartphone dataset and compared the performance with the state-of-the-art methods and models on the same dataset. The performance of the proposed model is also compared with different hyper-parameters mentioned in Table 2. Performance comparisons of different batch sizes of OpExHAN are presented in Table 3. It is observed that small batch sizes give the greatest training consistency and generalization performance. Single batch acquires a lot of computation, which is too expensive and time-consuming or also it can be seen like higher batch size works inefficiently due to noise of the samples. If the large batch will be selected and noise will be smaller, then it can give better accuracy on training data, but the validation and test accuracy can be lower because of worst generalization and lost regularizing effect. To generate effective results and reduce noise, more information is required during training which is possible through smaller batch sizes. OpExHAN model gave best result with batch size 16, and the performance is outstanding in terms of accuracy, precision, recall and F-score and in another direction with batch size 128 results are inappropriate. From Table 3, it can be detected that the optimal batch size is 16 for the proposed model.

Table 3 Performance comparisons of different batch sizes of OpExHAN with epoch 25 in terms of evaluation measures

The model is also experimented using different epochs like 15, 25 and 50 with respect to batch size 16 and 32, and the performance results are shown in Table 4. It is observed that the model obtained excellent performance at epoch 25 with batch size of 16. At epoch 50, accuracy is only 1.48% less than epoch 25, that’s why no more experiments are performed beyond epoch 50. Basically, high epoch size does not always mean that results will be more accurate. Epoch sizes can improve accuracy up to a certain point, but after that overfitting can occur in the model. Underfitting is also a result of having an extremely low epoch. When the batch size is large, the number of times the weight is updated every epoch is low, as compared to when the batch size is small. At small batch size, the reason for a high number of weight updates every epoch is because the loss value is calculated using fewer data points. There are more batches each epoch, because an epoch requires going through all of the training data points in a dataset. As a result, for the same number of epochs, the smaller the batch size, the greater the accuracy.

Table 4 Performance comparisons at different epochs with batch size 16 and 32 in terms of evaluation measures

The performance of the model with its accuracy and loss for train and validation data can also be represented. Model accuracy and model loss for batch size 16 are represented in Figs. 3 and 4, respectively. Model accuracy and loss for batch size 32 are also represented in Figs. 5 and 6 accordingly. Fundamentally through research it is identified that overfitting can be detected by the difference between training and validation accuracy, which shows the fact that the larger the gap, the higher the overfitting. The proposed model is performed best at epoch 25 with batch size 16, which is shown in Figs. 7 and 8 separately. Optimal accuracy is 94.69% and loss is 30.25%. Some facts identified by the experimentations that low accuracy with huge loss exist when huge errors are present on large data, low accuracy with low loss exist when few errors are present on large data and good accuracy with low loss exist when few errors are present on few data which can be considered as the ideal case.

Fig. 3
figure 3

Model accuracy at epoch 25 with batch size 16

Fig. 4
figure 4

Model loss at epoch 25 with batch size 16

Fig. 5
figure 5

Model accuracy at epoch 25 with batch size 32

Fig. 6
figure 6

Model loss at epoch 25 with batch size 32

Fig. 7
figure 7

Comparison of model accuracy and loss at epoch 15, 25, 50

Fig. 8
figure 8

Comparison of model accuracy and loss at batch sizes 8, 16, 32, 64 and 128

The performance of the OpExHAN model is compared with existing methods, which are proposed by different authors. These existing methods are first applied on scrapped dataset, which is mentioned in this paper and then compared with the proposed algorithm. The comparison is mentioned in Fig. 9. Extensive research is done in the field of review opinion classification, but in this paper, few methods and model are focused for comparison, which performed best in previous research. Table 5 presents the performance comparison using precision, recall, F-score, and accuracy against recent existing methods of opinion extraction on the same scrapped Amazon review dataset. It is observed that OpExHAN model used the hierarchical attention using Bi-GRU with Glove-based word embedding and performed outstanding on state-of-the-art methods. These state-of-the-art methods are applied on scrapped dataset and then compared with OpExHAN model. The accuracy is 94.6%, which is better than other methods. In Alharbi et al. (2021), authors proposed Glove with GRU and Glove with LSTM approaches on Amazon review dataset, but performance is inadequate. Another method, LSTM with word2vec, is also proposed (Zhang et al. 2021), and when this method is applied on scrapped dataset, then the resultant accuracy is better but precision and recall are unsatisfactory. LSTM is applied with Glove and word2vec both, where Glove performed better than word2vec on Amazon Smartphone review dataset. Two different methods in a sequential manner in a single model are also applied (Shrestha and Nasoz 2019) on the same dataset, where GRU is used for feature extraction and SVM is used for classification of opinions. The performance of this model is also unsatisfactory. In the proposed model, the use of the attention mechanism with hierarchical approach has improved the performance of GRU. Analysis of review is performed first at word level and then at sentence level, through which contextual information is identified properly at both levels. The Bi-GRU model with attention is better able to find a long-term dependency between explicit words in capturing contextual information of previous and next words. In the proposed model, lemmatization is performed in pre-processing, which gives the root words and astonishingly is able to identify contextually similar words through glove, which is also the reason for excellent performance.

Fig. 9
figure 9

Comparison of OpExHAN model with baseline exiting models

Table 5 Performance comparison of opinion extraction against existing methods on scrapped Amazon Reviews Dataset

The precision and recall of the proposed model are both around 91%, which is admirable value but 8–9% is still required to improve the performance. This deficiency is because of some major points, which are identified at error analysis. The OpExHAN model is unable to identify valid Smartphone features because of co-referencing between sentences of reviews, which degrades the recall value. Few reviews contain multiple categories of product features with its polarity in a single sentence, which are failed to notice and affect the recall value. But these types of reviews are very less in the Amazon dataset, so the performance is not too much affected. The precision value lowers because the model is also identified few terms as polarity of Smartphone features, which are contextually irrelevant in sentences. These major issues can be resolved by using product feature-based opinion prediction. Grammar and spell checking are also required to include in the proposed model for improving the performance. Comparative opinions are also not identified in this research. Few examples of reviews from Amazon review dataset are identified, which also degrade the performance.

  • Review 1: "Samsung Galaxy M02 (Blue, 3 GB RAM, 32 GB Storage)", Undoubtedly- High Specs. at such Low and Very Affordable Cost. Full Value for Money. Loving it…, Battery Life: Considerably Good…Value for Money: 100% Paisa Vasool…Sheerness: Considerably High Specs. at such Low and Affordable Cost…Improvement Suggestion: Charger could have been C-Type. This may be considered for Updated Version or New Production., Extremely Positive.

  • ISSUE: Multilingual issue which is not identified by Glove. Glove has only a dictionary of English words. In the above review, “Paisa Vasool” term belongs to the Hindi language so that it is not detected by Glove.

  • Review 2: Dont expect much from this phone.," Phone is only for Calling, WhatsApp and Youtube. I gifted this phone to my mother, so as per her usage i felt its okay. Worst part is the camera…. I even had Redmi Note 4…. I seriously felt Note 4 camera is much better that redmi 9.I got this phone in 7500/-", Negative.

  • ISSUE: One part of review shows the positive polarity and another shows the negative polarity. So that overall polarity should be neutral. But it is quoted as negative polarity. There is a mismatch between polarities and it is the ambiguous review.

  • Review 3: front camera, use front cem I am not happy, Neutral.

  • ISSUE: Camera is written as ‘cem’, it is not identified correctly by the proposed method.

6 Conclusion and future work

In this paper, HAN-based Bi-GRU approach is used to extract the opinions from large amount of Amazon’s Smart phone’s review dataset. OpExHAN model used the hierarchical attention using Bi-GRU with Glove-based word embedding and performed outstanding on state-of-the-art methods. Performance of the proposed method is compared with several baseline existing methods by conducting various experiments on Amazon’s dataset. Hence, the effectiveness of the proposed method is illustrated through various evaluation measures. The precision and recall of the proposed model are both around 91% and accuracy is 94.6%, which are admirable numbers. In future research, the extraction of aspect-based opinions will be focused, in order to make a purchasing decision with product feature-based opinion prediction. The OpExHAN model is unable to resolve some issues, which can target in future research, that are mentioned as:

  1. 1.

    Co-referencing between sentences of the review.

  2. 2.

    Multiple categories of product with its features that are mentioned in a single sentence.

  3. 3.

    Pruning of contextually irrelevant features and words from reviews.

  4. 4.

    Grammar and spell checking is also important before lemmatization and word embedding.

  5. 5.

    Comparative opinion identification in a review.

  6. 6.

    Multilingual reviews analysis for correct opinion identification.