Keywords

1 Introduction

To achieve aspect-level analysis on product or service reviews, the first task is aspect mining (or aspect extraction), which aims to extract aspect terms from a review, e.g., “operating system” and “preloaded software” from a laptop’s review “I love the operating system and preloaded software”. Existing aspect mining techniques can be divided into three categories, namely unsupervised, supervised, and semi-supervised.

Unsupervised learning models based on Latent Dirichlet Allocation (LDA) [13, 36] and word embeddings [9] do not need labeled reviews. However, it is hard to control a totally unsupervised model to only show the concerned aspects. Supervised sequential learning methods such as Hidden Markov Models (HMM) [12] and Conditional Random Fields (CRF) [11, 29] are applied to extract aspects, as the task can be formulated as a sequence labeling problem. Currently, some supervised deep learning models [17, 25, 31, 32] can achieve better performances than previous works by introducing additional supervision from lexicons and other hand-crafted features. However, we insist that the automated feature learning is always preferred. Moreover, because the manual annotation of training data is usually very costly, especially for domain dependent aspects (i.e., different domains may have different aspect spaces), researchers are motivated to develop more effective semi-supervised learning models for aspect mining.

Semi-supervised approaches include two directions, one is to guide the unsupervised models by encoding prior domain knowledge [2, 3, 15, 20], and the other is to enhance the supervised models with unlabeled reviews in corresponding domains [33]. The latter approach outperforms the former as it benefits from both labeled and unlabeled reviews. However, the existing model [33] is trained in two separated phases: pre-train on unlabeled review in corresponding domains; and then perform supervised learning on labeled reviews. The representations (or embeddings) learned in the pre-training phase do not take advantages of labeled reviews, i.e., they only learn domain specific but task free representations. Our consideration is whether we can learn task and domain specific representations from both labeled and unlabeled reviews at the same time and perform aspect mining in an end-to-end architecture.

In this paper, we propose a new semi-supervised End-to-end MOVing-window Attentive framework (called EMOVA) to enhance aspect mining on customer reviews. Instead of separately pre-training and supervised learning, EMOVA alternately learns a model on a mini-batch of labeled reviews and unlabeled reviews from the same domain based on Cross-View Training (CVT) [5]. Specifically, EMOVA derives the representations of reviews based on two neural layers with Bidirectional Long Short-Term Memory (BiLSTM) [8] by considering two important observations in reviews: (1) Customer reviews often contain misspelling words; (2) Multiple aspects may coordinately appear in one sentence. To this end, EMOVA derives char-features from words as extra embeddings, because general pre-trained word embeddings (e.g., GloVe [21]) may not cover all misspelling words. Moreover, the past nearby words provide useful semantic clues for finding new aspects. For instance, under the coordinate structure, the previous aspect (e.g., “operating system”) should be more significant than other words to guide the extraction of subsequent aspects (e.g., “preloaded software”). To capture these context significances, EMOVA employs an attention mechanism to encode the information within a moving-window.

In general, the contributions of this paper can be summarized as below.

  • We are the first to propose a semi-supervised deep learning framework for aspect mining, which introduces CVT to use unlabeled reviews to improve the representation learning within a unified end-to-end architecture.

  • We first attempt to develop a moving-window attention mechanism after two BiLSTM layers to capture significant past nearby information for the aspect prediction.

  • We conduct extensive experiments to evaluate the performance of EMOVA based on four real-world review datasets. Experimental results show that EMOVA performs better than the state-of-the-art techniques.

The reminder of this paper is organized as follows. Section 2 discusses related works. Then, we present our EMOVA framework in Sect. 3. Section 4 shows the experimental results. Finally, Sect. 5 concludes this paper.

2 Related Works

2.1 Aspect Mining as Sequence Labeling

Sequence labeling is a very common problem in natural language processing (e.g., part-of-speech tagging and named-entity recognition) and aims to assign a label to each element in a sequential input. The aspect mining task can be formulated as a sequence labeling problem, in which a label (whether an aspect or not) is given to each word in the review. Formally, the problem can be described as predicting a label sequence \(\{y_{1}...y_{n}\}\) for a given word sequence \(\{x_{1}...x_{n}\}\), where \(y_{i}\in \{ASPECT,~NONASPECT\}\). For instance, the reference [12] defines a set of labels to distinguish feature aspects, component aspects and function aspects, and train HMM to label each word in the review. However, the researchers [11] simplify these labels and apply \(\{B, I, O\}\) scheme, where B identifies the beginning of an aspect, I for the continuation of the aspect, and O for other words. The \(\{B, I, O\}\) scheme can well handle aspects expressing in phrases and has been applied for aspect mining [17, 33] and aspect-opinion term co-extraction [31, 32]. Our EMOVA also uses the same \(\{B, I, O\}\) labeling scheme.

2.2 Semi-supervised Approaches

Our EMOVA framework relates to the semi-supervised models for aspect mining. Most existing methods use prior knowledge to guide an unsupervised topic model. For instance, some methods manually choose domain specified seed words [15, 20] or must and cannot sets [3] for topic modeling. By introducing lifelong topic modeling [2], researchers propose a continually modeling system that can automatically mine knowledge from previous results to supervise the following tasks. However, this kind of methods often need manually defined domain knowledge and do not fully use existing labeled reviews. Another direction of semi-supervised learning is to take the advantage of unlabeled reviews in the same domain to improve the supervised model. The idea of pre-training has been applied in the aspect mining model [33] to learn domain specific word embeddings from unlabeled reviews in advance; these word embeddings have better representations than the general word embeddings and are fed into normal supervised models. However, these pre-trained domain specific representations are still not specific enough for the aspect mining task. Nevertheless, our EMOVA framework can learn both task and domain specific representations of reviews in an unified framework, which then enhance the aspect mining.

2.3 Cross-View Training

Normally, a deep learning model works best when trained on a large amount of data with reliable labels. However, for domain (or even entity) dependent aspects, manual annotation could be a huge investment. One solution is to apply effective semi-supervised learning to leverage unlabeled reviews. Current semi-supervised deep learning models separate the training process into two phases: pre-training and supervised learning. A key disadvantage of such models is that the first phase on representation learning does not benefit from labeled reviews.

Cross-View Training (CVT) [5] semi-supervises the learning by alternately switching the training process on labeled data and unlabeled data. It restricts the views on input data while training on unlabeled examples. Through auxiliary prediction modules, CVT can improve the representation learning of the supervised model. The idea of CVT is as follows: (1) A primary prediction module is trained with the standard supervised learning on labeled examples; (2) On unlabeled examples, a number of auxiliary prediction modules with different views on the input data are trained to agree with the primary prediction module; (3) By alternatively training on labeled data and unlabeled data, both representation learning and prediction modules get improved. Our EMOVA framework is based on the idea of CVT but has one more task specific architecture (e.g., moving-window attentions on two BiLSTM layers) for aspect mining.

3 The Framework EMOVA

In this section, we present our semi-supervised deep learning framework for aspect mining. First, we formulate the aspect mining task into a sequence labeling problem. Then, we present the technical details of the four key components in EMOVA. The architecture of our EMOVA framework is shown in Fig. 1.

Fig. 1.
figure 1

The architecture of our EMOVA framework.

3.1 Problem Statement

Suppose we have a set of labeled (\(D_l\)) and unlabeled (\(D_u\)) reviews for an entity. The aspect mining task is to learn a classifier from both \(D_l\) and \(D_u\) to extract a set of aspects for the entity. This task can be formulated as a sequence labeling problem by using \(\{B, I, O\}\) scheme, where B, I, and O indicate the beginning of, the continuation of, and the out of the aspect, respectively (refer to Sect. 2.1). Each word \(x_i\) in the sentence \(X=\{x_1,...,x_T\}\) must be assigned as one of \(\{B, I, O\}\). For instance, the input sentence “I love the operating system and preloaded software” may have the label sequence of \(\{O,O,O,B,I,O,B,I\}\).

3.2 Representation Learning

As recurrent neural networks can naturally represent the sequential information, our framework employs BiLSTM [8] to build the memory of contextualized representations for sequence labeling in the aspect mining task. Because combining general embeddings and char-features can help to handle misspelling words [19], we represent each word in the input sequence as the concatenation of an embedding vector and the char-feature from the output of a character-level Convolutional Neural Network (CNN). Further, the concatenation vector is fed into two BiLSTM layers which often achieve the best performance on building the memory in many sequential tasks [26]. Let \(V=\{v_1,...,v_T\}\) be the concatenation vectors of words. Their hidden representations are derived by concatenating the outputs of both forward \(\overrightarrow{LSTM}\) and backward \(\overleftarrow{LSTM}\) as follows:

$$\begin{aligned} h_t^1=[\overrightarrow{LSTM}(v_t)\oplus \overleftarrow{LSTM}(v_t)], t\in [1,T],~\text {and} \end{aligned}$$
(1)
$$\begin{aligned} h_t^2=[\overrightarrow{LSTM}(h_t^1)\oplus \overleftarrow{LSTM}(h_t^1)], t\in [1,T], \end{aligned}$$
(2)

in which \(\oplus \) denotes the concatenation operation, \(h_t^1\) is the hidden representations from the first BiLSTM layer at time step t, and \(h_t^2\) is from the second layer.

3.3 Moving-Window Attention

In the aspect labeling task, the information from past nearby steps provide useful clues for a prediction, e.g., the label “I” cannot follow “O”, and the previous aspects can guide the extraction of subsequent aspects. To capture such important past nearby information, our framework develops a moving-window attention component [16] after the two-layer BiLSTM network, while the attention mechanisms have become an essential component for various tasks to model significances and dependencies of sequential terms [30]. Specifically, the moving-window attention only caches the most recent \(N_A\) hidden states. At step t, we calculate the normalized significance score \(s_i^t\) of each cached state \(h_i^2\) (\(i\in [t-N_A,t-1]\)) as follows:

$$\begin{aligned} s_i^t=Softmax(U^A\cdot tanh(W_1^Ah_i^2+W_2^Ah_t^2+W_3^Ah_i^A)), \end{aligned}$$
(3)

where tanh is the activation function, \(h_i^2\) and \(h_t^2\) denote the cached past state and current state from the second BiLSTM layer, and \(h_i^A\) denotes the previous attentive representations in the moving-window. \(U^A\), \(W_1^A\), \(W_2^A\), and \(W_3^A\) are the model parameters.

To calculate current moving-window attentive aspect representation \(h_t^A\) at step t, our framework computes the weighted sum of the cached previous moving-window attentive aspect representations \(h_i^A\) with the score weights \(s_i^t\), applies the ReLU activation function, and stacks the result on current state \(h_t^2\), given by

$$\begin{aligned} h_t^A=h_t^2+ReLU(\sum \nolimits _{i=t-N_A}^{t-1}s_i^t\times h_i^A). \end{aligned}$$
(4)

3.4 Prediction Modules

In our framework, CVT trains labeled data with a primary prediction module. Suppose \(y_t\) is the label for the word \(x_t\in X\). The primary prediction module determines the probability distribution \(p(y_t|x_t)\) over labels from the results of the first BiLSTM layer (\(h_t^1\)) and moving-window attention layer (\(h_t^A\)) with a simple one-hidden-layer neural network (denoted by nn), given by

$$\begin{aligned} p(y_t|x_t)=nn(h_t^1\oplus h_t^A)=Softmax(U^P\cdot ReLU(W^P(h_t^1\oplus h_t^A))+b), \end{aligned}$$
(5)

where \(U^P\) and \(W^P\) are the model parameters.

Further, the proposed framework shares the first BiLSTM layer with the auxiliary prediction modules that have restricted views of unlabeled reviews. There are four different auxiliary prediction modules (\(p_{\text {left}}\), \(p_{\text {fwd}}\), \(p_{\text {bwd}}\), and \(p_{\text {right}}\)) in the framework, where \(p_{\text {left}}\) means, for the prediction of current word, this module only has a view of all past words on the left of current word in the sentence; \(p_{\text {fwd}}\) has a view of left and current words; \(p_{\text {bwd}}\) sees current and words on the right; and \(p_{\text {right}}\) only sees all future words on the right, as shown in Fig. 1. BiLSTM can easily provide these restricted views without additional computation as follows:

(6)

where \(nn_{\text {left}}\), \(nn_{\text {fwd}}\), \(nn_{\text {bwd}}\), and \(nn_{\text {right}}\) denote neural networks with the same structure given in Eq. 5. Since the second BiLSTM layer has already seen all words, we can only feed the hidden representations \(\overrightarrow{h}^1\) and \(\overleftarrow{h}^1\) from the first BiLSTM layer to the auxiliary prediction modules in order to restrict their view on an input sequence.

3.5 Cross-View Training

The key idea of CVT is to use unlabeled reviews from the same domain of labeled reviews to enhance the representation learning. During CVT, the model alternately learns on a mini-batch of labeled reviews or unlabeled reviews.

For the labeled reviews \(D_l\), the Cross-Entropy (CE) loss is utilized to train the primary prediction module \(p(y_t|x_t)\):

$$\begin{aligned} L_{\text {SUP}}=\frac{1}{D_l}\sum \nolimits _{x_t,y_t\in D_l}CE(y_t,p(y_t|x_t)). \end{aligned}$$
(7)

For the unlabeled reviews \(D_u\), the framework first infers \(p(y_i|x_i)\) (\(x_i \in D_u\)) based on the primary prediction module and then trains the auxiliary prediction modules to match the primary prediction module by using the Kullback-Leibler (KL) divergence function as the loss:

$$\begin{aligned} L_{\text {CVT}}=\frac{1}{D_u}\sum \nolimits _{x_i\in D_u}\sum \nolimits _{j}KL(p(y_i|x_i),p_j(y_i|x_i)), \end{aligned}$$
(8)

where \(j\in \{\text {left},\text {fwd},\text {bwd},\text {right}\}\) and the parameters of the primary prediction module are fixed during training. The auxiliary prediction modules can enhance the shared representations, because the new terms that are not in labeled reviews may have been encoded into the model and be useful for making predictions on some new aspects.

Further, we combine the supervised and CVT losses and minimize the total loss L with stochastic gradient descent:

$$\begin{aligned} L=L_{\text {SUP}}+L_{\text {CVT}}. \end{aligned}$$
(9)

In particular, we alternately minimize \(L_{sup}\) over a mini-batch of labeled reviews and \(L_\text {CVT}\) over a mini-batch of unlabeled reviews.

4 Experiments

In this section, we evaluate the performance of our proposed EMOVA framework and compare it with the state-of-the-art supervised and semi-supervised approaches.

4.1 Experimental Settings

Datasets: We conduct experiments over four benchmark datasets from the SemEval workshops [22,23,24]. Table 1 shows their statistics. \(D_{\text {laptop}1}\) and \(D_{\text {laptop}2}\) contain reviews of the laptop domain, while \(D_{\text {rest}1}\) and \(D_{\text {rest}2}\) are for the restaurant domain. In these datasets, aspect words have been labeled by the task organizer.

Table 1. Statistics of datasets.

The framework EMOVA needs unlabeled reviews for CVT. We collect unlabeled reviews corresponding to four labeled training datasets to train the model, which include laptop reviews from Amazon Review Dataset (230,373 sentences) [10] and restaurant reviews from Yelp Review Dataset (2,677,025 sentences) [34]. For comparison, we also train the model on a general unlabeled dataset (One Billion Word Language Model Benchmark) [1] to see whether performing CVT on general texts can improve the supervised model for aspect mining. As some sentences in the testing dataset may also appear in unlabeled reviews, we remove these sentences in unlabeled reviews to make the comparison fair.

Baselines: We compare our EMOVA with four groups of baselines. The first group is the winner of each dataset in the SemEval workshops, including IHS_RD [4] (\(D_{\text {laptop}1}\) winner), DLIREC [29] (\(D_{\text {laptop}2}\) winner), EliXa [27] (\(D_{\text {rest}1}\) winner), and NLANGP [28] (\(D_{\text {rest}2}\) winner). The second group is traditional supervised models including:

  • CRF [14] is the most commonly used method for sequence labeling.

  • WDEmb [35] is an enhanced CRF model with word embeddings, context embeddings, and dependency embeddings.

  • LSTM [18] is a vanilla BiLSTM with domain embeddings.

The third group takes the advantages of gold-standard opinion terms, sentiment lexicons, and other additional resources for training.

  • CMLA [32] applies a multi-layer architecture with coupled-attentions to model aspects and opinion words.

  • MIN [17] consists of three LSTM layers for multi-task learning, in which a sentiment lexicon and dependency rules are used to find opinion words.

  • DE-CNN [33] is the state-of-the-art model based on CNN and utilizes both general word embeddings and domain-specific embeddings for aspect mining.

  • BERT [6] is one of the key innovations in the recent progress of language modeling and achieves the state-of-the-art performance on many natural language processing tasks, we fine-tune \(\text {BERT}_{\text {BASE}}\) on the datasets as a baseline.

The fourth group is the variants of EMOVA.

  • EMOVA-S is our supervised model but without CVT on unlabeled data, so it is a purely supervised learning model.

  • EMOVA-G only performs CVT on the general unlabeled text (One Billion Word Language Model Benchmark) [1] which is not specific to the laptop or restaurant domain.

We report the results of these baselines in their original works, since we use exactly the same datasets.

Training Settings: We use pre-trained GloVe 840B 300-dimension vectors [21] to initialize the word embeddings, and the char-feature size is 50. All of the weight matrices except those in LSTMs are initialized from the uniform distribution \(U(-0.2, 0.2)\). For the initialization of the matrices in LSTMs, we adopt the Glorot Uniform strategy [7]. We apply dropout while the rates are set as 0.5 for labeled reviews and 0.8 for unlabeled reviews. The hidden state size is set to 300, and the learning rate is 0.05. We set the mini-batch size as 50 sentences, and the moving-window size (i.e., the number of cached past nearby aspect representations) \(N_A\) is 5.

4.2 Experimental Results

Main Results: We report F1 score (%) in the Table 2. The result shows that EMOVA performs the best. Compared to those challenge winners (IHS_RD on \(D_{\text {laptop}1}\), DLIREC on \(D_{\text {laptop}2}\), EliXa on \(D_{\text {rest}1}\), and NLANGP on \(D_{\text {rest}2}\)), EMOVA achieves absolute gains of 7.17%, 1.79%, 2.22%, and 2.84%, respectively. Even EMOVA-S (without CVT) can perform better than those supervised baselines in the first and second groups on three of the four datasets (except the second laptop dataset). The main reason should be the effectiveness of our moving-window attention layer which can help to discover some aspects under the guidance of frequent aspects in coordinate structures. The result also shows that EMOVA-G with general unlabeled texts can improve the pure supervised model EMOVA-S.

Table 2. Comparison results in F1 score.

The third group of baselines is considered as some special cases of semi-supervised learning, as they all rely on additional resources (e.g., hand-craft features, lexicons, pre-trained domain embeddings, and pre-trained language models) to improve the performance. In the pre-training step of these two-phase models (e.g., DE-CNN and BERT), they do not take advantage of labeled reviews. More specifically, BERT learns better representations by training a deep language model on large amounts of texts, and DE-CNN attempts to learn domain-specific but general-purpose representations rather than both domain and task specific representations in our EMOVA. As a result, EMOVA works better than the two-phase (i.e., pre-training and supervised learning) models.

Table 3. Ablation study on the key components of EMOVA.
Fig. 2.
figure 2

Effects of the moving-window size \(N_A\).

Ablation Study: The key components of EMOVA include char-features, BiLSTM layers, moving-window attentions, primary and auxiliary prediction modules, as shown in Fig. 1. To show the significance of each component, we remove each of them and evaluate the F1 score, as depicted in Table 3. Firstly, we disable the char-features and the result shows only slight effect in the row for w/o char-features. Then, we remove the moving-window attention layer and the result drops significantly on all datasets in the row for w/o attentions, which shows the essentiality of moving-window attentions. To explore which auxiliary prediction modules are more important, we only enable two of them (\(p_{\text {fwd}}\) and \(p_{\text {bwd}}\), or \(p_{\text {left}}\) and \(p_{\text {right}}\)) at each time. We find that EMOVA w/o fwd & bwd that do not see the current word is better than EMOVA w/o left & right, which may be caused by the more restricted view on the unlabeled input.

Fig. 3.
figure 3

Performance vs. percent of the labeled training set.

Effects of the Moving-Window Size: We also evaluated the effects of the size of moving-window in the attention layer of our EMOVA framework, the results are shown in Fig. 2. It is hard to improve the overall performance by simply increasing the moving-window size, i.e., EMOVA can achieve better aspect mining accuracy by focusing attention on a certain number of past nearby words. To reduce the computation cost, the moving-window size \(N_A\) is set to 5 in our experiments.

Less Labeled Training Data: A very common situation in aspect mining is some domains (or products) may not have large volumes of labeled data. To this end, we explore how EMOVA scales with less data by only feeding a subset (25%, 50%, 75%) of the labeled training data, as presented in Fig. 3. EMOVA with half of the training data can perform as well as EMOVA-S without CVT that sees all the training data. Thus, EMOVA is particularly useful when only a small set of labeled reviews is available, which greatly reduces the cost on manual annotations.

5 Conclusion

In this paper, we have proposed the first semi-supervised End-to-end MOVing-window Attentive framework (EMOVA) for aspect mining on customer reviews. The framework derives the representations of reviews based on two neural layers with Bidirectional Long Short-Term Memory (BiLSTM). The Cross-View Training (CVT) is employed to train auxiliary prediction modules on unlabeled reviews to improve the representation learning in a unified end-to-end architecture. Further, EMOVA exploits the moving-window attention mechanism to capture significant past nearby semantic contexts. Experimental results over four datasets from SemEval workshops show that EMOVA outperforms the state-of-the-art models, even on small labeled training datasets.