Keywords

1 Introduction

Consumers often express their opinions towards products and services through online reviews and discussion forums. These reviews may include useful suggestions that can help companies better understand consumer needs and improve their products and services. However, manually mining suggestions amid vast numbers of non-suggestions can be cumbersome, and equated to finding needles in a haystack. Therefore, designing systems that can automatically mine suggestions is essential. The recent SemEval [6] challenge on Suggestion Mining saw many researchers using different techniques to tackle the domain-specific task (in-domain Suggestion Mining). However, open-domain suggestion mining, which obviates the need for developing separate suggestion mining systems for different domains, is still an emerging research problem. We formally define the problem of open-domain suggestion mining as follows:

Definition 1

(Open-domain Suggestion Mining). Given a set of reviews \(\mathcal {R} = \{ r_1, r_2 \ldots r_n\}\) from multiple domains in \(\mathcal {D} = d_1 \cup d_2 \cup \ldots d_m \), train a classifier C using \(\mathcal {D}\) to predict the nature of each review \(r_i\).

Building on the work of [5], we design a framework to detect suggestions from multiple domains. We formulate a multitask classification problem to identify both the domain and nature (suggestion or non-suggestion) of reviews. Furthermore, we also propose a novel language model-based text over-sampling approach to address the class imbalance problem.

2 Methodology

2.1 Dataset and Pre-processing

We use the first publicly available and annotated dataset for suggestion mining from multiple domains created by [5]. It comprises of reviews from four domains namely, hotel, electronics, travel and software. During pre-processing, we remove all URLs (eg. https:// ...) and punctuation marks, convert the reviews to lower case and lemmatize them. We also pad the text with start \({{\mathbf {\mathtt{{{S}}}}}}\) and end \({{\mathbf {\mathtt{{{E}}}}}}\) symbols for over-sampling.

Table 1. Datasets and their sources used in our study [5]. The class ratio column highlights the extent of class imbalance in the datasets. The travel datasets have lower inter-annotator agreement than the rest, indicating that they may contain confusing reviews which are hard to confidently classify as suggestions or non-suggestions. This also reflects in our classification results.
Table 2. Most frequent 5-grams and their corresponding suggestions sampled using LMOTE. While the suggestions as a whole may not be grammatically correct, their constituent phrases are nevertheless semantically sensible.

2.2 Over-Sampling Using Language Model: LMOTE

One of the major challenges in mining suggestions is the imbalanced distribution of classes, i.e. the number of non-suggestions greatly outweigh the number of suggestions (refer Table 1). To this end, studies frequently utilize Synthetic Minority Over-sampling Technique (SMOTE) [1] to over-sample the minority class samples using the text embeddings as features. However, SMOTE works in the euclidean space and therefore does not allow an intuitive understanding and representation of the over-sampled data, which is essential for qualitative and error analysis of the classification models. We introduce a novel over-sampling technique, Language Model-based Over-sampling Technique (LMOTE), exclusively for text data and note comparable (and even slightly better sometimes) performance to SMOTE. We use LMOTE to over-sample the number of suggestions before training our classification model. For each domain, LMOTE uses the following procedure to over-sample suggestions:

Find Top \(\eta \) \(\texttt {n}\)-Grams: From all reviews labelled as suggestions (positive samples), sample the top \(\eta =100\) most frequently occurring \(\texttt {n}\)-grams (\(\texttt {n}=5\)). For example, the phrase “nice to be able to” occurred frequently in many domains.

Train Language Model on Positive Samples: Train a BiLSTM language model on the positive samples (suggestions). The BiLSTM model predicts the probability distribution of the next word (\(w_t\)) over the whole vocabulary (\(V \cup {{\mathbf {\mathtt{{{E}}}}}}\)) based on the last \(\texttt {n}=5\) words (\(w_{t-5},\ldots , w_{t-1}\)), i.e., the model learns to predict the probability distribution , such that \(w_t = \underset{w_i}{{{\,\mathrm{arg\,max}\,}}} \, P(w_i \ | \ w_{t-5} \ w_{t-4} \ w_{t-3} \ w_{t-2} \ w_{t-1})\).

Generate Synthetic Text Using Language Model and Frequent \(\texttt {n}\)-Grams: Using the language model and a randomly chosen frequent 5-gram as the seed, we generate text by repeatedly predicting the most probable next word (\(w_t\)), until the end symbol \({{\mathbf {\mathtt{{{E}}}}}}\) is predicted.

figure a

Table 2 comprises of the most frequent 5-grams and their corresponding suggestions ‘sampled’ using LMOTE. In our study, we generate synthetic positive reviews till the number of suggestion and non-suggestion class samples becomes equal in the training set.

Algorithm 1 summarizes the LMOTE over-sampling methodology. Following is a brief description of the sub-procedures used in the algorithm:

  • NGrams\((\mathcal {D}_{sugg}, \eta , n)\): It returns the top \(\eta \) n-grams from the set of suggestions, \(D_{sugg}\).

  • TrainLanguageModel\((\mathcal {D}_{sugg}, n)\): This procedure trains an n-gram BiLSTM Language Model on \(D_{sugg}\).

  • random\((n\_grams)\)- Randomly selects an n-gram from the input set.

  • LMOTEGenerate\((language\_model, seed)\): The procedure takes as input the trained language model and a randomly chosen n-gram from the set of top \(\eta \) n-grams as seed, and starts generating a review till the end tag, E is produced. The procedure is repeated until we have a total of \(\mathcal {N}\) suggestion reviews.

2.3 Mining Suggestion Using Multi-task Learning

Multi-task learning (MTL) has been successful in many applications of machine learning since sharing representations between auxiliary tasks allows models to generalize better on the primary task. Figure 1B illustrates 3-dimensional UMAP [4] visualization of text embeddings of suggestions, coloured by their domain. These embeddings are outputs of the penultimate layer (dense layer before the final softmax layer) of the Single task (STL) ensemble baseline. It can be clearly seen that suggestions from different domains may have varying feature representations. Therefore, we hypothesize that we can identify suggestions better by leveraging domain-specific information using MTL. Therefore, in the MTL setting, given a review \(r_i\) in the dataset, D, we aim to identify both the domain of the review, as well as its nature.

2.4 Classification Model

We use an ensemble of three architectures namely, CNN [2] to mirror the spatial perspective and preserve the n-gram representations; Attention Network to learn the most important features automatically; and a BiLSTM-based text RCNN [3] model to capture the context of a text sequence (Fig. 2). In the MTL setting, the ensemble has two output softmax layers, to predict the domain and nature of a review. The STL baselines on the contrary, only have a singe softmax layer to predict the nature of the review. We use ELMo [7] word embeddings trained on the dataset, as input to the models.

Fig. 1.
figure 1

(A) Receiver operating characteristics (TPR vs. Log FPR) curve pooled across all domains for all models used in this work demonstrates that LMOTE coupled with our multi-task model outperforms other considered alternatives across domains (B) 3-dimensional UMAP visualization of text embeddings of suggestions coloured by domain. Suggestions from different domains have distinct feature representations.

3 Results and Discussion

We conducted experiments to assess the impact of over-sampling, the performance of LMOTE and the multi-task model. We used the same train-test split as provided in the dataset for our experiments. All comparisons have been made in terms of the F-1 score of the suggestion class for a fair comparison with prior work on representational learning for open domain suggestion mining [5] (refer Baseline in Table 3). For a more insightful evaluation, we also compute the Area under Receiver Operating Characteristic (ROC) curves for all models used in this work. Tables 34 and Figs. 3 and 1A summarize the results of our experiments, and there are several interesting findings:

Fig. 2.
figure 2

Our multi-task classification model which consists of an ensemble of RCNN, CNN and BiLSTM attention network. The primary task is predicting the nature of a review (suggestion), while the auxiliary task involves predicting its domain (hotel).

Over-Sampling Improves Performance. To examine the impact of over-sampling, we compared the performance of our ensemble classifier with and without over-sampling i.e. we compared results under the STL, STL + SMOTE and STL + LMOTE columns. Our results confirm that in general, over-sampling suggestions to obtain a balanced dataset improves the performance (F-1 score & AUC) of our classifiers.

LMOTE Performs Comparably to SMOTE. We compared the performance of SMOTE and LMOTE in the single task settings (STL + SMOTE and STL + LMOTE) and found that LMOTE performs comparably to SMOTE (and even outperforms it in the electronics and software domains). LMOTE also has the added advantage of resulting in intelligible samples which can be used to qualitatively analyze and troubleshoot deep learning based systems. For instance, consider suggestions created by LMOTE in Table 2. While the suggestions may not be grammatically correct, their constituent phrases are nevertheless semantically sensible.

Table 3. Performance evaluation using F-1 score. Multi-task learning with LMOTE outperforms other alternatives in open-domain suggestion mining. Furthermore, owing to potentially confusing reviews in the travel domain (Table 1), its F-1 scores are significantly lower than the other domains.
Table 4. Performance evaluation using area under ROC with \(95\%\) confidence intervals. Multi-task learning with LMOTE outperforms other alternatives in open-domain suggestion mining. Multi-task learning leads to a significant improvement in AUC over its single task counterpart. (AUCs for baseline models proposed by [5] were unavailable.)
Fig. 3.
figure 3

Domain wise receiver operating characteristics (ROC) curves.

Multi-task Learning Outperforms Single-Task Learning. We compared the performance of our classifier in single and multi-task settings (STL + LMOTE and MTL + LMOTE) and found that by multi-task learning improves the performance of our classifier. We qualitatively analysed the single and multi task models, and found many instances where by leveraging domain-specific information the multi task model was able to accurately identify suggestions. For instance, consider the following review: “Bring a Lan cable and charger for your laptop because house-keeping doesn’t provide it.” While the review appears to be an assertion (non-suggestion), by predicting its domain (hotel), the multi-task model was able to accurately classify it as a suggestion.

4 Conclusion

In this work, we proposed a Multi-task learning framework for Open Domain Suggestion Mining along with a novel language model based over-sampling technique for text–LMOTE. Our experiments revealed that Multi-task learning combined with LMOTE over-sampling outperformed considered alternatives in terms of both the F1-score of the suggestion class and AUC.