SSAM: toward Supervised Sentiment and Aspect Modeling on different levels of labeling

In recent years, people want to express their opinion on every online service or product, and there are now a huge number of opinions on the social media, online stores and blogs. However, most of the opinions are presented in plain text and thus require a powerful method to analyze this volume of unlabeled reviews to obtain information about relevant details in minimum time and with a high accuracy. In this paper, we propose a supervised model to analyze large unlabeled opinion data sets. This model has two phases: preprocessing and a Supervised Sentiment and Aspect Model (SSAM) which is an extended version of Latent Dirichlet Allocation Model. In the preprocessing phase, we input thousands of unlabeled opinions and received a set of (key, value) pairs in which a key holds a word or an opinion and a value holds supervised information such as a sentiment label of this word or opinion. After that we give these pairs to the proposed SSAM algorithm, which incorporates different levels of supervised information such as (document and sentence) levels or (document and term) levels of supervised information, to extract and cluster aspects related to a sentiment label and also classify opinions based on their sentiments. We applied SSAM to reviews of electronic devices and books from Amazon. The experiments show that the aspects found by SSAM capture more important aspects that are closely coupled with a sentiment label, and also in sentiment classification SSAM outperforms other topic models and comes close to supervised methods.


Introduction
Unsupervised extraction of Aspects from unlabeled documents is a common challenge.This challenge has been met by the topic modeling.Supervised methods (Liu et al. 2015;Poria et al. 2016) for aspect extraction are not applicable when dealing with unlabeled datasets, and they may fail when applying them on a new domain, for example, a model which learned on electronic product data is not applicable on sport domain data.Latent Dirichlet Allocation (LDA) (David et al. 2003) is more popular and has widespread use topic model.It is assumed that for each document an aspect is randomly chosen from a specified distribution, and then a word is randomly chosen according to a distribution specified by the chosen aspect.The document aspect and aspect word distributions that generate the document are unknown, but can be inferred using Gibbs sampling.
Extending these models to consider more assumptions about the data generating process makes these models more general and effective.Sentiment and topic modeling simultaneously (Lin et al. 2012;Jo and Oh 2011;Titov and McDonald 2008;Mei et al. 2007) is an informative task which is done in topic modeling-based sentiment analysis methods.None of the existing topic models for sentiment analysis did not consider any supervised information such as review or term's sentiment label in their generative process.If we could constraint sentiments label of all words in a review be generated from one sentiment, based on review's sentiment label it would be very helpful to extract more coherent and specific aspects and also it is very useful to categorize every review in different sentiment classes.But here we have faced some limitations on real opinion datasets such as a huge number of unlabeled opinion data and lack of any knowledge about review's trend or review's sentiment.Many of works which have done in topic modeling-based sentiment analysis (Lin et al. (2012); Jo and Oh (2011); Titov and McDonald (2008); Mei et al. (2007); Poria et al. (2016); Rahman and Wang (2016); Lim and Buntine (2014)) used a little sentiment lexicons for giving sentiment label to those sentiment words that are appear in reviews.However, they could be to extract better aspects but have some problems such as extracting irrelevant aspects in different sentiment classes, having lower classification accuracy and are very time consuming due to requiring a lot of Gibbs sampling iteration to reaching a stable convergence and maybe unable to sampling on this volume of data.
In this paper, we proposed a supervised topic model called Supervised Sentiment and Aspect Model (SSAM), for classifying reviews in different sentiment classes by reformulating the generative process of LDA and adapt it to incorporate sentiment into our proposed model so that the resulting model represents the probability distributions over words for various pairs of sentiment and aspect.While aspects are drawn conditioned on review's sentiment label and words are drawn conditioned on the review's aspect and sentiment label, SSAM is capable to consider different types of supervision such as review's sentiment label and term's label where these all supervised information are calculated in preprocessing phase of the model.SSAM is distinguished from other related sentiment and topic models in its capability of accommodating with big unlabeled corpus of reviews by implementing SSAM on the big data Spark framework (Zaharia et al. 2010).We tested SSAM on the thousands of Amazon's Reviews in Electronics and Ebooks domains, and experiments results show that our proposed model significantly outperforms two strong supervised methods (SVM and NB) as well as two closed related sentiment and topic models (JST and ASUM) in sentiment classification accuracy.Aside from sentiment classification, SSAM has lower time complexity as compared to LDA and also SSAM can extract coherent and meaningful aspects.We summarize our contributions as follows: • SSAM, which considers reviews sentiment label and terms sentiment label as the extension of LDA model by adding a sentiment layer.
• SSAM can be accounted as a full framework for classifying unlabeled reviews and cluster related words with a high accuracy.• Our proposed model is capable of extracting implicit aspects, negation sentiments, intensified sentiments and can also consider sentence structure and terms order instead of bag of words.• Implementation is on the big data Spark framework to adapt to the explosive growth of opinions on the web.• A thorough analysis of the SSAM compared to other sentiment and topic modeling (e.g., JST and ASUM) and different supervised methods (e.g., SVM and NB) is presented.
The paper is organized as follows: Sect. 3 reviews some works on supervised topic models that are related to our proposed model, the SSAM and its inference procedure are described in Sect.4, and Results and experiments on the Amazon reviews dataset are discussed in Sect.6.Finally, conclusion and future works are outlined in Sect.7.

Terminology
In this section, we define the terminology used in this paper.
• Multiword aspect or sentiment: an n-gram phrase that conveys aspect or sentiment, for example, "portable DVD player", "well designed".• negation sentiment: a multiword with at least one sentiment word and one negation word such as no, not, none, cannot and etc. as the previous word, for example, "not bad", "not clear".• Intensified sentiment: a multiword with at least one sentiment word and one intensified word such as so, very, extremely and etc. as the previous word, for example, "very well", "so expensive".• Aspect: is a topic in topic modeling methods.
• Explicit Aspect: an aspect expression in a sentence that is a noun or noun phrase, for example, "camera", "battery".• Implicit Aspect: an aspect expression in a sentence that is another type such as adjective or adverb, for example, "not fit", "expensive".• Sentiment Lexicons: are the words with positive (+1) or negative (−1) sentiment, such as, good (+1) or bad (−1), which used in scoring levels of preprocessing phase.

Related works
Several modifications of LDA model to incorporate supervised information have been proposed in the literature.These models often involve incorporating some supervised information as prior knowledge to model learning and restriction in topic assignment.Two such types of topic modeling depending on how the supervised information is incorporated exist.These two types are named downstream and upstream topic modeling.Downstream topic models incorporate metadata such as time, author, publication date, publication venue in their generative process where generate both word and metadata simultaneously conditioned on the topic assignment.Examples of such "downstream" models include the Topics over Time model (TOT) (Wang and McCallum 2006), CorrLDA model (Mimno and McCallum 2012), the supervised latent Dirichlet allocation (Blei and McAuliffe 2010) and Labeled LDA (Ramage et al. 2009).
The upstream topic models start with the supervised information and represent each topic as a mixture of distributions conditioned by the supervised information.Examples of the upstream type include Joint Sentiment-Topic (JST) model (Lin et al. 2012), Aspect Sentiment Unification Model (ASUM) (Jo andOh 2011), DiskLDA (Lacoste-Julien et al. 2009), feaLDA (Lin et al. 2012), SenticLDA (Poria et al. 2016), HTSM (Rahman and Wang 2016) and TOTM (Lim and Buntine 2014).Closely related works to our proposed model are upstream topic models.In JST model, sentiment is integrated with an aspect in a single language model and sentiment and aspect words are discovered simultaneously to form a sentiment-bearing aspect, which can be used to capture sentiment association among words from different domains.Such sentiment-bearing aspect detected by JST has been used for sentiment classification.JST is a weaklysupervised model because it uses a small sentiment lexicon dataset as supervised information to modify the Dirichlet prior to sentiment-topic-word distribution.Aspect and Sentiment Unification Model (ASUM) is very similar to JST, as it extracts sentiment and aspects simultaneously by modeling each document with a sentiment distribution and a set of sentiment-specific aspect proportions.The main differences between ASUM, JST and SSAM are that both ASUM and JST do make use of a small seed set of sentiment words and have no mechanism to incorporate supervised information such as document or term sentiment labels in model inference, but SSAM can handle labeled and unlabeled data and classify unlabeled data based on the learned model.SSAM is a general model capable of operating on different levels of supervision information and works like as semi-supervised or supervised method.
FeaLDA is a supervised generative topic model for automatic detection of Web API documents form the pre-labeled web documents corpus.DiskLDA associates a single supervised label with each document and associates a topic mixture with each label; it applies a documents label transform matrix to modify Dirichlet prior of document-topic distribution in LDA model.
SenticLDA used a set of seed words, user feedback and semantic relationships between words into the model to extract more coherent aspects.
Different from the previous works where only document labels are incorporated as prior knowledge or a small sentiment lexicon used as supervised information into model learning, we propose a novel Supervised Sentiment and Aspect Model (SSAM) which is capable of incorporating supervised information derived from both the document and term sentiment labels calculated in the preprocessing phase into the generative process to constrain the model inference process and constrain the sentiment-document and sentiment-aspect-term distributions and this provides SSAM with a more stable statistical foundation.

Preprocessing phase
Raw text is usually not suitable for mining due to various reasons; hence, the raw text needs to be broken down into smaller elements such as sentences or words and also needs some preprocessing steps involving some transformations on the text.In this paper, we use different transformations including stop word removal, stemming, bigrams and ngrams extraction, implicit aspects detection, negation and intensified sentiments extraction, and the last transformation is the scoring on three different levels (term, sentence and document).Bigrams and n-grams extraction is based on approaches mentioned in Church and Hanks (1990) by applying these techniques we can find all useful n-grams, and these n-grams include almost all multiword aspects, negation and intensified sentiments.
Table 1 contains the examples of these extracted n-grams from Amazon Electronics dataset.
As shown in Fig. 1, the scoring step has three different levels based on how to spread the calculated scores into document, sentence and term levels.At the document level, the words in a document are generated from the same sentiments and aspects, in this level, a sentiment label vector, σ , is calculated according to Algorithm 3, and here we use two manually pre-defined threshold vectors min and max with length S (number of sentiments, set by user) for assigning values to σ vector elements.
For example, suppose we have three different sentiment labels (negative, neutral and positive), S = 3, the score value of document d is +1 and min and max vectors are min = {−10, −1, 1}, max = {0, 1, 10}, then σ d would be: σ d = (0, 1, 1), this means document d has both sentiment labels 2 and 3. Output of this algorithm is document d and its sentiment vector σ .Scoring at the term level captures dependencies and neighborhoods between the words (e.g., words at left and right of a sentiment word) in a sentence and assumes a sentence may contain one or more aspects and one or more sentiment.The score value in this level is calculated by Algo- rithm 1.In this algorithm, w − 1 refer to the neighbor word on the left and w + 1 refer to the neighbor on the right of word w in a sentence.Sentence level of scoring assumes one sentence tends to represent one sentiment and one aspect.
Algorithm 2 shows the process of calculating score value at the sentence level.Output of both term and sentence level of scoring is a corpus of documents where every document has a set of (key, value) pairs.
where N d i is the length of document d i and each key in (key,value) pair is a word member of vocabulary with V distinct terms {1, . . ., V } and value is the label of this word.Also let each l s ∈ {0, 1} and S is the number of sentiment labels.The formal definition of the generative process of SSAM is as follows: The procedure for generating a word in document d under SSAM may be summarized in three steps.First one draws a sentiment label s from the per-document sentiment proportion π d ; in the next step, one chooses an aspect k from the per-document aspect distribution θ d,s conditioned on the sampled sentiment label s.At the final step one chooses a word from the sentiment-aspect word distribution ϕ s,z .The JST and ASUM models draw a multinomial mixture distribution π d over all S sentiment labels, for each document d, from a Dirichlet prior γ .But we would like to restrict π d to be defined only over the sentiments that correspond to its sentiment labels σ d .Since the document-sentiment assignments s i 123 (see line 12 in Algorithm 4) are drawn from this distribution, this restriction ensures that all the sentiment assignments are limited to the document's sentiment labels.
It is worth noting that if we use just the term level of scoring and set γ to a pre-defined constant (i.e., 0.1), then SSAM could be reduced to JST model.If we use the sentence level of scoring but do not incorporate the document's sentiment label, then SSAM could be like the ASUM model, and if we consider the term level of scoring with a pre-labeled corpus, our model works like feaLDA.Generative processes of JST, ASUM and feaLDA are different from the SSAM in that our proposed model incorporates learned supervised information in an effective way by introducing a transformation matrix λ and a document labels vector σ for encoding the knowledge achieved from the preprocessing phase to modify the Dirichlet priors of both sentiment-aspect word distributions and document specific sentiment distributions.SSAM exploits supervised information by using asymmetric priors γ and β.
In the following, we discuss how to incorporate prior knowledge into the proposed model.

Incorporating document's sentiment labels
SSAM incorporates document's sentiment labels by introducing the document labels vector σ ; to achieve this objective, we generate the document's sentiment labels σ d using a Bernoulli coin toss for each sentiment label s, with the sentiment labeling prior ε s as shown in line 8 of SSAM generative process (Algorithm 4).We use the σ vector to restrict the document-sentiment Dirichlet prior γ = (γ 1 , . . ., γ S ) as follows: For example, suppose we have three sentiment labels, {negative, neutral, positive}, S = 3, and a document d has a vector of sentiments labels given byσ d = {0, 1, 1} then if π d is drawn from a Dirichlet distribution with γ d = σ d × γ = (0, γ, γ ) prior, this means the Dirichlet is restricted to sentiments neutral and positive.This fulfills our requirement that the document's sentiment labels are restricted to its own sentiment labels.The dependency of π on both γ and σ is indicated by directed edges from σ and γ to π in the plate notation in Fig. 3.

Incorporating terms or sentences label
Another type of supervised information considers term labels which are calculated from term and sentence levels of scoring in the preprocessing phase.In the existing supervised topic models, we usually set the Dirichlet prior of sentiment-aspect word distribution β to a symmetric value.Our experiments showed that incorporating term labels into the model could potentially improve the model classification performance.We encode the labeled terms into the SSAM model by introducing a word-sentiment association transformation matrix λ with dimension V × S. For word w i , its sentiment label association vector λ W i is calculated as follows: (2) Where the function count() enumerates all words w i which are members of sentiment s, and also S s=1 λ w i ,s = 1.For example, if there are three sentiment labels S = 3 and assume word camera with index w t occurred 200 times in the sentiment label 1 and 80 times occurred in sentiment label 2 and 20 times occurred in sentiment label 3, has a corresponding association vector λ w i = (200/300, 80/300, 20/300), we can then incorporate term labels into SSAM by setting In this state, SSAM can ensure that a labeled term such as "camera" has a higher probability of being drawn from aspects associated with sentiment label 1. Initialization of β in SSAM is different from all other supervised and unsupervised topic models.

Learning and inference
From the SSAM graphical model shown in Fig. 3, the joint distribution of all variables (observed and hidden) can be factored into three terms: (5) By integrating out π, θ and ϕ in the first, second and third terms on the RHS of Eq. ( 5), respectively, we obtain 123 The notations are described in Table 2.In SSAM, we will assume that the documents and terms are multiply tagged in the preprocessing phase, at inference time.when the labels σ d of the documents are observed, the document labeling prior ε is d-separated from the rest of the model given σ d , and the sentiments per document prior γ d is now restricted to the document d labels σ d ; therefore, we use collapsed Gibbs sampling (Griffiths and Steyvers 2004) to inference the latent variables θ, ϕ and π at each iteration of the markov chain.Sampling probability for choosing the sentiment and aspect of the ith word is given by and N −i d in this expression exclude the word i. Gibbs sampling (Algorithm 5) will sequentially sample each variable S and Z from the distributions over the observed variables of all other variables and data, until a stationary state of the markov chain has been reached.Then samples obtained from the markov chain are used to approximate the per-corpus sentiment-aspect word distribution per-document sentiment-aspect distribution and per-document sentiment distribution

Implementing SSAM on Spark framework
Spark (Zaharia et al. 2010) is a fast and general purpose engine for large-scale data processing framework which provides new features not previously available in Hadoop including caching, ease of use and many more.The detailed implementation of SSAM on Spark is shown in Algorithm 6.Here we first distribute data and parameters such as perreview sentiment distribution π and sentiment-aspect word distribution ϕ over P processors, with π p = π/ p and ϕ p = ϕ on each processor, then collapsed Gibbs sampling procedure is executed on each processor, π p and ϕ p are + denotes before preprocessing and * denotes after preprocessing updated locally at the same time; after the sampling, we calculate ϕ by collecting all locally updated ϕ p by using Eq. 13 then broadcast updated ϕ to all processors.
5 Experimental setup

Dataset
In this paper, we use two different sets of Amazon reviews on electronic devices and books which we name Electronics and Book, respectively.These datasets are public on the internet. 1We preprocessed the reviews by removing non-English alphabets and stop words based on a stop word list, stemming, extracting n-grams phrases and replace them in reviews.The final Book Dataset contains 38,473 documents, 87,836 unique words, and 1,272,683 word tokens in total; the Electronics Dataset contains 143,828 documents, 224,725 unique words, and 6,493,136 word tokens.Statistics before and after the preprocessing phase is summarized in Table 3.
In our experiments, two sentiment lexicons, namely MPQA2 and appraisal lexicons,3 were used to give a score to terms and documents in preprocessing phase.

Sentiment classification accuracy
To specify the sentiment label of a review, we use the perdocument sentiment distribution π (Eq.12), such that a review is positive if the positive sentiment probability is equal to or a higher than negative sentiment probability, and is negative otherwise.For all datasets used here, each review is accompanied by a user rating on a scale of 1-5.Reviews rated as 1 or 2 stars are treated as negative and other ratings (3, 4 or 5) as positive.

Precision, recall and F-score
Average precision, recall, and F-score are used to evaluate the correctness of classified reviews in every sentiment label.

Experiments
In this section, we showed the experimental results of the SSAM model.We performed different experiments to evaluate our proposed model SSAM such as evaluating discovered sentiment-aspects by SSAM, presenting the sentiment classification performance of SSAM and comparing against two

Aspects discovery evaluation
In this experiment, discovered aspects coupled with a sentiment are evaluated.We use three criteria for extracted aspects: being coherent, being specific, and internal correlation.We applied SSAM on Electronics and Book review datasets and also evaluated the modeling power of SSAM based on the fore-mentioned three criteria.In this evaluation, we compared SSAM results with some other sentimenttopic models such as JST and ASUM.Here we analyze the extracted aspects under positive and negative sentiment labels.Some of the sentiment-aspects that SSAM discovered are presented in Table 4: aspects presented in Table 4 were generated in both positive and negative sentiment label each of which is shown by the top 15 aspect words.Inspecting the aspects extracted by SSAM, they are seen to be specific in every sentiment label, e.g., camera size is an aspect of camera which classified as a negative sentiment and the negative features such as bulky, heavy and not fit proved that.
Another example of such extracted sentiment-aspects is pol-itics where this aspect is classified as a negative sentiment because of existing negative sentiment words such as inconsistent and dissatisfaction.
Extracted aspects are coherent and informative in each class, e.g., the aspect computer network has a set of closely related and coherent words such as internet, connect, DSL router, wireless access point and also the aspect picture quality has words such as low quality, not clear, contrast, resolution.Another advantage of SSAM is the ability to extract multiword aspects and sentiments such as picture quality, not clear, camera size, middle east, camera bag, lcd screen, not work, low quality.Two hyperparameters, β and γ are tuned using incorporated supervised information.These two hyperparameters have a main role in extracting coherent aspects that are related to a specific sentiment.

Performance comparison of SSAM with two existing supervised methods
Our second experiment shows the classification results of SSAM on classifying a review as a positive sentiment or negative sentiment and also compares our model with two supervised methods, Naive Bayes (NB) and Support Vector Machines (SVMs).Beside the classification accuracy, three metrics Recall, Precision and F1 score are reported in Table 5.
As will be seen from    This demonstrates the effectiveness of SSAM in incorporating supervised information into the model inference.So applying a sentiment classifier such as SSAM that can offer a high precision and high recall to classify negative and positive sentiment while the majority of reviews are positive.

Performance comparison of SSAM with different levels of scoring
In this section, we show how the proposed model behaves with different aspect number settings on the above-mentioned datasets when different levels of supervised information (term level, sentence level, document level and mixtures of  , 10, 20, 30, 40, 50, 60}.Table 6 shows the best classification accuracy results of SSAM by incorporating prior information extracted from the preprocessing phase at different levels.As can be seen from Fig. 4a, b, incorporating different levels of supervised information, i.e., term and document with multiple aspect settings on the Book and Electronics datasets, performs better than single level.Both Tables 6 and 7 show that at the term level of scoring, SSAM and JST have almost the same results and also at the sentence and document levels SSAM and ASUM have similar accuracy, but SSAM with both document and term levels of scoring gives a significant improvement over the others in all datasets.

Performance comparison of SSAM with existing weakly-supervised sentiment-topic modeling
In this experiment, we compare the sentiment classification performance of SSAM with other existing supervised or weakly-supervised sentiment-topic models (i.e., Aspect Unification Model ASUM and Joint Sentiment-Topic model JST): the sentiment classification accuracy results are presented in Figs.5a and 4b and the best classification results are summarized in Table 7.In all aspect number settings, SSAM outperforms the other supervised and weakly-supervised sentiment-topic models.It can be seen from Table 7 that SSAM outperforms JST in accuracy by almost 14% and also outperforms ASUM by 5% on the Electronics dataset when the aspect number is set to Z = 1.The SSAM model outperforms JST by almost 11%.Although ASUM improves upon JST, it is worse than SSAM with its accuracy nearly 4% lower compared to SSAM on the Books dataset when setting aspect number to Z = 30.The baseline results in Fig. 5 are calculated based on the updated sentiment lexicon by counting the overlap of sentiment lexicon with each review in the corpus: if the count of positive sentiment words in a review is greater than the count of negative words, a review is classified as positive sentiment, and vice versa.As you can see, baseline results are below 65% for both datasets.

Conclusion
In this paper, we described a supervised sentiment aspect model (SSAM) which provides a novel framework for sentiments classification.While most of other supervised sentiment classification methods can only classify labeled reviews, SSAM is capable of incorporating different levels of supervision which are calculated in the preprocessing phase for improving sentiment classification performance.These supervised values are used to constrain the asymmetric Dirichlet prior of document-sentiment and sentiment-aspect word distributions.Results from different experiments show that SSAM outperforms two supervised models (i.e., SVM and NB) and also outperforms two weakly-supervised sentiment and topic models (i.e., JST and ASUM).Our proposed model only has a small sentiment lexicon dataset as supervised information in the preprocessing phase similarly to JST and ASUM.SSAM can extract implicit aspects, multiword aspects and multiword sentiments.Our proposed model used sentence structure and word order in the preprocessing phase and model inference.

Fig. 1
Fig. 1 Document, sentence and term levels of scoring in preprocessing phase

Fig. 3
Fig. 3 Supervised Sentiment and Aspect graphical model

Fig. 4
Fig. 4 Sentiment classification accuracy by the three different levels of scoring (Term, Sentence and Document) versus Different Aspect number settings a books dataset, b electronics dataset

Fig. 5
Fig. 5 Sentiment classification accuracy by the three topics models (SSAM, ASUM and JST) and baseline versus different aspect number settings a Books dataset, b electronics dataset

Table 1
Extracted N-grams and their types

Table 2
Meanings of the notations j The number of words that are assigned sentiment label k and aspect j in review d N d The total number of words in review d P(w|s, z, β) = k z

Table 5
Performance comparison of SSAM with two supervised approaches Unit in % and numbers in bold face denote the best result in each metric

Table 6
Performance comparison of SSAM with different levels of scoring

Table 7
Performance comparison of SSAM with two weaklysupervised sentiment-topic models

Compliance with ethical standards Conflict of interest The
authors declare that they have no conflict of interest.Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.