Opinion mining using principal component analysis based ensemble model for e-commerce application

Vinodhini, G.; Chandrasekaran, R M

doi:10.1007/s40012-014-0055-3

Opinion mining using principal component analysis based ensemble model for e-commerce application

Original Research
Published: 09 September 2014

Volume 2, pages 169–179, (2014)
Cite this article

Download PDF

CSI Transactions on ICT Aims and scope Submit manuscript

Opinion mining using principal component analysis based ensemble model for e-commerce application

Download PDF

G. Vinodhini¹ &
R M Chandrasekaran¹

3568 Accesses
21 Citations
Explore all metrics

Abstract

With the rapid expansion of e-commerce over the decades, more and more product reviews emerge on e-commerce sites. In order to effectively utilize the information available in the form of reviews, an automatic opinion mining system is needed to organize the reviews and to help the users and organizations in making an informed decision about the products. Opinion mining systems based on machine learning approaches are used to categorize the reviews containing the customer opinion into positive or negative reviews. In this paper we explore this new research area of applying a hybrid combination of machine learning approaches tied with principal component analysis as a feature reduction technique. We introduce two hybrid ensemble based models (i.e. bagging and bayesian boosting based) for opinion classification. The results are compared with two individual classifier models based on statistical learning (i.e. logistic regression and support vector machine) using a dataset of product reviews. The other objective is to compare the influence of using different n-gram schemes (unigrams, bigrams and trigrams). We found that ensemble based hybrid methods perform better in terms of various quality measures in classifying the opinion into positive and negative reviews. We also applied a pairwise statistical test to compare the significance of the classifiers.

Sentiment analysis of Chinese online reviews using ensemble learning framework

Article 22 February 2018

A Comparison Study on Ensemble Strategies and Feature Sets for Sentiment Analysis

Sentiment Analysis Framework for E-Commerce Reviews Using Ensemble Machine Learning Algorithms

1 Introduction

With the expanse of the e-commerce and the social networking sites, there exists vast amount of information in the social media. Product reviews expressed in social network sites play an influential role in the market business analysis. As the number of reviews has been increasing at a rapid pace, it becomes difficult for the end user to analyze the opinions expressed in social media. Thus, there is a need for an opinion mining system which enable the retrieval of opinions. Such an opinion mining system can be used by the enterprises to determine how users perceive their products and how they stand with respect to competition. It is human nature to always depend on other people’s opinion and experiences while buying products. So customers can also benefit through automated opinion mining systems. The features of the product distinguish one product from other similar products and also from different brands. Such product features play a crucial role in the decision making process of the potential customer [1, 2, 3–6]. Thus, our focus in this work is on feature level opinion mining.

As the online reviews are too many for people to go through, how to automatically classify them into different opinion orientation categories (e.g. positive/negative) has become an important research problem. Various machine learning classifiers dominate the opinion classification in the literature [1, 7–10]. Many of the previous studies, however, used single classifier for the classification task. Few works in opinion classification literature have shown that combining individual classifiers is an effective technique for improving classification accuracy [11, 12, 28]. One major difficulty of the opinion classification system is the dimensionality of the features used to describe texts. The higher dimension of the features, makes it difficult in applying machine learning algorithms to opinion classification. Thus a reduction of the feature set by removing irrelevant features is essential in opinion classification [3, 13, 14].

In this work, we have applied supervised machine learning methods in order to classify reviews. Specifically, we have used ensemble based methods on the dataset. The classification models are empirically validated on a data set obtained by crawling opinions about digital cameras from the Amazon website. We chase several goals. First, we apply principal component analysis (PCA) and perform component level analysis to obtain the reduced feature set. Secondly, we compare the effectiveness of the ensemble based methods used with two individual statistical learning based models i.e. logistic regression (LR) and Support vector machine (SVM). Finally, we check our models applied over several n-grams combinations. The experimental and statistical results indicate that the hybrid method based on ensemble is effective for review text opinion classification.

The remainder of the paper is organized as follows. Section 2 narrates the related work about opinion mining. Section 3 discusses the problem outline used in this work. Section 4 reports various steps involved in data analysis. The various methods used to model the prediction system are introduced in Sect. 5. Section 6 presents the results and Sect. 7 concludes our work.

2 Review of literature

Many interesting works exist that focus on extracting the opinions from the customer reviews [5, 15–18]. Though many researchers have investigated opinion classification from different perspectives, use of machine learning for opinion classification counts more [1, 10, 11, 19–24]. Among the Machine learning techniques, it is observed that SVM, naive bayes (NB) and decision tree approaches have achieved great success in opinion categorization [1, 10, 11, 13, 14, 21, 25]. Besides theses, other machine learning methods such as K-nearest neighborhood, ID3, C5, centroid classifier and winnow classifier are also used for opinion mining, [10, 11, 19, 21, 24]. NB also achieved great success in opinion categorization [3, 17, 26]. Although some foundational studies have investigated potential ensemble approaches in the area of opinion classification, research has been limited and more in depth empirical comparative work is needed [11, 12, 21, 27, 28].

The literature also reveals that the result of an opinion mining varies according to the composition, method of feature [3, 13, 14]. Different levels of word granularity are used as features for opinion classification. Unigrams are used as feature for opinion classification [29, 30]. The combination of unigram, bigram and trigram are used as features [31] in classifying opinion. The high dimensionality of the features obtained from the text reviews increases the complexity of the text opinion classification. Various feature selection and reduction approaches such as information gain, mutual information, Chi square test and fisher’s discriminant ratio are employed in the opinion classification [13, 14, 29, 32, 33]. Except the work of [25], the opinion mining literature does not contribute any work using PCA as feature reduction technique.

2.1 Motivation and contribution

Opinion mining systems are highly domain dependant. The results can vary significantly from a domain to another which make the opinion mining a very interesting and challenging task. Prior studies have shown that many works in opinion mining exist on the product domain using single classifier [1, 32, 34, 16, 35–37]. This motivates us to conduct this analysis on product domain. PCA is a popular and effective feature reduction technique applied in various other applications [25]. Research on opinion mining by combining a feature reduction and an ensemble learning algorithm is not done so far in the literature. We therefore intuitively seek to integrate the feature reduction method and ensemble classification algorithms in an efficient way to enhance the performance of the classification. PCA is applied as a feature reduction technique to extract the reduced principal components. Reduced principle components, thus obtained from PCA is further analyzed to eliminate the least influencing attributes based on the attribute weights. In order to evaluate the prediction models different quality parameters are used to capture the various aspects of the model quality.

Another contribution of this work is to study the effect of different levels of features (unigrams, bigrams and trigrams) employed to build the opinion mining models. To analyze the relationship clearly three data models are developed. Model I using only unigram product attribute as features for classification. Model II uses a combination of unigrams and bigrams. Model III is developed using unigrams, bigrams and trigrams product attributes. For each data model (models I, II and III), wide range of comparative experiments are conducted by comparing ensemble based hybrid methods with individual classification methods. Given the importance of text sentiment classification in in the real-world applications, we believe a comparative study of ensemble based hybrid models in text sentiment classification will greatly benefit application development as well as researchers in related areas.

3 Problem outline

This section describes the problem outline used to develop the prediction models. The following Fig. 1 shows the outline of our work.

4 Data source

We collected the review sentences from the publicly available customer review site (http://www.amazonreviews.com) using web crawler. We totally collected 937 customer reviews of digital camera. Out of these, 272 are negative, 355 are positive and 310 are neutral reviews. Outliers are performed as suggested in Briand et al. [38] and are not considered for further processing. In order to obtain a balanced data distribution for our binary classification problem, we have considered only 250 positive and 250 negative (500 reviews) reviews. For each of the positive and negative review sentences, the product attributes discussed in the review sentences are collected manually (bag of words). From the bag of words, unique product features are grouped, which results in a final list of product attributes (features) of size 115. Among 115 product attributes 96 are unigram attributes, 12 are bigrams and 7 are trigram attributes. In terms of these, the descriptions of review dataset models to be used in the experiment are given in Table 1.

Table 1 Properties of data source

Opinion mining using principal component analysis based ensemble model for e-commerce application

Abstract

Similar content being viewed by others

Sentiment analysis of Chinese online reviews using ensemble learning framework

A Comparison Study on Ensemble Strategies and Feature Sets for Sentiment Analysis

Sentiment Analysis Framework for E-Commerce Reviews Using Ensemble Machine Learning Algorithms

1 Introduction

2 Review of literature

2.1 Motivation and contribution

3 Problem outline

4 Data source

4.1 Feature reduction (Independent variable)

4.2 Component level analysis

5 Classification methods

5.1 Baseline methods

5.2 Bagging

5.3 Bayesian boosting

6 Results and discussion

6.1 Performance of individual classifiers

6.2 Performance of ensemble based hybrid classifiers

6.3 Quality metrics

6.3.1 Correctness

6.3.2 Completeness

6.3.3 Effectiveness

6.4 Statistical significance test

6.5 Threats to validity

7 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation