1 Introduction

With the expanse of the e-commerce and the social networking sites, there exists vast amount of information in the social media. Product reviews expressed in social network sites play an influential role in the market business analysis. As the number of reviews has been increasing at a rapid pace, it becomes difficult for the end user to analyze the opinions expressed in social media. Thus, there is a need for an opinion mining system which enable the retrieval of opinions. Such an opinion mining system can be used by the enterprises to determine how users perceive their products and how they stand with respect to competition. It is human nature to always depend on other people’s opinion and experiences while buying products. So customers can also benefit through automated opinion mining systems. The features of the product distinguish one product from other similar products and also from different brands. Such product features play a crucial role in the decision making process of the potential customer [1, 2, 36]. Thus, our focus in this work is on feature level opinion mining.

As the online reviews are too many for people to go through, how to automatically classify them into different opinion orientation categories (e.g. positive/negative) has become an important research problem. Various machine learning classifiers dominate the opinion classification in the literature [1, 710]. Many of the previous studies, however, used single classifier for the classification task. Few works in opinion classification literature have shown that combining individual classifiers is an effective technique for improving classification accuracy [11, 12, 28]. One major difficulty of the opinion classification system is the dimensionality of the features used to describe texts. The higher dimension of the features, makes it difficult in applying machine learning algorithms to opinion classification. Thus a reduction of the feature set by removing irrelevant features is essential in opinion classification [3, 13, 14].

In this work, we have applied supervised machine learning methods in order to classify reviews. Specifically, we have used ensemble based methods on the dataset. The classification models are empirically validated on a data set obtained by crawling opinions about digital cameras from the Amazon website. We chase several goals. First, we apply principal component analysis (PCA) and perform component level analysis to obtain the reduced feature set. Secondly, we compare the effectiveness of the ensemble based methods used with two individual statistical learning based models i.e. logistic regression (LR) and Support vector machine (SVM). Finally, we check our models applied over several n-grams combinations. The experimental and statistical results indicate that the hybrid method based on ensemble is effective for review text opinion classification.

The remainder of the paper is organized as follows. Section 2 narrates the related work about opinion mining. Section 3 discusses the problem outline used in this work. Section 4 reports various steps involved in data analysis. The various methods used to model the prediction system are introduced in Sect. 5. Section 6 presents the results and Sect. 7 concludes our work.

2 Review of literature

Many interesting works exist that focus on extracting the opinions from the customer reviews [5, 1518]. Though many researchers have investigated opinion classification from different perspectives, use of machine learning for opinion classification counts more [1, 10, 11, 1924]. Among the Machine learning techniques, it is observed that SVM, naive bayes (NB) and decision tree approaches have achieved great success in opinion categorization [1, 10, 11, 13, 14, 21, 25]. Besides theses, other machine learning methods such as K-nearest neighborhood, ID3, C5, centroid classifier and winnow classifier are also used for opinion mining, [10, 11, 19, 21, 24]. NB also achieved great success in opinion categorization [3, 17, 26]. Although some foundational studies have investigated potential ensemble approaches in the area of opinion classification, research has been limited and more in depth empirical comparative work is needed [11, 12, 21, 27, 28].

The literature also reveals that the result of an opinion mining varies according to the composition, method of feature [3, 13, 14]. Different levels of word granularity are used as features for opinion classification. Unigrams are used as feature for opinion classification [29, 30]. The combination of unigram, bigram and trigram are used as features [31] in classifying opinion. The high dimensionality of the features obtained from the text reviews increases the complexity of the text opinion classification. Various feature selection and reduction approaches such as information gain, mutual information, Chi square test and fisher’s discriminant ratio are employed in the opinion classification [13, 14, 29, 32, 33]. Except the work of [25], the opinion mining literature does not contribute any work using PCA as feature reduction technique.

2.1 Motivation and contribution

Opinion mining systems are highly domain dependant. The results can vary significantly from a domain to another which make the opinion mining a very interesting and challenging task. Prior studies have shown that many works in opinion mining exist on the product domain using single classifier [1, 32, 34, 16, 3537]. This motivates us to conduct this analysis on product domain. PCA is a popular and effective feature reduction technique applied in various other applications [25]. Research on opinion mining by combining a feature reduction and an ensemble learning algorithm is not done so far in the literature. We therefore intuitively seek to integrate the feature reduction method and ensemble classification algorithms in an efficient way to enhance the performance of the classification. PCA is applied as a feature reduction technique to extract the reduced principal components. Reduced principle components, thus obtained from PCA is further analyzed to eliminate the least influencing attributes based on the attribute weights. In order to evaluate the prediction models different quality parameters are used to capture the various aspects of the model quality.

Another contribution of this work is to study the effect of different levels of features (unigrams, bigrams and trigrams) employed to build the opinion mining models. To analyze the relationship clearly three data models are developed. Model I using only unigram product attribute as features for classification. Model II uses a combination of unigrams and bigrams. Model III is developed using unigrams, bigrams and trigrams product attributes. For each data model (models I, II and III), wide range of comparative experiments are conducted by comparing ensemble based hybrid methods with individual classification methods. Given the importance of text sentiment classification in in the real-world applications, we believe a comparative study of ensemble based hybrid models in text sentiment classification will greatly benefit application development as well as researchers in related areas.

3 Problem outline

This section describes the problem outline used to develop the prediction models. The following Fig. 1 shows the outline of our work.

Fig. 1
figure 1

Problem outline

4 Data source

We collected the review sentences from the publicly available customer review site (http://www.amazonreviews.com) using web crawler. We totally collected 937 customer reviews of digital camera. Out of these, 272 are negative, 355 are positive and 310 are neutral reviews. Outliers are performed as suggested in Briand et al. [38] and are not considered for further processing. In order to obtain a balanced data distribution for our binary classification problem, we have considered only 250 positive and 250 negative (500 reviews) reviews. For each of the positive and negative review sentences, the product attributes discussed in the review sentences are collected manually (bag of words). From the bag of words, unique product features are grouped, which results in a final list of product attributes (features) of size 115. Among 115 product attributes 96 are unigram attributes, 12 are bigrams and 7 are trigram attributes. In terms of these, the descriptions of review dataset models to be used in the experiment are given in Table 1.

Table 1 Properties of data source

In order to study the influence of the word size in the classification, three word vector models (models I, II and III) are developed using the respective features mentioned in Table 1. To create the word vector models, the review sentences are preprocessed by tokenization, stop words removal and stemming. After pre-processing, the reviews are represented as bag of words. Model I is represented as word vector with only unigram attributes. Model II is represented as a word vector with a combination of unigram and bigram. Model III is represented as a word vector with a combination of unigram, bigram and trigram attributes. The word vector models are created based on the term occurrences. Each preprocessed product review sentences available in the polarity data set, thus obtained were labelled as positive or negative.

4.1 Feature reduction (Independent variable)

Principal component analysis is the widely used statistical method to reduce the dimension of feature set. Assuming X (n · m) matrix as the standardized word vector data with n reviews and m product attributes, the principal components algorithm works as follows.

  1. i.

    Calculate the covariance matrix.

  2. ii.

    Calculate Eigen values and eigenvectors.

  3. iii.

    Reduce the dimensionality of the data.

  4. iv.

    Calculate a standardized transformation matrix T.

  5. v.

    Calculate domain features (p) for reviews.

The final result is a n × p matrix of domain features. Using rapidminer tool, the principal components for each of the models I, II and III are identified. The stopping rule used is ‘eigen value >1′. Due to this stopping rule the number of principal components for the models (I, II and III) are cut down to 1 (PC1). PC1 represents the reduced dimension which is obtained by stopping rule. One component (PC1) with 50.7 % variance is obtained from model I. One component with a cumulative variance of 52.9 % are obtained for model II and 53.7 % for model III. Due to the stopping rule chosen, the percentage of variance is less. In order to justify the choice of PC1 alone as reduced component, a component level analysis is done. Most of the literatures showed that SVM and NB are perfect methods in opinion classification [1, 6, 7, 3943]. Also SVM and NB classifiers are used as base classifiers in our ensemble based approaches. So, in this component level empirical analysis, we will use SVM and NB classifiers. The accuracy is measured using the classifiers SVM and NB in conjunction with (and without) the use of PCA and PC1. Table 2 shows the results of evaluations using ten fold cross validation.

Table 2 PCA performance (accuracy) of SVM and NB

The accuracy is better with PC1 alone as a component model (Table 2). To reduce the attributes of PC1, an empirical analysis is done to find the influence of attributes in PC1. Figure 2 shows the rapidminer work flow representation of PCA component level analysis.

Fig. 2
figure 2

Rapidminer work flow for PCA component level analysis

4.2 Component level analysis

In order to find the dominating attributes of the reduced principle component PC1, a component level analysis is done. In this analysis the accuracy of the SVM and NB classifiers in conjunction with the use of different attribute weights of PC1 is measured. The attributes in PC1 are sorted in decreasing order of weights ranging from 1 to 0. The number of attributes chosen for models I, II and III are based on the attribute weights as shown in Table 3.

Table 3 Number of attributes for attribute Weights of PC1

The classification performance is measured using tenfold cross validation. It can be observed from the Figs. 3 and 4. That the accuracy increases with increase in the number of attributes, but when the accuracy value reaches some boundaries, the performance of classifier are the same or worse. Thus, it is evident that the accuracy of the classifiers is influenced by the choice of a number of attributes. The choice of the number of attributes is based on attribute weights of principal components (PC1). When number of attributes of PC1 are 25, 36 and 42 for models I, II and III respectively, both classifiers significantly improved the classification accuracy. After which the classification accuracy is reduced with little variations between classifiers for all models. This suggests that model I with 25 number of attributes, model II with 36 number of attributes and model III with 42 number of attributes are sufficiently optimal for the classifiers to perform better input/output mapping. Thus the irrelevant attributes of PC1 can be reduced to improve classifier performance. As a result of analysis, the reduced feature list for models I, II and III are shown in Tables 4, 5 and 6 respectively.

Fig. 3
figure 3

Accuracy of SVM with varying no. of attributes in PC1

Fig. 4
figure 4

Accuracy of NB with varying no. of attributes in PC1

Table 4 Attribute list for Model I
Table 5 Attribute list for Model II
Table 6 Attribute list for Model III

To perform classification, the word vector for models I, II and III are reconstructed using the reduced set of features represented in Tables 4, 5 and 6 for all review sentences. The vector models are used to compare the classification performance using two ensembles based classification i.e. bagging and bayesian boosting models and two individual classifier models.

5 Classification methods

This section discusses the classification methods used in this work to develop the prediction system. The classification methods are employed using weka tool with default values for all parameters.

5.1 Baseline methods

Support vector machines are powerful classifiers arising from statistical learning theory that have proven to be efficient for various classification tasks in text categorization. SVM belongs to a family of generalized linear classifiers. It is a supervised machine learning approach used for classification n to find the hyper plane maximizing the minimum distance between the plane and the training points. LR is a standard technique based on maximum likelihood estimation. The first step in logistic methods is identifying which combination of independent variables best estimates the dependent variable. This is known as model selection. The model is used with default values for classification parameters [10].

5.2 Bagging

The main idea is to construct each member of the ensemble from a different training dataset, and to predict the combination by uniform averaging over class labels [44]. The bagging algorithm creates an ensemble of models for a learning scheme where each model gives an equally weighted prediction [11, 21, 28]. A bootstrap sample of S items is selected uniformly at random with replacement. This means each classifier is trained on a sample of examples taken with a replacement from the training set, and each sample size is equal to the size of the original training set. Then, they are aggregated into to make a collective decision using majority voting. Therefore, Bagging produces a combined model that often performs better than the single model built from the original single training set.

5.3 Bayesian boosting

Boosting is an iterative process, which adaptively changes the distribution of training examples so that the base classifiers will focus on examples that are hard to classify. Boosting have become one of the alternative framework for classifier design, together with the more established classifier like bayesian classifier. NB classifier is used as inner classifier and the number of iterations to combine the classifier is 10. Other parameters are used with default values [11, 12, 21, 27, 28].

6 Results and discussion

The prediction systems are developed using each of the methods discussed in Sect. 5 for the models I, II and III. The results are shown in Tables 7, 8, 9, 10, 11, 12 and 13. For each 10-fold cross validation, the data set was first partitioned into ten equal sized sets and each set was in turn used as the test set while the classifier trains on the other nine sets. In this work the results obtained for the test data set are evaluated first using misclassification rate.

Table 7 Results of LR
Table 8 Results of SVM
Table 9 Results of bagged SVM
Table 10 Results of bayesian boosting
Table 11 Results of correctness of classifiers
Table 12 Results of completeness of classifiers
Table 13 Results of effectiveness of classifiers

Misclassification rate is defined as the ratio of number of wrongly classified reviews to the total number of reviews classified by the prediction system. The wrong classifications fall into two categories. If negative reviews are classified as positive (C1), it is named as a type I error. If positive are classified as negative (C2), it is named as type II error.

$$ {\text{Type I error}} = {\text{C1}}/({\text{Total no}} . {\text{ of positive reviews}}) $$
$$ {\text{Type II error}} = {\text{C}}2/({\text{Total no}} . {\text{ of negative reviews}}) $$
$$ {\text{Overall misclassification rate}} = ({\text{C}}1 + {\text{C}}2)/({\text{Total no}} . {\text{ of reviews}}) $$

The obtained results are compared to the actual opinion and the four quality parameters are computed. Tables 7, 8, 9 and 10 summarize the misclassification results. G1 refers to the positive group and G2 refers to the negative group. The possible output results are presented in the inner matrix of the tables, which are G1G2 (actual positive and predicted negative—type II error) and G2G1 (actual negative and predicted positive—type I error). The overall misclassification is given at the bottom of the matrix.

6.1 Performance of individual classifiers

The classification results obtained for LR and SVM methods are given in Tables 7 and 8 respectively. In Table 7, the classification results of LR show that type II error is comparatively lesser than type I error for all three models (models I, II and III). This indicates that the LR method predicts positive reviews more accurately than negative reviews for models I, II and III. Among the models used, model I has better performance in terms of type I error and type II error compared to other two data models (models II and III). Due to less type I and II error, the overall misclassification of LR is also less for model I compared to models II and III. Table 8 gives the classification results in terms of error measures for SVM method. The type I and type II errors are considerably lesser when compared to LR method, which shows the superiority of SVM. Due to less type I and II error, the overall misclassification is also less compared to LR for all three models (models I, II and III). Though the overall misclassification is less compared to LR, SVM method also predicts positive reviews more accurately (type II error is lesser than type I error) than negative reviews for models II and III. Among the models used, SVM again performs better for model I than models II and III.

6.2 Performance of ensemble based hybrid classifiers

Table 9 and 10 presents the results of hybrid bagged SVM and hybrid boosting prediction respectively. Table 9 shows that the overall misclassification rate of bagged SVM is reduced considerably for models I, II and III compared to SVM (the best individual classification method found in Sect. 6.2). This represents the high accuracy in prediction of the bagged ensemble based method compared to the individual classifiers. The classification results also show that type II error is comparatively lesser than type I error for all three models (models I, II and III). This indicates that the bagged SVM based hybrid method also predicts positive reviews more accurately than negative reviews for models I, II and III. In general, bagged SVM with PCA reduction performs better for model I compared to models II and III.

Table 10 gives the results of bayesian boosting based hybrid prediction. The overall misclassification rate is reduced considerably for models I, II and III compared to the best individual classification method identified in Sect. 6.2 (SVM). Bayesian boosting ensemble based hybrid method performs better than SVM. But, bagged SVM based hybrid method dominates bayesian boosting ensemble based hybrid method in terms of type I error and type II error for model I. Thus, in terms of overall misclassification rate for model I. Comparing with bagged SVM based model bayesian boosting method dominates for models II and III with lesser type I and type II error.

In general, the results in Tables 7, 8, 9 and 10 shows that the ensemble based hybrid approach performs better than other individual classification methods. The performance of bagged SVM ensemble model is appreciable for model I and bayesian boosting is better for all other models (II & III). Among the models, model I performs with high accuracy for all classification methods used except bayesian boosting (Figs. 5, 6).

Fig. 5
figure 5

overall misclassification rate of data models

Fig. 6
figure 6

overall misclassification rate of classifiers

6.3 Quality metrics

In addition to the misclassification rate, the following quality metrics are evaluated.

6.3.1 Correctness

Correctness is defined as the ratio of the number of reviews correctly classified as positive to the total number of reviews classified as positive.

6.3.2 Completeness

Completeness is defined as the ratio of number of positive reviews classified as positive to the total number of actual positive reviews.

6.3.3 Effectiveness

Effectiveness is defined as the proportion of positive reviews considered high risk out of all reviews. Let, Type II misclassification is Pr(nfp/fp)

$$ {\text{Effectiveness}} = \Pr (fp/fp) = 1 - \Pr (nfp/fp) $$

Tables 11, 12 and 13 summarize the various quality measures of all the classification methods used in the analysis. These measures used for evaluation are discussed in Sect. 6.1. From the results in Table 11, it is found that the SVM and LR based models (I, II and III) lead to low correctness, which implies that a large number of sentences that are not positive/negative would have been inspected. The correctness value is much higher for hybrid ensemble methods compared to other methods used for models I, II and III. Among the hybrid classifiers, highest correctness of 83.3 % is achieved by model III of bayesian boosting in classifying review sentences. Among the models I, II and III in each classification method, classification results are good for model I of SVM, LR and bagged SVM. This proves that rather than the combination of the unigram, bigram and trigram, unigrams alone have a strong relationship to review classification. In general hybrid ensemble based models classifies the reviews very accurately with high correctness.

Completeness of the classification models is shown in Table 12. Table 12 shows that hybrid ensemble based methods predicts the maximum positive and negative reviews present compared to other methods used for all models. Among three models, model I of bagged SVM predicts the maximum positive and negative reviews with high completeness of 81.6 %. The effectiveness of the models are represented in the Table 13. Effectiveness captures the productive effort to be spent in inspecting the real positive and negative review sentences. Bagged SVM proves to be more effective and for models I and bayesian boosting is effective for models II and III. Among the classification model used, hybrid methods classify the review with better effectiveness. The higher effectiveness of model III for bayesian boosting (81.3 %) indicates that the waste of effort during analysis is very minimum.

In general, our experimental results show that among the classification methods used, hybrid ensemble methods perform better on all quality measures. Among the hybrid classification methods bagged SVM achieves better performance in all quality measures for model I. Model II and III suites better for bayesian boosting classification method. Thus, for bayesian boosting, the inclusion of bigrams and trigrams provides better performance compared to the performance of the classifier using unigrams alone. Moreover the results also shows that PCA is a suitable dimension reduction method for ensemble based methods.

6.4 Statistical significance test

We applied the nonparametric McNemar’s statistical test to compare the performance of the best trained classifier. Comparison of the classifiers based on McNemar’s nonparametric statistical test showed that the ensemble based hybrid method performs better. The null hypothesis (H0) for this experimental design suggests that different classifiers perform similarly whereas the alternative hypothesis (H1) claims otherwise suggesting that at least one of the classifiers performs differently. The z scores will indicate whether we should accept H0 and reject H1 or vice versa. In order to calculate the z scores, the classification results of the three classifiers must be identified for each individual instance.

In Tables 14, 15 and 16, the arrowheads ↑ denote the classifier mentioned in the table row header performed better in the given dataset and ← denote the classifier mentioned in the table column performed better in the given dataset. Z scores are given next to the arrowheads as a measure of how statistically significant the results are. By looking at the Mc nemar’s test results for the model I (Table 14), it is deduced that bagged SVM has produced significantly better results than SVM, LR and bayesian boosting classifiers (H1 is accepted with a confidence level of more than 99.5 %). SVM classifier performed better than LR for model I. In Table 15, the Mc nemar’s test results for the model II data set shows that bagged SVM performs better than SVM and LR. It is also observed that bayesian boosting performs better than bagged SVM for model II. The performance differences between ensemble classifiers and other individual classifiers were found to be statistically significant for the model I (H1 is accepted with a confidence level of more than 99 %). For all three models SVM performs statistically better than LR. For model III (Table 16), among the ensemble methods used bayesian boosting performs better than bagged SVM. So, the hypothesis H1 is accepted with a confidence level of more than 99.5 %. We calculated the p values for the one-tailed Mc nemar’s tests comparing our ensemble based approach with the baselines. The resulting p values shows that bagged SVM based hybrid approach is significantly better than the other approaches for model I. This improvement is statistically significant at p < 0.005. For model II, bayesian boosting is statistically significantly better than the other approaches (p < 0.001). For model III, bayesian boosting is statistically significantly better than the other approaches (p < 0.005).

Table 14 McNemar’s test results: model I
Table 15 McNemar’s test results: model II
Table 16 McNemar’s test results: model III

6.5 Threats to validity

This work does not consider neutral reviews for classification i.e. multi class classification. Moreover, the performance of the classifiers is evaluated for product reviews, but opinion analysis is domain specific. So the hybrid methods need to be evaluated on other application domains. Product attributes are selected from review sentences manually, which cannot be assured as 100 % accurate. So a suitable part of speech tagging approach may be employed.

7 Conclusion

In the development of prediction models to classify the reviews, more reliable approaches are expected to reduce the misclassifications. In this paper, two ensembles based hybrid approaches, which perform better than the statistical baseline approaches are introduced. Among the methods used, the combination of ensembles and PCA methods were highly robust in nature for models I, II and III which was studied through the various quality parameters. Bagged SVM dominates for unigrams model with a reduction in overall misclassification rate of 3.3 % compared to bayesian boosting based hybrid model. Bayesian boosting performs better for combination of of unigrams, bigrams and trigrams with a reduction in overall misclassification rate of 2.9 % compared to bagged SVM based method. In future, the performance of hybrid classifier is to be evaluated on various other domains. Different hybrid combinations of soft computing techniques can also investigated. The use of PCA as feature reduction technique must be analyzed on various other feature selection methods like information gain & mutual information in the future. The effect of feature reduction methods (PCA, fisher’s linear discrimination ratio, latent semantic indexing) combined with other ensemble methods such as stacking and voting can be done as an extension of this work.