Introduction to sentiment analysis

The popularity of rapidly growing online social networks and electronic media based societies has influenced the young researchers to pursue their work on sentiment analysis. These days organizations quite keen assess their customers or public opinion about their products from social media text [1]. The online service providers are hooked on assessing social media data on blogs, online forums, comments, tweets and product reviews. This assessment is exploited for their decision making or amelioration of their services or quality of products. The applications of sentiment analysis encompass the areas like social event planning, election campaigning, healthcare monitoring, consumer products and awareness services. The immoderate use of internet by business organizations all around the globe has noticed that opinionated web text has molded our business plays and socio-economic systems. The computational power is fueled by burgeon of machine learning techniques. This work focused on four text classifiers utilized for sentiment analysis viz. Naïve Bayes, J48, BFTree and OneR algorithm. The “Machine learning techniques for sentiment analysis” section of this paper provides the intuition behind the task of sentiment classification by leveraging the modeling of aforementioned four classifiers. The architecture of proposed model using four sentiment classifiers is disposed in “Proposed methodology for optimization of sentiment prediction using weka” section. The related work with recent contributions of machine learning in the field of sentiment classification is described in “Related work” section. In “Datasets taken” section, the three manually annotated datasets are described along with their preprocessing. The experimental results and discussion of efficacies of classifiers are cataloged in “Results and discussions” section followed by the ending remarks along with a future direction in “Conclusion” section.

Levels of sentiment

Due to scarcity of opinion text available in digital form, very less research interest on computational linguistics in the last decade of twentieth century was witnessed [2,3,4]. The escalation of social media text on internet attracts young researchers to define the level of granularities of text. The web text is classified into three levels viz. document level, sentence level and word level. In [5], the fourth level granularity is defined by using deep convolution neural network. This fourth level is character level feature extraction approach used for extracting features of each character window from given word (Table 1).

Table 1 Levels of sentiment along with their attributes

Machine learning techniques for sentiment analysis

The social networking sites dispense their data conveniently and freely on the web. This availability of data entices the interest of young researchers to plunge them in the field of sentiment analysis. People express their emotions and perspectives on the social media discussion forums [6]. The business organizations employ researchers to investigate the unrevealed facts about their products and services. Spontaneous and automatic determination of sentiments from reviews is the main concern of multinational organizations [7,8,9,10]. The machine learning techniques have improved accuracy of sentiment analysis and expedite automatic evaluation of data these days. This work attempted to utilize four machine learning techniques for the task of sentiment analysis. The modeling of four techniques is briefly discussed below.

Naïve Bayes used for sentiment classification

The dichotomy of sentiment is generally decided by the mindset of an author of text whether he is positively or negatively oriented towards his saying [6, 11,12,13]. Naïve Bayes classifier is a popular supervised classifier, furnishes a way to express positive, negative and neutral feelings in the web text. Naïve Bayes classifier utilizes conditional probability to classify words into their respective categories. The benefit of using Naïve Bayes on text classification is that it needs small dataset for training. The raw data from web undergoes preprocessing, removal of numeric, foreign words, html tags and special symbols yielding the set of words. The tagging of words with labels of positive, negative and neutral tags is manually performed by human experts. This preprocessing produces word-category pairs for training set. Consider a word ‘y’ from test set (unlabeled word set) and a window of n-words (x1, x2, …… xn) from a document. The conditional probability of given data point ‘y’ to be in the category of n-words from training set is given by:

$$P(y/x_{1} ,x_{2} , \ldots \ldots x_{n} ) = P\left( y \right) \times \mathop \prod \limits_{i = 1}^{n} \frac{{P(x_{i} /y)}}{{P(x_{1} ,x_{2} , \ldots \ldots x_{n} )}}$$
(1)

Consider an example of a movie review for movie “Exposed”. The experimentation with Naïve Bayes yields the following results.

J48 algorithm used for sentiment prediction

The hierarchical mechanism divides feature space into distinct regions followed by the categorization of sample into category labels. J48 is a decision tree based classifier used to generate rules for the prediction of target terms. It has an ability to deal with larger training datasets than other classifiers [14]. The word features for sentences of corpus taken from labeled arff file of training set are represented in the leaf nodes of decision tree. In the test set every time when a near feature qualifies the label condition of internal feature node, its level is lifted up in the same branch of decision tree. The assignment of labels to the word features of test set gradually generates different two branches of decision tree. J48 algorithm uses entropy function for testing the classification of terms from the test set.

$$Entropy \left( {Term} \right) = - \mathop \sum \limits_{j = 1}^{n} \frac{|Term j|}{|Term|}\log_{2} \frac{|Term j|}{|Term|}$$
(2)

where (Term) can be unigram, bigram and trigram. In this study we have considered unigrams and bigrams. The example in the Table 2 contains bigrams like “Horrible acting”, “Bad writing” and “Very misleading” are labeled with negative sentiment whereas the term “More enjoyable” reflects positive sentiment towards the movie. The decision tree of J48 algorithm for obtaining sentiment form text is represented in the Fig. 1 below.

Table 2 Initial four reviews of training set and two reviews test set
Fig. 1
figure 1

J48’s Decision Tree for terms of Example in Table 2

BFTREE algorithm used for sentiment prediction

Another classification approach outperforms J48, C4.5 and CART by expanding only best node in the depth first order. BFTree algorithm excavates the training file for locating best supporting matches of positive and negative terms in the test file. BFTree algorithm keeps heuristic information gain to identify best node by probing all collected word features. The only difference between J48 and BFTree classifier is the computation order in which decision tree is built. The decision tree disparate feature terms of plain text taken from movie reviews and classify them at document level by tagging appropriate labels. BFTree extracts best node from labeled and trained binary tree nodes to reduce the error computed from information gain.

$$Info_{gain} \left( {S, A} \right) = Entropy \left( S \right) - \mathop \sum \limits_{i \in V(A)} \frac{{\left| {S_{i} } \right|}}{S} \times Entropy(S_{i} )$$
(3)

where S is word feature term of test set and A is the attribute of sampled term from training set. V(A) denotes set of all possible values of A. The binary tree stops growing when an attribute A captures single value or when value of information gain vanishes.

OneR algorithm used for sentiment prediction

OneR algorithm is a classification approach which restricts decision tree to level one thereby generating one rule. One rule makes prediction on word feature terms with minimal error rate due to repetitive assessment of word occurrences. The classification of most frequent terms of a particular sentence is made on the basis of class of featured terms from training set. The demonstration of OneR algorithm for sentiment prediction with smallest error of classification is given below:

Step 1:

Select a featured term from training set.

Step 2:

Train a model using step 3 and step 4.

Step 3:

For each prediction term.

For each value of that predictor.

Count frequency of each value of target term.

Find most frequent class.

Make a rule and assign that class to predictor.

Step 4:

Calculate total error of rules of each predictor.

Step 5:

Choose predictor with smallest error.

Proposed methodology for optimization of sentiment prediction using weka

The preprocessing of raw text from web is done in python 3.5 using NLTK and bs4 libraries. Each review in the first dataset is parsed with NLTK’s parser and title of the review is considered as a feature. We have obtained 15 features from first dataset and 42 features from each of second and third dataset. The CSV files generated from Python are converted to ARFF files for WEKA 3.8. Only two sentiment labels namely Pos for positive and Neg for negative are used for assigning sentences. The working methodology of proposed work for optimization of sentiment prediction is given below in Fig. 2.

Fig. 2
figure 2

Proposed methodology

After loading files with ARFF loader, the class assigner picks up appropriate class labels from dataset and performs feature selection on the basis of frequently used headings and most frequent titles. The feature selector module is implemented using three feature selection methods namely Document Frequency (DF), Mutual Information (MI) and Information Gain (IG). The mathematical modeling of these feature selection methods requires some probability distributions and statistical notations described below:

P(w): Probability that a document ‘d’ contains term ‘w’.

P(c’): Probability that document ‘d’ does not belongs to category ‘c’.

P(w, c): Joint probability that document ‘d’ contains word term ‘w’ of category ‘c’.

P(c/w): Conditional probability that a document ‘d’ belongs to category ‘c’ under the condition that ‘d’ contains word term ‘w’.

Similarly other notations like P(w’), P(w/c), P(w/c’), P(c/w’) and P(c’/w) are taken and {c} is the set of categories.

N1: Number of documents that exhibit category ‘c’ and contain term ‘w’.

N2: Number of documents that do not belong to category ‘c’ but contains term ‘w’.

N3: Number of documents that belong to category ‘c’ and do not contain term ‘w’.

N4: Number of documents that neither belong to category ‘c’ nor contain term ‘w’.

N: Total number of document reviews.

DF method qualifies only those documents in which a higher frequency terms are considered.

$$DF = \mathop \sum \limits_{i = 1}^{m} N_{1i}$$
(4)

The MI method measures features of text by computing similarity of word terms ‘w’ and category ‘c’.

$$Sim_{Info} \left( {w, c} \right) = \log \frac{{P\left( {w/c} \right)}}{P\left( w \right)}$$
(5)
$$MI = \log \frac{{N_{1} \times N}}{{(N_{1} + N_{3} )(N_{1} + N_{2} )}}$$
(6)

The IG-construct measures similarity information for category by exploiting probabilities of absence or presence of terms in a document review.

$$\begin{aligned} IG\left( w \right) = - \mathop \sum \nolimits P\left( c \right) \cdot \log P\left( c \right) + P\left( w \right)\left[ {\mathop \sum \nolimits P\left( {c/w} \right) \cdot \log P\left( {c/w} \right)} \right] \\+\, P\left( {w^{'} } \right)\left[ {\mathop \sum \nolimits P\left( {c/w^{'} } \right) \cdot \log P\left( {c/w^{'} } \right)} \right] \\ \end{aligned}$$
(7)

The normalization module converts all letters into lowercase, removal of punctuation marks and special symbols, conversion of numbers into words, expansion of abbreviation and limiting the average length of twenty words in a sentence. Each sentence is delimited by a newline character. The Python’s NLTK and bs4 libraries are used for this purpose. Data splitter take the ratio of (80:20) of (Train: Test) subsets. We have used manual splitting of dataset at the time of retrieval of data from web. The four classifiers are trained with training subsets followed by performance evaluation. The evaluation metrics taken in the experiment are precision, recall, accuracy and F-measure.

Related work

Existing approaches of sentiment prediction and optimization widely includes SVM and Naïve Bayes classifiers. Hierarchical machine learning approaches yields moderate performance in classification tasks whereas SVM and Multinomial Naïve Bayes are proved better in terms of accuracy and optimization. Sentiment analysis using neural network architectures has appeared in very few works. The sentiment prediction methods using recursive neural networks and deep convolution neural networks are bit complex in capturing compositionality of words. Extracting character level features and embeddings of complex words is found hard in many neural network architectures whereas extracting sentence level or word level features such as morphological tags and stems are more effectively achieved in convolutional neural networks. A very few researchers have used J48, BFTree and OneR for the task of sentiment prediction. These three classifiers are utilized for other classification tasks like emotion recognition from text and twitter’s text categorizations. The summary of benchmarks related to machine learning techniques in terms of accuracy of classification is listed in the Table 2. SVM and Naive Bayes are proving better in terms of benchmarks than other machine learning techniques (Table 3).

Table 3 Benchmarks of classifier’s accuracies

Datasets taken

Three Datasets are manually annotated from http://www.amazon.in. First dataset consists of product reviews of Woodland’s wallet are taken from 12th October 2016 to 25th October 2016 for training set containing 88 reviews and from 25th October 2016 to 30th October 2016 for testing set containing 12 randomly chosen product reviews with their sentiments prediction using four machine learning algorithms. Second dataset consists of 7465 Digital Camera reviews of Sony are taken from 01st October 2016 to 25th October 2016 for training set and 1000 reviews are from 25th October 2016 to 30th October 2016 for test dataset. Third dataset consists of movie reviews taken from http://www.imdb.com. It contains 2421 reviews for training set and 500 reviews for test set.

Results and discussions

The experiment is carried out by using freeware WEKA software tool for classification of sentiments in the text. Standard implementations of Naïve Bayes, J48, BFTree and OneR algorithms are exploited from WEKA version 3.8. The classification accuracy of first dataset shows 100% classification accuracy with Naïve Bayes in some of the epochs because of small size of dataset. The average of 29 epochs for all four classifiers on second and third datasets is presented in Table 4 below. Naïve Bayes shows faster learning among four classifiers whereas J48 found to be slower. OneR classifier is leading from other three classifiers in percentage of correctly classified instances. The accuracy of J48 algorithm is promising in true positive and false positive rates.

Table 4 Performance evaluation of four classifiers

Results of classification accuracies for the test subsets with 42 and 15 attributes are recorded. The average accuracies of 29 runs on three datasets is presented in Table 5 below. All four classifiers improved in accuracies with the increase of features from 15 to 42. This shows the direct proportionality of multiple features with learning capability for machine learning algorithms.

Table 5 Test accuracies of classification algorithms for three datasets

Conclusion

This paper exploits four machine learning classifiers for sentiment analysis using three manually annotated datasets. The mean of 29 epochs of experimentation recorded in Table 4 shows that OneR is more precise in terms of percentage of correctly classified instances. On the other hand, Naïve Bayes exhibits faster learning rate and J48 reveals adequacy in the true positive and false positive rates. Table 5 reveals the truth that J48 and OneR are better for smaller dataset of woodland’s wallet reviews. The preprocessing of proposed methodology is limited to extract foreign words, emoticons and elongated words with their appropriate sentiments. The future work in the task of sentiment analysis has scope to improve preprocessing with word embeddings using deep neural networks and can also extend this study through convolution neural networks.