Keywords

1 Introduction

Explosion in the use of social networks, microblogs, forums and e-commerce pages has generated an increased interest in discovering opinions, feelings and sentiment of authors regarding such information. The essence of this information is subjectivity, even though sentiment could exist related to objective text, many decisions people make are influenced by others’ opinions. One clear example is stock prices that are subject not only to objective parameters but also to speculation. This subjective information also raises interest to discover trends and changes in society’s opinions.

Sentiment analysis, also known as opinion mining, tries to cope with this subjectivity problem. Sentiment analysis defines the subjectivity problem as a polarity problem, where the main goal is to determine the polarity of text. There is a wide variety of uses and applications of sentiment analysis such as: marketing trends, satisfaction of end users surveys, political opinion trends, among others. In this manner, detection of sentiment polarity facilitates understanding such subjective information.

In Spanish texts, as in other languages, sentiment analysis poses challenges for Natural Language Processing (NLP) due to the inherent nature of the language, the ambiguity in the use of words, misspelling, grammatical mistakes, missing or misused punctuation, use of slang, and the lack of standardization while tagging corpora.

This paper presents a hybrid approach to extract features from a corpus consisting of Spanish opinion sentences. After feature extraction, we construct a model using support vector machines (SVM) and a Spanish Sentiment Lexicon created semi-automatically. In order to test the accuracy of the model, two previously annotated corpora were used achieving 87.9 % accuracy in a 5-fold cross-validation, 52.6 % accuracy for 3-class TASS 2014 corpus and 65.4 % for SFU Reviews corpus.

The rest of the paper is organized as follows: In Sect. 2 a brief synthesis of related work about sentiment analysis on Spanish texts is presented; in Sect. 3 the process of supervised classification to construct the model is explained; in Sect. 4 implementation details of a sentiment analysis system are described; in Sect. 5 the accuracy of the model is evaluated; and finally, in Sect. 6 conclusions and future work are outlined.

2 Related Work

Sentiment analysis has many interesting applications such as: determining customer satisfaction, identifying opinion trends, the market sentiment,Footnote 1 political opinion trends, determining some entity’s reputation and more important listening the voice of the majority respect to some topic of interest. Excellent work has been done regarding sentiment analysis in English. However, it is relevant to increase the efforts focused on other languages.

Specifically in Spanish, more work is needed on freely available lexicons in order to improve algorithms to detect sentiment polarity. In this regard, a few works are open and freely available such as the work of Molina-Gonzales et al. [10] where a Spanish Opinion Lexicon (SOL) was constructed using machine translation over the lexicon constructed by Hue and Liu in [8]. In addition to machine translation, the SOL lexicon was improved manually and enhanced with domain information from a movies reviews corpus.Footnote 2 Another example of freely available lexicon is the work of Perez-Rosas et al. [13] where two corpora were obtained using Latent Semantic Analysis over WordNet and SentiWordNet: full-strength and medium-strength lexicons. Open lexicons only for academic and research purposes are also in the scope: Redondo et al. [14] adapted the ANEW corpus previously done in English by Bradley and Lang [3] using manual translation of 1034 words; and Sidorov et al. [15] manually annotated the Spanish Emotions Lexicon (SEL) with the six basic human emotions: anger, fear, joy, disgust, surprise, and sadness.

Other works aim to extract automatically lexicons from annotated corpora, such as the work of Gutierrez, et al. [7], in which graph-based algorithms have been used to annotate extracted words from previously annotated corpora. In their approach, every word is either potentially positive or negative if it appears in a phrase tagged as positive or negative respectively and semantic relations are captured with a graph-based representation. Montejo-Raez et al. [11] implemented a strategy to obtain a set of polarity words from twitter by using the query “me siento” (I feel) and then manually tagging the polarity of words.

Regarding lexicon based methods for sentiment analysis, Taboada et al. [16] presented a heuristic Semantic Orientation Calculator (SO-CAL) which uses annotated dictionaries of polarity words (adjectives, adverbs, verbs and nouns), negation words and intensifiers. Moreno-Ortiz et al. [12] also applied a heuristic calculation to obtain what they call Global Sentiment Value (GSV), however, their GSV formula has poor accuracy.

Machine Learning approaches are often used to solve sentiment polarity classification problems. Martinez-Camara et al. [9] performed several tests with different features as Term Frequency-Inverse Document Frequency (TF/IDF) and Binary Term Occurrence (BTO) with SVM and Naive Bayes classifiers. In the work of Anta et al. [1] a series of tests using N-grams obtained through a combination of preprocessing tasks (lemmatization, stemming, spell-checking) with Bayesian classifiers and decision tree learners is also performed.

Our hybrid approach is similar to [5, 18] in the sense that it combines a lexicon-based approach with a supervised-learning approach. Del-Hoyo et al. [5] formed feature vectors concatenating TFIDF features and the result of their Semantic Tool for detecting affect in texts. Vilares et al. [18] used part-of-speech tags with syntactic dependencies. Nevertheless, our work is different from [5, 18] in the sense that no bag of words is needed to construct the feature vector. Therefore, feature vector dimensionality is reduced.

3 Classification Model

The work presented in this paper aims to cope with challenges in sentence-level sentiment analysis. The methodology that has been followed is depicted in Fig. 1: A supervised-learning approach was used to determine sentiment polarity of sentences. First, a process collects sentences from twitter and several e-commerce pages to construct a corpus. The Sentiment Groups features were extracted from this corpus. Furthermore, a sentiment lexicon was obtained from corpora using the most representative polarity words. Using support vector machines (SVM) three models were obtained with a radial basis kernel for binary classification. Then, 5-fold cross-validation was used to obtain accuracy and the F1-measure, although similar results were obtained using 6, 8 and 10-fold-cross-validation.

Fig. 1.
figure 1

Supervised-learning in sentiment analysis

3.1 Tweets + Reviews Corpus

The Tweets + Reviews (TR) Corpus was obtained downloading tweetsFootnote 3 and a subset of the reviewsFootnote 4 from e-commerce pages available in the work of Dubiau and Ale [6]. Once downloaded, tweets and reviews were split into sentences. A total of 6687 sentences were obtained. Each sentence was tagged accordingly to its polarity as: terrible (N+), bad (N), neutral (NEU), good (P) and excellent (P+).

To ensure quality of corpus, tagging and selection of sentences was done in a three-step process:

  1. (i)

    Manual tagging: A group of five trained colleagues tagged each sentence, and each sentence was reviewed by at least two different taggers.

  2. (ii)

    Sentiment polarity identification: A heuristic sentiment algorithm was used to tag automatically the sentences. This heuristic calculator determines sentiment polarity using basic linguistic rules and polarity words and is explained in detail in Sect. 4.2.

  3. (iii)

    Corpus construction: Sentences that matched the heuristic sentiment calculator and manual tagging were selected to construct the corpus.

With this procedure we obtained 3084 sentences as shown in Table 1.

Table 1. Tweets & reviews corpus

3.2 Spanish Sentiment Lexicon

Our Spanish Sentiment Lexicon (SSL) is one of the major components of the sentiment analysis system. For this work, the SSL was constructed semi-automatically:

  1. 1.

    Extracting the most representative (frequent) words selected statistically from the TR Corpus and tagging them with the polarity of the sentence they came from and with their part-of-speech (PoS) tag.

  2. 2.

    Adding to the SSL the most common polarity adjectives and adverbs coming from on-line dictionaries. It was necessary to tag those words coming from external sources.

  3. 3.

    Validating manually the Sentiment Spanish Lexicon in order to increase lexicon quality.

Even though it has been an iterative and time consuming process, we have constructed a consistent Spanish Sentiment Lexicon. Currently, the Spanish Sentiment Lexicon is formed by 4583 words divided into adjectives, adverbs, nouns, negation adverbs, and intensifier words. Intensifier words depends upon the language and could be quantitative adverbs or comparative/superlative adjectives. This is summarized in Table 2.

Table 2. Spanish sentiment lexicon

Labeling of words proceeded as follows: -2 for N+, -1 for N, 0 for NEU, 1 for P and 2 for P+. Negative and positive labels are defined accordingly to intentionality rather than interpretation. For instance, there are words that are interpreted with negative feeling such as: “government”, “politicians”, “acne”, among others, but in fact, those words are not indeed negative, even though many times they are used in contexts where negative feelings are towards them. In this sense, words that are part of our lexicon are those whose intentionality is clearly positive or negative. Other words of ambiguous intentionality or context dependent like: ambitious, small, big, long, among others are tagged as context-dependent. The Spanish Sentiment Lexicon is freely available.Footnote 5

3.3 Feature Selection: Sentiment Groups

In sentiment analysis, commonly selected features are: frequency or occurrence of terms, frequency or occurrence of n-grams (especially bigrams and trigrams) and part-of-speech (PoS) tags. In this paper the notion of Sentiment Groups as features for sentiment analysis classification is introduced.

A Sentiment Group is a group of related words. These words can be polarity words (bueno-good, malo-bad, etc.), intensity words (poco-little, mucho-much, quantitative adverbs in case of spanish language) or negation adverbs (no-not, nada-nothing, etc.), a special case in spanish is double negation, and we identify it and treat double negation as intensification of single negation (No me gustó nada la película, I did not like the movie at all). Neutral and objective words as well as conjunctions and prepositions are not part of a Sentiment Group. However, they are essential for delimiting Sentiment Groups. Several Sentiment Groups can occur within a sentence. This situation occurs often when there is incorrect punctuation use, which is common in informal text such as tweets and reviews.

A Sentiment Group is defined by these rules:

  • All words within Sentiment Group are at most 2 words from distance.

  • A Sentiment Group can contain double or triple negation (because of Spanish nature, other languages may not contain neither double nor triple negation)

  • Sentiment group separators are conjunctions (y-and, o-or, etc.) and punctuation: comma, semicolon.

Examples of sentiment groups are depicted in Table 3. In addition to grouping, words that belong to Sentiment Groups are tagged with its PoS tag and its polarity (+ for positive and – for negative).

Table 3. Examples of sentiment groups

A Sentiment Group contains basically atomic units of sentiment independent from each other. In this manner, we can extract several atomic units of sentiment and use those characteristics to build the feature vector. A Sentiment Group can be seen as a fast-heuristic approach to a syntactic dependency parser for informal text.

3.4 Feature Vector

We performed some tests on the corpus and discovered that average length of sentences was 9 words with standard deviation of 8. However, we also found some anomalous cases where length of sentences extended up to 45 words. Taking into account maximum expected length of sentence as 20 (a little more than average + standard deviation) words and with no punctuation and a terrible use of grammar rules a sentence can contain up to ten Sentiment Groups. Hence, we decided to construct the feature vector using ten Sentiment Groups. In the feature vector each Sentiment Group is characterized by all polarity, intensity and negation words and their PoS tags. Feature vector is then formed as (1).

$$ F = \left\{ {NEG_{n} ,DNEG_{n} ,INT_{n} , + NN_{n} , - NN_{n} , + JJ_{n} , - JJ_{n} , + VB_{n} , - VB_{n} , + RB_{n} , - RB_{n} |1 \le n \le 10} \right\} $$
(1)

It is important to note that order matters while constructing the feature vector, so each word belongs to the corresponding Sentiment Group. Double negation is taken into account. Table 4 shows the interpretation for each Sentiment Group feature.

Table 4. Features in each sentiment group within feature vector

3.5 Support Vector Machines

Support Vector Machines (SVM) are widely used as classifiers in sentiment analysis related tasks [1, 5, 9, 18]. Due to the inherent characteristic of SVM as linear classifiers, it was decided to construct three binary models instead of one multiclass model but still using non -linear kernel for better performance, specifically radial basis function kernel. In this manner, three balanced corpus were obtained using undersampling and were used to train three SVM models: good-bad (P, N) model with 2839 sentences, excellent-good (P+,P) model with 840 sentences and bad-terrible (N,N+) model with 282 sentences.

4 Tinga Sentiment Analysis System

In order to test our proposed model we implemented a testbed that we refer to as Tinga,Footnote 6 which is part of a Scala library for Natural Language Preprocessing and it also includes modules for text preprocessing, tokenizing, part-of-speech tagging, basic text features extraction, and a module that wraps a Java Support Vector Machine library.Footnote 7

4.1 Text Preprocessing

Much has being said about text preprocessing regarding sentiment analysis, some approaches clean text by getting rid of punctuation, stopwords, diacritics and stemming or lemmatizing words [1, 9]. However, in our proposal we use minimal classic preprocessing as in [2] and only follow the next preprocessing steps to normalize informal text (such as tweets):

  • EmojiFootnote 8-emoticons identification: A regex was used to identify emojis and emoticons present in informal text (like tweets). Each emoji and emoticon was previously classified as positive or negative and it is replaced in text using polarity words (excelente, buen, neutro, malo, terrible).

  • Hashtag split: Many hashtags are formed by several words. We implemented an algorithm to split hashtags into several words.

  • Repetition of characters: In Spanish only a few words allow repetition of characters, for instance, consonants c, l, n and r are the only ones that are allowed to be repeated. Although all vowels can be repeated to form words, neither polarity nor intensity words are within this set of words. Therefore we detect and erase the repetition of characters including vowels, non {c,l,n,r} consonants and punctuation.

  • Upper-case words: In chat slang, upper case means yelling.

  • Adversative conjunctions detection: In addition to determination of negation, determination of adversative conjunctions changes the sentiment of opinion.

  • Special characters and punctuation: All Spanish characters are allowed but only basic punctuation marks are allowed.

  • Spell checking: Tinga has implemented a Bayes theorem based spell checker.

  • Tokenizing and PoS tagging: A sentence is split into valid word tokens and tagged according to its grammatical category (PoS tag).

4.2 Polarity Identification

Cascading SVM Classifiers. With the three models obtained from the training phase it is possible to classify into two classes (P, N) or into four classes (P+, P, N, N+) by simply cascading SVM models as shown in Fig. 2. Neutral (NEU) and none (NONE) classes are discarded from classifiers but taken into account in our heuristic sentiment calculator.

Fig. 2.
figure 2

Cascading SVM classifiers

Heuristic Sentiment Calculator. A basic Heuristic Sentiment Calculator (HSC) was implemented to match manual tagging. HSC uses some basic rules:

  • Split the sentences into Sentiment Groups (SG)

  • Obtain polarity of each SG by multiplying all polarity and intensity words within each SG taking into account negation, double negation.

  • Obtain sentence level sentiment using weighted sum of all SG present

For a more formal calculator using syntactic dependencies, see [17].

5 Results

Our model was validated using 5, 6, 8 and 10-fold cross validation over balanced corpora (see Sect. 3.5) and also tested against the TASS 2014 corpus and the SFU Reviews Corpus. Cross-validation gives good results, however, relying only on cross-validation is not sufficient to tackle the sentiment analysis problem. Therefore, we test our approach against TASS 2014 (1 k corpus) [19] and SFU Review Corpus [16]. For TASS 2014 our accuracy was 52.6 % for 3-class problem and 35.2 % for 5-class problem. For the SFU Review Corpus our accuracy was 65.4 % as shown in Table 5.

Table 5. Accuracy and F-1 measure of proposed hybrid approach

Testing the TASS corpus is a challenging task because of the nature of its tagging, for instance, we found some examples clearly neutral but tagged as positive. We did better in the SFU Review corpus, but still with low accuracy because of difference of structure. The TR corpus consist of sentences while SFU Review corpus consist of long texts. While analyzing long texts we need to tackle context in order to weight most relevant sentences.

6 Conclusions and Future Work

We have presented a hybrid method to classify polarity of Spanish comments with support vector machines trained with a vector formed with lexical-syntactic features: part-of-speech tags and polarity valence of words. A Spanish Sentiment Lexicon was constructed to be the reference of polarity valence of words. In addition to this, we implemented Tinga, our sentiment analysis system, to test our proposed model. Finally, to test accuracy and F-1 measure TASS 2014 and SFU Reviews corpus were used. Our approach use cascading classifiers, but in future work an SVM with polynomial kernel will be tested in order to make classification in one single step.

A Spanish Sentiment Lexicon is the result of a semi-automatic process and more formalization is needed by evaluating reliability using kappa agreement. Also, it is necessary to increment its potential by adding context information to polarity words. In this manner, the polarity of the words will be influenced also by its context, giving better results. We are also working on a graph based approach [4] to tackle both: sentiment analysis classification and automatic lexicon annotation.

Limitations of this work are given mainly due to the heuristic approach. More work on validation of lexicon and feature extraction is needed to improve the robustness to the model.