Language Resources and Evaluation

, Volume 50, Issue 1, pp 35–65

Developing a successful SemEval task in sentiment analysis of Twitter and other social media texts

  • Preslav Nakov
  • Sara Rosenthal
  • Svetlana Kiritchenko
  • Saif M. Mohammad
  • Zornitsa Kozareva
  • Alan Ritter
  • Veselin Stoyanov
  • Xiaodan Zhu
Original Paper

DOI: 10.1007/s10579-015-9328-1

Cite this article as:
Nakov, P., Rosenthal, S., Kiritchenko, S. et al. Lang Resources & Evaluation (2016) 50: 35. doi:10.1007/s10579-015-9328-1

Abstract

We present the development and evaluation of a semantic analysis task that lies at the intersection of two very trendy lines of research in contemporary computational linguistics: (1) sentiment analysis, and (2) natural language processing of social media text. The task was part of SemEval, the International Workshop on Semantic Evaluation, a semantic evaluation forum previously known as SensEval. The task ran in 2013 and 2014, attracting the highest number of participating teams at SemEval in both years, and there is an ongoing edition in 2015. The task included the creation of a large contextual and message-level polarity corpus consisting of tweets, SMS messages, LiveJournal messages, and a special test set of sarcastic tweets. The evaluation attracted 44 teams in 2013 and 46 in 2014, who used a variety of approaches. The best teams were able to outperform several baselines by sizable margins with improvement across the 2 years the task has been run. We hope that the long-lasting role of this task and the accompanying datasets will be to serve as a test bed for comparing different approaches, thus facilitating research.

Keywords

Sentiment analysis Twitter SemEval 

1 Introduction

The Internet has democratized content creation enabling a number of new technologies, media and tools of communication, and ultimately leading to the rise of social media and an explosion in the availability of short informal text messages that are publicly available. Microblogs such as Twitter, weblogs such as LiveJournal, social networks such as Facebook, and instant messengers such as Skype and Whatsapp are now commonly used to share thoughts and opinions about anything in the surrounding world, along with the old-fashioned cell phone messages such as SMS. This proliferation of social media content has created new opportunities for studying public opinion, with Twitter being especially popular for research purposes due to its scale, representativeness, variety of topics discussed, as well as easy public access to its messages (Huberman et al. 2008; Java et al. 2007; Kwak et al. 2010).

Despite all these opportunities, the rise of social media has also presented new challenges for natural language processing (NLP) applications, which had largely relied on NLP tools tuned for formal text genres such as newswire, and thus were not readily applicable to the informal language and style of social media. That language proved to be quite challenging with its use of creative spelling and punctuation, misspellings, slang, new words, URLs, and genre-specific terminology and abbreviations, e.g., RT for re-tweet and #hashtags.1 In addition to the genre difference, there is also a difference in length: social media messages are generally short, often length-limited by design as in Twitter, i.e., a sentence or a headline rather than a full document. How to handle such challenges has only recently been the subject of thorough research (Barbosa and Feng 2010; Bifet et al. 2011; Davidov et al. 2010; Jansen et al. 2009; Kouloumpis et al. 2011; O’Connor et al. 2010; Pak and Paroubek 2010; Tumasjan et al. 2010).

The advance in NLP tools for processing social media text has enabled researchers to analyze people’s opinions and sentiments on a variety of topics, especially in Twitter. Unfortunately, research in that direction was further hindered by the unavailability of suitable datasets and lexicons for system training, development and testing.

Until the rise of social media, research on opinion mining and sentiment analysis had focused primarily on learning about the language of sentiment in general, meaning that it was either genre-agnostic (Baccianella et al. 2010) or focused on newswire texts (Wiebe et al. 2005) and customer reviews (e.g., from web forums), most notably about movies (Pang et al. 2002) and restaurants, but also about hotels, digital cameras, cell phones, MP3 and DVD players (Hu and Liu 2004), laptops, etc. This has given rise to several resources, mostly word and phrase polarity lexicons, which have proved to be very valuable for their respective domains and types of texts, but less useful for short social media messages such as tweets.

Over time, some Twitter-specific resources were developed, but initially they were either small and proprietary, such as the i-sieve corpus (Kouloumpis et al. 2011), were created only for Spanish like the TASS corpus (Villena-Román et al. 2013), or relied on noisy labels obtained automatically based on emoticons and hashtags (Go et al. 2009; Mohammad 2012; Mohammad et al. 2013; Pang et al. 2002). Moreover, they all focused on message-level sentiment only, instead of expression-level sentiment in the context of a tweet. In fact, the first large-scale freely available resource for sentiment analysis on Twitter were the datasets that we developed for SemEval-2013 task 2 (Nakov et al. 2013), which we further extended for SemEval-2014 Task 9 (Rosenthal et al. 2014) as well as for the upcoming SemEval-2015 Task 10 (Rosenthal et al. 2015). They offered both message-level and expression-level annotations.

The primary goal of these SemEval tasks was to serve as a test bed for comparing different approaches, thus facilitating research that will lead to a better understanding of how sentiment is conveyed in social media. These tasks have been highly successful, attracting wide interest at SemEval and beyond: they were the most popular SemEval tasks in both 2013 and 2014, attracting 44 and 46 participating teams, respectively, and they have further fostered the creation of additional freely available resources such as NRC’s Hashtag Sentiment lexicon and the Sentiment140 lexicon (Mohammad et al. 2013), which the NRC team developed for their participation in SemEval-2013 task 2, and which were key for their winning the competition. Last but not least, even though named Sentiment Analysis in Twitter, the tasks also included evaluation on SMS and LiveJournal messages, as well as a special test set of sarcastic tweets.

In the remainder of this article, we first introduce the problem of contextual and message-level polarity classification (Sect. 2). We then describe the process of creating the training and the testing datasets (Sect. 3) and the evaluation setup (Sect. 4). Afterwards, we list and briefly discuss the participating systems, the results, and the lessons learned (Sects. 5 and 6). Finally, we compare the task to other related efforts (Sect. 7), and we point to possible directions for future research (Sect. 9).

2 Task description

SemEval-2013 task 2 (Nakov et al. 2013) and SemEval-2014 Task 9 (Rosenthal et al. 2014) had two subtasks: an expression-level subtask and a message-level subtask. Participants could choose to participate in either or both. Below we provide short descriptions of the objectives of these two subtasks.
  • Subtask A: contextual polarity disambiguation Given a message containing a marked instance of a word or a phrase, determine whether that instance is positive, negative or neutral in that context. The instance boundaries were provided: this was a classification task, not an entity recognition task.

  • Subtask B: message polarity classification Given a message, decide if it is of positive, negative, or neutral sentiment. For messages conveying both positive and negative sentiment, the stronger one is to be chosen.

Each participating team was allowed to submit results for two different systems per subtask: one constrained, and one unconstrained. A constrained system could only use the provided data for training, but it could also use other resources such as lexicons obtained elsewhere. An unconstrained system could use any additional data as part of the training process; this could be done in a supervised, semi-supervised, or unsupervised fashion.

Note that constrained/unconstrained refers to the data used to train a classifier. For example, if other data (excluding the test data) was used to develop a sentiment lexicon, and the lexicon was used to generate features, the system would still be constrained. However, if other, manually or automatically labeled data (excluding the test data) was used with the original data to train the classifier, then such a system would be considered unconstrained.2

3 Dataset creation

In this section, we describe the process of collecting and annotating our datasets of short social media text messages. We will focus our discussion on general tweets as collected for SemEval-2013 Task 2, but our testing datasets also include sarcastic tweets, SMS messages and sentences from LiveJournal, which we will also describe.

3.1 Data collection

First, we gathered tweets that express sentiment about popular topics. For this purpose, we extracted named entities using a Twitter-tuned NER system (Ritter et al. 2011) from millions of tweets, which we collected over a 1-year period spanning from January 2012 to January 2013; for downloading, we used the public streaming Twitter API.

We then identified popular topics as those named entities that are frequently mentioned in association with a specific date (Ritter et al. 2012). Given this set of automatically identified topics, we gathered tweets from the same time period which mentioned the named entities. The testing messages had different topics from training and spanned later periods; this is true for both Twitter2013-test, which used tweets from later in 2013, and Twitter2014-test, which included tweets from 2014.

The collected tweet data were greatly skewed towards the neutral class. In order to reduce the class imbalance, we removed messages that contained no sentiment-bearing words using SentiWordNet as a repository of sentiment words. Any word listed in SentiWordNet 3.0 with at least one sense having a positive or a negative sentiment score greater than 0.3 was considered a sentiment-bearing word.3

We annotated the same Twitter messages for subtask A and subtask B. However, the final training and testing datasets overlap only partially between the two subtasks since we had to discard messages with low inter-annotator agreement, and this differed between the subtasks.

After the annotation process, we split the annotated tweets into training, development and testing datasets; for testing, we further annotated three additional out-of-domain datasets:4
  • SMS messages from the NUS SMS corpus5 (Chen and Kan 2013);

  • LiveJournal sentences from LiveJournal (Rosenthal and McKeown 2012);

  • Sarcastic tweets a small set of tweets containing the #sarcasm hashtag.

3.2 Annotation process

Our datasets were annotated for sentiment on Mechanical Turk.6 Each sentence was annotated by five Mechanical Turk workers, also known as Turkers. The annotations for subtask A and subtask B were done concurrently. Each Turker had to mark all the subjective words/phrases in the tweet message by indicating their start and end positions and to say whether each subjective word/phrase was positive, negative, or neutral (subtask A). Turkers also had to indicate the overall polarity of the message (subtask B). The instructions we gave to the Turkers, along with an example, are shown in Fig. 1. Several additional examples (Table 1) were also available to the annotators.

Providing all the required annotations for a given message (a tweet, an SMS, or a sentence from LiveJournal) constituted a Human Intelligence Task, or a HIT. In order to qualify for the task, a Turker had to have an approval rate greater than 95 %, and should have completed 50 approved HITs. We further discarded the following types of annotations:7
  • messages containing overlapping subjective phrases;

  • messages marked as subjective but having no annotated subjective phrases;

  • messages with every single word marked as subjective;

  • messages with no overall sentiment marked.

Fig. 1

Instructions given to workers on Mechanical Turk, followed by a screenshot

Table 1

List of example sentences with annotations that were provided to the Turkers

All subjective phrases are italicized. Positive phrases are in green, negative phrases are in red, and neutral phrases are in blue

For each message, the annotations provided by several Turkers were combined as follows. For subtask A, we combined the annotations using intersection as shown in the last row of Table 2. A word had to appear in 2/3 of the annotations in order to be considered subjective. Similarly, a word had to be labeled with a particular polarity (positive, negative, or neutral) 2/3 of the time in order to receive that label. We also experimented with other methods of combining annotations: (1) by computing the union of the annotations for the sentence, and (2) by taking the annotations provided by the worker who annotated the most HITs. However, we found that these methods were not as accurate. We plan to explore further alternatives in future work, e.g., using the MACE adjudication method (Hovy et al. 2013).

For subtask B, the polarity of the entire sentence was determined based on the majority of the labels. If there was a tie, the sentence was discarded (these are likely to be controversial cases). In order to reduce the number of rejected sentences, we combined the objective and the neutral labels, which Turkers tended to mix up.
Table 2

Example of a sentence annotated for subjectivity on Mechanical Turk

I would love to watch Vampire Diaries :) and some Heroes! Great combination

9/13

I would love to watch Vampire Diaries :) and some Heroes!Great combination

11/13

I would love to watch Vampire Diaries :) and some Heroes! Great combination

10/13

I would love to watch Vampire Diaries :) and some Heroes! Great combination

13/13

I would love to watch Vampire Diaries :) and some Heroes! Great combination

12/13

I would love to watch Vampire Diaries :) and some Heroes! Great combination

 

The words and the phrases that were marked as subjective are italicized and highlighted in bold. The first five rows show annotations provided by the Turkers, and the final row shows their intersection. The final column shows the accuracy for each annotation compared with respect to the intersection in the final row

For the sarcastic tweets, we slightly altered the annotation task. The tweets were shown to the Turkers without the #sarcasm hashtag, and the Turkers were asked to determine whether the tweet was sarcastic on its own. Furthermore, the Turkers had to indicate the degree of sarcasm as (a) definitely sarcastic, (b) probably sarcastic, and (c) not sarcastic. Although we do not use the degree of sarcasm at this time, it could be useful for analysis as well as possibly excluding tweets that do not appear to be sarcastic. For the SMS and the LiveJournal messages, the annotation task was the same as for tweets, but without the annotations for sarcasm.

The obtained annotations were used as gold labels for the corresponding subtasks. Consecutive tokens marked as subjective serve as target terms in subtask A. The statistics for all datasets are shown in Tables 3 and 4 for subtask A and B, respectively. Each dataset is marked with the year of the SemEval edition it was produced for. An annotated example from each source is shown in Table 5.

When building a system to solve a task, it is good to know how well we should expect it to perform. One good reference point is human performance and agreement between annotators. Unfortunately, as we derive annotations by agreement, we cannot calculate standard statistics such as Kappa directly. Instead, we decided to measure the agreement between our gold standard annotations (derived by agreement) and the annotations proposed by the best Turker, the worst Turker, and the average Turker (with respect to the gold/consensus annotation for a particular message). Given a HIT, we just calculate the overlaps as shown in the last column in Table 2, and then we calculate the best, the worst, and the average, which are respectively 13/13, 9/13 and 11/13, in the example. Finally, we average these statistics over all HITs that contributed to a given dataset, to produce lower, average, and upper averages for that dataset. The accuracy (with respect to the gold/consensus annotation) for different averages is shown in Table 6. Since the overall polarity of a message is chosen based on majority, the upper bound for subtask B is 100 %. These averages give a good indication about how well we can expect the systems to perform. For example, we can see that even if we used the best annotator for each HIT, it would still not be possible to get perfect accuracy, and thus we should also not expect perfect accuracy for an automatic system.
Table 3

Dataset statistics for subtask A

Corpus

Positive

Negative

Objective/Neutral

Total

Twitter2013-train

5895

3131

471

9497

Twitter2013-dev

648

430

57

1135

Twitter2013-test

2734

1541

160

4435

SMS2013-test

1071

1104

159

2334

Twitter2014-test

1807

578

88

2473

Twitter2014-sarcasm

82

37

5

124

LiveJournal2014-test

660

511

144

1315

Table 4

Dataset statistics for subtask B

Corpus

Positive

Negative

Objective/Neutral

Total

Twitter2013-train

3662

1466

4600

9728

Twitter2013-dev

575

340

739

1654

Twitter2013-test

1572

601

1640

3813

SMS2013-test

492

394

1207

2093

Twitter2014-test

982

202

669

1853

Twitter2014-sarcasm

33

40

13

86

LiveJournal2014-test

427

304

411

1142

Table 5

Example annotations for each source of messages

Source

Message

Message-Level Polarity

Twitter

Why would you [still]- wear shorts when it’s this cold?! I [love]+ how Britain see’s a bit of sun and they’re [like ’OOOH]+ LET’S STRIP!’

Positive

SMS

[Sorry]- I think tonight [cannot]- and I [not feeling well]- after my rest.

Negative

LiveJournal

[Cool]+ posts , dude ; very [colorful]+ , and [artsy]+

positive

Twitter Sarcasm

[Thanks]+ manager for putting me on the schedule for Sunday

Negative

The target terms (i.e., subjective phrases) are marked in [\(\ldots \)], and are followed by their polarity (subtask A); the message-level polarity is shown in the last column (subtask B)

Table 6

Average (over all HITs) overlap of the gold annotations with the worst, average, and the worst Turker for each HIT, for subtasks A and B

Corpus

Subtask A

Subtask B

Lower

Avg.

Upper

Avg.

Twitter2013-train

75.1

89.7

97.9

77.6

Twitter2013-dev

66.6

85.3

97.1

86.4

Twitter2013-test

76.8

90.3

98.0

75.9

SMS2013-test

75.9

97.5

89.6

77.5

Livejournal2014-test

61.7

82.3

94.5

76.2

Twitter2014-test

75.3

88.9

97.5

74.7

2014-test

62.6

83.1

95.6

71.2

3.3 Tweet delivery

Due to Twitter’s terms of service, we could not deliver the annotated tweets to the participants directly. Instead, we released annotation indexes and labels, a list of corresponding Twitter IDs, and a download script8 that extracts the corresponding tweets via the Twitter API.

As a result, the task participants had access to different number of training tweets depending on when they did the downloading,9 as over time some tweets were deleted. Another major reason for tweet unavailability was Twitter users changing the status of their accounts from public to private. Note that this account status change goes in both directions and changes can be made frequently; thus, some task participants could actually download more tweets by trying several times on different dates.

4 Scoring

The participating systems were required to perform a three-way classification for both subtasks. A particular marked phrase (for subtask A) or an entire message (for subtask B) was to be classified as positive, negative or objective/neutral. We evaluated the systems by computing a score for predicting positive/negative phrases/messages. For instance, to compute positive precision, \(P_{pos}\), we find the number of phrases/messages that a system correctly predicted to be positive, and we divide that number by the total number it predicted to be positive. To compute positive recall, \(R_{pos}\), we find the number of phrases/messages correctly predicted to be positive and we divide that number by the total number of positives in the gold standard. We then calculate \(\hbox {F}_1\)-score for the positive class as follows \(F_{pos}=\frac{2 P_{pos} R_{pos}}{P_{pos} + R_{pos}}\). We carry out similar computations for the negative phrases/messages, \(F_{neg}\). The overall score is then the average of the \(\hbox {F}_1\)-scores for the positive and negative classes: \(F=(F_{pos}+F_{neg})/2\).

We provided the participants with a scorer that outputs the overall score F, as well as P, R, and \(F_1\) scores for each class (positive, negative, neutral) and for each test set.

5 Participants and results

In the first edition of the task (SemEval-2013), there were 28 submissions by 23 teams for subtask A, and 51 submissions by 38 teams for subtask B; a total of 44 teams took part in the task overall. In the second year (SemEval-2014), the task again attracted a high number of participants: there were 27 submissions by 21 teams for subtask A, and 50 submissions by 44 teams for subtask B, a total of 46 different teams.10 Eighteen teams participated in both years.

Most of the submissions were constrained, with just a few unconstrained. In any case, the best systems were constrained both years. Some teams participated with both a constrained and an unconstrained system, but the unconstrained system was not always better than the constrained one. There was a single ranking, which included both constrained and unconstrained systems, where the latter were marked accordingly.

5.1 Systems

Algorithms In both years, most systems were supervised and used a variety of handcrafted features derived from n-grams, stems, punctuation, part-of-speech (POS) tags, and Twitter-specific encodings such as emoticons, hashtags, and abbreviations. The most popular classifiers included support vector machines (SVM), Maximum Entropy (MaxEnt), and Naïve Bayes.

Notably, only one of the top-performing systems in 2013, teragram (Reckman et al. 2013) (SAS Institute, USA), was entirely rule-based, and fully relied on hand-written rules. We should also mention the emerging but quite promising approach of applying deep learning, as exemplified by the top-performing SemEval-2014 teams of coooolll (Tang et al. 2014) (Harbin Institute of Technology and Microsoft Research China) and ThinkPositive (dos Santos 2014) (IBM Research Brazil).11

Preprocessing In addition to standard NLP steps such as tokenization, stemming, lemmatization, stop-word removal and POS tagging, most teams applied some kind of Twitter-specific processing such as substitution/removal of URLs, substitution of emoticons, spelling correction, word normalization, abbreviation lookup, and punctuation removal. Several teams reported using Twitter-tuned NLP tools such as POS and named entity taggers (Gimpel et al. 2011; Ritter et al. 2011).

External lexical resources Many systems relied heavily on existing sentiment lexicons. Sentiment lexicons are lists of words (and sometimes phrases) with prior associations to positive, negative, and sometimes neutral sentiment. Some lexicons provide a real-valued or a discrete sentiment score for a term to indicate its intensity. Most of the lexicons that were created by manual annotation tend to be domain-independent and include a few thousand terms, but larger lexicons can be built automatically or semi-automatically. The most popular lexicons used by participants in both years included the manually created MPQA Subjectivity Lexicon (Wilson et al. 2005), Bing Liu’s Lexicon (Hu and Liu 2004), as well as the automatically created SentiWordNet (Baccianella et al. 2010). The winning team at SemEval-2013, NRC-Canada (Mohammad et al. 2013), reported huge gains from their automatically created high-coverage tweet-specific sentiment lexicons (Hashtag Sentiment Lexicon and Sentiment140 lexicon).12 They also used the NRC Emotion Lexicon (Mohammad and Turney 2010, 2013) and the Bing Liu Lexicon (Hu and Liu 2004). The NRC lexicons were released to the community, and were used by many teams in the subsequent editions of the SemEval Twitter sentiment task.

In addition to using sentiment lexicons, many top-performing systems used word representations built from large external collections of tweets or other corpora. Such representations serve to reduce the sparseness of the word space. Two general approaches for building word representations are word clustering and word embeddings. The Brown clustering algorithm (Brown et al. 1992) groups syntactically or semantically close words in a hierarchy of clusters. The CMU Twitter NLP tool provides word clusters produced with the Brown clustering algorithm on 56 million English-language tweets. Recently, several deep learning algorithms have been proposed to build continuous dense word representations, called word embeddings (Collobert et al. 2011; Mikolov et al. 2013). Similar to word clusters, syntactically or semantically close words should have similar embedding vectors. The pre-trained word embeddings are publicly available,13 but they were generated from news articles. Therefore, some teams chose to train their own word embeddings on tweets using the available software word2vec (Mikolov et al. 2013).

The coooolll team (Tang et al. 2014) (Harbin Institute of Technology and Microsoft Research China) went one step further and produced sentiment-specific word embeddings. They extended the neural network C&W model (Collobert et al. 2011) to incorporate the sentiment information on sentences and modified the loss function to be a linear combination of syntactic loss and sentiment loss. Similarly, at SemEval-2015, the UNITN team (Severyn and Moschitti 2015) used an unsupervised neural language model to initialize word embeddings that they further tuned by a deep learning model using a separate corpus and distant supervision; they then continued training in a supervised way on the SemEval data.

Further details on individual systems can be found in the proceedings of SemEval-2013 (Manandhar and Yuret 2013), SemEval-2014 (Nakov and Zesch 2014), and SemEval-2015 (Nakov et al. 2015).

5.2 Baselines

There are several baselines that one might consider for this task, and below we will explore some of the most interesting ones.

Majority class This baseline always predicts the most frequent class as observed on the training dataset. As our official evaluation metric is an average of the F-score for the positive and for the negative classes, it makes sense to consider these two classes only. For our training dataset, this baseline predicts the positive class for both subtasks A and B as it is more frequent for both subtasks.

Target’s majority class This baseline is only applicable to subtask A. For that subtask, we can calculate the majority class for individual target terms. If a target term (a word or a phrase) from the test set occurs as target in the training dataset, this baseline predicts the most frequent class for that term. If the frequencies tie between two classes, the priority in the order of positive, negative, and neutral is used to break the tie. For example, if a term appears the same number of times as positive and as negative in the training dataset, we predict positive class for the term. If a target term does not occur in the training data, we predict the most frequent class from the entire training dataset, i.e., the positive class.

Lexicon-based We add up the scores for lexicon words or phrases matched in the target term (for subtask A) or in the entire message (for subtask B), and we predict a positive class if the cumulative sum is greater than zero, a neutral class if it is zero, and a negative class if it is \(<0\). If no matches are found, we predict neutral. We calculate this baseline using three different sentiment lexicons: MPQA Subjectivity Lexicon, Bing Liu’s Lexicon, and SentiWordNet 3.0. We use a score of 1 for a positive entry and a score of \(-1\) for a negative entry in the MPQA and Bing Liu’s lexicons. As SentiWordNet has a real-valued positive score and a real-valued negative score assigned to a word sense, for it we average positive and negative scores over all senses of a word and we subtract the average negative score from the average positive score to get the final sentiment score for the target word or phrase.

SVM unigrams This is a more sophisticated baseline, which trains a Support Vector Machine classifier on the training dataset, using unigrams as features. In the experiments, we used the LibSVM package (Chang and Lin 2011) with linear kernel and a value of the C parameter that we optimized on the development dataset.

SVM unigrams+bigrams This baseline is similar to the previous one with the exception that the feature set now includes unigrams and bigrams.
Table 7

The macro-averaged F-scores for different baselines

Baseline

2013

2014

Tweet

SMS

Tweet

Tweet sarcasm

Live-Journal

Subtask A

Majority class

38.10

31.50

42.22

39.81

33.42

Target’s majority class

71.62

68.60

72.52

60.86

55.33

Lexicon-based

 MPQA

55.57

54.43

50.69

48.22

59.38

 Bing Liu’s

58.88

49.89

50.07

45.72

59.21

 SentiWordNet

64.16

68.69

60.38

50.41

73.44

SVM unigrams

83.56

81.50

80.57

77.33

78.78

SVM unigrams+bigrams

83.82

81.71

81.03

76.95

77.96

Subtask B

Majority class

29.19

19.03

34.64

27.73

27.21

Lexicon-based

 MPQA

46.21

47.17

46.09

33.68

55.49

 Bing Liu’s

53.59

53.32

49.96

31.67

61.09

 SentiWordNet

45.43

43.55

44.85

46.66

56.49

SVM unigrams

56.95

54.21

58.58

47.71

59.47

SVM unigrams+bigrams

57.59

53.81

58.14

48.40

57.47

Table 7 shows the macro-averaged F-scores for different baselines. First, note that for almost all baselines the scores for subtask A are substantially higher than the corresponding scores for subtask B. Second, we can see that for subtask A the Target’s Majority Class baseline and the SVM unigrams baseline achieve remarkable results: by simply predicting a target’s majority class one can obtain F-scores in the low seventies, and by training an SVM model with only unigram features one can get F-scores in the low eighties. For comparison, for subtask B the SVM unigrams baseline only achieves F-scores in the fifties. We explore the differences between the subtasks and the corresponding datasets in more detail in Sect. 6.7.

Overall, for both subtasks, statistical machine learning yields stronger baseline results than simple lexicon-based methods. Therefore, it is not surprising that most participants relied on statistical learning from the training dataset and used sentiment lexicons to obtain additional features.

Finally, note that most baselines perform badly on sarcastic tweets, even though the Majority Class baseline score on this dataset does not significantly differ from the corresponding scores on the other test datasets.

5.3 Results

Table 8

Results for subtask A

#

System

2013: Progress

2014: Official

Tweet

SMS

Tweet

Tweet sarcasm

Live-Journal

1

NRC-Canada

90.14\(_{1}\)

88.03\(_{4}\)

86.63\(_{1}\)

77.13\(_{5}\)

85.49\(_{2}\)

2

SentiKLUE

90.11\(_{2}\)

85.16\(_{8}\)

84.83\(_{2}\)

79.32\(_{3}\)

85.61\(_{1}\)

3

CMUQ-Hybrid

88.94\(_{4}\)

87.98\(_{5}\)

84.40\(_{3}\)

76.99\(_{6}\)

84.21\(_{3}\)

4

CMU-Qatar

89.85\(_{3}\)

88.08\(_{3}\)

83.45\(_{4}\)

78.07\(_{4}\)

83.89\(_{5}\)

5

ECNU (*)

87.29\(_{6}\)

89.26\(_{2}\)

82.93\(_{5}\)

73.71\(_{8}\)

81.69\(_{7}\)

6

ECNU

87.28\(_{7}\)

89.31\(_{1}\)

82.67\(_{6}\)

73.71\(_{9}\)

81.67\(_{8}\)

7

Think_Positive (*)

88.06\(_{5}\)

87.65\(_{6}\)

82.05\(_{7}\)

76.74\(_{7}\)

80.90\(_{12}\)

8

Kea

84.83\(_{10}\)

84.14\(_{10}\)

81.22\(_{8}\)

65.94\(_{17}\)

81.16\(_{11}\)

9

Lt_3

86.28\(_{8}\)

85.26\(_{7}\)

81.02\(_{9}\)

70.76\(_{13}\)

80.44\(_{13}\)

10

senti.ue

84.05\(_{11}\)

78.72\(_{16}\)

80.54\(_{10}\)

82.75\(_{1}\)

81.90\(_{6}\)

11

LyS

85.69\(_{9}\)

81.44\(_{12}\)

79.92\(_{11}\)

71.67\(_{10}\)

83.95\(_{4}\)

12

UKPDIPF

80.45\(_{15}\)

79.05\(_{14}\)

79.67\(_{12}\)

65.63\(_{18}\)

81.42\(_{9}\)

13

UKPDIPF (*)

80.45\(_{16}\)

79.05\(_{15}\)

79.67\(_{13}\)

65.63\(_{19}\)

81.42\(_{10}\)

14

TJP

81.13\(_{14}\)

84.41\(_{9}\)

79.30\(_{14}\)

71.20\(_{12}\)

78.27\(_{15}\)

15

SAP-RI

80.32\(_{17}\)

80.26\(_{13}\)

77.26\(_{15}\)

70.64\(_{14}\)

77.68\(_{18}\)

16

senti.ue (*)

83.80\(_{12}\)

82.93\(_{11}\)

77.07\(_{16}\)

80.02\(_{2}\)

79.70\(_{14}\)

17

SAIL

78.47\(_{18}\)

74.46\(_{20}\)

76.89\(_{17}\)

65.56\(_{20}\)

70.62\(_{22}\)

18

columbia_nlp

81.50\(_{13}\)

74.55\(_{19}\)

76.54\(_{18}\)

61.76\(_{22}\)

78.19\(_{16}\)

19

IIT-Patna

76.54\(_{20}\)

75.99\(_{18}\)

76.43\(_{19}\)

71.43\(_{11}\)

77.99\(_{17}\)

20

Citius (*)

76.59\(_{19}\)

69.31\(_{21}\)

75.21\(_{20}\)

68.40\(_{15}\)

75.82\(_{20}\)

21

Citius

74.71\(_{21}\)

61.44\(_{25}\)

73.03\(_{21}\)

65.18\(_{21}\)

71.64\(_{21}\)

22

IITPatna

70.91\(_{23}\)

77.04\(_{17}\)

72.25\(_{22}\)

66.32\(_{16}\)

76.03\(_{19}\)

23

SU-sentilab

74.34\(_{22}\)

62.58\(_{24}\)

68.26\(_{23}\)

53.31\(_{25}\)

69.53\(_{23}\)

24

Univ. Warwick

62.25\(_{26}\)

60.12\(_{26}\)

67.28\(_{24}\)

58.08\(_{24}\)

64.89\(_{25}\)

25

Univ. Warwick (*)

64.91\(_{25}\)

63.01\(_{23}\)

67.17\(_{25}\)

60.59\(_{23}\)

67.46\(_{24}\)

26

DAEDALUS

67.42\(_{24}\)

63.92\(_{22}\)

60.98\(_{26}\)

45.27\(_{27}\)

61.01\(_{26}\)

27

DAEDALUS (*)

61.95\(_{27}\)

55.97\(_{27}\)

58.11\(_{27}\)

49.19\(_{26}\)

58.65\(_{27}\)

The systems are sorted by their score on the Twitter2014 test dataset; the rankings on the individual datasets are indicated with a subscript

The (*) indicates an unconstrained submission

Table 9

Results for subtask B

#

System

2013: Progress

2014: Official

Tweet

SMS

Tweet

Tweet sarcasm

Live- Journal

1

TeamX

72.12\(_{1}\)

57.36\(_{26}\)

70.96\(_{1}\)

56.50\(_{3}\)

69.44\(_{15}\)

2

coooolll

70.40\(_{3}\)

67.68\(_{2}\)

70.14\(_{2}\)

46.66\(_{24}\)

72.90\(_{5}\)

3

RTRGO

69.10\(_{5}\)

67.51\(_{3}\)

69.95\(_{3}\)

47.09\(_{23}\)

72.20\(_{6}\)

4

NRC-Canada

70.75\(_{2}\)

70.28\(_{1}\)

69.85\(_{4}\)

58.16\(_{1}\)

74.84\(_{1}\)

5

TUGAS

65.64\(_{13}\)

62.77\(_{11}\)

69.00\(_{5}\)

52.87\(_{12}\)

69.79\(_{13}\)

6

CISUC_KIS

67.56\(_{8}\)

65.90\(_{6}\)

67.95\(_{6}\)

55.49\(_{5}\)

74.46\(_{2}\)

7

SAIL

66.80\(_{11}\)

56.98\(_{28}\)

67.77\(_{7}\)

57.26\(_{2}\)

69.34\(_{17}\)

8

SWISS-CHOCOLATE

64.81\(_{18}\)

66.43\(_{5}\)

67.54\(_{8}\)

49.46\(_{16}\)

73.25\(_{4}\)

9

Synalp-Empathic

63.65\(_{23}\)

62.54\(_{12}\)

67.43\(_{9}\)

51.06\(_{15}\)

71.75\(_{9}\)

10

Think_Positive (*)

68.15\(_{7}\)

63.20\(_{9}\)

67.04\(_{10}\)

47.85\(_{21}\)

66.96\(_{24}\)

11

SentiKLUE

69.06\(_{6}\)

67.40\(_{4}\)

67.02\(_{11}\)

43.36\(_{30}\)

73.99\(_{3}\)

12

JOINT_FORCES (*)

66.61\(_{12}\)

62.20\(_{13}\)

66.79\(_{12}\)

45.40\(_{26}\)

70.02\(_{12}\)

13

AMI_ERIC

70.09\(_{4}\)

60.29\(_{20}\)

66.55\(_{13}\)

48.19\(_{20}\)

65.32\(_{26}\)

14

AUEB

63.92\(_{21}\)

64.32\(_{8}\)

66.38\(_{14}\)

56.16\(_{4}\)

70.75\(_{11}\)

15

CMU-Qatar

65.11\(_{17}\)

62.95\(_{10}\)

65.53\(_{15}\)

40.52\(_{38}\)

65.63\(_{25}\)

16

Lt_3

65.56\(_{14}\)

64.78\(_{7}\)

65.47\(_{16}\)

47.76\(_{22}\)

68.56\(_{20}\)

17

columbia_nlp

64.60\(_{19}\)

59.84\(_{21}\)

65.42\(_{17}\)

40.02\(_{40}\)

68.79\(_{19}\)

18

LyS

66.92\(_{10}\)

60.45\(_{19}\)

64.92\(_{18}\)

42.40\(_{33}\)

69.79\(_{14}\)

19

NILC_USP

65.39\(_{15}\)

61.35\(_{16}\)

63.94\(_{19}\)

42.06\(_{34}\)

69.02\(_{18}\)

20

senti.ue

67.34\(_{9}\)

59.34\(_{23}\)

63.81\(_{20}\)

55.31\(_{6}\)

71.39\(_{10}\)

21

UKPDIPF

60.65\(_{29}\)

60.56\(_{17}\)

63.77\(_{21}\)

54.59\(_{7}\)

71.92\(_{7}\)

22

UKPDIPF (*)

60.65\(_{30}\)

60.56\(_{18}\)

63.77\(_{22}\)

54.59\(_{8}\)

71.92\(_{8}\)

23

SU-FMI

60.96\(_{28}\)

61.67\(_{15}\)

63.62\(_{23}\)

48.34\(_{19}\)

68.24\(_{21}\)

24

ECNU

62.31\(_{27}\)

59.75\(_{22}\)

63.17\(_{24}\)

51.43\(_{14}\)

69.44\(_{16}\)

25

ECNU (*)

63.72\(_{22}\)

56.73\(_{29}\)

63.04\(_{25}\)

49.33\(_{17}\)

64.08\(_{31}\)

26

Rapanakis

58.52\(_{32}\)

54.02\(_{35}\)

63.01\(_{26}\)

44.69\(_{27}\)

59.71\(_{37}\)

27

Citius (*)

63.25\(_{24}\)

58.28\(_{24}\)

62.94\(_{27}\)

46.13\(_{25}\)

64.54\(_{29}\)

28

CMUQ-Hybrid

63.22\(_{25}\)

61.75\(_{14}\)

62.71\(_{28}\)

40.95\(_{37}\)

65.14\(_{27}\)

29

Citius

62.53\(_{26}\)

57.69\(_{25}\)

61.92\(_{29}\)

41.00\(_{36}\)

62.40\(_{33}\)

30

KUNLPLab

58.12\(_{33}\)

55.89\(_{31}\)

61.72\(_{30}\)

44.60\(_{28}\)

63.77\(_{32}\)

31

senti.ue (*)

65.21\(_{16}\)

56.16\(_{30}\)

61.47\(_{31}\)

54.09\(_{9}\)

68.08\(_{22}\)

32

UPV-ELiRF

63.97\(_{20}\)

55.36\(_{33}\)

59.33\(_{32}\)

37.46\(_{42}\)

64.11\(_{30}\)

33

USP_Biocom

58.05\(_{34}\)

53.57\(_{36}\)

59.21\(_{33}\)

43.56\(_{29}\)

67.80\(_{23}\)

34

DAEDALUS (*)

58.94\(_{31}\)

54.96\(_{34}\)

57.64\(_{34}\)

35.26\(_{44}\)

60.99\(_{35}\)

35

IIT-Patna

52.58\(_{40}\)

51.96\(_{37}\)

57.25\(_{35}\)

41.33\(_{35}\)

60.39\(_{36}\)

36

DejaVu

57.43\(_{36}\)

55.57\(_{32}\)

57.02\(_{36}\)

42.46\(_{32}\)

64.69\(_{28}\)

37

GPLSI

57.49\(_{35}\)

46.63\(_{42}\)

56.06\(_{37}\)

53.90\(_{10}\)

57.32\(_{41}\)

38

BUAP

56.85\(_{37}\)

44.27\(_{44}\)

55.76\(_{38}\)

51.52\(_{13}\)

53.94\(_{44}\)

39

SAP-RI

50.18\(_{44}\)

49.00\(_{41}\)

55.47\(_{39}\)

48.64\(_{18}\)

57.86\(_{40}\)

40

UMCC_DLSI_Sem

51.96\(_{41}\)

50.01\(_{38}\)

55.40\(_{40}\)

42.76\(_{31}\)

53.12\(_{45}\)

41

IBM_EG

54.51\(_{38}\)

46.62\(_{43}\)

52.26\(_{41}\)

34.14\(_{46}\)

59.24\(_{38}\)

42

Alberta

53.85\(_{39}\)

49.05\(_{40}\)

52.06\(_{42}\)

40.40\(_{39}\)

52.38\(_{46}\)

43

lsis_lif

46.38\(_{46}\)

38.56\(_{47}\)

52.02\(_{43}\)

34.64\(_{45}\)

61.09\(_{34}\)

44

SU-sentilab

50.17\(_{45}\)

49.60\(_{39}\)

49.52\(_{44}\)

31.49\(_{47}\)

55.11\(_{42}\)

45

SINAI

50.59\(_{42}\)

57.34\(_{27}\)

49.50\(_{45}\)

31.15\(_{49}\)

58.33\(_{39}\)

46

IITPatna

50.32\(_{43}\)

40.56\(_{46}\)

48.22\(_{46}\)

36.73\(_{43}\)

54.68\(_{43}\)

47

Univ. Warwick

39.17\(_{48}\)

29.50\(_{49}\)

45.56\(_{47}\)

39.77\(_{41}\)

39.60\(_{49}\)

48

UMCC_DLSI_Graph

43.24\(_{47}\)

36.66\(_{48}\)

45.49\(_{48}\)

53.15\(_{11}\)

47.81\(_{47}\)

49

Univ. Warwick (*)

34.23\(_{50}\)

24.63\(_{50}\)

45.11\(_{49}\)

31.40\(_{48}\)

29.34\(_{50}\)

50

DAEDALUS

36.57\(_{49}\)

40.86\(_{45}\)

33.03\(_{50}\)

28.96\(_{50}\)

40.83\(_{48}\)

The systems are sorted by their score on the Twitter2014 test dataset; the rankings on the individual datasets are indicated with a subscript

The (*) indicates an unconstrained submission

The results for the 2014 edition of the task are shown in Tables 8 and 9, and the corresponding team affiliations are shown in Table 11. The tables show results on the two progress test datasets (tweets and SMS messages), which are the official test datasets from the 2013 edition of the task, and on the three official 2014 test sets (tweets, tweets with sarcasm, and LiveJournal). There is an index for each result showing its relative rank within the respective column. The systems are ranked by their score on the Twitter-2014 test set, which is the official ranking for the task; all remaining rankings are secondary.

5.3.1 Subtask A

Table 8 shows the results for subtask A, which attracted 27 submissions from 21 teams at SemEval-2014. There were seven unconstrained submissions: five teams submitted both a constrained and an unconstrained run, and two teams submitted an unconstrained run only. The best systems were constrained.

Comparing Table 8 to Table 7, we can see that all participating systems outperformed the Majority Class baseline by a sizable margin. However, some systems could not beat the Target’s Majority Class baseline, and most systems could not compete against the SVM-based baselines.

5.3.2 Subtask B

The results for subtask B are shown in Table 9. The subtask attracted 50 submissions from 44 teams at SemEval-2014. There were eight unconstrained submissions: six teams submitted both a constrained and an unconstrained run, and two teams submitted an unconstrained run only. As for subtask A, the best systems were constrained.

Comparing Table 9 to Table 7, we see that almost all participating systems outperformed the Majority Class baseline, but some ended up performing slightly lower on some of the datasets. Moreover, several systems could not beat the remaining stronger baselines; in particular, about a third of the systems could not compete against the SVM-based baselines.

6 Analysis

In this section, we analyze the results from several perspectives. In particular, we discuss the progress over the first 2 years of the SemEval task, the system independence of the training domain, the need for external lexical resources, the impact of different techniques for handling negation and context, and the differences between the two subtasks.

6.1 Progress over the first 2 years

As Table 11 shows, 18 out of the 46 teams in 2014 also participated in the 2013 edition of the task. Comparing the results on the progress Twitter test dataset Nakov et al. (2013) and Rosenthal et al. (2014), we can see that NRC-Canada, the 2013 winner for subtask A, has now improved their F-score from 88.93 (Mohammad et al. 2013) to 90.14 (Zhu et al. 2014b), which is the 2014 best. The best score on the progress SMS test dataset in 2014 of 89.31 belongs to ECNU (Zhao et al. 2014); this is a big jump compared to their 2013 score of 76.69 (Tiantian et al. 2013), but it is lower compared to the 2013 best of 88.37 achieved by GU-MLT-LT (Günther and Furrer 2013).

For subtask B, on the Twitter progress test dataset, the 2013 winner, NRC-Canada, improves their 2013 result from 69.02 (Mohammad et al. 2013) to 70.75 (Zhu et al. 2014b), which is the second best in 2014; the winner in 2014, TeamX, achieves 72.12 (Miura et al. 2014). On the SMS progress test set, the 2013 winner, NRC-Canada, improves its F-score from 68.46 to 70.28. Overall, we see consistent improvements on the progress test datasets for both subtasks: 0–1 and 2–3 points absolute for subtasks A and B, respectively.

For both subtasks, the best systems on the Twitter2014-test dataset are those that performed best on the progress Twitter2013-test dataset: NRC-Canada for subtask A, and TeamX (Fuji Xerox Co., Ltd.) for subtask B. However, the best results on Twitter2014-test are substantially lower than those for the Twitter2013-test for both subtask A (86.63 vs. 90.14) and subtask B (70.96 vs. 72.12). This is so despite the Majority Class baselines for Twitter2014-test being higher than those for Twitter2013-test: 42.2 versus 38.1 for subtask A, and 34.6 versus 29.2 for subtask B. Most likely, having access to the Twitter2013-test at development time, teams have overfitted on it. It could be also the case that some of the sentiment lexicons that were built in 2013 have become somewhat outdated by 2014.

6.2 Performance on out-of-domain data

All participating systems were trained on tweets only. No training data were provided for the other test domains, SMS and blogs, nor was there training data for sarcastic tweets. Some teams, such as NRC-Canada, performed well across all test datasets. Surprisingly, on the out-of-domain test datasets they were able to achieve results comparable to those they obtained on tweets, or even better. Other teams, such as TeamX, chose to tune a weighting scheme specifically for class imbalances in tweets and, as a result, were only strong on Twitter datasets.

The Twitter2014-sarcasm dataset turned out to be the most challenging test dataset for most of the participants in both subtasks. The differences in performance on general and sarcastic tweets was 5–10 points for subtask A and 10–20 points for subtask B for most of the systems.

6.3 Impact of training data size

As we mentioned above, due to Twitter’s terms of service, we could not deliver the annotated tweets to the participants directly, and they had to download them on their own, which caused problems as at different times different subsets of the tweets could be downloaded. Thus, task participants had access to different number of training tweets depending on when they did the downloading.

To give some statistics, in the 2014 edition of the task, the number of tweets that participants could download and use for subtask B varied between 5215 tweets and 10,882. On average, the teams were able to collect close to 9000 tweets; teams that did not participate in 2013, and thus had to download the data later, could download about 8500 tweets.

The difference in training data size did not seem to have had a major impact. In fact, the top two teams in subtask B in 2014 [coooolll (Tang et al. 2014) and TeamX (Miura et al. 2014)] used \(<8500\) tweets for training.

6.4 Use of external resources

The participating systems were allowed to make use of external resources. As described in Sect. 2, a submission that directly used additional labeled data as part of the training dataset was considered unconstrained. In both 2013 and 2014, there were cases of a team submitting a constrained and an unconstrained run and the constrained run performing better. It is unclear why unconstrained systems did not always outperform the corresponding constrained ones. It could be because participants did not use enough external data or because the data they used was too different from Twitter or from our annotation method.

Several teams chose to use external (weakly) labeled tweet data indirectly, by creating sentiment lexicons or sentiment word representations, e.g., sentiment word embeddings. This approach allowed the systems to qualify as constrained, but it also offered some further benefits. First, it allowed to incorporate large amounts of noisily labeled data quickly and efficiently. Second, the classification systems were robust to the introduced noise because the noisy data were incorporated not directly as training instances but indirectly as features. Third, the generated sentiment resources could be easily distributed to the research community and used in other applications and domains (Kiritchenko et al. 2014a).

These newly built sentiment resources, which leveraged on large collections of tweets, yielded large performance gains and assured top ranks for the teams that made use of them. For example, NRC-Canada reported 2 and 6.5 points of absolute improvement for subtasks A and B, respectively, by using their tweet-specific sentiment lexicons. On top of that, the coooolll team achieved another 3–4 points absolute improvement on the tweet test datasets for Subtask B thanks to sentiment-specific word embeddings.

Most participants greatly benefited from the use of existing general-domain sentiment lexicons. Even though the contribution of these lexicons on top of the Twitter-specific resources was usually modest, namely 1–2 points absolute, on the Twitter test datasets, the general-domain lexicons were particularly useful on out-of-domain data, such as the SMS test dataset, where their use resulted in gains of up to 3.5 points absolute for some participants.

Similarly, general-domain word representations, such as word clusters and word embeddings, showed larger gains on the out-of-domain SMS test dataset (1–2 points absolute) than on Twitter test datasets (0.5–1 points absolute).

6.5 Negation handling

Many teams incorporated special handling of negation into their systems. The most popular approach transformed any word that appeared in a negated context by adding a suffix _NEG to it, e.g., good would become good_NEG (Das and Chen 2007; Pang et al. 2002). A negated context was defined as a text span between a negation word, e.g., no, not, shouldn’t, and a punctuation mark or the end of the message.

Alternatively, some systems flipped the polarity of sentiment words when they occurred in a negated context, e.g., the positive word good would become negative when negated. The RTRGO team (Günther et al. 2014) reported an improvement of 1.5 points absolute for Subtask B on Twitter data when using both approaches together.

In Zhu et al. (2014a), the authors argued that negation affects different words differently, and that a simple reversing strategy cannot adequately capture this complex behavior. Therefore, they proposed an empirical method to determine the sentiment of words in the presence of negation by creating a separate sentiment lexicon for negated words (Kiritchenko et al. 2014b). Their system, NRC-Canada, achieved 1.5 points of absolute improvement for Subtask A and 2.5 points for Subtask B by using sentiment lexicons generated for affirmative and negated contexts separately.

6.6 Use of context in subtask A

As suggested by the name of subtask A, Contextual Polarity Disambiguation, a model built for this subtask is expected to explore the context around a target term. For example, the top-performing NRC-Canada system used unigrams and bigrams extracted within four words on either side of the target term. The system also extracted additional features from the entire message in the same way as it extracted features from the target terms themselves. The inclusion of these context features resulted in F-score improvements of 4.08 points absolute on Twitter2013-test and 2.41 points on SMS2013-test. The second-best system in 2013, AVAYA (Becker et al. 2013), used dependency parse features such as the paths between the head of the target term and the root of the entire message. Similarly, the third-best BOUNCE system (Kökciyan et al. 2013) used features and words extracted from neighboring target phrases, achieving 6.4 points of absolute improvement on Twitter2013-dev. The fourth-best LVIC-LIMSI system (Marchand et al. 2013) also used the words surrounding the target terms during development, but their effect on the overall performance was not reported. The SentiKLUE system (Evert et al. 2014), second-best in 2014, used context in the form of automatically predicted message-level polarity.

6.7 Why subtask A seems easier than subtask B

The performance of the sentiment analysis systems is significantly higher for subtask A than for subtask B. A similar difference can also be observed for many baselines, including the SVM-unigrams baseline. Furthermore, a simple Target’s Majority Class baseline showed surprisingly strong results on subtask A. Thus, we analyzed the data in order to determine why these baselines performed so well for subtask A. We found that 85.1 % of the target words in Twitter2013-test and 88.8 % of those in Twitter2014-test occurred as target tokens in the training data. Moreover, the distribution of occurrences of a target word that has been observed with different polarities is skewed towards one polarity.

Finally, the average ratio of instances pertaining to the dominant polarity of a target term to the total number of instances of that target term is 0.80. (Note that this ratio is calculated for all target words that occurred more than once in the training and in the test datasets.) These observations explain, at least in part, the high overall result and the dominant role of unigrams for subtask A.

We have conducted an experiment to examine the impact of sentiment resources in subtask A in the situation where the test targets would not appear in the training set. For this, we split the Twitter2013-test set into three subsets. In the first subset, “targets fully seen in training”, each instance has a target with the following property: there exist instances in the training data with exactly the same target; this subset comprises 55 % of the test set. In the second subset, “targets partially seen in training”, each instance has a target X with the following property: there exist instances in the training data whose target expression includes one or more, but not all, tokens in X; this subset comprises 31 % of the test set. In the third subset, “targets unseen in training”, each instance has a target X with the following property: there are no instances in the training data whose target includes any of the tokens in X; this subset comprises 14 % of the test set. We then ran the top-performing NRC-Canada system on each of the three subsets (a) using all features, (b) using all but the lexicon features, and (c) using all but the n-gram features. The results are shown in Table 10. We can see that on instances with unseen targets the sentiment lexicons play the most prominent role, yielding a gain of 14.54 points absolute.
Table 10

Subtask A: macro-averaged F-scores for the NRC-Canada system on different subsets of Twitter2013-test with one of the feature groups removed. The number in brackets shows the absolute difference compared to the scores in row (a)

Classifier

Targets fully seen in training

Targets partially seen in training

Targets unseen in training

(a) All features

93.31

85.42

84.09

(b) All, but no lexicons

92.96 (−0.35)

81.26 (−4.16)*

69.55 (−14.54)*

(c) All, but no n-grams

89.30 (−4.01)*

81.61 (−3.81)*

80.62 (−3.47)*

Scores marked with a * are statistically significantly different (\(p<.05\)) from the corresponding scores in row (a)

7 Related work

Sentiment analysis has enjoyed a lot of research attention over the last 15 years, especially in sub-areas such as detecting subjective and objective sentences; classifying sentences as positive, negative, or neutral; and more recently, detecting the target of the sentiment. Much of this work focused on customer reviews of products and services, but tweets, Facebook posts, and other social media data are now increasingly being explored. Recent surveys by Pang and Lee (2008) and Liu and Zhang (2012) give detailed summaries of research on sentiment analysis.

Initially, the problem was regarded as standard document classification into topics, e.g., (Pang et al. 2002) experimented with various classifiers such as maximum entropy, Naïve Bayes and SVM, using standard features such as unigram/bigrams, word counts/present, word position and POS tagging. Around the same time, other researchers realized the importance of external sentiment lexicons, e.g., (Turney 2002) proposed an unsupervised approach to learn the sentiment orientation of words/phrases: positive versus negative. Later work studied the linguistic aspects of expressing opinions, evaluations, and speculations (Wiebe et al. 2004), the role of context in determining the sentiment orientation (Wilson et al. 2005), of deeper linguistic processing such as negation handling (Pang and Lee 2008), of finer-grained sentiment distinctions (Pang and Lee 2005), of positional information (Raychev and Nakov 2009), etc. Moreover, it was recognized that in many cases, it is crucial to know not just the polarity of the sentiment, but also the topic towards which this sentiment is expressed (Stoyanov and Cardie 2008).

Naturally, most research in sentiment analysis was done for English, and much less efforts were devoted to other languages (Abdul-Mageed et al. 2011; Chetviorkin and Loukachevitch 2013; Jovanoski et al. 2015; Kapukaranov and Nakov 2015; Perez-Rosas et al. 2012; Tan and Zhang 2008).

Early sentiment analysis research focused on customer reviews of movies, and later of hotels, phones, laptops, etc. Later, with the emergence of social media, sentiment analysis in Twitter became a hot research topic. Yet, there was a lack of suitable datasets for training, evaluating, and comparing different systems. This situation changed with the emergence of the SemEval task on Sentiment Analysis in Twitter, which ran in 2013–2015 (Nakov et al. 2013; Rosenthal et al. 2014, 2015). The task created standard datasets of several thousand tweets annotated for sentiment polarity.

In fact, there was an even earlier shared task on sentiment analysis of text: the SemEval-2007 Affective Text Task (Strapparava and Mihalcea 2007). However, it was framed as an unsupervised task where newspaper headlines were to be labeled with eight affect categories—positive and negative sentiment, as well as six emotions (joy, sadness, fear, anger, surprise, and disgust). For each headline–affect category pair, human annotators assigned scores from 0 to 100 indicating how strongly the headline expressed the affect category. In contrast, in our task, we focus on tweets, SMS messages, and blog posts. Moreover, apart from our main subtask on message-level sentiment, we also include a subtask on determining phrase-level sentiment.

Since our 2013 shared task, several other shared tasks have been proposed that further explored various sub-problems in sentiment analysis. We describe them briefly below.

7.1 Aspect-based sentiment analysis

The goal of the SemEval-2014 Task 4 on Aspect-Based Sentiment Analysis (ABSA) was to identify aspect terms and the sentiment towards those aspect terms from customer reviews, where the focus was on two domains: laptops and restaurants (Pontiki et al. 2014).14

For example, a review may gush positively about the lasagna at a restaurant, but negatively about the long wait before the food has arrived. In the restaurant domain, the aspect terms were further aggregated into coarse categories such as food, service, ambiance, price, and miscellaneous. The goal was to identify these aspect categories and the sentiment expressed towards them.

The ABSA task attracted 32 teams, who contributed 165 submissions. There is substantial overlap in the approaches and resources used by the participants in our task and in the ABSA task. Moreover, one of the top performing systems in our competition, NRC-Canada, also participated in the ABSA task and achieved the best scores in three out of the six subtask-domain combinations, including two out of the three sentiment subtasks (Zhu et al. 2014c). The use of automatically created in-domain sentiment resources proved to be valuable for this task as well. Other useful features were derived from dependency parse trees in order to establish the relationship between aspect terms and sentiment expressions.

There is an ongoing follow-up task, SemEval-2015 Task 12 (Pontiki et al. 2015), which consolidates the subtasks from 2014 into a principled unified framework, where opinion target expressions, aspects and sentiment polarities are linked to each other in tuples. This is arguably useful when generating structured aspect-based opinion summaries from user reviews in real-world applications (e.g., customer review sites). The task is further extended to multiple sentences, and a new domain is added: reviews of hotels. Overall, this follow-up task has attracted 93 submissions by 16 teams.

7.2 Sentiment analysis of figurative language

Social media posts are often teeming with creative and figurative language, rich in irony, sarcasm, and metaphors. The SemEval-2015 Task 11 (Ghosh et al. 2015) on Sentiment Analysis of Figurative Language15 is interested in understanding how this creativity impacts perceived affect. For this purpose, tweets rich in irony, sarcasm, and metaphor were annotated on an 11-point discrete scale from \(-5\) (most negative) to \(+5\) (most positive). The participating systems were asked to predict this human-annotated fine-grained sentiment score, and were evaluated not only on the full dataset, but also separately on irony, sarcasm, and metaphor. One of the goals of the task was to explore how conventional sentiment analysis techniques can be altered to deal with non-literal content.

While our task also had evaluation on sarcastic tweets, for us this was just a separate (arguably harder) test set: we did not focus specifically on sarcasm and we did not provide specific training data for it. In contrast, SemEval-2015 Task 11 was fully dedicated to analyzing figurative language on Twitter (which includes not only sarcasm, but also irony and metaphor); moreover, they used an 11-point scale, while we were interested in predicting three classes. The task has attracted 15 teams, who submitted 29 runs.

7.3 Detecting events and polarity towards events

SemEval-2015 Task 9 CLIPEval Implicit Polarity of Events (Russo et al. 2015) focuses on the implicit sentiment polarity towards events.16 There are two subtasks. The first one asks to determine the sentiment (positive, negative, or neutral) towards an event instance, while the second one requires to identify both event instantiations and their associated polarity values. The task is based on a dataset of events annotated as instantiations of pleasant and unpleasant events previously collected in psychological research (Lewinsohn and Amenson 1978; MacPhillamy and Lewinsohn 1982). It has attracted two teams who submitted three runs.

7.4 Sentiment analysis of movie reviews

A popular test bed for sentiment analysis systems has been the movie reviews dataset from rottentomatoes.com collected initially by Pang and Lee (2005). State-of-the-art results were obtained on this test set using a recursive deep neural network (Socher et al. 2013): an F-score of 85.4 on detecting review-level polarity (positive or negative). Even though this method does not require any handcrafted features or external semantic knowledge, it relies on extensive phrase-level sentiment annotations during training, which are expensive to acquire for most real-world applications.

Very comparable results (an F-score of 85.5) were reported using more conventional machine learning techniques, and crucially, large-scale sentiment lexicons generated automatically from tweets (Kiritchenko et al. 2014b).

Finally, there is an ongoing 2015 Kaggle competition Classify the sentiment of sentences from the Rotten Tomatoes dataset, which aims to bring together sentiment analysis systems for fine-grained sentiment analysis of movie reviews.17

8 SemEval-2015 and beyond

8.1 The SemEval-2015 edition of the task

In addition to the two subtasks described above (contextual and message-level polarity), we have added three new subtasks18 in 2015. The first two focus on the sentiment towards a given topic in a single tweet or in a set of tweets, respectively. The third new subtask asks to determine the strength of prior association of Twitter terms with positive sentiment; this acts as an intrinsic evaluation of automatic methods that build Twitter-specific sentiment lexicons with real-valued sentiment association scores.
  • Topic-based message polarity classification Given a message and a topic, classify whether the message is of positive, negative, or neutral sentiment towards the given topic.

  • Detecting trends towards a topic Given a set of messages on a given topic from the same period of time, classify the overall sentiment towards the topic in these messages as (a) strongly positive, (b) weakly positive, (c) neutral, (d) weakly negative, or (e) strongly negative.

  • Determining the strength of twitter sentiment terms Given a word or a phrase, propose a score between 0 (lowest) and 1 (highest) that is indicative of the strength of association of that word/phrase with positive sentiment. If a word/phrase is more positive than another one, it should be assigned a relatively higher score.

8.2 Outlook on SemEval-2016

There is a new edition of the task which will run as part of SemEval-2016. In this new edition,19 we will focus on sentiment with respect to a topic, but on a five-point scale, which is used for human review ratings on popular websites such as Amazon, TripAdvisor, Yelp, etc. From a research perspective, moving to an ordered five-point scale means moving from binary classification to ordinal regression.

We further plan to continue the trend detection subtask, which represents a move from classification to quantification,20 and is on par with what applications need. In real-world applications, the focus often is not on the sentiment of a particular tweet, but rather on the percentage of tweets that are positive/negative.

Finally, we plan a new subtask on trend detection, but using a five-point scale, which would get us even closer to what business (e.g., marketing studies), and researchers, (e.g., in political science or public policy), want nowadays. From a research perspective, this is a problem of ordinal quantification (Esuli and Sebastiani 2010).

9 Conclusion

We have presented the development and evaluation of a SemEval task on Sentiment Analysis in Twitter. The task included the creation of a large contextual and message-level polarity corpus consisting of tweets, SMS messages, LiveJournal messages, and a special test set of sarcastic tweets. It ran in 2013, 2014, and 2015, attracting the highest number of participating teams in all 3 years, with new challenging subtasks added in 2015, and some coming in 2016.

The task has fostered the creation of some freely-available resources such as NRC’s Hashtag Sentiment lexicon and the Sentiment140 lexicon (Mohammad et al. 2013), which the NRC-Canada team initially developed for their participation in SemEval-2013 task 2, and which were key for their winning the competition. Further specialized resources were developed for 2014 and for 2015 as well.

We hope that the long-lasting role of this task and the accompanying datasets, which we release freely21 under a Creative Commons Attribution 3.0 Unported License22, will be to serve as a test bed for comparing different approaches and for fostering the creation of new relevant resources. This would facilitate research, would lead to better understanding of how sentiment is conveyed in social media, and ultimately to the creation of better sentiment analysis systems.

In future work, we plan to extend the task with new data from additional domains. We further plan to work on getting the setup as close as possible to what real-world applications need; this could mean altering the task/subtask definition, the data filtering process, the data annotation procedure, and/or the evaluation setup. Last but not least, we are interested in comparing annotations obtained from crowdsourcing with annotations from experts (Borgholt et al. 2015).
Table 11

The teams that participated in SemEval-2014 task 9, their affiliations, and an indication whether each team participated in SemEval-2013 task 2

Team

Affiliation

2013?

Alberta

University of Alberta

 

AMI_ERIC

AMI Software R&D and Université de Lyon (ERIC LYON 2)

Yes

AUEB

Athens University of Economics and Business

Yes

BUAP

Benemérita Universidad Autónoma de Puebla

 

CISUC_KIS

University of Coimbra

 

Citius

University of Santiago de Compostela

 

CMU-Qatar

Carnegie Mellon University, Qatar

 

CMUQ-Hybrid

Carnegie Mellon University, Qatar

 

columbia_nlp

Columbia University

Yes

cooolll

Harbin Institute of Technology

 

DAEDALUS

Daedalus

 

DejaVu

Indian Institute of Technology, Kanpur

 

ECNU

East China Normal University

Yes

GPLSI

University of Alicante

 

IBM_EG

IBM Egypt

 

IITPatna

Indian Institute of Technology, Patna

 

IIT-Patna

Indian Institute of Technology, Patna

 

JOINT_FORCES

Zurich University of Applied Sciences

 

Kea

York University, Toronto

Yes

KUNLPLab

Koç University

 

lsis_lif

Aix-Marseille University

Yes

Lt_3

Ghent University

 

LyS

Universidade da Coruña

 

NILC_USP

University of São Paulo

Yes

NRC-Canada

National Research Council Canada

Yes

Rapanakis

Stamatis Rapanakis

 

RTRGO

Retresco GmbH and University of Gothenburg

Yes

SAIL

Signal Analysis and Interpretation Laboratory

Yes

SAP-RI

SAP Research and Innovation

 

senti.ue

Universidade de Évora

Yes

SentiKLUE

Friedrich-Alexander-Universität Erlangen-Nürnberg

Yes

SINAI

University of Jaén

Yes

SU-FMI

Sofia University

 

SU-sentilab

Sabanci University

Yes

SWISS-CHOCOLATE

ETH Zurich

 

Synalp-Empathic

University of Lorraine

 

TeamX

Fuji Xerox Co., Ltd.

 

Think_Positive

IBM Research, Brazil

 

TJP

University of Northumbria at Newcastle Upon Tyne

Yes

TUGAS

Instituto de Engenharia de Sistemas e Computadores, Investigação e Desenvolvimento em Lisboa

Yes

UKPDIPF

Ubiquitous Knowledge Processing Lab

 

UMCC_DLSI_Graph

Universidad de Matanzas and Univarsidad de Alicante

Yes

UMCC_DLSI_Sem

Universidad de Matanzas and Univarsidad de Alicante

Yes

Univ. Warwick

University of Warwick

 

UPV-ELiRF

Universitat Politècnica de València

 

USP_Biocom

University of São Paulo and Federal University of São Carlos

 
Footnotes
1

Hashtags are a type of tagging for Twitter messages.

 
2

We should note that the distinction between constrained and unconstrained systems is quite subtle. For example, the creation of a dedicated lexicon obtained from other annotated data could be regarded by someone as a form of supervision beyond the dataset provided in the task. A similar argument could be also made about various NLP tools for Twitter processing such as Noah’s ARK Tweet NLP, Alan Ritter’s twitter_nlp, or GATE’s TwitIE, which are commonly used for tweet tokenization, normalization, POS tagging (Gimpel et al. 2011), chunking, syntactic parsing (Kong et al. 2014), named entity recognition (Ritter et al. 2011), information extraction (Bontcheva et al. 2013), and event extraction (Ritter et al. 2012); all these tools are trained on additional tweets. Indeed, some participants in 2013 and 2014 did not understand well the constrained versus unconstrained distinction, and we had to check the system descriptions, and to reclassify some submissions as constrained/unconstrained. This was a hard and tedious job, and thus for the 2015 edition of the task, we did not make a distinction between constrained and unconstrained systems, letting the participants to use any additional data, resources and tools they wished to. In any case, our constrained/unconstrained definition for the 2013 and 2014 editions of the task are clear, and the system descriptions for the individual systems are also available. Thus, researchers are free to see the final system ranking any way they like, e.g., as two separate constrained versus unconstrained rankings or as one common ranking.

 
3

Filtering based on an existing lexicon does bias the dataset to some degree; however, note that the text still contains sentiment expressions outside those in the lexicon.

 
4

We pre-filtered the SMS messages and the sarcastic tweets with SentiWordNet, but we did not do it for LiveJournal sentences.

 
6

The use of Amazon’s Mechanical Turk has been criticised from an ethical (e.g., human exploitation) and a legal (e.g., tax evasion, minimal legal wage in some countries, absence of a work contract) perspective; see Fort et al. (2011) for a broader discussion. We have tried our best to stay fair, adjusting the pay per HIT in such a way that the resulting hourly rate be on par with what is currently considered good pay on Mechanical Turk. Indeed, Turkers were eager to work on our HITs, and the annotations were completed quickly.

 
7

Note that this discarding only happened if a single Turker had created contradictory annotations; it was not done at the adjudication stage.

 
9

However, this did not have major impact on the results; see Sect. 6.3 for detail.

 
10

In the ongoing third year of the task (SemEval-2015), there were submission by 41 teams: 11 teams participated in subtask A, 40 in subtask B (Rosenthal et al. 2015).

 
11

Neural nets and deep learning were also used by top-performing teams in 2015, e.g., by UNITN (Severyn and Moschitti 2015) (University of Trento and Qatar Computing Research Institute).

 
20

Note that a good classifier is not necessarily a good quantifier, and vice versa (Forman 2008). See Esuli and Sebastiani (2015) for pointers to literature on text quantification.

 

Acknowledgments

We would like to thank Theresa Wilson, who was coorganizer of SemEval-2013 Task 2 and has contributed tremendously to the data collection and to the overall organization of the task. We would also like to thank Kathleen McKeown for her insight in creating the Amazon Mechanical Turk annotation task. For the 2013 Amazon Mechanical Turk annotations, we received funding by the JHU Human Language Technology Center of Excellence and the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), through the U.S. Army Research Lab. All statements of fact, opinion or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of IARPA, the ODNI or the U.S. Government. The 2014 Amazon Mechanical Turk annotations were funded by Kathleen McKeown and Smaranda Muresan. The 2015 Amazon Mechanical Turk annotations were partially funded by SIGLEX.

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  • Preslav Nakov
    • 1
  • Sara Rosenthal
    • 2
  • Svetlana Kiritchenko
    • 3
  • Saif M. Mohammad
    • 3
  • Zornitsa Kozareva
    • 4
  • Alan Ritter
    • 5
  • Veselin Stoyanov
    • 6
  • Xiaodan Zhu
    • 3
  1. 1.Qatar Computing Research Institute, HBKUDohaQatar
  2. 2.Columbia UniversityNew YorkUSA
  3. 3.National Research Council CanadaOttawaCanada
  4. 4.USC Information Sciences InstituteMarina del ReyUSA
  5. 5.The Ohio State UniversityColumbusUSA
  6. 6.Johns Hopkins UniversityBaltimoreUSA

Personalised recommendations