6.1 Introduction

Sentiment analysis (sometimes called “opinion mining”) is a research topic that has been actively discussed and developed for some 20 years, particularly in the fields of natural language processing (NLP) and information retrieval (IR) (Pang and Lee 2008). In this paper, we introduce the multilingual opinion analysis task (MOAT) (Seki et al. 2010, 2008, 2007), which was included in NTCIR-6, 7, and 8 (2006–2010). We then discuss the role and novelty of the task in sentiment analysis research.

Sentiment analysis research began in 2002 (Pang et al. 2002; Turney 2002; Wiebe et al. 2002). Various frameworks for classifying documents in terms of positivity or negativity that use either supervised learning (Pang et al. 2002) or unsupervised learning (Turney 2002) have been proposed. In parallel, many researchers started to build opinion corpora based on newspaper articles (Wiebe et al. 2002) for multi-perspective question answering (MPQA). Other early research work was published at the AAAI 2004 Spring Symposium: Exploring Attitude and Affect in Text: Theories and Applications (Shanahan et al. 2006).

At the Text Retrieval Conference (TREC) in 2006, a new “Blog Track” was introduced, and was continued until 2010.Footnote 1 The original organizers released the TREC Blogs06 Collection (Macdonald and Ounis 2006), for which there have been 100,649 blog posts (excluding duplicate documents) and over 3.2 million permalinks. This dataset was used for the opinion finding (blog post) retrieval task in the TREC 2006 Blog Track and for the polarity opinion finding (blog post) retrieval task in the TREC 2007 Blog Track. In addition, the MPQA opinion corpus from the University of Pittsburgh (Wiebe et al. 2005), which defines a framework for opinion annotation using multiple assessors, has been released.

Building on this previous work, we introduced our opinion analysis task at NTCIR-6 in 2006. The novel aspects of the NTCIR MOAT task can be summarized as follows:

  1. 1.

    We have released an opinion annotation corpus for evaluation workshops. The annotation units include opinionatedness, topic relevance, polarity, opinion holder (from NTCIR-6), and opinion target (from NTCIR-7).

  2. 2.

    We have provided a multilingual opinion corpus that includes material in English, Chinese, and Japanese.

  3. 3.

    The topic set in the evaluation corpus is shared across languages.

In Sect. 6.2, we give details of the NTCIR MOAT design to clarify its novel features and suggest an opinion corpus annotation strategy for evaluation workshops. In Sect. 6.3, we explain the evolution of opinion analysis research since the introduction of MOAT. Finally, in Sect. 6.4, we conclude our remarks and discuss future research directions.

6.2 NTCIR MOAT

6.2.1 Overview

NTCIR MOAT was held at NTCIR-6 (Seki et al. 2007), NTCIR-7 (Seki et al. 2008), and NTCIR-8 (Seki et al. 2010). The task definition evolved through the three sessions, as shown in Table 6.1.

Table 6.1 MOAT progress during NTCIR-6, 7, & 8

The goal of the task is to form a bridge between element technologies such as opinion/polarity sentence classification or opinion holder/target phrase recognition to an application such as (opinion) IR or question answering. The target languages include English, Chinese (both Traditional and Simplified), and Japanese, and the topic set for IR or question answering is shared across languages. We have prepared a document set relevant to the topics retrieved from newspaper articles published in each target language, and have evaluated the system using these document sets annotated with multiple assessors.

6.2.2 Research Questions at NTCIR MOAT

Many researchers have focused on a resourceless approach to sentiment analysis (Elming et al. 2014; Le et al. 2016). Blitzer et al. (2007) proposed a domain adaptation approach for sentiment classification. Wan (2009) addressed the Chinese sentiment classification problem by using English sentiment corpora on the Internet. This type of research can be categorized as a semi-supervised approach to opinion/sentiment analysis that aims to solve the resource problem by using small labeled and large unlabeled datasets. We recognize that addressing language resource problems in sentiment analysis for nonnative languages is an important research area. Alternatively, applications such as the Europe Media Monitor (EMM) News ExplorerFootnote 2 provide an excellent service by including viewpoints from different countries. We also understand that providing these varied opinions from different countries offers opportunities for better worldwide communications. NTCIR MOAT is the first task to provide opportunities for nonnative researchers to develop a sentiment analysis system for low-resource languages and to bridge cultures by clarifying opinion differences across different languages.

6.2.3 Subtasks

With the broad range of information sources available on the web and in social media, there has been increased interest by both commercial and governmental parties in trying to analyze and monitor the flow of prevailing attitudes from anonymous users automatically. As a result, the research community has given much attention to automatic identification and processing of the following.

  • Sentences in which an opinion is expressed (Wiebe et al. 2004),

  • The polarity of the expression (Wilson et al. 2005),

  • The opinion holders of the expression (Choi et al. 2005),

  • The opinion targets of the experssion (Ruppenhofer et al. 2008), and

  • Opinion question and answering (Stoyanov et al. 2005), (Dang 2008).Footnote 3

With these factors in mind, we defined the subtasks in NTCIR MOAT as follows.

  1. 1.

    Opinionated sentences

    The judgment of opinionated sentences is a binary decision for all sentences.

  2. 2.

    Relevant sentences

    Each set contains documents that are found to be relevant to an opinion question, such as that shown in Fig. 6.1. For those participating in the relevance subtask evaluation, each opinionated sentence should be judged as either relevant (Y) or non-relevant (N) to the opinion questions. In NTCIR-8 MOAT, only opinionated sentences were annotated for relevance.

  3. 3.

    Opinion polarities

    The polarity is determined for each opinion clause. In addition, the polarity is to be determined with respect to the topic description if the sentence is relevant to the topic, and based on the attitude of the opinion if the sentence is not relevant to the topic. The possible polarity values are positive (POS), negative (NEG), or neutral (NEU).

  4. 4.

    Opinion holders

    The opinion holders are annotated in terms of opinion clauses that express an opinion. However, the opinion holder for an opinion clause can occur anywhere in the document. The assessors performed a kind of co-reference resolution by marking the opinion holder with the opinion clause if the opinion holder makes an anaphoric reference noting the antecedent of the anaphora. Each opinion clause must have at least one opinion holder.

  5. 5.

    Opinion targets

    The opinion targets were annotated in a similar manner to the opinion holders. Each opinion clause must have at least one opinion target.

  6. 6.

    Cross-lingual opinion Q&A

    The cross-lingual subtask is defined as the opinion Q&A task. Together with the questions in English, the answer opinions should be extracted in different languages. To keep it simple, the extraction unit is defined as a sentence. The answer set is defined as the combination of the annotations for the conventional subtasks, with opinionatedness, polarity, and answeredness being matched with the definition in the question description.

Fig. 6.1
figure 1

Example: opinion question fields at NTCIR-8 MOAT

6.2.4 Opinion Corpus Annotation Requirements

Opinion corpus annotation for multiple domains (as in news topics) usually requires expert linguistic knowledge because crowdsourcing annotation (such as the Amazon Mechanical Turk) does not fit the NTCIR MOAT annotation framework. We conducted our evaluation using agreed (intersection) annotations from multiple expert assessors. To check the stability of this evaluation strategy, we compared the evaluation results for agreed (intersection) annotation and selective (union) annotation to arrive at a gold standard for using NTCIR-8 MOAT submission data.

For the English cases in Table 6.2 (the \(\kappa \) coefficient between assessor annotations was 0.73) and the Traditional Chinese cases in Table 6.3 (\(\kappa \) coefficient 0.46), the rank of the participants’ systems is different. Although the rank differences for the English cases were within statistical significance, among the Traditional Chinese cases, the precision-oriented systems (CTL and WIA) tended to be ranked higher for cases of agreed (intersection) annotation, and recall-oriented systems (KLELAB-1 and NTU) tended to be ranked lower. For the Simplified Chinese cases in Table 6.4 (\(\kappa \) coefficient 0.97) and the Japanese cases in Table 6.5 (\(\kappa \) coefficient 0.72), there was no rank difference for the participants’ systems despite the different strategies because of either high \(\kappa \) agreement (Simplified Chinese) or a low number of participants (Japanese). From these observations, we concluded that the \(\kappa \) coefficient between assessor annotations should exceed 0.7 for stable evaluation. We also found that strong opinion definition and online annotation tools were helpful, but using expert linguistic annotators remained necessary to achieve high \(\kappa \) agreement.

Table 6.2 Evaluation strategy analysis using NTCIR-8 MOAT English raw submission data
Table 6.3 Evaluation strategy analysis using NTCIR-8 MOAT traditional Chinese raw submission data
Table 6.4 Evaluation strategy analysis using NTCIR-8 MOAT simplified Chinese raw submission data
Table 6.5 Evaluation strategy analysis using NTCIR-8 MOAT Japanese raw submission data

6.2.5 Cross-Lingual Topic Analysis

We ranked topics by averaging their F1-scores, the harmonic mean of precision and recall, obtained from all NTCIR-8 MOAT raw submissions in the opinionated judgment subtask. The best three (easy) topics and worst three (difficult) topics and the opinion percentage in the source documents are shown in Table 6.6.

Table 6.6 Cross-lingual topic analysis using NTCIR-8 MOAT raw submission data

From these results, we found that the topic difficulty is strongly related to each language. We also found that, with many opinions in the source, the topics tended to be easier. Exceptions to this rule included the opinion question for topic N16: “What reasons have been given for the anti-Japanese demonstrations that took place in April, 2005 in Peking and Shanghai in China?” We surmise that this was caused by the systems’ difficulty in judging quite sensitive opinions expressed in newspaper articles in each language.

6.3 Opinion Analysis Research Since MOAT

6.3.1 Research Using the NTCIR MOAT Test Collection

Some researchers have used the NTCIR MOAT test collection and presented their work at top-rated conferences, particularly those focused on cross-lingual sentiment analysis. Two representative examples are as follows.

  1. 1.

    Joint Bilingual Sentiment Classification

    Lu et al. (2011) hypothesized that aligned sentences between languages should be similar in opinion polarity and strongness. They proposed a method for improving the polarity classification performance that used the MPQA opinion corpus and the NTCIR MOAT corpus as labeled corpora, and aligned news corpora in Chinese and English as unlabeled corpora. They extended their work by using a cross-lingual mixture model (Meng et al. 2012) to improve performance when learning polarity clues from unlabeled corpora.

  2. 2.

    Cross-lingual Sentiment Lexicon Learning

    Gao et al. (2015) proposed a method for generating low-resource language sentiment lexicons using available English sentiment lexicons. They created Chinese sentiment lexicons using a bilingual word graph label propagation approach. They evaluated Chinese sentiment classification at the sentence level by using the NTCIR MOAT corpus and found increased effectiveness of sentiment classification when using their generated sentiment lexicon to generate features.

6.3.2 Opinion Corpus in News

Several opinion corpora involving news have been developed after NTCIR MOAT was published. In this subsection, we introduce the SemEval-2007 Task 14: Affective Corpus (Strapparava and Mihalcea 2007) and the sentiment-annotated quotation set (Balahur and Steinberger 2009; Balahur et al. 2010).

In the SemEval-2007 Affective Corpus, six emotion labels and two polarity labels have been annotated to headlines collected from 1,250 news websites and newspaper articles. The sentiment-annotated quotation set contains a set of 1,590 English language quotations (reported speech), manually annotated by two independent sets of annotators for sentiment (positive, negative, or objective/neutral) expressed toward the entities mentioned inside the quotation. Web crawling for news articles employed the EMM (Steinberger et al. 2009)Footnote 4 developed by the European Commission Joint Research Centre.

The NTCIR MOAT corpus, however, remains in use as a large cross-lingual news opinion corpus targeted at Chinese, Japanese, and English.

6.3.3 Current Opinion Analysis Research: The Social Media Corpus and Deep NLP

After NTCIR MOAT was published, TwitterFootnote 5 and other microblog media came into widespread use by many users. The NLP/IR researchers also focused on tweet sentiment analysis (Martinez-Camara et al. 2013). To improve sentiment classification in Twitter, specific clues were found to be useful because a tweet is much shorter than a news article, including tweet context (Jiang et al. 2011), emoticons and hashtags (Purver and Battersby 2012), lengthened words (Brody and Diakopoulos 2011), and emoji (Felbo et al. 2017).

On the other hand, deep NLP research such as Stanford Sentiment Treebank (Socher et al. 2013)Footnote 6 has become mainstream from a technological point of view. In this research, the learning model builds up a representation of whole sentences based on the sentence structure. An opinion corpus called the Stanford Sentiment Treebank has been developed to estimate compositionality in the sentiment detection task. It includes the fine-grained sentiment labels “very negative”, “negative”, “neutral”, “positive”, and “very positive” for 215,154 phrases in trees parsed with the Stanford Parser from 11,855 sentences extracted from movie reviews (Pang and Lee 2005).

In SemEval 2018 (Mohammad et al. 2018), an opinion corpus has been created from 10,983 English, 4,381 Arabic, and 7.094 Spanish tweets, and used to evaluate the systems. Several tasks are defined that provide annotations for the mental state of the tweeter, including (1) the intensities of the four basic emotions (anger, fear, joy, and sadness), (2) the intensity of sentiment/valence (very negative, moderately negative, slightly negative, neutral or mixed, slightly positive, moderately positive, and very positive), and (3) multi-label emotion classification across 12 emotions (anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, trust, and neutral). The corpus used best–worst scaling (Louviere et al. 2015), a comparative annotation method in which assessors were asked what was the best (highest in terms of the property) and worst (lowest in terms of the property), given n items (typically n \( = 4\)). Real-valued scores for the association between the items and the property were determined based on the number of times an item was chosen as the best and the worst. The median number of assessors for each tweet was seven. The inter-annotator agreements (Fleiss’s \(\kappa \)) for the multi-label emotion classification were 0.21, 0.29, and 0.28 for the 12 classes, and 0.40, 0.48, and 0.45 for the four basic emotions in English, Arabic, and Spanish. Most of the participants employed SVM/SVR, LSTMs, and Bi-LSTMS as machine learning algorithms, and also took word embedding, affect lexicon features, and word n-grams as features.

Although the document genres being focused on and the annotation properties have changed over time, cross-lingual opinion corpora remain important in current research.

6.4 Conclusion

In this paper, we have discussed the contributions made by our development of NTCIR MOAT. We created a cross-lingual opinion corpus using the news document genre, following which, several researchers have conducted cross-lingual opinion research using our test collections. Although sentiment classification accuracy is improved by using a cross-lingual corpus, research investigating linguistic opinion properties characterized by languages rooted in different cultures and opinion retrieval strategies preferable for different language characteristics remain to be undertaken.

In recent research, high-quality contextual representations based on neural architectures such as ELMo (Peters et al. 2018a) and BERT (Devlin et al. 2019) are proving to be effective in NLP research. In addition, linguistic properties such as morphological, local-syntax, and longer-range semantics tend to be treated at different layers, such as the word-embedding layer, lower contextual layers, or upper layers in each of these cases (Peters et al. 2018b; Jawahar et al. 2019). As an extension of bilingual sentiment word-embedding frameworks (Zhou et al. 2015), cross-lingual sentiment retrieval research that considers syntax and semantics in different languages will be an interesting direction for future work.