Encoding emotion in Chinese: a database of Chinese emotion words with information of emotion type, intensity, and valence

Despite the increasing interest in emotion and sentiment analysis in Chinese text, the field lacks reliable, normative ratings of the emotional content and valence of Chinese emotion words. This paper reports the first large-scale survey of average language users ’ judgment of perceived emotion type (e.g., ANGER , HAPPINESS ), emotional intensity, and valence (e.g., POSITIVE , NEGATIVE ) of Chinese emotion words. The results of the survey reveal significant differences from previously proposed Chinese emotion lexicons, which mostly relied on a few researchers ’ judgment or automatic annotation. Furthermore, the current study also explores the issue of lexical variation across different Chinese varieties with a comparison of emotion word perception by Chinese speakers from three different areas (Mainland China, Hong Kong, and Singapore). The emotion lexicons constructed in the current study will serve as an important reference for future research on emotion and language, including (but not limited to) topics related to sentiment detection and analysis, perception of affective language, and cross-regional lexical and semantic variation in Chinese.

Despite the possible difference in research interest, most emotion studies along both lines assume-and heavily rely on-the existence of certain emotion taxonomies. The two features that are most often discussed for categorizing emotions are emotion type and emotion intensity (e.g., see the emotion annotation scheme proposed in Wiebe et al. 2005). It is widely believed that there are some basic emotion types (e.g., HAPPINESS, ANGER, and SADNESS), and each emotion belongs to one or more basic types b ; on the other hand, emotions may also differ in their intensity, for example, scared is a stronger emotion than afraid although they both belong to the category of FEAR. In the studies that are specific to emotion and language, where the center of concern is the encoding and decoding of emotion in language, one of the most important tasks is to map the emotion taxonomy to linguistic expressions of emotion (i.e., emotion words and phrases), which would result in a representational model of the emotion lexicon c . In addition to emotion type and intensity, the third feature that is often coded in an emotion lexicon is valence, which generally speaking, refers to the overall positivity or negativity of the word. Valence judgment (or sentiment classification) of emotion words and emotion-laden words has been carried out in both lines of emotion studies, by psycholinguists and computer scientists, respectively.
But how do we obtain an emotion lexicon annotated with emotion type, intensity, and valence? How can we be sure if upset should be considered as ANGER or SADNESS in emotion type, HIGH or LOW in emotion intensity, POSITIVE, NEGATIVE, or NEUTRAL in valence? Obviously, questions of the latter type may solicit different opinions if we ask a group of English speakers because the way people perceive emotion words and the emotional content in these words may be profoundly individual d . Thus, models that only rely on a few people's judgment, even if they are experts, are subject to criticism regarding the representativeness and generalizability of their results. An alternative approach is to conduct judgment experiments with large groups of participants so as to obtain a normative perception of the emotion lexicon that better represents the comprehension of the general population.
So far, there have been a few experimental-based emotion word models published for English (e.g., Bradley and Lang 1999;Nabi 2002;Strauss and Allen 2008), but similar studies are still lacking for other languages including Chinese e . The past decade or so has witnessed a fast growing body of literature on emotion expressions in Chinese, but most existing studies solely relied on a few individuals' (usually researchers) judgment (Chang et al. 2000;Lee 2010;Xu and Tao 许小颖, 陶建华 2003) or automatic annotation (e.g., Xu et al. 2008). To further complicate the issue, Chinese is widely used in a number of countries and regions (Mainland China, Hong Kong, Singapore, Taiwan, etc.) and consequently has evolved into several varieties over the years of parallel development. Cross-regional variations have been found in almost every linguistic aspect-especially in pronunciation, lexicon, and grammar (see Huang et al. 2014;Li 李宇明 2010;Lin et al. 2014;Tsou and You 邹嘉彦, 游汝杰 2010)-which renders untenable the assumption of a homogeneous perception of the Chinese emotion lexicon. However, to our best knowledge, regional differences have not been examined comprehensively in previous research of Chinese emotion words.
Thus, the goal of the current study is twofold: (1) to collect judgment of Chinese emotion words from a sizable group of laymen and (2) to compare the judgment of Chinese speakers from different areas. Specifically, we focus on the judgment of emotion type, emotion intensity, and valence by native Chinese speakers from Mainland China, Hong Kong, and Singapore. To preview the results, the current study revealed both similarity and significant differences compared with previous studies, which can be attributed to either our participants' background (i.e., laymen) or the fact that the current results were based on a sizable group of participants' judgment; we also found important cross-regional differences in the participants' perception of Chinese emotion words. The current results will serve as an important reference for future research on emotion and the Chinese language.
In the rest of this paper, we will first review previous studies on emotion and language in more detail; we will then introduce the experimental methods of the current study, followed by the results and discussion.

Emotion type, emotion intensity, and valence of emotion words
As stated above, previous literature generally agreed on the existence of some basic emotion types, but the exact identity of the basic categories is still under debate (see Ekman 1992;John 1988;Kövecses 2000;Oatley and Johnson-Laird 1987;Plutchik 1980Plutchik , 1994Turner 1996Turner , 2000. The number of basic types proposed in previous research ranged from four (e.g., ANGER, ANXIETY, HAPPINESS, and SADNESS as in John (1988)) to five (e.g., HAPPINESS, SADNESS, ANXIETY, ANGER, and DISGUST in Oatley and Johnson-Laird (1987) or ANGER, FEAR, HAPPINESS, SADNESS, and SURPRISE in Turner (1996)), six (e.g., ANGER, DISGUST, FEAR, HAPPINESS, SADNESS, and SURPRISE in Ekman (1992)), eight (e.g., JOY, SADNESS, TRUST, DISGUST, FEAR, ANGER, SURPRISE, and ANTICIPATION in Plutchik (1994) or ANGER, ANXIETY, DISGUST, FEAR, NEUTRAL, HAPPINESS, SADNESS, and SURPRISE in Strauss and Allen (2008)), and even 24 (e.g., Xu and Tao 许小颖, 陶建华 2003; see the discussion below). Regardless of how many basic types were proposed, most previous studies acknowledged that not all emotion words can be identified with a single emotion type-instead, there are complex emotions that must be understood as a blend of two or more basic emotion types. For example, in Turner's (1996) model, worry belonged to both FEAR and SADNESS and guilt was a mixture of SADNESS, FEAR, and ANGER. Needless to say, when we examine across models, the classification of complex emotions will vary with the number and identity of basic types assumed in the model. A model that assumes more basic types will have fewer complex emotions compared to a model with fewer basic types. Given a specific model, it is also possible for an emotion word to not belong to any basic type, in which case one can say that the emotion word fails to be represented in the model. Thus, one way to evaluate the goodness of a model is to count the number of emotion words that are accounted for (or not accounted for) by the model.
Compared to emotion type, the representation of emotion intensity is less contentious, as intensity is always measured on a one-dimensional scale. Previous studies used either broad intensity bands (e.g., LOW, MEDIUM, and HIGH in Plutchik (1980) and Turner (2000); see also Lee (2010)) or numerical scales with finer categorization (e.g., Strauss and Allen (2008) used a 7-point scale).
Coding valence presents another type of challenge. All previous studies agreed on using a one-dimensional scale to represent valence-either a numerical scale (e.g., Bradley and Lang 1999 used a 9-point scale; Khoo et al. (2015) used a 7-point scale) or a scale with ordered categories (e.g., POSITIVE, NEGATIVE, and NEUTRAL in Baccianella et al. (2010)). Nonetheless, the interpretation of a valence judgment is still unclear. In most studies, the valence scale is interpreted as a continuum from "positive" to "negative," via "neutral" in the middle. However, in the seminal work that produced the ANEW (Affective Norms for English Words) database, Bradley and Lang (1999) asked participants to rate a word on a scale of "pleasantness." Their methodology was replicated in a number of subsequent studies that built similar databases for other languages, e.g., Spanish (Redondo et al. 2007) and French (Monnier and Syssau 2014). Regardless of how the valence scale is labeled, the interpretation of a valence judgment can be ambiguous, as the perceived valence of a word may be either a judgment of the emotional experience encoded by the word (i.e., from the perspective of the experiencer) or an attitude toward the encoded emotion (i.e., from the perspective of a reporter). In the current study, by including both valence and the word's emotional content (emotion type and intensity) in the rating task, we hope to unveil the distinction between the two perspectives.

Construction of emotion word models
It is not uncommon for emotion word models to be based on a few researchers' judgment. Consider models with emotion type and intensity information first. For example, Turner's (2000) model was based on only one researcher's judgment, and John's (1988) model had three independent judges. More recently, Nabi (2002) suggested that researchers' conceptualization of emotion words could be different from that of average language users. Along this line, Strauss and Allen (2008) carried out the first large-scale experiment with laymen participants for emotion type categorization and intensity rating. The study recruited 200 participants to rate a list of 463 words, with each word judged by 50 participants on average. The main results of the study were a list of highly representative emotion words of each emotion type, which elicited a consensus of judgment from the participants, and a list of complex emotion words, which received mixed judgment. For example, angry, mad, and rage are all representative of ANGER, and cheerful, enjoy, and joy are representative of HAPPINESS; by contrast, examples of complex emotion words include helpless (SADNESS + FEAR + ANXIETY) and desire (HAPPINESS + ANXIETY). Strauss and Allen's study also revealed several significant differences from previous studies that relied on experts' judgment. In addition to categorization differences in individual words (e.g., the word doom was categorized as FEAR in Strauss and Allen (2008), but as SADNESS in John (1988)), Strauss and Allen's model also reflected richer emotional content of the test words-as many previously proposed single-type emotion words turned out to be complex-probably due to the large number of participants in Strauss and Allen's experiment.
While the benefits of experimental methods are obvious, the risk of using unsupervised laymen's judgment must not be overlooked. Nabi (2002) warned of the use of free recall tasks in emotion research due to the difference between the lay understanding of emotion words and theoretical definitions of emotions assumed in scientific research; more recently, Bai's (2015) study of Chinese expressions also evinced that completely unsupervised experiments may produce incongruous results (see the discussion of Bai (2015) below). Thus, the key to the success of emotion word judgment experiments is to provide sufficient control of the test materials and to ensure the reliability of the responses.
Similar to emotion type and intensity, the coding of valence information may be achieved based on the judgment of a large number of laymen participants (e.g., the series of ANEW databases) or only a few expert annotators (e.g., the WKWSCI Sentiment Lexicon by Khoo et al. (2015)) or completely automatically using corpus analysis and lexical information (see Tang et al. (2009) for a more detailed review). While the first method is mainly adopted by psycholinguistic studies, the latter two are more often used in computer science studies. As far as we know, there has not been a comparison of valence annotations obtained with different methods, which is probably a result of these methods being used by separate populations of researchers.

Models of Chinese emotion words
Most existing models of Chinese emotion words regarding emotion type and intensity only relied on expert judgment, among which we notice two slightly different approaches. While some scholars generated emotion word models from the Chinese lexicon per se, others based their models on the translation equivalents of the English emotion lexicon. Along the first line, Xu and Tao 许小颖, 陶建华 (2003) proposed seven basic emotion types (好 hao "love," 恶 wu "disgust," 喜 xi (乐 le) "happiness," 怒 nu "anger," 哀 ai "sadness," 惧 ju "fear," and 欲 yu "desire") following the tradition in Chinese classical philosophy. The same authors also proposed 24 finer emotion categories including, e.g., 羞 xiu "shame," 烦 fan "annoyed," 傲 ao "pride," 信 xin "trust," and 疑 yi "suspicion" based on 372 emotion-related words, but the relationship between the seven basic types and the 24 finer categories was not explained f . In another study, Chang et al. (2000) categorized 33 frequently used emotion verbs-each of which had more than 40 occurrences in the Sinica Corpus (Chen et al. (1996))-into seven basic categories (HAPPINESS, DEPRESSION, SADNESS, REGRET, ANGER, FEAR, and WORRY). In both studies, the authors did not explain the criteria of emotion word classification, so we had to speculate that the classification was based on the authors' judgment.
Along the second line, Lee (2010) published an emotion word model generated by mapping Chinese emotion words to the model of English emotion words in Turner (2000), with some additional emotion words from Plutchik (1980). As a result, Lee's model had an identical structure as Turner's model, with five basic emotion types (ANGER, FEAR, HAPPINESS, SADNESS, and SURPRISE) and three intensity levels (HIGH, MODER-ATE, and LOW). The classification of each Chinese emotion word in Lee's model followed the classification of the word's English equivalent in the English model. Being the first to represent both emotion type and emotion intensity in Chinese emotion words, Lee's model has been applied in a number of subsequent emotion-related studies in Chinese (Lee 2010;Lee et al. 2010;Lee et al. 2013;Lee et al. 2014).
It should be noted that Lee's model relies on an assumption that the classification of emotion words (by emotion type and intensity) can be transferred across languages via translation equivalents. This assumption necessarily requires that (1) for each Chinese emotion word, there exists a precise translation equivalent in English, and (2) each pair of Chinese and English translation equivalent emotion words are perceived (by native speakers of Chinese and English, respectively) with the same emotion type and equivalent emotion intensity. However, previous studies have found that the encoding of emotional concepts may very well vary across languages. For example, Pavlenko (2008b) pointed out that the concept of fun is not encoded in the Russian lexicon. A more relevant example was noted by Bai (2015) about the difference between the English word shameless and its (near) equivalent in Chinese, i.e., 无耻 wuchi "shameless": while shameless may be used in a joking way in English (i.e., more similar to bold), 无耻 wuchi "shameless" is almost always insulting. In other words, even if translation equivalents do exist, the emotional content encoded in the words might not be exactly the same, which would further lead to differences in the perception of emotion type and emotion intensity across languages.
To be sure, there are a small number of Chinese emotion word studies that used laymen's judgment, but the scope of research in these studies was limited to certain emotion types. For example, both Li et al. (2004) and Bai (2015) focused on SHAME expressions in Chinese. Li et al.'s study started with a list of 83 words that were related to 羞 xiu "shame/shyness," 耻 chi "disgrace," and 辱 ru "humiliation/shame" in the dictionary; the list was then expanded to 113 words and phrases by 10 native speakers; finally, the complete list of SHAME expressions were submitted to a judgment experiment for emotion sub-type with a separate group of 52 native speakers. Bai (2015), on the other hand, fully relied on laymen's judgment for both generalization (via a free listing task) and categorization (via a similarity sorting task) of SHAME expressions in Chinese.
One unexpected result of Bai's study is that in the free listing task, a few emotion words typical of ANGER (e.g., 愤怒 fennu "angry," 生气 shengqi "angry"), SADNESS (e.g., 伤心 shangxin "sad"), and DISGUST (e.g., 讨厌 taoyan "hate," 厌恶 yanwu "disgust") were constantly proposed by lay participants as prototypical SHAME words. Granted that SHAME is often associated with ANGER, SADNESS, and DISGUST (in fact, the status of SHAME as a unique emotion type is still arguable), most researchers would not consider these words as core SHAME words in Chinese. In our view, these incongruous results are very likely attributable to the unsupervised nature of the free listing task; in addition, the lack of contrast with other emotions-as the experiment focused on SHAME expressions only-may also cause lay participants to confuse words of other related emotion categories as the core vocabulary of the emotion type of interest. Due to these concerns, the current experiment used a word list based on previous expert reports together with a categorization task that had a full range of emotion type options, in order to provide a more controlled experimental setting (see below for details of experimental methods).
Last but not the least, it is worth noting that Bai (2015) was one of the first to examine the variation of emotion word perception by Chinese speakers from different regions (Mainland China, Singapore) and language backgrounds (monolingual, bi(multi)lingual). Bai's results showed significant differences in the perception of SHAME expression across participant groups, although it is unclear whether the observed variation patterns may be generalized to the perception of other emotions.
As for valence information, the two major Chinese sentiment databases are Affective Lexicon Ontology (Xu et al. 2008) and Hownet (Dong and Dong 2006), both of which are constructed by automatic or semi-automatic techniques. To our knowledge, there has not been a valence model for Chinese words based on large-scale manual annotation.
To summarize, there have been a few attempts in previous literature to construct databases of Chinese emotion words with annotated information related to emotion, but a comprehensive, cross-regional model of Chinese emotion words as perceived by lay speakers is yet to be proposed. The current study set out to fill this gap.

Methods
Design of the current study-especially regarding the coding of emotion type and intensity-mainly follows Strauss and Allen's (2008) survey of English emotion words. In this section, we report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.

Experimental materials
We used the list of Chinese emotion words from Tao 许小颖, 陶建华 (2003, cf. Lee 2010), which consisted of 372 words ranging from one to four syllables in length, with the majority being disyllabic compounds g . Since the complete list was too long to be tested in one experimental session, we randomly divided the words into four word lists (lists 1-4), with 93 words per list, and each participant only completed one word list.
A set of quality control measures for rating consistency was implemented: first, each word list had two test words that each appeared twice so that we could compare a participant's ratings of repeated items as a measure of rating consistency within the participant; second, each word list shared one test word with every other word list, resulting in a set of six test words repeated across lists, in order to assess rating consistency across participant groups working on different word lists; finally, we inserted a nonword item, 几几 jiji, to all four word lists, in order to check whether the participants were responding to the judgment task attentively. Thus, each word list contained 97-98 test tokens in total. Table 1 lists all the items that are repeated within or across word lists. It should be noted that all the repeated items, except for the non-word item, 几几 jiji, were intentionally filled by words that express strong and relatively unambiguous emotions, so that the evaluation of rater consistency would be minimally complicated by the complex meaning of the test words, even though such complication may not be completely avoided (see the discussion in the "Results" section).

Participants and procedure
The participants were 256 (192F, 64M) native Chinese speakers recruited from two university campuses, one in Hong Kong and one in Singapore. More than half of the participants (N = 149) were in the age range of 21-25 years, followed by the age range of 16-20 years (N = 81), and the rest of the participants (N = 26) were between 26-55 years. In terms of language background, 92 were born and raised in Mainland China List 1: 厌倦 yanjuan "bored," 悲痛 beitong "grief" List 2: 担忧 danyou "worried," 丧气 sangqi "discouraged" List 3: 愉悦 yuyue "pleasure," 痛恨 tonghen "hate" List 4: 愤慨 fenkai "resentful," 愁闷 choumen "depressed" Lists 1, 2: 开心 kaixin "happy" Lists 1, 3: 焦躁 jiaozao "anxious" Lists 1, 4: 愤怒 fennu "angry" Lists 2, 3: 沉痛 chentong "grief" Lists 2, 4: 震惊 zhenjing "shocked" Lists 3, 4: 畏惧 weiju "fear" Lists 1, 2, 3, 4: 几几 jiji and self-identified as Mainland Mandarin speakers, 87 Hong Kong and Hong Kong Chinese speakers, and the remaining 77 Singapore and Singaporean Chinese speakers. All the participants got cash compensation or partial course credit for participating in the study. The judgment task was administered in the form of an online survey on the Google Form platform. Each participant was randomly assigned to work on one word list. The participant's task was to judge the emotion type, intensity, and valence of each test item on the list. For emotion type, we followed the emotion taxonomy in Strauss andAllen (2008, see also Lee 2010;Xu and Tao 许小颖, 陶建华 2003), assuming seven basic emotion types (怒 nu ANGER, 焦虑 jiaolü ANXIETY, 厌恶 yanwu DISGUST, 害怕 haipa FEAR, 喜 xi HAPPINESS, 哀 ai SADNESS, 惊讶 jingya SURPRISE) and the existence of complex emotions as blends of basic emotions. The participant was asked to choose-if possible-the most appropriate emotion type for the test word; if none of the seven basic emotion types was appropriate, the participant could choose one of three additional options, acknowledging that the test word belonged to a different emotion type other than the seven basic types (其他情感类型 qita qinggan leixing OTHER EMOTIONS) or that the word was not an emotion word (中立/无情感色彩 zhongli/wu qinggan secai NEUTRAL/ EMOTIONLESS, or 不理解词义 bu lijie ciyi UNFAMILIAR WITH THE WORD). If the word was recognized as a true emotion word (i.e., one of the seven basic types or OTHER EMO-TIONS), the participant also needed to rate the emotion intensity of the word on a 7point scale (1 = "basically no emotion"; 7 = "very strong emotion") h . The participants were also asked to judge the valence of each word; following previous studies, we used three general terms to label valence, i.e., 褒义 baoyi POSITIVE, 贬义 bianyi NEGATIVE, and 中立 zhongli NEUTRAL.
All the instructions in the online surveys were in Chinese (in both simplified Chinese characters and traditional Chinese characters). Each test item was rated by 61-68 participants (see Table 2 for the number of participants by word list and region).
It should be noted that the emotion type categorization task only allowed singlecategory responses, i.e., the participants were forced to classify an emotion word with only one emotion type. As a result, words encoding complex emotions will be revealed in the pooled responses from a group of participants, but not in a single participant's responses. Another noteworthy point about the categorization task is the availability of additional options apart from the pre-assumed basic emotion types. For one thing, the fact that OTHER EMOTIONS was an available option made it possible for the results to shed light on the taxonomy of basic emotions: if a large group of words was categorized as OTHER EMOTIONS, this would suggest the existence of additional basic emotion types. In addition, the word list we used included not only typical emotion words (e.g., 悲痛 beitong "grief," 开心 kaixin "happy") but also words whose emotion-word status may be arguable (e.g., 理解 lijie "understand") and words that may only be recognized in certain dialectal regions (e.g., 背悔 beihui "regretful," 来劲 laijin "in high spirits"). Instead of forcing the participants to identify these words as emotion words, we provided the options NEUTRAL/EMOTIONLESS and UNFAMILIAR WITH THE WORD. Thus, results of this study can also distinguish the most typical emotion words from those that are less recognized or less commonly used.

Reliability of the results
Rating consistency within a participant was checked by test words repeated within a word list (N = 8; see Table 1). For each intra-list repeated word, we calculated the percentage of participants who classified both occurrences of the word as the same emotion type, and the average consistency rate was 77.6 % (s.d. = 13.57 %, range = 58.8-95.6 %; see Table 3). Further analysis showed that the two words with less than 70 % of the raters being consistent (担忧 danyou "worried" and 痛恨 tonghen "hate") tended to be associated with more than one basic emotion type, which explains why participants may switch categories when rating the same word repeatedly. Importantly, these words still elicited highly similar overall ratings of emotion type between the two occurrences (e.g., 痛恨 tonghen "hate" was rated mainly as DISGUST (54.4 %) + ANGER (38.2 %) at the first occurrence and as DISGUST (61.8 %) + ANGER (25.0 %) at the second occurrence).
We also compared the emotion intensity ratings for the first and second occurrences of the intra-list repeated items. The results showed that all the intra-list repeated items elicited highly similar ratings of emotion intensity between the two occurrences (p > .05 in all t tests, correlation coefficient >0.5 for seven out of eight words; see Table 3 for detail). Taken together, these results indicate that the participants' judgment of emotion type and intensity in the current experiment was stable and consistent overall. Within-rater consistency in valence ratings was between 70.3 and 94.1 % (mean = 86.1 %, s.d. = 7.4 %). Similar to the emotion category results, lexical items with lower within-rater consistency tend to have greater degree of mixed ratings across rater. For example, 愤慨 fenkai "resentful" was rated as NEGATIVE (70.5 %) + NEUTRAL (21.3 %) + POSITIVE (8.2 %) at the first occurrence and NEGATIVE (64.5 %) + NEUTRAL (19.3 %) + POSI-TIVE (16.1 %) at the second occurrence.
Rating consistency across participants working on different lists was checked by the six real words repeated across word lists (see Table 1). Each of the six words appeared in two word lists and was thus rated by two separate groups of participants. As shown in Table 4, the majority of the participants agreed on the emotion type, and the ratings of emotion intensity were highly similar across groups (p > .05 in all t tests).
Finally, the non-word item, 几几 jiji, which appeared in all four lists, was recognized as UNFAMILIAR by most of the participants (84.8 %), followed by NEUTRAL/EMO-TIONLESS (14.0 %). It should be noted that in Strauss and Allen's study of English emotion words, the non-word item (Ytzok) was used as a gatekeeper, as participants who failed to recognize the non-word item as UNFAMILIAR were excluded from the analysis. In the current study, we did not adopt this practice because not recognizing 几几 jiji as a non-word item does not necessarily reflect a lack of competence of the participant. Although 几几 jiji is not a compound in standard Chinese vocabulary, the component 几 ji is a real morpheme with several possible meanings, e.g., pronominal "how many" and nominal "bench," making it possible to assign meanings to几几 jiji.
Furthermore, it is not unlikely for Chinese language users to associate 几几 jiji with real compounds such as 磨磨叽叽 momojiji "grumble" and 叽叽喳喳 jijizhazha "twitter"-both of which contain morphemes that are homophonic with 几几 jiji-in order to make sense of 几几 jiji. Thus, 几几 jiji is more word-like in Chinese than Ytzok is in English, and a failure to recognize 几几 jiji as a non-word should not be taken as evidence of unreliable judgment.

Emotion type, intensity, and valence judgment of Chinese emotion words
This section reports the general rating results for real-word items i . When analyzing the results of emotion type, intensity, and valence judgment, we first separated the survey results by participants' Chinese variety (Mainland China, Hong Kong, and Singapore), resulting in three subsets of data, which we refer to as "MC," "HK," and "SG" datasets, respectively. For each test word in each dataset, we compiled the list of emotion categories the word was classified with and the percentage of participants that voted for each category. The emotion categories with the highest and second highest percentages of voters were considered as the primary and secondary emotion categories of the word. Following Strauss and Allen's study, words whose primary emotion type was voted by no less than 70 % of the survey participants were defined as REPRESENTATIVE emotion words, whereas words whose primary emotion type was voted by less than 70 % of the survey participants were defined as BLENDED emotion words. As the names suggest, words in the former category are considered to be highly representative of certain basic emotion type, and those in the latter category express a mixture of emotions of multiple basic types (i.e., complex emotions). Some examples of universally agreed REPRE-SENTATIVE emotion words are 欣喜 xinxi "happy" and 伤感 shanggan "sad," categorized as HAPPINESS (100 %) and SADNESS (>90 %), respectively, in all three datasets; by contrast, words such as 不安 bu'an "restless" (primary: ANXIETY, 55-68 %; secondary: FEAR, 22-32 %) and 忌恨 jihen "hate" (primary: DISGUST, 61-65 %; secondary: ANGER, 26-32 %) received consistent, mixed classification results across regions and are thus typical examples of BLENDED emotion words in Chinese.
As for intensity judgment, we calculated the mean and standard deviation of intensity ratings of each test word in each dataset pooled from all the valid ratings (if a participant classified a test word as UNFAMILIAR, his or her rating of emotion intensity, if any, was excluded; if a participant classified a test word as NEUTRAL, his or her rating of emotion intensity was recognized as 1). To give some examples, we found that among REPRESENTA-TIVE ANGER words in the MC dataset, 暴怒 baonu "rage" was perceived as an extremely intensive anger emotion (mean = 6.96, s.d. = 0.20), followed by other ANGER words such as 激愤 jifen "wrathful" (mean = 6.54, s.d. = 1.00), 愤怒 fennu "angry" (mean = 6.43, s.d. = 0.71), 愤懑 fenmen "resentful" (mean = 6.35, s.d. = 0.87), 愤慨 fenkai "resentful" (mean = 6.18, s.d. = 0.72), 忿怒 fennu "angry" (mean = 6.04, s.d. = 0.98), and 生气 shengqi "angry" (mean = 6.00, s.d. = 0.87) on the intensity continuum. We also noticed a few trends regarding cross-regional differences in the perception of emotion intensity and the relationship between emotion type and emotion intensity ratings (see more discussion below).
Based on the raw valence judgment, we calculated the percentage of participants who voted for POSITIVE, NEGATIVE, and NEUTRAL for each test word j . In general, valence judgment shows higher cross-rater agreement than emotion type ratings-probably a direct result of the smaller set of valence categories. Furthermore, valence judgment also showed some general correlations with emotion type; however, such correlations were not strong enough to fully predict valence based on emotion type, suggesting that the valence judgment was not simply based on the sentiment of the emotion described in the emotion word (see more discussion below).
The full list of emotion words including primary and secondary emotion types, emotion intensity, and valence is provided in the Additional file 1. The complete database of emotion type, intensity, and valence ratings is available upon request.

Major findings
As discussed above, current results from the survey generate a list of words that are highly representative of each emotion type. Table 5 below further presents a list of words that are unanimously voted (100 %) into one emotion type and thus can be considered as core members of the category. Table 6 lists the number of REPRESENTATIVE emotion words for each emotion type (excluding OTHER EMOTIONS and UNFAMILIAR; Table 7) in each region. Of all the emotion types, HAPPINESS has the most number of representative words in all three regions. However, HAPPINESS is also the only emotion type that is more often rated as POSITIVE (see below for more discussion on valence ratings), so overall there are more representative words for negative types than the positive type-consistent with previous observation for English (Pennebaker et al. 1997;Stone et al. 1966) and the negative differentiation phenomenon (i.e., negative concepts have more elaborate expressions than positive ones) in general. Comparing across regions, the MC dataset has the largest REPRESENTATIVE emotion lexicon (N = 163), followed by HK (N = 145), and SG (N = 107). Table 6 also shows that altogether 90 REPRESENTATIVE emotion words were shared by all three areas, which can be considered as the core emotion lexicon in Chinese.
This study also found that all three datasets had words that were classified as OTHER EMOTIONS by the majority (≥70 %) of the participants (see Table 8 for the list), suggesting that the seven emotion types assumed in this study may not be sufficient to categorize all the emotions that are encoded in Chinese. Our data also suggest regional differences in the categorization of words as OTHER EMOTIONS. While MC and SG both have more than ten words agreed by most raters as in the OTHER EMOTIONS category, the HK dataset only has four such words.   (2008), we do not treat SHAME as a basic emotion in this study, although some contend that SHAME is one of the core, self-conscious emotions that is particularly important in the Chinese culture (Li et al. 2004; see also Bai 2015). As mentioned earlier, previous studies have identified some prototypical SHAME expressions in Chinese, ranging from 40 in Bai (2015) to 113 in Li et al. (2004). A small subset of these words was also present in the current study (mostly in the MC and SG datasets): 6 from Bai's list generated by Chinese monolinguals, 11 from Bai's list by Chinese-English bilinguals, and 14 from Li et al.'s list. According to our results, a few of the overlapped words-不好意思 buhaoyisi "embarrassed," 害羞 haixiu "shy," 羞惭 xiucan "ashamed," 羞愧 xiukui "ashamed"-was majorly categorized as OTHER EMO-TIONS in at least one regional dataset. Other overlapped, so-called SHAME words in these two studies, such as 生气 shengqi "angry," 讨厌 taoyan "hate," and 伤心 shangxin "sad," were mostly categorized under ANGER, DISGUST, and SADNESS as their primary and secondary categories in our study. Thus, our results do not provide sufficient support for claiming SHAME as one of the basic emotion types lexicalized in Chinese; on the contrary, our results suggest that some of the SHAME expressions previously proposed by lay participants may indeed be an artifact of the eliciting task (see our discussion on Bai (2015) in Models of Chinese emotion words).

Following Strauss and Allen
As for emotion intensity, our results showed that the test words covered a wide range of emotion intensity in all three datasets (MC: 1.17-6.96; HK: 1.29-6.45; SG: 1.58-6.47). We also found that MC participants overall rated the test words with higher emotion intensity than both HK and SG participants (p < .001 in both paired t tests; see Table 7 for by-region summaries), suggesting possible cross-regional differences in the perception of emotion intensity, although it is unclear at this stage what may have caused such differences.
Furthermore, REPRESENTATIVE emotion words (excluding OTHER EMOTIONS and UN-FAMILIAR) received on average higher intensity ratings compared to BLENDED emotion  words, and the trend was significant in all three regions (p < .001 in all t tests). We propose two possible accounts for the observed pattern: first, the ambiguity in emotion type may interfere with the perception of emotion intensity, therefore participants may find it difficult to evaluate emotion intensity for words belonging to multiple emotion categories; second, for some participants, the perception of emotion intensity may be conflated with emotion prototypicality, and thus, BLENDED emotion words were perceived as less intensive than REPRESENTATIVE emotion words. One reviewer suggested a possible effect of word frequency on perceived intensity, as high-frequency words may be perceived as less intensive than low-frequency words. However, this hypothesis was not supported in a post hoc analysis of the MC dataset, with frequency counts gathered from the BCC Chinese corpus (the assorted subcorpus) k . A linear regression model showed that after controlling for emotion type (both primary and secondary), (log) word frequency had no significant effect on perceived emotion intensity (t = −1.18, p > .1). Whether or not the frequency effect may exist in the other two datasets-as well as the origin of the effect (e.g., familiarity, register, or prototypicality) if it does exist-still awaits further investigation. Valence ratings showed higher cross-rater agreement than emotion type-probably due to the fact that there are only three valence categories. In each dataset, more than 65 % of the test words (MC: 242; HK: 279; SG: 242) received concurred valency ratings from the majority (≥70 %) of the raters. Again, emotion type is an important predictor for valence, as words of the same emotion type tend to show similar valence ratings. Figure 1 shows the average percentage of POSITIVE, NEGATIVE, and NEUTRAL valence ratings for REPRESENTATIVE words of each emotion type. As expected, HAPPINESS words were often rated as POSITIVE, whereas ANGER, ANXIETY, DISGUST, FEAR, and SADNESS words NEGATIVE. Interestingly, although SURPRISE words were most often rated as NEUTRAL, the second most-often-voted category was NEGATIVE, and these words were rarely rated as POSITIVE.
The strong correlation with emotion type suggests that our participants' valence ratings were to a large extent based on the perceived "pleasantness" of the emotional experience described by the emotion words. Nevertheless, emotion type cannot fully account for the valence ratings. For example, a simple analysis of REPRESENTATIVE HAPPI-NESS and SADNESS words in the MC dataset revealed that while these words were on average voted by 92.1% and 87.9% of the participants as HAPPINESS and SADNESS, Table 8 Words classified as OTHER EMOTIONS by the majority (≥70 %) of the raters and the corresponding percentage of raters in each language variety MC HK SG 害羞haixiu "shy" 86 % 不好意思 buhaoyisi "embarrassed" 83 % 倚重 yizhong "rely heavily upon" 82 % 怀疑 huaiyi "suspect" 82 % 羞惭 xiucan "ashamed" 78 % 敬慕 jingmu "admire" 78 % 珍惜 zhenxi "cherish" 77 % 同情 tongqing "empathy" 77 % 挂念 guanian "reminisce" 74 % 羞涩 xiuse "shy 74 % 敬重 jingzhong "revere" 73 % 信服 xinfu "convinced" 73 % 无奈 wunai "helpless" 86 % 羡慕 xianmu "admire" 77 % 自傲 zi'ao "proud" 76 % 同情 tongqing "empathy" 71 % 体贴 titie "considerate" 79 % 挂念 guanian "reminisce" 79 % 怜惜 lianxi "compassionate" 79 % 自傲 zi'ao "proud" 79 % 羞愧 xiukui "ashamed" 78 % 害羞 haixiu "shy" 75 % 溺爱 ni'ai "spoil" 75 % 同情 tongqing "empathy" 75 % 崇拜 chongbai "adore" 74 % 羞惭 xiucan "ashamed" 74 % 痴迷 chimi "obsessed" 73 % 情愿 qingyuan "willingly" 73 % 关切 guanqie "care" 70 % respectively, only 87.1% and 78.4 % of the participants voted these words as POSITIVE and NEGATIVE, respectively. That is to say, at least 5-10 % of the participants did not vote for the valence category predicted by the emotion type they chose; in most cases, these participants rated the valence as NEUTRAL, but there are also some cases where participants voted for the opposite of the predicted valence category. . Although all three words were predominately categorized as HAPPINESS by MC participants, no more than one third of the participants recognized these words as POSITIVE; instead, there was a sizable proportion of the participants who recognized these words as NEGATIVE. In our opinion, these results showed that the participants may have a neutral or negative attitude toward the emotional experience described by the word; in other words, the valence ratings reflected not only the polarity of the emotional experience (from the experiencer's perspective) but also the evaluation of such emotional experience (from a reporter's perspective).
In the current study, since participants were simply instructed to rate the goodness of word meaning, it is possible that some participants took the experiencer's perspective while others put themselves in an assessor position, thus resulting in mixed valence ratings when the two perspectives entail different-or even opposite-evaluations. Our results suggest the importance of examining the distinction between these two perspectives, which as far as we know, has not received much attention in either psycholinguistic research on word valence or computational research on sentiment analysis. Relatedly, our data in general suggest a certain degree of individual differences in all three emotional measures. An important source of such variation is raters' experience of emotion words used in specific contexts (for example, as discussed above, the emotional connotation of a word may change with perspective). However, the overall high cross-rater consistency in our dataset indicates that language users may extrapolate Fig. 1 Mean percentage of raters who selected POSITIVE, NEGATIVE, and NEUTRAL valence categories for REPRESENTATIVE emotion words by primary emotion type and rater region. OTHER and UNKNOWN words are not included core meanings of emotion words based on relevant contexts and that such meanings can be queried in a context-free rating experiment.

Comparison with previous studies
When comparing the current results with Xu and Tao 许小颖, 陶建华 (2003), which used the same emotion word list, one significant difference we noticed is that a subset of the test words-previously proposed as true emotion words by Xu and Tao 许小颖, 陶建华 (2003)-was consistently classified as NEUTRAL/EMOTIONLESS in the current study. As shown in Table 9, many of these words (e.g., 了解 liaojie "understand," 关注 guanzhu "pay close attention to," 理解 lijie "understand," 留神 liushen "careful") have meanings related to directing attention or understanding and thus may be more relevant to attitude or mood (see Frijda 1986Frijda , 1988Frijda , 1994Frijda , 2004 than the narrow sense of emotion. These results exemplify the differences between a large number of speakers' judgment and a few individuals' judgment, as well as laymen's language intuition and researchers' judgment. We also conducted a more detailed comparison with Lee's (2010) emotion word model, which, like the current study, represented both emotion type and emotion intensity. Before we introduce the comparison results, it should be noted that the structure of Lee's model is slightly different from the current one: while Lee also distinguished emotion words belonging to one basic emotion type ("basic" emotions in Lee's terms) and those encoding a blend of two or (rarely) three basic emotion types ("first order" and "second order" emotions in Lee's terms), Lee's model assumed only five basic emotion types (ANGER, FEAR, HAPPINESS, SADNESS, SURPRISE) and used a broader classification system of emotion intensity (HIGH, MODERATE, LOW) than the 7-point scale in the current study.
Our comparison focused on the representation of emotion words shared between the two studies (N current = 372, N Lee = 236, N current∩Lee = 226), which yielded a few important differences. To start with, only about half of the shared emotion words were listed under the same primary emotion category in Lee's and the current datasets (116 in MC,122 in HK,109 in SG), although the discrepancy might be largely due to the different emotion classification systems used in the two studies. Secondly, among the shared words considered as basic emotion words on Lee's list (N = 141), less than 75 % of were also classified as REPRESENTATIVE emotion words in the current study (105 in MC,90 in HK,72 in SG). For example, 37 words in the shared lexicon were considered as basic HAPPINESS words in Lee's model. While the majority of these words were indeed categorized as REPRESENTATIVE of HAPPINESS in our study, a good number of them (7 in MC, 7 in HK, 14 in SG) fell into the BLENDED category, some of which were categorized Table 9 Words classified with NEUTRAL/EMOTIONLESS as the primary emotion type and the corresponding percentage of raters in each dataset MC HK SG 了解 liaojie "understand" 87 % 留神 liushen "careful" 74 % 关注 guanzhu "pay close attention to" 86 % 了解 liaojie "understand" 86 % 留神 liushen "careful" 86 % 沉静 chenjing "calm" 82 % 自爱 zi'ai "self-esteem" 71 % 理解 lijie "understand" 80 % 了解 liaojie "understand" 84 % 想 xiang "think" 79 % 拥护 yonghu "support" 73 % as HAPPINESS by less than half of the participants in at least two regions (e.g., 放松 fangsong "relaxed": <32 % in MC, 29 % in HK, <10 % in SG, whose primary category is NEU-TRAL or OTHER EMOTIONS in all three datasets). Differences also exist in the rating of emotion intensity (see Table 10). On the one hand, Lee's intensity levels (HIGH, MODERATE, LOW) do correspond to different mean intensity ratings in the current study (p < .05 in all t tests comparing the mean intensity ratings of Lee's HIGH-, MODERATE-, LOW-intensity words in the current study). But on the other hand, words categorized with the same intensity level in Lee's model received highly variable ratings in the current study (see Table 10 for summary statistics). Each intensity level in Lee's model corresponds to a wide range of intensity ratings in the current study-so wide that the three intensity levels have significant overlaps in their range of intensity ratings in the current study. For instance, in the MC dataset, the shared words categorized as LOW-intensity in Lee's model received intensity ratings between 2.14 and 5.35, with the upper end going well into the range of intensity ratings for words categorized with MODERATE (2.55-6.26) and HIGH (4.61-6.96) intensity in Lee's model. The overlap is not caused by outlier ratings in individual words, as more than 60 % of the LOW-intensity words (13 out of 21) were rated with an intensity higher than 4 in the MC dataset, among which 14.2 % (3 out of 21) were rated higher than 5. In other words, a good proportion of Lee's LOW-intensity words were perceived with intermediate or even high intensity by our lay participants from MC. Similar trends of overlapping intensity ranges were found in HK and SG datasets, too.
Lastly, we compared the valence results with the polarity information in Xu et al.'s emotion word database (2008), which shared 326 out of the 372 test words in the current study. Xu et al. coded each word as a single valence category (POSITIVE, NEGA-TIVE, NEUTRAL) with the exception of only one word 挂虑 gualu "worried," which was coded as POSITIVE + NEGATIVE. As shown in Table 11, Xu et al.'s valence classification of POSITIVE and NEGATIVE words largely corresponds to the current valence ratings, e.g., POSITIVE words in Xu et al. also received on average a relatively high percentage of POSI-TIVE ratings in the current study; nevertheless, there also exist some words whose classification in Xu et al.'s greatly differs from the current ratings. For example, six words (沉静 chenjing "calm," 得意 deyi "proud," 高亢 gaokang "resounding," 牵挂 qiangua "reminisce," 瞧得起 qiaodeqi "look up to," 羡慕 xianmu "admire") were classified as POSITIVE in Xu et al. but only received the POSITIVE rating from less than half of the participants in the current MC dataset (i.e., more than half of the MC participants rated these words as NEGATIVE or NEUTRAL). Greater differences were found for the NEUTRAL words in Xu et al., as on average less than one-third of the current participants agreed that these words are NEUTRAL. In our opinion, the differences between Xu et al.'s and Table 10 Summary of current intensity ratings of the shared words by intensity categorization in Lee (2010) Intensity level in Lee (2010) Mean and range of intensity ratings in the current study current valence ratings may be due to both raters' background and the number of raters involved in the task. Taken together, results from the current study revealed both similarities and discrepancies from previous models of Chinese emotion words. Given our discussion above, the discrepancies may be attributable to three sources: (1) differences in model structure, (2) researchers' vs. laymen's judgment and the number of raters, and (3) crosslinguistic differences in the emotional content of translation equivalents between English and Chinese. It is beyond the scope of the current paper as to which model would yield more accurate predictions or better performance in psychological/computational studies of Chinese emotion language; in fact, we suspect that the assessment results will be highly dependent on the nature of the task. The current study provided a dataset of Chinese emotion word categorization that is significantly different from previous works and provides a wider geographical coverage, which may serve as an additional reference for future studies of Chinese emotion language.

Conclusion
In this study, we conducted a large-scale survey of common Chinese speakers' judgment of Chinese emotion words in carefully controlled experimental settings. Results of this study generated the first database of normative ratings of emotion type, emotion intensity, and valence for a list of Chinese emotion words, based on the judgment of Chinese speakers from Mainland China, Hong Kong, and Singapore. A number of cross-regional differences in emotion word perception have been revealed in the database, although the cause of the observed variation is beyond the scope of the current study.
We believe that the current results will produce important implications for further research of Chinese emotion language. Specifically, the current emotion word database will benefit both psycholinguistic research on emotion word perception and processing and computational research on identifying emotion types and sentiment at both lexical and text levels. Since the methodology we used follows the behavioral experimental tradition in psycholinguistic research, the results can be easily adapted or used for stimuli selection and construction in other psycholinguistic experiments. On the other hand, our results, which relied on multi-rater annotation in controlled settings, provide more fine-grained details of word meanings than typical databases used in computational research. For example, instead of categorizing a word as simply belonging to one category (of emotion type, intensity, or valence), our results make it possible to distinguish words that are core members of a certain category and words that are on the borderline between two (or more) categories. Such heavily annotated resources may be Table 11 Summary of current valence ratings of the shared words by valence categorization in Xu et al. (2008) Valence category in Xu et al. (2008) Mean and range of percentage of valence ratings for the same category (POSITIVE, NEGATIVE, or NEUTRAL)  used as the gold standard or training data in (semi)-automatic sentiment detection and classification tasks. Furthermore, the database will also be a valuable resource for linguistic studies, especially those related to lexical semantics and the syntax-semantics interface. For instance, previous studies propose that Chinese BA and BEI constructions denote the change of state of the patient involved, and their use is conditioned by a number of factors including verb lexical semantics, object definiteness (e.g., Liu 1997Liu , 2007Zhang 张伯江 2001). In addition, these constructions may also express speakers' emotions (Jing-Schmidt 2005; Jing-Schmidt and Jing 2011). Thus, other things being equal, emotion words with higher intensity should be more likely to occur in BA/BEI constructions than words with lower intensity, and such a prediction can be tested given the intensity information in the current database. Last but not the least, the current database will also shed light on the research of cross-regional variation in Mandarin Chinese, with regard to both the patterns of variation in emotion word perception and the mechanisms of such variations.
6 Endnotes a The precise definition of emotion is still under constant debate; the issue is further complicated by related concepts and terms such as affect, mood, attitude, sentiment, etc. (see Frijda (1986Frijda ( , 1988Frijda ( , 1994Frijda ( , 2004 for more detailed discussion of the distinction among these concepts). In this study, we use the term "emotion" in a narrow sense and focus on words that express the most widely-recognized, core human emotions (e.g., happiness, anger, and sadness). b This paper uses small caps and italic for English category names and English emotion words, respectively. c Emotion words can be divided into two broad categories depending on their usage: expressive emotion words, which express the user's emotion directly (e.g., taboo and swear words like shit and interjections like wow), and descriptive emotion words, which describe certain emotion states or processes (e.g., happy, sad, rage) (see Clore et al. 1987;Kövecses 2000;Pavlenko 2008a;Potts 2007;etc.). This paper focuses on descriptive emotion words, which directly encode emotional content in the semantic meaning. d Strictly speaking, variation in judgment may exist both in the understanding of the emotion word (e.g., what emotion is referred to by the word upset) and in the categorization of the encoded emotion (e.g., what emotion type and intensity level the emotion upset belongs to). In this paper, we do not make the distinction between an emotion word and the emotion it refers to (i.e., the word upset vs. the emotion upset), but see Nabi (2002) and Russell (1991) for a discussion on possible differences between the two. e In this paper, we use the term "Chinese" to refer to the Standard Mandarin Chinese, which has a standard orthography in either simplified Chinese characters (used in Mainland China, Singapore) or traditional Chinese characters (used in Taiwan, Hong Kong). f Xu and Tao 许小颖, 陶建华 (2003) manually extracted emotion words from the adjective and verb categories in Modern Chinese Grammar Information Dictionary (Yu et al. 俞士汶等 2003). The authors distinguish the list of emotion words from words that express attitudes (e.g., 粗暴 cubao "gruff," 温和 wenhe "gentle," and 冷漠 lengmo "indifferent") and personality (e.g., 忠诚 zhongcheng "loyal," 善良 shanliang "kind," 英勇 yingyong "brave"). g Xu and Tao's 许小颖, 陶建华 (2003) list has 374 words in total. Our study excludes two words that are repeatedly listed (喜欢 xihuan "like" and 惦念 diannian "reminisce"), which results in 372 emotion words. h A reviewer raised a question regarding the rationale of choosing a 7-point scale (e.g., as opposed to a 5-point scale). Previous studies have found that the number of points on a rating scale had no significant effect on rating results (see Krosnick and Presser (2010: 272) for a survey of related studies). Therefore, this study adopted a 7-point scale following Strauss and Allen's (2008) experimental design. i Given the high intra-and inter-participant rating consistency, for repeated test words, we report only the rating results of one occurrence of each item, which is either the first occurrence in the list (for items repeated within a list) or the occurrence in a numerically earlier list (for items repeated across lists; eg, items appearing in Lists 1 and 2 will be reported based on the rating results in List 1). j If a participant classified a test word as UNFAMILIAR, his or her rating of valence (if any) was excluded from analysis. k BCC (http://bcc.blcu.edu.cn/) was released in September 2014. The assorted subcorpus of BCC has about one billion words. The frequency information of the emotion words in the subcorpus was collected on April 8, 2016.

Additional file
Additional file 1: Emotion type and intensity rating for Chinese emotion words (by region). (DOCX 107 kb)