Skip to main content

Social media emotions annotation guide (SMEmo): Development and initial validity

Abstract

The proper measurement of emotion is vital to understanding the relationship between emotional expression in social media and other factors, such as online information sharing. This work develops a standardized annotation scheme for quantifying emotions in social media using recent emotion theory and research. Human annotators assessed both social media posts and their own reactions to the posts’ content on scales of 0 to 100 for each of 20 (Study 1) and 23 (Study 2) emotions. For Study 1, we analyzed English-language posts from Twitter (N = 244) and YouTube (N = 50). Associations between emotion ratings and text-based measures (LIWC, VADER, EmoLex, NRC-EIL, Emotionality) demonstrated convergent and discriminant validity. In Study 2, we tested an expanded version of the scheme in-country, in-language, on Polish (N = 3648) and Lithuanian (N = 1934) multimedia Facebook posts. While the correlations were lower than with English, patterns of convergent and discriminant validity with EmoLex and NRC-EIL still held. Coder reliability was strong across samples, with intraclass correlations of .80 or higher for 10 different emotions in Study 1 and 16 different emotions in Study 2. This research improves the measurement of emotions in social media to include more dimensions, multimedia, and context compared to prior schemes.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. A full list of lexicons from Dr. Mohammad and colleagues is available here: http://saifmohammad.com/WebPages/lexicons.html.

  2. As part of a related, but separate effort, we developed an annotation scheme for kama muta/heartwarming and emotional responses to cute social media (Golonka et al., 2023). Paletz (2018) is available from the authors on request.

  3. Lemmatization is similar to stemming—it reduces each wordform to its citation form in the dictionary: e.g., hoping, hoped to hope; brought to bring. This is more important for morphologically rich languages like Polish and Lithuanian than for morphologically poor ones like English.

  4. For this and subsequent correlation tables, we included tables of p values in Appendix Table 11, 12, 13, 14, 15 and 16. However, note that we are using correlations descriptively rather than inferentially, so we are interested in direction and magnitude rather than significance, which can vary by sample size.

  5. The Polish bicultural social science researcher is fluent in English, Polish, and Russian; the Lithuanian researcher is fluent in English, Lithuanian, Russian, French, and German.

  6. Delfi.lt “Lietuvos itakingiausieji 2018”, https://www.delfi.lt/apps/itakingiausieji2018/bendras/balsas.

  7. https://nvoatlasas.lt/en/filtering/, Accessed on Aug 2, 2018.

  8. https://www.lrs.lt/sip/portal.show?p_r=8787&p_k=1.

  9. To our delight, he describes kama muta, which we had already included in this study, as an example of an emotion that does not rely on English vernacular but is a meaningful construct to study.

  10. Demszky et al. (2020) note, “We find that all Cohen’s kappa values are greater than 0, showing rater agreement” (p. 4051), also determine reliability using Spearman rho’s correlation, and refer to their interrater agreement as high.

  11. Some methods go beyond keywords to use sophisticated machine learning paradigms, such as transformers (see e.g., Acheampong et al., 2021, for a review), but we will not discuss them here as they are less interpretable than keyword approaches, and it is not clear how their performance adds to theory of emotion.

  12. One proprietary exception is Empath, not to be confused with the text-based Empath system described elsewhere. https://webempath.com

  13. A full list of lexicons from Dr. Mohammad and colleagues is available here: http://saifmohammad.com/WebPages/lexicons.html.

References

Download references

Author Note

Susannah Paletz is at the College of Information Studies, University of Maryland, College Park, affiliated with two University of Maryland centers: the Social Data Science Center (SoDa) and the Applied Research Laboratory for Intelligence and Security (ARLIS), which used to be the Center for Advanced Study of Language (CASL). For Study 1, Drs. Paletz, Golonka, Adams, and Bradley were all at CASL. Ms. Stanton and Mr. Ryan were interns at CASL through the START internship program at the University of Maryland, College Park. Ewa Golonka, Nick B. Pandža, and C. Anton Rytting are currently at ARLIS. David Ryan is at Stanford University in the Computer Science Department and the Feminist, Gender and Sexuality Studies Department. Egle E. Murauskaite is at the ICONS project at the University of Maryland, College Park. Michael Johns was at ARLIS during this work but is now at the Institute for Systems Research, University of Maryland. Cody Buntain was at the New Jersey Institute for Technology during data collection but is now at the College of Information Studies, University of Maryland, College Park. We have no known conflict of interest to disclose. This material is based on work supported, in whole or in part, with funding from the United States Government Office of Naval Research (ONR) grant 12398640 and Minerva Research Initiative / ONR Grant #N00014-19-1-2506. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the University of Maryland, College Park and/or any agency or entity of the US Government. A presentation describing some of Study 2’s reliability information and adapting the annotation guide to Polish and Lithuanian was presented at the 26th International Congress of the International Association for Cross-Cultural Psychology (July, 2022). The authors are grateful to Susan Campbell, Brooke Auxier, and two prior anonymous reviewers for their suggestions on an earlier version of this paper, as well as to Nataliya Stepanova for her assistance with sampling for Study 2. We are also deeply grateful to our Study 2 annotators: Agata Bieniek, Anna Kostrzewa, Gabrielė Kundrotaitė, Agata Kuzia, Klaudia Kuźnicka, Małgorzata Perczak-Partyka, Rafał Rosiak, Laura Russak, Austėja Serbentaitė, Ewa Szczepska, Karolina Tokarek, Aurelija Tylaitė, and Marta Urbańska-Łaba.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Susannah B. F. Paletz.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Practices Statement

Some of the data or materials for the experiments reported here are available. We are uploading as an electric supplement a data file that includes the Tweet IDs and links, YouTube IDs andhyperlinks, and our original coding: https://doi.org/10.3758/s13428-023-02195-1. However, due to the terms of service of the social media platforms, we cannot send (or download) the YouTube videos or the text or images of tweets. We packaged the annotated corpus of Polish and Lithuanian Facebook posts for access with permission on a site at our university and it is available here: http://hdl.handle.net/1903/29776. We are also very interested in sharing our latest annotation guide upon request. These analyses, being by their nature explicitly exploratory and novel, were not preregistered, but are available upon request.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (SAV 1.01 MB)

Appendices

Appendix 1: Additional relevant literature

Prior annotation of emotion

Several researchers have annotated various corpora for emotion. In the social media problem space, none of the schemes do all of what we are attempting. Other annotation schemes exist outside of social media, some of which we review. Many of these schemes use a much smaller set of emotions based on older theories of emotion (e.g., Alm et al., 2005; Aman & Szpakowicz, 2007; Novielli et al., 2018; Schuff et al., 2017), do not measure intensity within/of emotions (e.g., Alm et al., 2005; Novielli et al., 2018), or use exclusive coding such that only one emotion is chosen per unit annotated (e.g., Alm et al., 2005; Volkova et al., 2010). Even those that measure mixtures of emotions simply tag the annotated unit as “mixed emotion” without measuring the intensity of each (e.g., Aman & Szpakowicz, 2007; see Table 9).

Table 9 Current scheme vs. selected prior annotation schemes

For example, the widely cited article by Wiebe et al. (2005) discusses in detail how to annotate language (e.g., speech events) for opinions and valence. While the title implies an emotion annotation scheme and several examples use emotion words (e.g., “The U.S. fears a spill-over”, p. 173), they simply propose polarity for attitudes or private states (positive, negative, both, and neither), and four levels of intensity of the statement itself (low, medium, high, extreme; e.g., the intensity of “The U.S. fears a spill-over” is listed as medium, p. 174). Wiebe et al. (2005) do cite other emotion taxonomies, but do not incorporate them into their own scheme.

Separately, Alm et al. (2005) created an affective text-to-speech system using stories written by Beatrix Potter, H. C. Andersen, and the Brothers Grimm. They annotated for emotions, expanding from Ekman’s six emotions (anger, disgust, fear, sadness, happiness, positive surprise, negative surprise, and neutral). Two annotators independently coded each sentence for an emotion or lack thereof. The authors found emotion annotation difficult, with inter-annotator agreement ranging between .24 and .51. Aman and Szpakowicz (2007), expanding on Alm et al. (2005), used a mixture of manual and automated (using seed words) methods to identify Ekman’s six basic emotions (plus mixed emotion and no emotion) sentence by sentence in blog posts. This framework incorporated intensity (low, medium, high) and the possibility of mixed emotion, but not what those mixed emotions were. Their reliability statistics were superior to Alm et al. (2005), with average pairwise kappa ranging generally from .60 to .79 (with the exception being mixed emotion, which was .43). Happiness and fear had the highest reliabilities and surprise the lowest.

Volkova et al. (2010) further expanded the work of Alm et al. (2005) to develop automated tools to detect emotions in text. The authors selected a group of fifteen emotions made up of seven positive (relief, joy, hope, interest, compassion, surprise, and approval), seven negative (disturbance, sadness, despair, disgust, hatred, fear, and anger), and neutral. They also measured intensity on a one (closest to neutral) to five (extreme polarization) scale. Before the main study, the authors had the coders agree on eight related clusters of emotions from the original fifteen (e.g., {joy, approval}; {disgust, anger, hatred}) before coding. They then annotated lemmatized word lists as to whether each word would change the polarity of a potential context. Only 4% of total number of items from each word list was scored as having multiple polarities (negative/neutral or positive/neutral) by the same annotator (Volkova et al., 2010). For the main experiment, ten coders assessed eight Grimm fairy tales in Standard German. The Manual Emotion Annotation Tool (MEAT) allowed coders to highlight and select an emotion code for that string of text if the participant were to read the text out loud, which would then add nonverbal signals (e.g., facial expressions). This annotation process then involved labeling only one of the fifteen emotions per chosen text. Within coder, simultaneous expression of different emotions (emotional complexity) could therefore not be detected using this method. However, given that different coders could choose different and overlapping text units using MEAT, it might be possible to use their method to detect simultaneous and overlapping emotions between coders. These researchers had to create a reliability metric to account for the text unit differing across coders. They also utilized the clusters determined previously for testing reliability.

More recently, Novielli et al. (2018) created a corpus of annotated emotions using 4800 posts between software developers on Stack Overflow. They examined six emotions overlapping with Ekman’s basic emotions (love, joy, surprise, anger, sadness, and fear). While intensity was not measured (only the presence/absence of the emotion), emotional complexity (i.e., multiple emotion labels) was allowed, and they found that 3% of their posts were labeled with two emotions. Each post was annotated by three coders and the observed raw agreement was high, from .86 and .98; however, the authors note that the Fleiss’ kappa was much smaller, from .30 to .62 due to the low frequency of some emotions (e.g., surprise).

Not all emotion annotation schemes rely solely on text. The broad and productive literature on affective computing often includes recognition of emotions in speech and video, although it is usually not geared toward social media posts and may still use limited sets of emotions (see for reviews, Devillers et al., 2005; Poria et al., 2017). Devillers et al. (2005) attempted to overcome the limits of small sets of emotions and created a multi-modal annotation scheme by using a corpus of recorded and transcribed phone call exchanges from two different types of call centers (financial and medical). This Multi-level Emotion and Context Annotation Scheme (MECAS) was developed for both speech-only and multimodal data and allowed for fine-grained, as well as seven coarse-grained, emotions. Each turn of speech was annotated for 24 labels. Their 21 fine-grained emotions were anxiety, stress, fear, panic, annoyance, impatience, cold anger, hot anger, disappointment, sadness, despair, hurt, embarrassment, relief, interest, amusement, surprise, neutral, dismay, resignation, and compassion. MECAS also includes positive, negative, and unknown labels. These emotions were labeled in intensity on a five-point scale. Devillers et al. (2005) went beyond typical annotation to acknowledge potential mixtures of emotions, allowing both a major and minor emotion label for each annotated unit, with major being the focal point, and minor being a less obvious but still present element of emotion. These blended emotions were defined in three different segments: conflictual, when the major and minor emotions are not the same valenced emotion (i.e., one negative and one positive); nonconflictual, when the major and minor emotions had same valence value (e.g., both positive); and ambiguous, when the major and minor emotions are same coarse-grained emotion, which is more specific than valence. While MECAS overcomes many of the issues we criticize in other schemes (has measures of intensity, includes several emotions, allows for more than one type of emotion), some of the emotions that their list separates into different categories we would consider to be different levels of intensity of the same emotion (e.g., annoyance and anger) and does not include other emotions that might be important for annotation in social media contexts (e.g., contempt, pride). It also only allows for the co-occurrence of at most two emotions. Further, as with the others reviewed here, because they were not created for social media, these annotation schemes do not distinguish between the personal reactions of the coders and the content of the text.

Some of these annotation guides were the basis of attempts to create keyword-based or automated measures for emotions (e.g., Aman & Szpakowicz, 2007), and additional corpora that are limited to sentiment and/or Ekman/Plutchik lists of emotion abound (see review in Schuff et al., 2017). Given the growing importance of and problems with automated ways of measuring emotion (Stark & Hoey, 2020), we discuss some of the more popular automated measures.

Popular automated assessments of sentiment and emotion

There exist many automated, text-based methods to detect sentiment and emotion (e.g., Stieglitz & Dang-Xuan, 2013; Strapparava & Valitutti, 2004).Footnote 11 While this section describes some (and additional) types mentioned in the above text, there is more methodological detail here. There has long been interest in automating the presence of emotion in text (e.g., Strapparava & Mihalcea, 2007; see Bostan & Klinger, 2018 for a review). Automated measures are often deployed because they can scale to a huge number of posts and are relatively easy to use, versus annotation, which is time consuming even when not using trained annotators who understand the language and culture of the social media. Despite these realistic constraints that lead many researchers to use automated methods, it is important to point out that many of these metrics rely on emotion lists that are outdated or brief, and they thus may be limited with regards to social media in particular (e.g., Plutchik’s list, see Mohammad et al., 2015; see also Bostan & Klinger, 2018, and Schuff et al., 2017 for a review). This section briefly describes some of the popular automated assessments. By their nature, these metrics currently rely on text-based assessments, and so simply cannot cover as many modalities as the annotation of an entire post that includes multimedia content (for a review of text-based emotion mining, see Yadollahi et al., 2017).Footnote 12

One of the most popular, Linguistic Inquiry and Word Count (LIWC), is used to score sentiment and a small set of emotions. LIWC assesses text by counting words from a dictionary of specific word lists (Pennebaker et al., 2015). LIWC has a general affect word list, as well as separate positive and negative affect words lists (Tausczik & Pennebaker, 2010). The negative affect word list includes words from separate anger, sadness, and anxiety lists, but extends beyond them; the positive affect list includes a set of words such as love, nice, and sweet (Pennebaker et al., 2015). LIWC is a relatively easy-to-use, inexpensive program that can be applied to a range of different document types. The different categories were validated by comparing expert judges’ ratings of each individual word as to its fit within the categories (Pennebaker et al., 2007), and internal reliability was tested based on co-occurrence of words (corrected internal consistency alpha = .64 for positive affect, .55 for negative affect, .73 for anxiety, .53 for anger, and .70 for sadness, Pennebaker et al., 2015). The creators of LIWC initially warned against its use in short text, noting it was designed to be used with at least 50 words (Pennebaker Conglomerates, 2016). This limitation may have made it less appropriate for analyzing short social media text such as tweets, which historically have been limited to 140 characters and even now allow only 280. LIWC has been used widely to study social media, including examining emotional and other type of content in comments on disinformation versus true news in Facebook posts (e.g., Barfar, 2019; though see below). More recent versions (starting with 2015) include ‘netspeak’ and are more geared toward social media analysis.

As alluded to above, the National Research Council (NRC) Emotion Lexicon provides several manually annotated sentiment and emotion lexicons,Footnote 13 of which the oldest and best known is the NRC Emotion Lexicon or EmoLex (Mohammad & Turney, 2013). EmoLex consists of human annotations of positive and negative sentiment and Plutchik’s (1962, 2001) eight basic emotions—joy, sadness, anger, fear, disgust, surprise, trust, and anticipation—for 14,200 English word types, of which 4462 are annotated as being associated with at least one of the eight emotions. More recently the NRC has released a new lexicon building on EmoLex, the NRC Emotion/Affect Intensity Lexicon (NRC-EIL; Mohammad, 2018). Unlike the simple Boolean (binary) annotations in EmoLex, the NRC-EIL provides scalar intensities for emotional association with nearly 10,000 words either chosen from EmoLex or frequently co-occurring with other emotional words. Although both lexicons were annotated solely by English-speaking annotators rating English words, the NRC provides automatic translations of the original English lexicons in over a hundred languages, including Polish and Lithuanian. As with other lexicons, this method leaves out many emotions (Cowen & Keltner, 2017, 2021).

More promising is VADER (Valence Aware Dictionary for Sentiment Reasoning; Hutto & Gilbert, 2014), which was created by constructing a list of over 9000 lexical feature candidates, including emoticons and slang, and then collecting intensity and polarity ratings of those features from ten independent raters. Several iterations of assessments, including with two human experts, were used to qualitatively identify properties of text that would affect perceived sentiment intensity (e.g., capitalization, certain punctuation). The final version of VADER, which measures positive and negative sentiment, thus went beyond a word list to incorporate rules and heuristics garnered by humans.

Hutto and Gilbert (2014) compared VADER’s coding performance to that of seven sentiment analysis lexicons, including LIWC. VADER performed as well as individual human raters at matching the aggregated mean from 20 coders for sentiment intensity for each tweet (text-only), which was their measure of ground truth (r = .88). When measured on a three-way sentiment classification task, VADER had a precision score of .99, a recall score of .94, and an overall F-score of .96, (Hutto & Gilbert, 2014). LIWC scored lower on both intensity matching and classification, with a correlation with the social media text gold standard of r = .62, precision of .94, recall of .48, and overall F-score of .63. Thus, VADER may be superior to LIWC and other lexical measures in examining sentiment polarity in Twitter data. However, VADER does not measure specific emotions, but positivity to negativity.

A separate lexicon relevant here is the Lexical Suite (http://www.lexicalsuite.com/), which is the second version of the Evaluative Lexicon (Rocklage et al., 2018). In addition to measures of valence that are strongly correlated with LIWC valence, this suite has a measure of emotionality. Emotionality here is described as the degree to which a person’s evaluation or attitude as expressed in text is emotional versus more cognitive. Rather than simply expressing how much a person likes or dislikes something, the words capture amounts of emotion, generally speaking (e.g., “fantastic” has a higher emotionality score than “valuable”). The authors created these word lists through a lengthy, iterative process including ratings by judges of words. They found that Emotionality was uncorrelated with LIWC valence (-.11) but significantly positively associated with arousal (.43), suggesting discriminant and convergent validity, respectively (Rocklage et al., 2018). Of note, this measure does not distinguish different emotions (e.g., love from contempt). However, for our convergent/discriminant validity tests, it offers an additional dimension to compare with our annotation scheme.

New lexicons developed using machine learning techniques such as unsupervised clustering overcome some of the issues of context by associating words in vector space. A context problem, as it might occur in a typical lexicon, is when a word’s meaning changes depending on its context. For instance, “kill” is in the list of words for “anger” in LIWC2015: however, there are different emotional connotations for “she killed her,” “she made a killing in the stock market,” and “that meme just killed me.” Jiang and Wilson (2018) created ComLex in part because they wanted a context-specific lexicon, and so developed 300 categories based on over 2 million user comments on social media (Facebook, Twitter, and YouTube). However, although the authors frequently refer to the clusters being linguistic signals of emotion, these categories are neither based on top-down theoretical constructs of emotions nor are intuitive lay categories of emotions. Instead, the dimensions are clusters of words and/or symbols that are then given a category label. The emotion lists that are mainly just emoji: the list for ‘funny’ includes a frog, alien, pizza slice, and beer; the one for ‘doubt’ has the emoji of not only a question mark and a shrug, but an eye, a female symbol, a male symbol, and emoji of a female face, male face, and baby/child face (see the ComLex at https://shanjiang.me/resources/ComLex.csv). In one of the less validated topic lists of words that includes more emotions, one has the words happy, glad, and excite, but also includes sure, sorry, sick, proud, tire, afraid, fearful, ashamed, lucky, confuse, jealous, and hop). Their lexicon was developed to examine fact-checked social media posts in particular but should not be taken as psychologically validated emotion lists.

Appendix 2: Emotion Categories (Version 3.32, January 2020)

Table 10 Emotion categories and brief descriptions

Appendix 3: Example annotation

Fig. 6
figure 6

TOP: “Shame! Parliament members are building women's hell!” About today's demonstrations and reading lists of disgrace – members of parliament who voted against women's right to health and dignity – in today's Fakty TVN. BOTTOM: “Black protests” of the Razem Party. Thousands of people dressed in black against the abortion ban. TEXT ON SIGNS: Hands off women; Women’s hell continues; Freedom of choice instead of terror. EMOTION ANNOTATION: Anger (80), Excitement (50), Contempt (20), Hate (10), Fear (10).

Fig. 7
figure 7

TOP: It’s difficult to follow the recommendation “stay home” when you are homeless. BOTTOM: “Medics on the street”: meet people who help the homeless from Warsaw during pandemic. EMOTION ANNOTATION: Sadness (65), Empathic Pain (30), Admiration (20).

Fig. 8
figure 8

TOP: Hungarians told immigrants NO! They are finishing building another border wall. The government has labeled illegal immigrants as criminals and is deporting them from the country. And yesterday, Orban announced changes to the constitution. Keep it up! BOTTOM: Orban: immigrants have one week to leave Hungary. TEXT ON IMAGE: You know what matters? Border protection! (rhyming) Keep it up! EMOTION ANNOTATION: Admiration (50), Hate (48), Anger (23), Amusement (25), Excitement (16).

Appendix 4: Correlation table p-values

Table 11 English annotated emotion correlations p-values
Table 12 English annotated and automated emotion correlations p-values
Table 13 Lithuanian annotated emotion correlations p-values
Table 14 Lithuanian annotated and automated emotion correlations p-values
Table 15 Polish annotated emotion correlations p-values
Table 16 Polish annotated and automated emotion correlations p-values

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Paletz, S.B.F., Golonka, E.M., Pandža, N.B. et al. Social media emotions annotation guide (SMEmo): Development and initial validity. Behav Res (2023). https://doi.org/10.3758/s13428-023-02195-1

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.3758/s13428-023-02195-1

Keywords