Introduction

Human emotion recognition refers to the process of discovering a person’s state of mind and reactions, that are associated with a specific event or situation. It is performed with sophisticated algorithms that, thanks to technological progress, have become increasingly efficient in tracking emotional changes. The research interest is motivated above all by promising applications in the areas of ambient assistance, decision support, and the prevention of emotional hazards [1]. According to “Research and Markets” report,Footnote 1 the global market of emotion recognition is expected to rise about 65 Billion by 2023, with 39% Annual Growth Rate between 2017 and 2023.

Algorithms for detecting emotional changes use different context information and modalities: Facial cues, speech variations, gestures, data from body or brain sensors, and more. Some approaches rely on the person’s self-assessment to measure instant emotion. Each approach exhibits advantages and disadvantages, depending on the contexts or settings in which they are used.

This paper aims to provide a comprehensive overview of the current state of important aspects of machine recognition of human emotions to assist the interested community in developing and improving methods, techniques, and systems. In doing so, we not only add recent findings to the body of knowledge established by previously published reviews and original papers. Rather, we bring together different perspectives: (1) Systems that use emotional data, (2) the datasets they use, (3) description languages, and (4) domain-specific models and ontologies as the basis for recognition algorithms.

To this end, we have analyzed a large number of papers from various relevant areas such as emotion interpretation theories, modeling languages, ontologies, data sets, and output interfaces from the perspective of the use of emotion recognition by intelligent systems. In addition, we have tried to get an overview of the survey articles already existing in this area and to evaluate them to be able to present as comprehensive a compendium as possible with our paper.

The paper has the following structure: “Methodology” briefly describes the criteria we used to search literature sources and, in particular, previously published review articles. “Human Emotional Functioning” deals with the human emotional functioning, the manifestations of human emotions, and approaches to their interpretation. “Languages and Ontologies” focuses on ontologies and domain-specific modeling approaches to describe emotions. “Datasets” gives an overview of the data sets used in the literature for emotion recognition, followed by a compilation of systems for emotion recognition “Systems for Emotion Recognition” and of systems that exploit emotion data “Information Systems (IS) Exploiting Emotion Data”. The paper concludes with a summary of the results “Summary” and a brief outlook on open research questions “Open Research Questions”.

Methodology

The comprehensive overview provided in this paper is based on findings from existing studies and literature available online (e.g., Google scholarFootnote 2). Five aspects play a special role in the context of emotion recognition: models, languages, ontologies, datasets, and systems. In total, we selected 230 relevant literature sources on these aspects and referenced them in the bibliography. In contrast to our approach, which considers all five aspects, most published studies focus on one of these aspects or on one emotional modality like spoken language, video, image, etc. For example, study [2] focuses on speech datasets from very recent years, and [3, 4] overviews miscellaneous facial datasets only based on video, audio, or image input. [5, 6] review different approaches for detecting emotion from text only.

To validate this finding, we systematically searched Google Scholar for articles that had relevant terms in their title (e.g., allintitle: emotion language survey). In detail, the search terms used were combinations of the phrases “emotion model,” “emotion language,” “emotion ontology,” “emotion dataset,” and “emotion information system,” in conjunction with the keywords “survey,” and “review,” respectively, as shown in Fig. 1. The keyword “review” has been selected additionally as it seems that authors use it more often than “survey”. The quantitative results of these searches are shown in Fig. 2 which also contains the result of a search with all emotion aspects combined in one query: as expected, such a study could not be found.

Fig. 1
figure 1

Google Scholar search result using term “InTitle”

Fig. 2
figure 2

The graph shows the results of human emotion aspects identified in the literature followed by “Survey” and “Review” keywords

Human Emotional Functioning

Emotions are specific reactions to an experienced event [7]. In the domain of cognitive psychology, the term emotion refers to “specific sets of physiological and mental dispositions triggered by the brain in response to the perceived significance of a situation or object” [8]. Theories of cognitive psychology suggest that the individual interpretation of a situation or event influences emotions and behaviors [9,10,11]: guided by their beliefs, different people may interpret the same events differently. Psychologists study how people interpret and understand their worlds and how emotions prepare people for acting and regulating their social behavior [12]. Emotional expression is also an important part of emotional function, as human emotions can influence a person’s physical reactions [13]. These emotional reactions are mediated by speech, facial expressions, body gestures, or physiological signals [14].

In this section, we present approaches to categorize, model, interpret, and understand emotions.

Categorizing the Manifestations of Emotion

The emotion felt by a person cannot be “grasped” physiologically in everyday life, so that one is dependent on the external manifestations in which emotions express themselves: Emotion controls many modes of human visible behavior like gestures, facial expression, postures, voice tone, respiration, and skin color. All this affects the way people interact with each other. Psychologists and engineers have conducted several studies to understand and categorize the manifestations of emotions.

Facial Expressions

Ekman and Friesen [15] conducted a study on the universality of facial expressions and classified them in relation to six basic emotions: anger, happiness, sadness, disgust, surprise, and fear. Based here-on, they developed a taxonomy of facial muscle movements (Facial Action Coding System, FACS) that is general enough to describe a person’s basic emotional state through analyzing the relationship between points on the person’s face [16, 17]. Many studies use Ekman’s results as the basis for the recognition task. A similar approach [18] describes facial expression as a result of the so-called action units (AUs) that capture the possible movements of facial muscles. Such facial movements occur in most people and can reflect certain emotions in combination. Table 1 shows the connection between certain combinations of action units and the basic emotion they express. For instance, happiness is calculated as a combination of action unit 6 (cheek raiser) and 12 (lip corner puller) [19].

Table 1 Action units (AUs) that correspond to determine basic emotions [20, 21]

Vocal Expressions

The speech signal contains both explicit (linguistic, i.e., the message presented) and implicit (paralinguistic) information such as references to the emotional state of the speaker. Acoustic speech signals are mainly generated by the vibration of the vocal chords, whereby the frequency determines the pitch of the tone. Further parameters of the speech signal are the intensity, duration of the spectral speech properties, contour, melt-frequency cepstral coefficients (MFCCs), tone base, and voice quality [22]. The variation of the pitch and its intensity together form the prosody. Many features of speech signals are used to extract emotions. Table 2 outlines different emotional behavior which are listed in relation to the common vocal parameters.

Table 2 The properties of speech signal-based emotion analysis [22]

Physiological Signals

In general, physiological signals are divided into two categories: (1) signals derived from peripheral nervous system phenomena such as heart rate (ECG) and skin conductance (EMG) and (2) signals derived from the central nervous system such as brain signals (EEG) [23]. Physiological signals can be collected via wearable sensors and evaluated for the classification and identification of emotions [24]. For emotion classification, signal features like frequency, amplitudes, minima, and maxima are analyzed. Popular approaches are Support Vector Machine (SVM) [25], Fisher linear discriminant projection [26], Canonical Correlation analysis (CCA) [27], Artificial Neural Network (ANN) [28], K-Nearest Neighbor (KNN) [29], Adaptive Neuro-Fuzzy Interference System (ANFIS) [30], or the Bayesian network method [31]. There are several ways to derive emotions from physiological signals: (1) measure various parameters of the signal and compare the results to a self-assessment Manikin (SAM) questionnaire [32] (see Fig. 3); (2) estimate emotion based on facial expressions and overall impressions by psychological experts; (3) correlate the results to a gold standard, such as facial recognition or EEG; (4) compare the result with well-known dataset to elicit the emotions, such as the International Affective Picture System (IAPS) dataset [33] to generate comparable results [34].

Fig. 3
figure 3

Self-assessment Manikin to quantify emotion, original see [67]

Body Gesture

Gestures (except sign language) are a form of non-verbal interaction in which a human moves a certain part of the body, especially hands or head. This movement is used to convey a message and additional information such as human emotions [35]. Figure 4 depicts emotion expressions which associated with body pose and motion. In addition, one can deduce emotion parameters from measured movement values such as speed, amplitude, and time expenditure of body parts involved in the various gesture phases (preparation, stroke, and relaxation). Table 3 depicts the frequent arm movements which form a certain emotion.

Fig. 4
figure 4

Emotion interpretation from body movements, original see [208]

Table 3 Gesture-based movement factors that express emotion [36]

Multimodal Emotion

The aforementioned methods of identifying emotions can also be combined, in which case we speak of a multimodal approach to emotion recognition. Fusing modalities together may increase the performance of systems for emotion recognition [14]. For example, the integration of facial expressions and speech signals leads to a new “audiovisual” signal. [37] and the combined evaluation of audio-visual and physiological signals can further reduce the error rate of emotion recognition.

Emotion Theories

In the theoretical discussion, different views on emotions are represented, which are reflected in several theories and descriptive models. A simple approach consists in relating “emotion” to phenomena like anger, fear, or happiness. More sophisticated approaches describe emotions in a multidimensional space with an unlimited number of categories. Brosch et al. [38] distinguish four main directions: the Ekman basic emotion theory [39], the appraisal theory [40], dimensional theories [41,42,43], and the constructivist theory of emotion [44]. Combining theories leads to further refinements. For instance, [45] combines appraisal and dimensional theory, and [46] claim that there are two main views for emotion classification, namely, basic and dimensional.

Basic Emotion Theory

Ekman [39] assumes emotions to be discrete and related to a fixed number of neural and physically represented states. As already mentioned in Facial Expressions, he proposes a division into six basic emotion categories. Barrett [47] extends this view by associating each emotion with a specific and unmistakable set of bodily and facial responses. An emotion is thus represented by a characteristic syndrome of hormonal, muscular and autonomic responses that are coordinated in time and correlated in intensity [48]. E.g., each emotion comes with different, distinguishable facial movements.

Appraisal Theory

Appraisal theory defines emotions as processes and not as states [49]. It assumes that the interpretation (appraisal, assessment) of a given situation causes a specific emotional reaction in the interpreting person [40]. Consequently, not the situation per se, but the individual assessment of that situation causes the type and intensity of emotional reaction.

Two-Dimensional Theory

The central assumption of dimensional theories is that emotional states can be represented by variation across certain dimensions [42, 43, 50]. In the two-dimensional case, the dimensions valence and arousal are considered (see Fig. 5): Valence denotes the polarity of emotions (positive or negative), and arousal indicates the intensity (high or low). In this way, all emotions can be classified in the arousal-valence coordinate system.

Fig. 5
figure 5

Two-dimensional arrangement of valence and arousal following [190]

PAD Three-Dimensional Theory

Reference [41] introduce a third dimension and focus on “Pleasure”, “Arousal”, and “Dominance” (PAD three-dimensional model; see Fig. 6). Pleasure expresses the range to which the person aware the situation as enjoyable or not, arousal represents the extent to which the situation stimulates the person; dominance describes the extent to which the person is able to control her/his emotional state in the given situation [51]. The PDA model has been used to represent emotions for non-verbal interaction such as body language [52], or to represent emotions of animated characters in virtual worlds [53].

Fig. 6
figure 6

PAD three-dimensional representation of emotions following [189]

Plutchik’s Wheel of Emotions

Robert Pluchik [54] assumes eight basic emotions, namely, joy, trust, fear, surprise, sadness, disgust, anger, and anticipation, the combination of which results into more complex ones. In his “wheel” model (see Fig. 7), these basic emotions form the “spokes” of a wheel and are dyed in different colors from the color spectrum. Related emotions are housed on the same spoke and are gradually colored with the spoke color according to their intensity (arousal). E.g., joy is organized with ecstasy (higher arousal) and serenity (lower arousal) on the same (yellow) spoke. Opposing emotions are arranged on opposite spokes, e.g., joy versus sadness, anger versus fear, trust versus disgust, and surprise versus anticipation [54]. More complex emotions can be represented as combinations of basic emotions, similar to the way primary colors are combined. The wheel rims indicate such combinations. For example, joy and trust combine to be love, boredom, and annoyance combine to contempt.

Fig. 7
figure 7

Plutchik’s Wheel of emotions [188], picture from wikimedia.org

OCC Model of Emotion

Ortony, Clore, & Collins (OCC) [55] assume that emotions arise from assessing observations about events, agents, and objects. Events are manifestations that occur at a certain point in time; agents can be people, animals, machines and similar, or abstract entities such as institutions. Several studies employed the OCC model to reason about emotion or generate emotions in artificial characters. The OCC model classifies emotions into 22 categories (see Fig. 8).

Fig. 8
figure 8

Structure of OCC model (the 22 categories are dyed with red color)

Construction Theory

Man builds up his own knowledge of the world based on his experiences. The constructive model therefore assumes that emotions are psychological connections built from more basic psychological components [44].

Languages and Ontologies

For dealing with emotions in computerized applications, suitable descriptive languages are needed, i.e., languages that allow a representation of, e.g., the models described above. Although, of course, universal modeling languages could be used here—as in all areas—first efforts to develop special domain-specific languages for the field of emotion modeling can be observed. However, these are mainly simply structured markup languages, although the emergence of powerful meta-modeling platforms [41, 43, 50,51,52] would allow for the definition and use of comprehensive domain-specific conceptual modeling languages [53].

Domain-specific concepts come with various rules, constraints, and semantics [56]. Ontologies are used for their semantic foundation and formal definition [57, 58]. Reference [59, 60] investigate the integration of ontology with domain-specific language at the meta-model level and automated reasoning process. [61] discuss the use of the formal semantics of the Web Ontology Language (OWL)Footnote 3 together with reasoning services for addressing constraint definition, suggestions, and debugging. The ontology-based approach presented in [62] allows for integrating different domains and reasoning services. In [63], the authors propose a “User Story Mapping-Based Method” for extracting knowledge from the relevant domain by applying a formal guideline. The authors demonstrate how to establish a full semantic model of a particular domain using ontology concepts.

Languages

Standardizing emotion representations by a closed set of emotion denominators is perceived as being too restrictive. On the other hand, leaving the choice of emotion annotation completely unlimited is considered to be not appropriate [64]. Consequently, there seems to be no specific standard language that covers all aspects of emotions as they appear in the approaches and theories described above. However, markup languages have been presented that provide a set of syntax and semantic description concepts and thus satisfy the demands of some researchers. This section describes popular markup languages and the purpose behind their development.


(EmotionML) Emotion Markup Language [65] has been presented by the W3C Consortium to allow for (1) manual annotation of material involving emotionality, (2) automatic recognition of emotions from sensor data, speech recordings, etc., and (3) generation of emotion-related system responses, which may involve reasoning about the emotional aspects, e.g., in interactive systems.Footnote 4 Due to the lack of generally agreed descriptors, EmotionML does not come with a fixed emotion vocabulary, but proposes possible structural elements and allows users to “plug in” their own favorite vocabulary. Concerning the basic structural elements, it is assumed that emotions can be described in terms of categories or a small number of dimensions, and that emotions involve triggers, appraisals, feelings, expressive behavior, and action tendencies [65]. Consequently, EmotionML is a quite general-purpose XML-based language that implicitly realizes Ekman’s basic emotion theory and [39] and dimensional theories [41].


(EARL) Emotion Annotation and Representation Language [66] is an XML-based language designed specifically for representing emotions in technological contexts with a focus on emotion annotation, recognition, and generation. EARL can represent emotions as categories, dimensions, or sets of appraisal scale [67]. EARL can be utilized for manual emotion annotation and for generating affective system such as speech synthesizers or embodied conversational agents (ECAs) [68].


(EMMA) Extensible MultiModal Annotation [69] is a markup language intended for representing multimodal user inputs (e.g., speech, pen, keystroke, or gesture) in a standardized way for further processing. As such, it was designed for use in the so-called Multimodal Interaction Frameworks (MMI) to solve uncertainty in user input interpretations. EMMA distinguishes between instance data (contained within an EMMA interpretation for an input or an ouput) and the data model which is optionally specified as an annotation of that instance. Multiple interpretations of a single input or output are possible. The EMMA structural syntax is quite simple and provides elements for the organization of interpretations and instances like root (container with version and name space information), interpretation element, container, or literal element. EMMA markup is intended to be generated automatically by components and expected to include speech recognizer, semantic interpreters, and interaction managers. It concentrates on single input/output (e.g., single natural language utterance). EMMA may be used as a "host language" for plugging-in EmotionML for covering emotion interpretation, such that all EmotionML information is encoded in element and attribute structures. As an example,Footnote 5 see the following analysis of a non-verbal vocalization where emotion is described as a low-intensity state, maybe “boredom”:

figure a

(VHML) Virtual Human Markup Language [70] is an XML-based markup language that is intended for use in computer animation of human bodies and facial expressions (e.g., the control of interactive “Talking Heads”). It is created to adapt the various issues of Human–Computer Interaction (HCI) such as Facial or Body Animation, Text-to-Speech production, Dialogue Manager interaction, Emotional Representation plus Hyper and Multi-Media information.Footnote 6 For example, a “virtual human” who introduces some bad news to the user (“I’m sorry, I can’t find that file”) may speak with a sad voice, a sorry face and with a bowed body gesture. To produce the required vocal, facial and emotional response, tags such as<smile>,<anger>,<surprised> have been defined to make the Virtual Human believable and more realistic.


(SSML) Speech Synthesis Markup Language [71] has been produced by Voice Browser Working Group to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications.Footnote 7 The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms. Popular voice assistants (Google assistant, Alexa, and Cortana) are known to use SSML. SSML adds markup elements (or tags) on input text to construct speech waveforms to improve the quality of synthesized content and to sound more natural. For instance,the tag <amazon:emotion name="excited" intensity="medium"> tells Alexa to speak a string (e.g., “Three seconds till lift off”) in an “excited” voice. Speech synthesis can be beneficial in many text-to-speech applications, e.g., reading for the blind, supporting the handicapped, access to the email remotely, and proofreading, etc.

Emotion Ontologies

In the context of information systems, an ontology defines a set of conceptualizations and representational primitives with which to model a domain of knowledge or discourse [72]. In general, ontologies use primitives like categories, properties, and relationships between categories for organizing information into knowledge using machine readable statements to be computerized, shareable, and extensible. No universal ontology has been proposed for the domain of human emotions so far, but several approaches exist for conceptualizing particular important aspects of human emotions. In this section, we will classify these ontologies in terms of: name, goal, key concept, and underlying emotion theory (according to “Emotion Theories”). As there are many approaches in each class, we present these in table form.

Text Ontologies

Text is a powerful means to interact and transfer information as well to express emotion. When dealing with text ontologies, it has to be taken in mind that in modern social software systems (like Facebook, Twitter etc.) abbreviations of all kinds as well as so-called emoticons, emojies, etc. are used besides of classical language-based text. Thus, establishing such an ontology is challenging. The real meaning of a text, possibly including such abbreviations, depends on the combination of all text components and the particular situation. A given text, therefore, may be ambiguous and lead to different ontological interpretations, if the related situation (the context) is not considered. Table 4 summarizes ontologies that may be used for emotion inference from texts.

Table 4 Ontologies used for emotion inference from texts

Ontologies Conceptualizing Facial Cues, Speech, and Body Expressions

Facial expression synthesis has acquired much interest with the MPEG-4 standard [85]. The MPEG-4 framework is used for modeling facial animation based on facial expressions that implicitly reflect emotion. This, can be beneficial in several domains such as psychology, animation control, or healthcare particularly in the patient–computer interaction field. An appropriate ontological conceptualization may also support the recognition of emotions from speech (e.g., study [86]). The same is true for body movements that, however, are more context dependent and thus harder to synthesize comparing. [87] employed a Virtual Human (VH) to increase realism and reliability of body expressions. Table 5 summarizes the existing emotion ontologies conceptualizing facial cues, speech, and body expressions.

Table 5 Ontologies conceptualizing facial cues, speech, and body expressions

Context-Awareness Ontologies

In our social environment, the context has a significant impact on human’s emotions. The relationship between emotion and its cause can be further understood when investigating the context. Based on the observed contextual information, the situation that trigger an emotional state can be better understood. Therefore, appropriate contextual information is required such as: place, time, things in the environment, etc. Today, context-awareness has been introduced as a key feature in emotion recognition projects that demonstrate relevance and effectiveness. Table 6 summarizes emotion ontologies based on contextual information.

Table 6 Ontologies conceptualizing emotions and contextual information

High-Level Ontologies

High-level or General-/Upper-/Top-ontologies are very generic and, in general, are defined for providing the ontological foundations about the kind of things. They come with concepts like object or process, etc. The existing upper emotion ontologies supply the most significant shared concepts to represent human emotions. The developer can extend an upper level ontology by defining lower level concepts (as specializations of higher level concepts) according to the development’s purpose. Table 7 summarizes the known high-level emotion ontologies.

Table 7 High-level emotion ontologies

Datasets

Datasets may help to accelerate work progress by providing a benchmark resource for analyzing and comparing system performance before tackling a system in real-life settings. This section lists and describes available datasets that are widely used for evaluating emotion recognition systems. The descriptions are summarize in Table 8.

Textual Datasets (Corpora)

Textual datasets, i.e., corpora, are widely used in computer linguistics for evaluating natural language processing systems like taggers, parsers, translators, etc. With the increasing number of social-media participants, emotion recognition from unstructured written text is of growing interest. There are several ways to represent emotions in texts depending on the particular emotion model used.


Stack Overflow (Q&A) dataset has been created by annotating manually 4800 posts (questions, answers, and comments) from the platform Stack Overflow (Q&A) [104]. Each post in the dataset was analyzed regarding the presence or absence of emotion. From total 4800 posts, 1959 were labeled with basic emotion.


EmoBank [105] is a large-scale corpus of 10 k English sentences. Data annotation was performed by applying the dimensional Valence-Arousal-Dominance (VAD) scheme [106]. A subset of the dataset has been annotated initially according to Ekmans six basic emotions, such that the mapping between both representation formats (dimensional and basic) becomes possible. Each sentence was rated regarding to both: The writer and the reader.


Emotion intensity dataset of tweets [107] was created to study the impact of word hashtags on emotion intensities in the text. The annotation is performed on 1030 tweets in the form of a hashtag with a query term (#<query term>). The dataset annotates intensities for emotion: anger, joy, sadness, and fear, respectively. The study proved that “emotion-word hash-tags” influence emotion intensity by transferring more emotion.


Social Media Posts-based dataset [108] has been developed for the training of prediction models for valence and arousal that achieve high predictive accuracy. The dataset consists of 2895 Facebook posts that have been rated by two psychologically trained experts on two separate ordinal nine-point scales regarding valence and arousal thus defining each post’s position on the circumplex model of affect CIRCUMPLEX MODEL of emotion [43].


text_emotion is a CrowdFlower platform [109] dataset consisting of almost 40.000 annotated tweets like<1966197101,"hate","rachaelk_x","why is this english homework so hard i seem to be getting nowhere"> (csv-format). 13 labels are used for annotation, for example “hate”, “sadness”, “worry”, “boredom”, “happiness”, “surprise”, etc.


News Headlines dataset consists of 1.000 annotated news headlines extracted from news websites [110], such as: New York Times, CNN, and BBC News, as well as News search engines. The reason for focusing on headlines was that they usually have a “high load of emotional content”. The annotation was performed by six independently working persons using a web-based interface that displayed one headline at a time, together with six slide bars for emotions and one slide bar for valence. For rating the emotional load, a fine-grained scale has been used.


Twitter microblogs dataset [111] was established for exploring the impact and correlation of “external factors” like weather, news events, or time with a user’s emotional state. The dataset consists of 2.557 tweets that have been collected early in 2017. The tweets are self-tagged by their respective author with a #happy or #sad hashtag and come with metadata such as author, time, and location. They originate from 20 large US metropolitan areas.


SemEval-2018 Task 1 [112] provides a collection of datasets for English, Arabic, and Spanish tweets. In particular, an Affect in Tweets dataset of more than 22.000 annotated tweets has been established. For each emotion dimension (anger, fear, joy, and sadness), the data are annotated for fine-grained real-valued scores indicating the intensity of emotion.


GoEmotions [113] is the currently largest available manually annotated dataset of 58 k English Reddit comments, labeled for 27 emotion categories or Neutral.


EmoEvent [114] is a multilingual dataset collected from Twitter and based on different emotional events. A total of 8409 tweets in Spanish and 7303 in English were labeled by the six Ekman’s basic emotions plus the “neutral or other emotions”.

Emotional Speech Datasets

Speech datasets are utilized as a foundation for different researches in the field of emotional speech analysis. An overview is given in Table 8.


EMODB [115] contains German sentences of emotional utterances. Ten German actors simulate emotions and produce 10 utterances used in daily communication. The data were labeled by 20 annotators with a total of six emotion types plus neutral.


DES [116] Danish emotional voice data: Four Danish actors (two male and two female) convey emotional language as they each consider it realistic. The recordings of each actor consist of two isolated words, nine sentences and two passages. The data set contains emotional speech expressions in five emotional states, neutrally annotated by 20 people.


MASC [117] Mandarin Affective Speech Corpus is a database of emotional speech consisting of 25,636 audio recordings of utterances and corresponding transcripts. It serves as a tool for linguistic and prosodic feature investigation of emotional expression in Mandarin Chinese, and for research in speaker recognition with affective speech. Five emotional states were recorded: Neutral, Anger, Elation, Panic, and Sadness. The recordings are from 68 speakers (23 females, 45 males); information about the speakers is available.


VERBO [118] is a database of speech in the Portuguese language of Brazil, called Voice Emotion Recognition dataBase. It consists of 14 sentences each spoken by 12 professionals (6 actors and 6 actresses) for each of the six basic emotions plus neutral. This meant that 1176 audio recordings were analyzed. This leads to a total number of 1176 audio recordings, which were annotated by three people.


MSP-Podcast corpus [119] contains speech data from online audio-sharing websites (100 h by over 100 speakers) annotated with sentence-level emotional score regarding four basic emotions (general, joy, anger, and sadness).


URDU-Dataset [120] is an Urdu-language speech emotion database that includes 400 utterances by 38 speakers (27 male and 11 female). Four basic emotions are annotated: anger, happiness, neutral, and sadness.


DEMoS [121] is an Italian emotional speech corpus. It contains 9365 emotional and 332 neutral samples produced by 68 native speakers (23 females, 45 males).


AESDD [122] The Acted Emotional Speech Dynamic Database contains around 500 Greek utterances by a diverse group of actors simulating five emotions.


Emov-DB [123] The Emotional Voices Database contains English recordings from male and female actors and French utterances form a male actor. The emotional states annotated are neutral, sleepiness, anger, disgust, and amused.


JL corpus [124] is a strictly guided simulated emotional speech corpus of four long vowels in New Zealand English. It contains 2400 recording of 240 sentences by 4 actors (2 males and 2 females). Five primary emotions (angry, sad, neutral, happy, and excited) and five secondary emotions (anxious, apologetic, pensive, worried, and enthusiastic) are annotated.


EmoFilm [125] is a multilingual emotional speech corpus that consists of 1115 English, Spanish, and Italian emotional utterances extracted from 43 films and 207 speakers. Five emotions are recognized: anger, contempt, happiness, fear, and sadness.

Facial Expression Datasets

Recognizing emotions from facial expressions is an important area of research. Classifying facial expressions requires large amounts of data to reflect the diversity of conditions in the real world. Table 8 provides an overview.


UIBVFED [126] provides sequenced semi-blurry facial images with different head poses, orientations, and movement. Over 3000 facial images were extracted from the daily news and weather forecast of the public tv-station PHOENIX. Seven basic emotions are categorized, namely: sad, surprise, fear, angry, neutral, disgust, and happy as well as “None” if the facial expression could not be recognized.


FFHQ [127]: Flickr faces HQ includes 70,000 high-quality.png images in high resolution (1024*1024) and contains considerable variation in age, ethnicity, and image background. It also has a good coverage of accessories like glasses, sunglasses, hats, etc. The images were crawled from Flickr.


Google Facial Expression Comparison Dataset [128] is a large-scale dataset consisting of facial image triplets along with human annotations. The latter indicate which two faces in each triplet form the most similar pair in terms of facial expression. The dataset was annotated by six or more human raters, which is quite different from existing expression datasets that focus mainly on basic emotion classification or action unit recognition.


Yale Face Database [129] is a dataset containing 165 GIF images of 15 different people in varying lighting conditions. The people in the images show distinct emotions and expressions (happy, normal, sad, sleepy, surprised, and winking).


CAS (ME) \(^2\) Chinese Academy of Sciences Macro-Expressions and Micro-Expressions [130] contain both macro- and microexpressions in long videos, which the authors say facilitates the development of algorithms to detect microexpressions in long video streams. The database consists of two parts, one of which contains 87 long videos containing both macroexpressions and microexpressions from a total of 22 subjects, all filmed in the same setting. The other part includes 357 cropped expression samples containing 300 macroexpressions and 57 microexpressions. The facial expression samples were coded with facial action units marked and emotions labeled. In addition, participants were asked to review each recorded facial movement and indicate their emotional experience of it. Emotion is labeled using four types (negative, positive, surprise, and others). Happiness and sadness are classified as positive and negative, respectively. Surprise refers to an emotion that can be positive or negative. The “others” category represents ambiguous emotions that cannot be categorized to the mentioned categories.


AFEW Acted Facial Expression in Wild [131] AFEW was created in a semi-automatic process from 37 DVD movies: First, the subtitles were parsed and searched for keywords, and then, the relevant clips found were assessed and annotated by a human observer. In total, AFEW version 4.0 includes 1268 clips annotated with the basic emotions (anger, disgust, fear, happiness, neutral, sadness, and surprise). The AFEW dataset was used several times as the basis for the “Emotion Recognition In The Wild Challenge”.


SFEW Static Facial Expressions in the Wild [132] have been created by selecting frames from the AFEW [131] dataset. It comprises 700 images labeled by basic six emotions. It presents real-world images with variety of properties of facial cues (e.g., head poses, age range, and illumination variation).


SPOS Spontaneous vs. Posed dataset [133] contains spontaneous and posed facial expressions from seven subjects who had been shown emotional movie clips to produce spontaneous facial expressions. Six categories of basic emotions were considered (happy, sad, anger, surprise, fear, and disgust). Subjects were also asked to pose these six types of facial expressions after watching the movie clips. Data were recorded with both the visual and near-infrared cameras. A total of 84 posed and 147 spontaneous facial expression clips were labeled.


CMU Multi-PIE [134] X contains approximately 750 K images from 337 exposures taken in up to four sessions over a 5-month period under 19 different illumination conditions. The images were captured from 15 different viewpoints to ensure data diversity. Six basic emotion categories are used to label the result (neutral, smile, surprise, squint, disgust, and scream).


CAS-PEAL [135] includes about 99K images taken from 1040 persons (595 males and 445 females), some of them with glasses and/or hats. Five posed expressions were captured in different head poses and different lighting proprieties. Images’ backgrounds are adapted with different colors to simulate the real world. The subjects were asked to represent (neutral, smile, fear, and surprise) emotional states.


GFT Facial Expression Database [136] comprises about 172K video frames taken from 96 subjects in 32 three-person groups. To analyze facial expression automatically, GFT includes expert annotations of Facial Action Coding System (FACS) occurrence and intensity, facial landmark tracking, and baseline results for linear Support Vector Machine (SVM), deep learning, active patch learning, and personalized classification.


ADFES Amsterdam Dynamic Facial Expression Set [137] is a rich stimulus set of 648 emotional expression movies taken from 22 persons (10 female, 12 male). It includes the six “basic emotions” as well as the emotion states of contempt, pride, and embarrassment. Active turning of the head is used to indicate the direction of the expressions.


Angled Posed Facial Expression Dataset [138] contains facial expressions videos shot from different angles and poses. The different observation angles are intended to facilitate emotion recognition.


Tsinghua facial expression dataset [139] contains 110 images of young and old Chinese showing eight facial expressions (Neutral, Happiness, Anger, Disgust, Surprise, Fear, Content, and Sadness). Each image in the dataset was labeled on the basis of perceived facial expressions, emotion intensity, and age by two different age groups.


MUG Facial Expression Database [140] contains image sequences taken from 86 people (35 women and 51 men, all of Caucasian origin) making facial expressions against a blue screen background. The database consists of two parts: The first contains images where subjects were asked to show the six basic expressions (anger, disgust, fear, joy, sadness, and surprise). The second part contains emotions generated in the laboratory: Subjects were recorded while watching a video created to generate emotions. The dataset was created to improve some issues such as high resolution, uniform lighting, and others.


RAF-DB Real-world Affective Faces Database [141] contains approximately 30 K different face images downloaded from the Internet, each labeled by approximately 40 people based on crowdsourcing annotation. The images vary in age, gender and ethnicity of the persons, head pose, lighting conditions, occlusions, post-processing procedures, etc. Each image is labeled with a seven-dimensional (basic) expression distribution vector, landmark locations, race, age and gender attributes, and other features.


FERG-DB [142] Facial Expression Research Group 2D Database. It contains 2D images of six stylized characters (3 males and 3 females) with annotated facial expressions. The database contains 55,767 annotated face images grouped in seven basic expressions—anger, disgust, fear, joy, neutral, sadness, and surprise.


FACES [143] is a set of images of naturalistic faces of 171 young (n = 58), middle-aged (n = 56), and older (n = 57) women and men, each showing one of six facial expressions: Neutrality, Sadness, Disgust, Fear, Anger, and Happiness. The database includes two sets of images per person and per facial expression, resulting in a total of 2052 images.


iCV-MEFED iCV-Multi-Emotion Facial Expression Dataset [144] was designed for multiemotion recognition and includes about 31 K facial expressions with different emotions from 125 people with almost uniform gender distribution. Each person shows 50 different emotions, and for each of these emotions, 5 samples were taken under uniform illumination conditions with relatively uniform backgrounds. All emotion expressions are labeled with seven basic emotional states (anger, contempt, disgust, fear, happiness, sadness, surprise, and neutral). The images were taken and labeled under the supervision of psychologists, and the subjects were trained on the emotions they posed.

Hybrid Emotion Datasets

Recognition results can be more accurate when data are collected from different modalities, such as: text, audio, video, body and physiological data, etc. Table 8 lists the available data sets in the area of emotion recognition based on hybrid data sources. The table has been extended by a column titled “Type” to illustrate the modularity used to extract emotions.

Speech-Video Datasets

The records listed below combine voice and video. Table 8 provides an overview.


HUMAINE [145] data set contains naturalistic clip samples of emotional behavior in relation to the context (static, dynamic, indoor, outdoor, monologue, and dialogue). The emotional state is commented on in each case by a series of annotations associated with the clips: these include core signs in speech and language as well as gestures and facial features related to different genders and cultures. Six annotators have been used for a wide range of emotions (intensity, activation/sounds, valence, and power).


The Belfast [146] dataset contains clips extracted from television programs (chat shows and religious programs). These are recordings of people discussing emotional issues. 100 clips were annotated.


SEMAINE [147] dataset was created as part of an iterative approach to the creation of automatic agents called Sensitive Artificial Listener (SAL). SAL involves a person in an emotional conversation. The participants comment on each clip in five emotional dimensions (valence, activation, power, expectation/expectation, and intensity).


IEMOCAP [148] (interactive emotional and dyadic motion capture) is a multimodal and multilingual dataset that includes video, speech, facial motion capture, and text transcriptions over a period of 12 h. Ten actors execute sketched scenarios specially selected to evoke emotional expression. IEMOCAP was annotated by six annotators both in basic terms and dimensions.


GEMEP-FERA [149] comprises ten recordings of actors displaying expressions with different intensities. Five independent discrete emotions are labeled per video.


SAVEE [150] (Surrey Audio-Visual Expressed Emotion) dataset comes with six basic emotions plus neutral. It has been created as a pre-requisite for the development of an automatic emotion recognition system. SAVEE consists of recordings from four English actors in seven different emotions with a total of 480 British English utterances.


Biwi 3D-Audiovisual Corpus [151] contains 1109 dynamic 3D face scans taken while uttering an English sentence. The information was extracted by tracking the frames using a simple face template, by splitting the speech signal into phonemes, and by evaluating the emotions using an online survey. The data set can be used in areas such as audio-visual emotion recognition, emotion-independent lip reading, or angle-independent facial expression recognition.


OMGEmotion (One-Minute-Gradual Emotion) [152] dataset is composed of 567 emotion videos with an average length of 1 min, collected from a variety of YouTube channels using the search term “monologue”. The videos were separated into clips based on utterances, and each utterance was annotated by at least five independent subjects using an arousal/valence scale and a categorical emotion based on the universal emotions from Ekman.


RAVDESS [153] (Ryerson Audio-Visual Database of Emotional Speech and Song) is a set of multimodal, dynamic expressions of basic emotions. The data set includes 24 professional actors (12 female, 12 male), vocalizing two lexically matched statements by providing audio-visual recordings of vocal communication in North American English.


CEAR [154] (Context-Aware Emotion Recognition) dataset contains 13,201 video clips (with audio and visual tracks) and about 1.1 M frames that were extracted from 79 TV shows. Each clip is manually annotated with six emotion categories, including “anger”, “disgust”, “fear”, “happy”, “sad”, and “surprise“, as well as “neutral”. The clips range from short (around 30 frames) to longer ones (more than 120 frames) with an average length of 90 frames. A static image subset contains about 70,000 images. The dataset is randomly split into training, validation, and testing sets.


SEWA [155] is an audio-visual, multilingual dataset with recordings of facial, vocal, and speech behaviors made “in the wild”. It includes>2000 min of audio-visual data from 398 individuals (201 males and 197 females) and a total of six different languages. The recordings are annotated with face landmarks, facial action unit (FAU) intensities, different vocalizations, verbal cues, mirroring and rapport, continuous rated valence, arousal, liking, and prototypical examples (templates) of (dis)liking and mood.


CMU-MOSEI Multimodal Opinion Sentiment and Emotion Intensity dataset [156] contains>23 K sentence utterances in>3300 video clips from >1000 online YouTube speakers. The dataset is gender balanced. The sentences utterance is randomly chosen from various topics and monologue videos. The videos are annotated with the basic six emotion categories (happiness, sadness, anger, fear, disgust, and surprise).


VAMGS The Vera Am Mittag German Audio-Visual Emotional Speech Database [157] consists of 12 h of recordings of the German TV talk show “Vera am Mittag” (Vera at Noon). They are divided into broadcasts, dialogue acts, and utterances, and contain spontaneous and highly emotional speech recorded from unscripted, authentic discussions between talk show guests. The video clips were annotated by a large number of human raters on a continuous scale for three emotion primitives: Valence (negative vs. positive), Activation (calm vs. excited), and Dominance (weak vs. strong). The video section contains 1421 segmented utterances from 104 different speakers, the audio section contains 1018 utterances, and the facial image section contains 1872 facial images labeled with emotions.

Hybrid Facial Expression Datasets

Several image and video datasets have been introduced for supporting the analysis and prediction of human emotional reactions based on facial expressions. An overview is given in Table 8.


Cohn-Kanade [158] CK dataset was made available to the research community in 2000. The image data consisted of about 500 image sequences of 100 subjects and were FACS (Facial Action Coding System, see above) annotated. An extended data set called CK+ was published in 2010. In CK+, the number of sequences is increased by 22% and the number of subjects by 27%. The target expression for each sequence is fully FACS coded; the emotion labels have been revised. In addition, non-positive sequences for different types of smiles and the associated metadata have been added.


JAFFE [159] Japanese Female Facial Expression dataset includes 213 annotated images of 7 facial expressions (6 basic expressions + neutral) that have been posed by Japanese female models. Each image rated by 60 Japanese annotators on 6 emotion classes. The images are in.tiff format with no compression (see also [159]). Semantic ratings on emotion adjectives, averaged over 60 subjects, are provided in a text file. The JAFFE images may be used for non-commercial scientific research.


BU-4DFE [160] is a 3D facial expression data set comprising 606 3D facial expression sequences posed by 101 persons of different ethnic origin. For each person, there are six model sequences showing six prototypic facial expressions (anger, disgust, happiness, fear, sadness, and surprise), respectively. Each sequence consists of about 100 frames, the resolution is about 1040\(\times\)1329 pixels per frame. BU-4DFE is an extension of BU-3DFE [161] dataset (3D + time).


MMI [162] has been created as a resource for building and evaluating recognition algorithms of facial expression. It comprises over 2900 videos and high-resolution images of 75 subjects. Action Units (AU) in videos were fully annotated and partially coded on frame level. NVIE [163] Natural Visible and Infrared Facial Expressions dataset contains both spontaneous and posed expressions of more than 100 subjects. The images were taken synchronously with a visible and an infrared thermal imaging camera, with illumination from three different angles. The data set also allows a statistical analysis of the relationship between face temperature and emotions.


EMOTIC [164] EMOTions In Context is a database of images with people in real environments, annotated with their apparent emotions. The images are annotated with an extended list of 26 emotion categories combined with the three common dimensions (valence, arousal, and dominance). The dataset contains 23,571 images and 34,320 annotated people.


Affective-MIT [165] is a labeled dataset of spontaneous facial responses recorded in natural settings over the Internet: online viewers watched one of three intentionally amusing Super Bowl commercials and were simultaneously filmed using their webcam. They answered three self-report questions about their experience. The dataset consists of 242 facial videos (168,359 frames).


Affectiva [166] is described as the largest emotion dataset growing to nearly 6 million faces analyzed in 75 countries, representing about 2 billion face frames analyzed. Affectiva includes spontaneous emotional responses to consumers while doing a variety of activities. The data set consists of viewers watching media content (ads, movie clips, TV shows, and viral campaigns online). The dataset has been expanded to include other contexts such as videos of people driving cars, people in conversation interactions, and animated GIFs.


DISFA [167] Denver Intensity of Spontaneous Facial Action The dataset contains high-resolution stereo videos (1024\(\times\)768) of 27 people (12 women and 15 men) that capture the spontaneous (non-posed) emotions of the persons while watching video clips. Each record frame was manually coded for presence, absence, and intensity of facial action units according to the facial action unit coding system (FACS). An extension, DISFA+, comprises also posed facial expressions data, more detailed annotations, and meta data in the form of facial landmark points (in addition to the self-report of each individual regarding every posed facial expression).


LIRIS-ACCEDE [168] comprises 9800 video segments with a large content diversity. Affective annotations along the valence and arousal axes were achieved using crowdsourcing through a pair-wise video comparison protocol. The videos were selected from 160 diversified movies.


FABO [169] Bimodal Face and Body Gesture contains 1900 videos of face and body expressions recorded simultaneously by two cameras. The dataset combines facial cues and body in an organized bimodal manner.


Kinect FaceDB [170] includes facial images of 52 persons acquired by Kinect sensors. The data were captured in different time periods involving 9 different facial expressions under different conditions: neutral, smile, open mouth, left profile, right profile, occlusion eyes, occlusion mouth, occlusion paper, and light on.


YouTube emotion datasets [171] contain 1101 videos annotated with 8 basic emotions using Plutchik’s Wheel of Emotions [54]. The research efforts was focused on recognizing emotion-related semantics.

Table 8 Emotion datasets

Multimodal Emotion Datasets

Multimodal datasets combine video data with synchronously recorded physiological signals. The following examples are also listed at the bottom of Table 8.


MAHNOB-HCI [172] consists of multimodal recordings of participants in their response to excerpts from movies, images, and videos. The modalities include multicamera video of face, head, speech, eye gaze, pupil size, ECG, GSR, respiration amplitude, and skin temperature. The recordings for all excerpts were annotated by the 27 participants immediately after each excerpt using a form asking five questions about their own emotive state through self-assessment manikins (SAMs) [32]. A precise synchronization permits researchers to study the simultaneous emotional responses using different channels.


DEAP [173] Database for Emotion Analysis using Physiological Signals is a multimodal dataset combining face videos with electroencephalogram (EEG) and peripheral physiological signals. 32 participants were recorded while watching 40 1-min-long excerpts of music videos. The participants rated each video in terms of the levels of arousal, valence, like/dislike, dominance, and familiarity.


RECOLA [174] Remote Collaborative and Affective Interaction dataset consists of audio, visual, and physiological signal (ECG and EDA) recordings. Video conference interactions between 46 French-speaking participants while solving a cooperation task were recorded synchronously. The Emotions expressed by the participants were reported by themselves using the Self-Assessment Manikin (SAM) [32].


EMDB [175], the Emotional Movie Database, consists of 52 affective movie clips from different emotional categories without auditory content. Recorded signals are kin conductance level (SCL) and heart rate (HR). Subjective scores for annotation by the participants were arousal, valence, and dominance (all on a scale from 1 to 9).


MMSE [176] Multimodal Spontaneous Emotion Database. The data captured from different sensors, such as 3D models, 2D videos, thermal, facial expressions, and FACS codes. The physiological signals are also recorded such as heart rate, blood pressure, electrical conductivity of the skin, and respiration rate. The datast was captured from 140 individuals from various nationalities. Ten emotions are recorded per person including surprise, sadness, fear, anger, disgust, happiness, embarrassment, startle, sceptical, and pain.


DECAF [177] Multimodal Dataset for Decoding Affective Physiological Responses combines brain signals collected using the Magnetoencephalogram (MEG) sensor with explicit and implicit emotional responses of 30 participants to 40 1-min music video segments (used in DEAP) and 36 movie clips. This allows for comparisons between the EEG vs. MEG modalities as well as movie vs. music stimuli for affect recognition. The Recorded Signals are: MEG data, horizontal electrooculogram (hEOG), electrocardiogram (ECG), electromyogram of the Trapezius muscle (tEMG), and near-infrared face video.


emoF-BVP [178] is a multimodal dataset of face, body gesture, voice, and physiological signals recordings. It consists of audio and video sequences of actors displaying three different intensities of expressions of 23 different emotions and the corresponding physiological data.


DREAMER [179] is a multimodal database consisting of electroencephalogram (EEG) and electrocardiogram (ECG) signals together with 23 participants’ self-assessments of their emotion in terms of valence, arousal, and dominance. The signals were captured using portable, wearable, and wireless equipment during affect elicitation by means of audio-visual stimuli.


MEISD [180] a large-scale balanced Multimodal Multi-label Emotion, Intensity, and Sentiment Dialogue dataset (MEISD) collected from different TV series that has textual, audio, and visual features. For annotating the dataset, six basic emotions are used. Emotion annotation list is extended to incorporate two more labels, namely, Acceptance and neutral. The “acceptance” emotion has been taken from the Plutchik’s [54] wheel of emotions.


ASCERTAIN [181] contains emotional self-assessments (Arousal, Valence, Engagement, Liking, and Familiarity) from 58 users along with synchronously recorded electroencephalogram (EEG), electrocardiogram (ECG), galvanic skin response (GSR), and facial activity data recorded with commercially available sensors while watching affective movie clips. This multimodal database can be used to detect personality traits and emotional states via physiological responses.

Systems for Emotion Recognition

This chapter sketches a number of systems or Application Programmable Interfaces (APIs) for emotion recognition, which are used in numerous areas such as health care, education, and entertainment. These systems are based on the various methods discussed above, namely face analysis, speech processing, physiological signs, recognition, and analysis of emotional phrases in social media, body language, and gesture expressions. An overview of these systems is given in Table 9.

Text-Based Interfaces for Emotion Detection

IBM WatsonFootnote 8 is an analyzer of emotions in written text. It detects emotional tones, social tendencies and writing styles from simple texts of any length. Currently, the tool can analyze online texts such as tweets, online reviews, email messages, product reviews, or user texts for emotional content.


ToneAPIFootnote 9 was created for marketing people to evaluate (and potentially improve) the emotional impact of their advertising texts quantitatively and qualitatively. For this purpose, input texts are analyzed, compared with other texts from a corpus, and emotions and their intensity are derived. A total of 8 emotions are identified and their intensity is evaluated with a value between 1 and 100.


ReceptivitiFootnote 10 is a computational language psychology platform that aims at helping to understand the emotions, drives, and traits that affect human behavior. A set of algorithms uncovers signals from everyday human language, e.g., stress, depression, etc. The analysis is performed in real time without needing self-reports or surveys.


Synesketch [182] is an open source tool for analyzing the emotional content of text sentences and transforming the emotional tone into some visualizations. It is a dynamic text representation in animated visual patterns to reveal the underlying emotion.


EmoTxt [183] is an open-source toolkit for emotion detection from text. It was trained and tested on two large gold standard datasets mined from Stack Overflow and Jira. It provides supporting both emotion recognition from text and training of custom emotion classification models.

Audio-Based Interfaces for Emotion Detection


EMOSpeechFootnote 11 is an interface based on an end-to-end psychological model. It is designed to help automated call agents analyze recorded customer calls and then send real-time feedback to supervisors. It uses a three-dimensional emotion representation model and recognizes ten emotions from acoustic features in the voice.


VokaturiFootnote 12 was developed to detect the emotions “happy,” “sad,” “scared,” “angry,” or “neutral” from a speaker’s voice. The open-source version of the software selects between these five emotions with high accuracy, even when hearing the speaker for the first time, according to the manufacturer. The “plus” version is said to reach the performance level of a dedicated human listener.

Video-Based Interfaces for Emotion Detection


FaceReaderFootnote 13 created by Noldus [184] is a professional tool for automatic analysis of facial expressions. More than 10,000 manually annotated images were used to train the recognition component: emotion, gaze direction, head orientation, and personal characteristics, such as gender and age.


RealEyesFootnote 14 is a platform for emotion recognition using Webcams [185], used to measure people’s feeling when they watch video content online. Computer vision and machine learning techniques were used to analyze signals from physiological sensors, voice, and posture.


CrowdEmotionFootnote 15 used to explore facial points in real-time video, to detect the time series of Ekman six basic emotions. The Webcam tracks eye and what they are paying attention, as well as facial coding to understand emotion.

Image-Based Interfaces for Emotion Detection


Face++Footnote 16 detects faces within images and gain high-precision face location rectangles. Each detected face can be stored for future usage and analysis. Detected face is compared with stored faces to return a confidence score and thresholds to evaluate the similarity. It also can determine, if a subject is smiling or not.


SkyBiometryFootnote 17 is created by biometric company to detect faces and emotion in photos with a percentage rate for each emotion. The application also determines gender, smile, eyeglasses and sunglasses presence, age, roll and yaw, eyes, nose and mouth position, checks if lips are parted or sealed, eyes open or closed.

Multimodal Interfaces for Emotion Detection


Cloud VisionFootnote 18 is a tool created by Google, which understands faces, signs, landmarks, objects, text, as well as emotions by detecting facial features within image or video. The Cloud platform takes an image as an input, and then returns the expected percentage of each emotion for each face in that image.


Microsoft cognitive servicesFootnote 19 or Microsoft Project Oxford is a set of tools that make it possible for a computer to identify emotions in photographs using facial recognition technology. It detects emotional depth for each face using the core seven emotional states as well as “neutral”. Each scanned image is bounded with a box for the face, and then assigned a score between zero to one, where zero corresponds to a complete absence of the emotion in question and one is a strong emotional response.


CLMtrackrFootnote 20 created by MIT is a javascript library for fitting facial models to faces in videos or images [186]. It is an open and free to use JavaScript library for precise tracking of facial features. This library recognizes four emotional states: angry, sad, surprised, and happy.

KairosFootnote 21 integrates both face detection and important demographics data (Age, Gender, and Ethnicity). It detects real-time emotion from face as well as ethnicity to understand the diversity of human face.


AmazonFootnote 22 recognition is used to identify the objects, people, text, scenes, and activities, as well as emotion. This tool accepts two sources of data input: image and video. The level of confidence in the determination is ranging: (zero) minimum value to (100) maximum value. This application based on the same learning technology is developed by Amazon’s computer vision to analyze billions of images and videos daily.


SightcorpFootnote 23 Platform provides face analysis and face recognition software using computer vision and deep learning techniques. It allows for emotion recognition, age detection, gender detection, attention time, and eye gaze tracking in images, videos, and real-life environments.


SHOREFootnote 24 is used to detect the emotion, age, and gender of a person from a standard webcam. The special feature of this tool is its ability to analyze and recognize the respective emotion from a video input with multiple faces simultaneously.


nVisoFootnote 25 analyzes real-time emotions from facial expressions in video using 3D facial imaging technology. nViso can monitor many different facial data points to produce likelihoods for main emotion categories.


The iMotionsFootnote 26 Facial Expression Analysis (FEA) module provides 20 facial expression measures, seven core emotions (joy, anger, fear, disgust, contempt, sadness, and surprise), facial landmarks, and behavioral indices such as head orientation and attention. These output measures are assigned probability values to represent the likelihood that the expected emotion will be expressed. Summary values for engagement and valence are also provided.


AffectivaFootnote 27 is designed to detect facial cues or physiological responses based on emotions. It tracks a person’s heart rate from the human face using the webcam without any other sensors being worn, depending on the color change in the person’s face, which pulses every time the heart beats.

Table 9 Current emotion interfaces

Information Systems (IS) Exploiting Emotion Data

Emotion recognition applications such as those mentioned in the previous chapter are used in various fields and embedded in domain-specific information systems. Examples of such domains include medicine, e-learning, human resources, marketing, entertainment, and automotive; these are briefly outlined below. Table 10 lists some such systems, organized by application domain.


Health/Medical: Stress and various psychological problems require proper psychometric analysis of the patient. A healthcare system that focuses on emotional aspects may improve the quality of life. Such systems automatically monitor both the environment and the person to provide help and services. Also, coupling emotion recognition systems with activity recognition systems [187, 188] can generate important insights for supporting older people in the AAL context.


E-learning:Student emotions play an important role in any learning environment, whether in a classroom or in e-learning. In face-to-face classes, it is easier to observe student behavior, because the instructor interacts with students face-to-face in the same environment. In the virtual classroom, this is more difficult, especially since course participants often turn off their cameras. This is where techniques based on multimodal recognition come in to assist the lecturer.


Hiring/interview: Companies have begun to integrate emotion recognition technology into hiring processes to capture employees’ emotions. Employers argue that the technology can help eliminate hiring biases and reject candidates with undesirable traits. During interviews, employers can track candidates’ emotional responses to each question in the interview to determine how honest the applicant is. In addition, analyzing employee stress levels can impact productivity and career success. The discussion of whether one can ethically justify such an approach has to take place elsewhere.


Entertainment: Modern computer games use information about the player’s emotions and emotional reactions to dynamically adjust the game’s difficulty level and audio-visual features. Monitoring is done with multimedia tools, for example in video games: this involves measuring player behavior and emotional states to improve player engagement, challenge, immersion, and excitement, in addition to adjusting game features. In addition, emotion recognition is also used to evaluate the success of a game according to the experience of the player interacting with the game.


HCI/Robotics: Human–computer interaction, or the interaction between information technologies and humans, has become similar to human–human interaction. Detecting the emotional state during human–computer interaction aims to make this interaction more comfortable, effective, and easy. Similarly, the integration of emotions in robots is being researched to make robots more “social” and “human-like”.


Marketing: Marketing has moved to the study of consumer attitudes and measures the factors that influence consumer decisions. Emotions have become an important aspect of this. They help companies better understand the opinions expressed about a consultation or product.


Automotive industry: In the automotive industry, emotional information can be used to respond appropriately to recognized driver emotions to enhance safety and support the driving experience, e.g., by suggesting useful information or a conversation.

Table 10 Exploiting emotion recognition systems in different domains

Summary

As shown in the previous sections, there are a variety of methods and tools that can be used to model, analyze, and predict human emotions. We have reviewed the relevant literature on this as far as it was available to us. The extensive bibliography substantiates this.

Overall, we can summarize the contribution of our study as follows:

  • It combines emotion models, languages, ontologies, datasets, and systems.

  • It provides a comprehensive and systematic overview of the field of human emotion recognition, organized into tables to give the reader an easy and coherent way to find the data they need.

  • It provides a comprehensive benchmark for users to create human support in a specific environment by exploiting the available semantic and contextual information; there is no need to reinvent the wheels.

  • It can serve as a starting point for anyone studying human emotions, especially budding researchers.

  • It introduces a resource for many applications in various fields, for instance, computer science, psychologists, machine learning, human–computer interaction, e-learning, information systems and cognitive science.

It has been noted that there are numerous trends in emotion research with different goals and models. Facial expressions tend to be the most promising approaches to emotion measurement, but they are easier to fake in different situations compared to other recognition methods (e.g., voice or biosignals).

Emotion recognition accuracy can be increased by combining multiple modalities with information about the context, preferences, and situation of the observed person. Multimodal emotion recognition has been shown to be superior to unimodal, so there are several datasets that serve as benchmarks for emotion modalities that have recently been made available. The next generation of recognition tools will capture emotions from multiple input devices using both recent technological advances and integration methods.

The context in which emotions are experienced is another important topic currently being discussed by researchers. Some work has implemented ontology-based methods to better interpret contextual human emotion manifestation [ 9699].

Other developments include the ability to recognize and interpret not only current physical activities and reactions, but also relevant mental states beyond basic emotions, such as shame and pride.

Open Research Questions

As an additional outcome of this literature review, we summarize a list of limitations and open research questions that we would like to share with the research community:

  • Current emotion representation languages such as [65, 66, 68,69,70,71,] are XML extensions that do not provide more advanced knowledge representation or automated reasoning capabilities. None of these languages is general enough to cover all emotion vocabularies.

  • Known recognition frameworks (listed in “Datasets”) have a limited concept space by ignoring some real-world details. This limits their ability to reflect the user’s intentions in unpredictable situations.

  • So far, several emotion ontologies have been proposed that share many similarities in terms of their concepts, classes, and underlying psychological theory. For example, Ekman’s basic theory [39] is adapted in most ontologies. However, there does not yet seem to be a universally shared ontology.

  • The majority of emotion recognition interfaces are based on recognizing and analyzing facial data (i.e., using image or video data) to measure emotions. However, studies [217,218,219,220,221,222,223] have shown that integrating different modalities provides better accuracy than a single model for emotion recognition. Therefore, it stands to reason that future research will focus on multimodal emotion recognition [172,173,174,175, 177,178,179].

  • A dataset that integrates emotion recognition with context is lacking; such data could be useful to answer theoretical questions about the causes of emotional responses.

  • We identify the need for a conceptual human emotion framework, in particular, a domain-specific modeling language [224] that can describe human emotions in a comprehensive, reliable, and flexible way. The main challenge in developing such a language is the diversity of theoretical models and the complexity of human emotions. In addition, there is the influence of temporal structure and human context in interpreting emotions. For example, a person may express the same emotion in different situations depending on the context, sometimes with different intensity. We have presented a first approach to such a language, our “Human Emotion Modeling Language”, in [225].

Finally, we take the opportunity to thank the anonymous reviewers of this paper for their valuable comments.