Human Emotion: A Survey focusing on Languages, Ontologies, Datasets, and Systems

Emotions are an essential part of a person’s mental state and influence her/his behavior accordingly. Consequently, emotion recognition and assessment can play an important role in supporting people with ambient assistance systems or clinical treatments. Automation of human emotion recognition and emotion-aware recommender systems are therefore increasingly being researched. In this paper, we first consider the essential aspects of human emotional functioning from the perspective of cognitive psychology and, based on this, we analyze the state of the art in the whole field of work and research to which automated emotion recognition belongs. In this way, we want to complement the already published surveys, which usually refer to only one aspect, with an overall overview of the languages ontologies, datasets, and systems/interfaces to be found in this area. We briefly introduce each of these subsections and discuss related approaches regarding methodology, technology, and publicly accessible artefacts. This comes with an update to recent findings that could not yet be taken into account in previous surveys. The paper is based on an extensive literature search and analysis, in which we also made a particular effort to locate relevant surveys and reviews. The paper closes with a summary of the results and an outlook on open research questions.


Introduction
Human emotion recognition refers to the process of discovering a person's state of mind and reactions, that are associated with a specific event or situation. It is performed with sophisticated algorithms that, thanks to technological progress, have become increasingly efficient in tracking emotional changes. The research interest is motivated above all by promising applications in the areas of ambient assistance, decision support, and the prevention of emotional hazards [1]. According to "Research and Markets" report, 1 the global market of emotion recognition is expected to rise about 65 Billion by 2023, with 39% Annual Growth Rate between 2017 and 2023.
Algorithms for detecting emotional changes use different context information and modalities: Facial cues, speech variations, gestures, data from body or brain sensors, and more. Some approaches rely on the person's self-assessment to measure instant emotion. Each approach exhibits advantages and disadvantages, depending on the contexts or settings in which they are used.
This paper aims to provide a comprehensive overview of the current state of important aspects of machine recognition of human emotions to assist the interested community in developing and improving methods, techniques, and systems. In doing so, we not only add recent findings to the body of knowledge established by previously published reviews and original papers. Rather, we bring together different perspectives: (1) Systems that use emotional data, (2) the datasets they use, (3) description languages, and (4) domain-specific models and ontologies as the basis for recognition algorithms.
To this end, we have analyzed a large number of papers from various relevant areas such as emotion interpretation theories, modeling languages, ontologies, data sets, and output interfaces from the perspective of the use of emotion recognition by intelligent systems. In addition, we have tried to get an overview of the survey articles already existing in this area and to evaluate them to be able to present as comprehensive a compendium as possible with our paper.
The paper has the following structure: "Methodology" briefly describes the criteria we used to search literature sources and, in particular, previously published review articles. "Human Emotional Functioning" deals with the human emotional functioning, the manifestations of human emotions, and approaches to their interpretation. "Languages and Ontologies" focuses on ontologies and domain-specific modeling approaches to describe emotions. "Datasets" gives an overview of the data sets used in the literature for emotion recognition, followed by a compilation of systems for emotion recognition "Systems for Emotion Recognition" and of systems that exploit emotion data "Information Systems (IS) Exploiting Emotion Data". The paper concludes with a summary of the results "Summary" and a brief outlook on open research questions "Open Research Questions".

Methodology
The comprehensive overview provided in this paper is based on findings from existing studies and literature available online (e.g., Google scholar 2 ). Five aspects play a special role in the context of emotion recognition: models, languages, ontologies, datasets, and systems. In total, we selected 230 relevant literature sources on these aspects and referenced them in the bibliography. In contrast to our approach, which considers all five aspects, most published studies focus on one of these aspects or on one emotional modality like spoken language, video, image, etc. For example, study [2] focuses on speech datasets from very recent years, and [3,4] overviews miscellaneous facial datasets only based on video, audio, or image input. [5,6] review different approaches for detecting emotion from text only.
To validate this finding, we systematically searched Google Scholar for articles that had relevant terms in their title (e.g., allintitle: emotion language survey). In detail, the search terms used were combinations of the phrases "emotion model," "emotion language," "emotion ontology," "emotion dataset," and "emotion information system," in conjunction with the keywords "survey," and "review," respectively, as shown in Fig. 1. The keyword "review" has been selected additionally as it seems that authors use it more often than "survey". The quantitative results of these searches are shown in Fig. 2 which also contains the result of a search with all emotion aspects combined in one query: as expected, such a study could not be found.

Human Emotional Functioning
Emotions are specific reactions to an experienced event [7]. In the domain of cognitive psychology, the term emotion refers to "specific sets of physiological and mental dispositions triggered by the brain in response to the perceived significance of a situation or object" [8]. Theories of cognitive psychology suggest that the individual interpretation of a situation or event influences emotions and behaviors [9][10][11]: guided by their beliefs, different people may interpret the same events differently. Psychologists study how people interpret and understand their worlds and how emotions prepare people for acting and regulating their social behavior [12]. Emotional expression is also an important part of emotional function, as human emotions can influence a person's physical reactions [13]. These emotional reactions are mediated by speech, facial expressions, body gestures, or physiological signals [14].
In this section, we present approaches to categorize, model, interpret, and understand emotions.

Categorizing the Manifestations of Emotion
The emotion felt by a person cannot be "grasped" physiologically in everyday life, so that one is dependent on the external manifestations in which emotions express themselves: Emotion controls many modes of human visible behavior like gestures, facial expression, postures, voice tone, respiration, and skin color. All this affects the way people interact with each other. Psychologists and engineers have conducted several studies to understand and categorize the manifestations of emotions.

Facial Expressions
Ekman and Friesen [15] conducted a study on the universality of facial expressions and classified them in relation to six basic emotions: anger, happiness, sadness, disgust, surprise, and fear. Based here-on, they developed a taxonomy of facial muscle movements (Facial Action Coding System, FACS) that is general enough to describe a person's basic emotional state through analyzing the relationship between points on the person's face [16,17]. Many studies use Ekman's results as the basis for the recognition task. A similar approach [18] describes facial expression as a result of the so-called action units (AUs) that capture the possible movements of facial muscles. Such facial movements occur in most people and can reflect certain emotions in combination. Table 1 shows the connection between certain combinations of action units and the basic emotion they express. For instance, happiness is calculated as a combination of action unit 6 (cheek raiser) and 12 (lip corner puller) [19].

Vocal Expressions
The speech signal contains both explicit (linguistic, i.e., the message presented) and implicit (paralinguistic) information such as references to the emotional state of the speaker. Acoustic speech signals are mainly generated by the vibration of the vocal chords, whereby the frequency determines the pitch of the tone. Further parameters of the speech signal are the intensity, duration of the spectral speech properties, contour, melt-frequency cepstral coefficients (MFCCs), tone base, and voice quality [22]. The variation of the pitch and its intensity together form the prosody. Many features of speech signals are used to extract emotions. Table 2 outlines different emotional behavior which are listed in relation to the common vocal parameters.

Physiological Signals
In general, physiological signals are divided into two categories: (1) signals derived from peripheral nervous system phenomena such as heart rate (ECG) and skin conductance (EMG) and (2) signals derived from the central nervous system such as brain signals (EEG) [23]. Physiological signals can be collected via wearable sensors and evaluated for the classification and identification of emotions [24]. For emotion classification, signal features like frequency, amplitudes, minima, and maxima are analyzed. Popular approaches are Support Vector Machine (SVM) [25], Fisher linear discriminant projection [26], Canonical Correlation analysis (CCA) [27], Artificial Neural Network (ANN) [28], K-Nearest Neighbor (KNN) [29], Adaptive Neuro-Fuzzy Interference System (ANFIS) [30], or the Bayesian network method [31]. There are several ways to derive emotions from physiological signals: (1) measure various parameters of the signal and compare the results to a self-assessment Manikin (SAM) questionnaire [32] (see Fig. 3); (2) estimate emotion based on facial expressions and overall impressions by psychological experts; (3) correlate the results to a gold standard, such as facial recognition or EEG; (4) compare the result with well-known dataset to elicit the emotions, such as the International Affective Picture System (IAPS) dataset [33] to generate comparable results [34].  2 The graph shows the results of human emotion aspects identified in the literature followed by "Survey" and "Review" keywords

Body Gesture
Gestures (except sign language) are a form of non-verbal interaction in which a human moves a certain part of the body, especially hands or head. This movement is used to convey a message and additional information such as human emotions [35]. Figure 4 depicts emotion expressions which associated with body pose and motion. In addition, one can deduce emotion parameters from measured movement values such as speed, amplitude, and time expenditure of body parts involved in the various gesture phases (preparation, stroke, and relaxation). Table 3 depicts the frequent arm movements which form a certain emotion.

Multimodal Emotion
The aforementioned methods of identifying emotions can also be combined, in which case we speak of a multimodal approach to emotion recognition. Fusing modalities together may increase the performance of systems for emotion recognition [14]. For example, the integration of facial expressions and speech signals leads to a new "audiovisual" signal. [37] and the combined evaluation of audio-visual and physiological signals can further reduce the error rate of emotion recognition.

Emotion Theories
In the theoretical discussion, different views on emotions are represented, which are reflected in several theories and descriptive models. A simple approach consists in relating "emotion" to phenomena like anger, fear, or happiness. More sophisticated approaches describe emotions in a multidimensional space with an unlimited number of categories. Brosch et al. [38] distinguish four main directions: the Ekman basic emotion theory [39], the appraisal theory [40], dimensional theories [41][42][43], and the constructivist theory of emotion [44]. Combining theories leads to further refinements. For instance, [45] combines appraisal and dimensional theory, and [46] claim that there are two main views for emotion classification, namely, basic and dimensional.

Basic Emotion Theory
Ekman [39] assumes emotions to be discrete and related to a fixed number of neural and physically represented states. As already mentioned in Facial Expressions, he proposes a division into six basic emotion categories. Barrett [47] extends this view by associating each emotion with a specific and unmistakable set of bodily and facial responses. An emotion is thus represented by a characteristic syndrome of hormonal, muscular and autonomic responses that are coordinated in time and correlated in intensity [48]. E.g., each emotion comes with different, distinguishable facial movements.

Appraisal Theory
Appraisal theory defines emotions as processes and not as states [49]. It assumes that the interpretation (appraisal, assessment) of a given situation causes a specific emotional reaction in the interpreting person [40]. Consequently, not the situation per se, but the individual assessment of that situation causes the type and intensity of emotional reaction.

Two-Dimensional Theory
The central assumption of dimensional theories is that emotional states can be represented by variation across certain dimensions [42,43,50]. In the two-dimensional case, the dimensions valence and arousal are considered (see   In this way, all emotions can be classified in the arousalvalence coordinate system.

PAD Three-Dimensional Theory
Reference [41] introduce a third dimension and focus on "Pleasure", "Arousal", and "Dominance" (PAD threedimensional model; see Fig. 6). Pleasure expresses the range to which the person aware the situation as enjoyable or not, arousal represents the extent to which the situation stimulates the person; dominance describes the extent to which the person is able to control her/his emotional state in the given situation [51]. The PDA model has been used to represent emotions for non-verbal interaction such as body language [52], or to represent emotions of animated characters in virtual worlds [53].

Plutchik's Wheel of Emotions
Robert Pluchik [54] assumes eight basic emotions, namely, joy, trust, fear, surprise, sadness, disgust, anger, and anticipation, the combination of which results into more complex ones. In his "wheel" model (see Fig. 7), these basic emotions form the "spokes" of a wheel and are dyed in different colors from the color spectrum. Related emotions are housed on the same spoke and are gradually colored with the spoke color according to their intensity (arousal). E.g., joy is organized with ecstasy (higher arousal) and serenity (lower arousal) on the same (yellow) spoke. Opposing emotions are arranged on opposite spokes, e.g., joy versus sadness, anger versus fear, trust versus disgust, and surprise versus anticipation [54]. More complex emotions can be represented as combinations of basic emotions, similar to the way primary colors are combined. The wheel rims indicate such combinations. For example, joy and trust combine to be love, boredom, and annoyance combine to contempt.

OCC Model of Emotion
Ortony, Clore, & Collins (OCC) [55] assume that emotions arise from assessing observations about events, agents, and objects. Events are manifestations that occur at a certain point in time; agents can be people, animals, machines and similar, or abstract entities such as institutions. Several studies employed the OCC model to reason about emotion or generate emotions in artificial characters. The OCC model classifies emotions into 22 categories (see Fig. 8).

Construction Theory
Man builds up his own knowledge of the world based on his experiences. The constructive model therefore assumes that emotions are psychological connections built from more basic psychological components [44].

Languages and Ontologies
For dealing with emotions in computerized applications, suitable descriptive languages are needed, i.e., languages that allow a representation of, e.g., the models described above. Although, of course, universal modeling languages could be used here-as in all areas-first efforts to develop special domain-specific languages for the field of emotion modeling can be observed. However, these are mainly simply structured markup languages, although the emergence of powerful meta-modeling platforms [41,43,[50][51][52] would allow for the definition and use of comprehensive domainspecific conceptual modeling languages [53]. Domain-specific concepts come with various rules, constraints, and semantics [56]. Ontologies are used for their semantic foundation and formal definition [57,58]. Reference [59,60] investigate the integration of ontology with domain-specific language at the meta-model level and automated reasoning process. [61] discuss the use of the formal semantics of the Web Ontology Language (OWL) 3 together with reasoning services for addressing constraint definition, suggestions, and debugging. The ontology-based approach presented in [62] allows for integrating different domains and reasoning services. In [63], the authors propose a "User Story Mapping-Based Method" for extracting knowledge from the relevant domain by applying a formal guideline. The authors demonstrate how to establish a full semantic model of a particular domain using ontology concepts.

Languages
Standardizing emotion representations by a closed set of emotion denominators is perceived as being too restrictive. On the other hand, leaving the choice of emotion annotation completely unlimited is considered to be not appropriate [64]. Consequently, there seems to be no specific standard language that covers all aspects of emotions as they appear in the approaches and theories described above. However, markup languages have been presented that provide a set of syntax and semantic description concepts and thus satisfy the demands of some researchers. This section describes popular markup languages and the purpose behind their development.  [188], picture from wikimedia.org (EmotionML) Emotion Markup Language [65] has been presented by the W3C Consortium to allow for (1) manual annotation of material involving emotionality, (2) automatic recognition of emotions from sensor data, speech recordings, etc., and (3) generation of emotion-related system responses, which may involve reasoning about the emotional aspects, e.g., in interactive systems. 4 Due to the lack of generally agreed descriptors, EmotionML does not come with a fixed emotion vocabulary, but proposes possible structural elements and allows users to "plug in" their own favorite vocabulary. Concerning the basic structural elements, it is assumed that emotions can be described in terms of categories or a small number of dimensions, and that emotions involve triggers, appraisals, feelings, expressive behavior, and action tendencies [65]. Consequently, EmotionML is a quite general-purpose XML-based language that implicitly realizes Ekman's basic emotion theory and [39] and dimensional theories [41]. (EARL) Emotion Annotation and Representation Language [66] is an XML-based language designed specifically for representing emotions in technological contexts with a focus on emotion annotation, recognition, and generation. EARL can represent emotions as categories, dimensions, or sets of appraisal scale [67]. EARL can be utilized for manual emotion annotation and for generating affective system such as speech synthesizers or embodied conversational agents (ECAs) [68].
(EMMA) Extensible MultiModal Annotation [69] is a markup language intended for representing multimodal user inputs (e.g., speech, pen, keystroke, or gesture) in a standardized way for further processing. As such, it was designed for use in the so-called Multimodal Interaction Frameworks (MMI) to solve uncertainty in user input interpretations. EMMA distinguishes between instance data (contained within an EMMA interpretation for an input or an ouput) and the data model which is optionally specified as an annotation of that instance. Multiple interpretations of a single input or output are possible. The EMMA structural syntax is quite simple and provides elements for the organization of interpretations and instances like root (container with version and name space information), interpretation element, container, or literal element. EMMA markup is intended to be generated automatically by components and expected to include speech recognizer, semantic interpreters, and interaction managers. It concentrates on single input/output (e.g., single natural language utterance). EMMA may be used as a "host language" for plugging-in EmotionML for covering emotion interpretation, such that all EmotionML information is encoded in element and attribute structures. As an example, 5 see the following analysis of a non-verbal vocalization where emotion is described as a low-intensity state, maybe "boredom": (VHML) Virtual Human Markup Language [70] is an XMLbased markup language that is intended for use in computer animation of human bodies and facial expressions (e.g., the control of interactive "Talking Heads"). It is created to adapt the various issues of Human-Computer Interaction (HCI) such as Facial or Body Animation, Text-to-Speech production, Dialogue Manager interaction, Emotional Representation plus Hyper and Multi-Media information. 6 For example, a "virtual human" who introduces some bad news to the user ("I'm sorry, I can't find that file") may speak with a sad voice, a sorry face and with a bowed body gesture. To produce the required vocal, facial and emotional response, tags such as<smile>,<anger>,<surprised> have been defined to make the Virtual Human believable and more realistic.
(SSML) Speech Synthesis Markup Language [71] has been produced by Voice Browser Working Group to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. 7 The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms. Popular voice assistants (Google assistant, Alexa, and Cortana) are known to use SSML. SSML adds markup elements (or tags) on input text to construct speech waveforms to improve the quality of synthesized content and to sound more natural. For instance,the tag <amazon:emotion name="excited" intensity="medium"> tells Alexa to speak a string (e.g., "Three seconds till lift off") in an "excited" voice. Speech synthesis can be beneficial in many text-to-speech applications, e.g., reading for the blind, supporting the handicapped, access to the email remotely, and proofreading, etc.

Emotion Ontologies
In the context of information systems, an ontology defines a set of conceptualizations and representational primitives with which to model a domain of knowledge or discourse [72]. In general, ontologies use primitives like categories, properties, and relationships between categories for organizing information into knowledge using machine readable statements to be computerized, shareable, and extensible. No universal ontology has been proposed for the domain of human emotions so far, but several approaches exist for conceptualizing particular important aspects of human emotions. In this section, we will classify these ontologies in terms of: name, goal, key concept, and underlying emotion theory (according to "Emotion Theories"). As there are many approaches in each class, we present these in table form.

Text Ontologies
Text is a powerful means to interact and transfer information as well to express emotion. When dealing with text ontologies, it has to be taken in mind that in modern social software systems (like Facebook, Twitter etc.) abbreviations of all kinds as well as so-called emoticons, emojies, etc. are used besides of classical language-based text. Thus, establishing such an ontology is challenging. The real meaning of a text, possibly including such abbreviations, depends on the combination of all text components and the particular situation. A given text, therefore, may be ambiguous and lead to different ontological interpretations, if the related situation (the context) is not considered. Table 4 summarizes ontologies that may be used for emotion inference from texts.

Ontologies Conceptualizing Facial Cues, Speech, and Body Expressions
Facial expression synthesis has acquired much interest with the MPEG-4 standard [85]. The MPEG-4 framework is used for modeling facial animation based on facial expressions that implicitly reflect emotion. This, can be beneficial in several domains such as psychology, animation control, or healthcare particularly in the patient-computer interaction field. An appropriate ontological conceptualization may also support the recognition of emotions from speech (e.g., study [86]). The same is true for body movements that, however, are more context dependent and thus harder to synthesize comparing. [87] employed a Virtual Human (VH) to increase realism and reliability of body expressions. Table 5 summarizes the existing emotion ontologies conceptualizing facial cues, speech, and body expressions.

Context-Awareness Ontologies
In our social environment, the context has a significant impact on human's emotions. The relationship between emotion and its cause can be further understood when investigating the context. Based on the observed contextual information, the situation that trigger an emotional state can be better understood. Therefore, appropriate contextual information is required such as: place, time, things in the environment, etc. Today, context-awareness has been introduced as a key feature in emotion recognition projects that demonstrate relevance and effectiveness. Table 6 summarizes emotion ontologies based on contextual information.

High-Level Ontologies
High-level or General-/Upper-/Top-ontologies are very generic and, in general, are defined for providing the ontological foundations about the kind of things. They come with concepts like object or process, etc. The existing upper emotion ontologies supply the most significant shared concepts to represent human emotions. The developer can extend an upper level ontology by defining lower level concepts (as specializations of higher level concepts) according to the development's purpose. Table 7 summarizes the known high-level emotion ontologies.

Datasets
Datasets may help to accelerate work progress by providing a benchmark resource for analyzing and comparing system performance before tackling a system in real-life settings. This section lists and describes available datasets that are widely used for evaluating emotion recognition systems. The descriptions are summarize in Table 8.

Textual Datasets (Corpora)
Textual datasets, i.e., corpora, are widely used in computer linguistics for evaluating natural language processing systems like taggers, parsers, translators, etc. With the increasing number of social-media participants, emotion recognition from unstructured written text is of growing interest. There are several ways to represent emotions in texts depending on the particular emotion model used.
Stack Overflow (Q&A) dataset has been created by annotating manually 4800 posts (questions, answers, and comments) from the platform Stack Overflow (Q&A) [104]. Each post in the dataset was analyzed regarding the presence or absence of emotion. From total 4800 posts, 1959 were labeled with basic emotion.  [84] Introduce a better communication between users of social media by representing the structure and semantic of "EmotIcons". Adapt "EmotIcons" visually to the users' moods SO has a class "Emoticon System" that contains all emoticons' pictures, and "Emotion" class that comprises a set of emotions, each interpreted with appropriate "EmotIcons" B + D SN Computer Science (2022) 3:282 Speech Ontology [86] Recognize emotion from voice using semantic web Three-dimensional concepts: "evaluation", "activation", and "power.Evaluation". This ontology provides an automatic classification using reasoner and provide automatic hierarchical structure B Body Expression Ontology [87] Create emotional Body expression using virtual human Standard MPEG-4 used to provide parameters to enhance virtual gestures. The ontology classified animations by associating them to emotions. Gesture classification derived from videos presented in [95] D EmoBank [105] is a large-scale corpus of 10 k English sentences. Data annotation was performed by applying the dimensional Valence-Arousal-Dominance (VAD) scheme [106]. A subset of the dataset has been annotated initially according to Ekmans six basic emotions, such that the mapping between both representation formats (dimensional and basic) becomes possible. Each sentence was rated regarding to both: The writer and the reader.
Emotion intensity dataset of tweets [107] was created to study the impact of word hashtags on emotion intensities in the text. The annotation is performed on 1030 tweets in the form of a hashtag with a query term (#<query term>). The dataset annotates intensities for emotion: anger, joy, sadness, and fear, respectively. The study proved that "emotion-word hash-tags" influence emotion intensity by transferring more emotion.
Social Media Posts-based dataset [108] has been developed for the training of prediction models for valence and arousal that achieve high predictive accuracy. The dataset consists of 2895 Facebook posts that have been rated by two psychologically trained experts on two separate ordinal nine-point scales regarding valence and arousal thus defining each post's position on the circumplex model of affect CIRCUMPLEX MODEL of emotion [43].
Twitter microblogs dataset [111] was established for exploring the impact and correlation of "external factors" like weather, news events, or time with a user's emotional state. The dataset consists of 2.557 tweets that have been collected early in 2017. The tweets are self-tagged by their respective author with a #happy or #sad hashtag and come with metadata such as author, time, and location. They originate from 20 large US metropolitan areas.
SemEval-2018 Task 1 [112] provides a collection of datasets for English, Arabic, and Spanish tweets. In particular, an Affect in Tweets dataset of more than 22.000 annotated tweets has been established. For each emotion dimension (anger, fear, joy, and sadness), the data are annotated for fine-grained real-valued scores indicating the intensity of emotion.
GoEmotions [113] is the currently largest available manually annotated dataset of 58 k English Reddit comments, labeled for 27 emotion categories or Neutral.
EmoEvent [114] is a multilingual dataset collected from Twitter and based on different emotional events. A total of 8409 tweets in Spanish and 7303 in English were labeled by the six Ekman's basic emotions plus the "neutral or other emotions".

Emotional Speech Datasets
Speech datasets are utilized as a foundation for different researches in the field of emotional speech analysis. An overview is given in Table 8.
AESDD [122] The Acted Emotional Speech Dynamic Database contains around 500 Greek utterances by a diverse group of actors simulating five emotions.

Emov-DB [123] The Emotional Voices Database contains
English recordings from male and female actors and French utterances form a male actor. The emotional states annotated are neutral, sleepiness, anger, disgust, and amused.
JL corpus [124] is a strictly guided simulated emotional speech corpus of four long vowels in New Zealand English. It contains 2400 recording of 240 sentences by 4 actors (2 males and 2 females). Five primary emotions (angry, sad, neutral, happy, and excited) and five secondary emotions (anxious, apologetic, pensive, worried, and enthusiastic) are annotated.
EmoFilm [125] is a multilingual emotional speech corpus that consists of 1115 English, Spanish, and Italian emotional utterances extracted from 43 films and 207 speakers. Five emotions are recognized: anger, contempt, happiness, fear, and sadness.

Facial Expression Datasets
Recognizing emotions from facial expressions is an important area of research. Classifying facial expressions requires large amounts of data to reflect the diversity of conditions in the real world. Table 8 provides an overview.
UIBVFED [126] provides sequenced semi-blurry facial images with different head poses, orientations, and movement. Over 3000 facial images were extracted from the daily news and weather forecast of the public tv-station PHOE-NIX. Seven basic emotions are categorized, namely: sad, surprise, fear, angry, neutral, disgust, and happy as well as "None" if the facial expression could not be recognized.
FFHQ [127]: Flickr faces HQ includes 70,000 high-quality. png images in high resolution (1024*1024) and contains considerable variation in age, ethnicity, and image background. It also has a good coverage of accessories like glasses, sunglasses, hats, etc. The images were crawled from Flickr. [128] is a large-scale dataset consisting of facial image triplets along with human annotations. The latter indicate which two faces in each triplet form the most similar pair in terms of facial expression. The dataset was annotated by six or more human raters, which is quite different from existing expression datasets that focus mainly on basic emotion classification or action unit recognition. [129] is a dataset containing 165 GIF images of 15 different people in varying lighting conditions. The people in the images show distinct emotions and expressions (happy, normal, sad, sleepy, surprised, and winking). [130] contain both macroand microexpressions in long videos, which the authors say facilitates the development of algorithms to detect microexpressions in long video streams. The database consists of two parts, one of which contains 87 long videos containing both macroexpressions and microexpressions from a total of 22 subjects, all filmed in the same setting. The other part includes 357 cropped expression samples containing 300 macroexpressions and 57 microexpressions. The facial expression samples were coded with facial action units marked and emotions labeled. In addition, participants were asked to review each recorded facial movement and indicate their emotional experience of it. Emotion is labeled using four types (negative, positive, surprise, and others). Happiness and sadness are classified as positive and negative, respectively. Surprise refers to an emotion that can be positive or negative. The "others" category represents ambiguous emotions that cannot be categorized to the mentioned categories.

CAS (ME) 2 Chinese Academy of Sciences Macro-Expressions and Micro-Expressions
AFEW Acted Facial Expression in Wild [131] AFEW was created in a semi-automatic process from 37 DVD movies: First, the subtitles were parsed and searched for keywords, and then, the relevant clips found were assessed and annotated by a human observer. In total, AFEW version 4.0 includes 1268 clips annotated with the basic emotions (anger, disgust, fear, happiness, neutral, sadness, and surprise). The AFEW dataset was used several times as the basis for the "Emotion Recognition In The Wild Challenge". [132] have been created by selecting frames from the AFEW [131] dataset. It comprises 700 images labeled by basic six emotions. It presents real-world images with variety of properties of facial cues (e.g., head poses, age range, and illumination variation).

SFEW Static Facial Expressions in the Wild
SPOS Spontaneous vs. Posed dataset [133] contains spontaneous and posed facial expressions from seven subjects who had been shown emotional movie clips to produce spontaneous facial expressions. Six categories of basic emotions were considered (happy, sad, anger, surprise, fear, and disgust). Subjects were also asked to pose these six types of facial expressions after watching the movie clips. Data were recorded with both the visual and near-infrared cameras. A total of 84 posed and 147 spontaneous facial expression clips were labeled.

Hybrid Emotion Datasets
Recognition results can be more accurate when data are collected from different modalities, such as: text, audio, video, body and physiological data, etc. Table 8 lists the available data sets in the area of emotion recognition based on hybrid data sources. The table has been extended by a column titled "Type" to illustrate the modularity used to extract emotions.

Speech-Video Datasets
The records listed below combine voice and video. Table 8 provides an overview.
HUMAINE [145] data set contains naturalistic clip samples of emotional behavior in relation to the context (static, dynamic, indoor, outdoor, monologue, and dialogue). The emotional state is commented on in each case by a series of annotations associated with the clips: these include core signs in speech and language as well as gestures and facial features related to different genders and cultures. Six annotators have been used for a wide range of emotions (intensity, activation/sounds, valence, and power).
The Belfast [146] dataset contains clips extracted from television programs (chat shows and religious programs).
These are recordings of people discussing emotional issues. 100 clips were annotated.
SEMAINE [147] dataset was created as part of an iterative approach to the creation of automatic agents called Sensitive Artificial Listener (SAL). SAL involves a person in an emotional conversation. The participants comment on each clip in five emotional dimensions (valence, activation, power, expectation/expectation, and intensity).
IEMOCAP [148] (interactive emotional and dyadic motion capture) is a multimodal and multilingual dataset that includes video, speech, facial motion capture, and text transcriptions over a period of 12 h. Ten actors execute sketched scenarios specially selected to evoke emotional expression. IEMOCAP was annotated by six annotators both in basic terms and dimensions. [149] comprises ten recordings of actors displaying expressions with different intensities. Five independent discrete emotions are labeled per video.

GEMEP-FERA
SAVEE [150] (Surrey Audio-Visual Expressed Emotion) dataset comes with six basic emotions plus neutral. It has been created as a pre-requisite for the development of an automatic emotion recognition system. SAVEE consists of recordings from four English actors in seven different emotions with a total of 480 British English utterances.
Biwi 3D-Audiovisual Corpus [151] contains 1109 dynamic 3D face scans taken while uttering an English sentence. The information was extracted by tracking the frames using a simple face template, by splitting the speech signal into phonemes, and by evaluating the emotions using an online survey. The data set can be used in areas such as audio-visual emotion recognition, emotion-independent lip reading, or angle-independent facial expression recognition.
OMGEmotion (One-Minute-Gradual Emotion) [152] dataset is composed of 567 emotion videos with an average length of 1 min, collected from a variety of YouTube channels using the search term "monologue". The videos were separated into clips based on utterances, and each utterance was annotated by at least five independent subjects using an arousal/valence scale and a categorical emotion based on the universal emotions from Ekman. CEAR [154] (Context-Aware Emotion Recognition) dataset contains 13,201 video clips (with audio and visual tracks) and about 1.1 M frames that were extracted from 79 TV shows. Each clip is manually annotated with six emotion categories, including "anger", "disgust", "fear", "happy", "sad", and "surprise", as well as "neutral". The clips range from short (around 30 frames) to longer ones (more than 120 frames) with an average length of 90 frames. A static image subset contains about 70,000 images. The dataset is randomly split into training, validation, and testing sets.
SEWA [155] is an audio-visual, multilingual dataset with recordings of facial, vocal, and speech behaviors made "in the wild". It includes>2000 min of audio-visual data from 398 individuals (201 males and 197 females) and a total of six different languages. The recordings are annotated with face landmarks, facial action unit (FAU) intensities, different vocalizations, verbal cues, mirroring and rapport, continuous rated valence, arousal, liking, and prototypical examples (templates) of (dis)liking and mood.

CMU-MOSEI Multimodal Opinion Sentiment and Emotion
Intensity dataset [156] contains>23 K sentence utterances in>3300 video clips from >1000 online YouTube speakers. The dataset is gender balanced. The sentences utterance is randomly chosen from various topics and monologue videos. The videos are annotated with the basic six emotion categories (happiness, sadness, anger, fear, disgust, and surprise).

VAMGS
The Vera Am Mittag German Audio-Visual Emotional Speech Database [157] consists of 12 h of recordings of the German TV talk show "Vera am Mittag" (Vera at Noon). They are divided into broadcasts, dialogue acts, and utterances, and contain spontaneous and highly emotional speech recorded from unscripted, authentic discussions between talk show guests. The video clips were annotated by a large number of human raters on a continuous scale for three emotion primitives: Valence (negative vs. positive), Activation (calm vs. excited), and Dominance (weak vs. strong). The video section contains 1421 segmented utterances from 104 different speakers, the audio section contains 1018 utterances, and the facial image section contains 1872 facial images labeled with emotions.

Hybrid Facial Expression Datasets
Several image and video datasets have been introduced for supporting the analysis and prediction of human emotional reactions based on facial expressions. An overview is given in Table 8. [158] CK dataset was made available to the research community in 2000. The image data consisted of about 500 image sequences of 100 subjects and were FACS (Facial Action Coding System, see above) annotated. An extended data set called CK+ was published in 2010. In CK+, the number of sequences is increased by 22% and the number of subjects by 27%. The target expression for each sequence is fully FACS coded; the emotion labels have been revised. In addition, non-positive sequences for different types of smiles and the associated metadata have been added.

Cohn-Kanade
JAFFE [159] Japanese Female Facial Expression dataset includes 213 annotated images of 7 facial expressions (6 basic expressions + neutral) that have been posed by Japanese female models. Each image rated by 60 Japanese annotators on 6 emotion classes. The images are in.tiff format with no compression (see also [159]). Semantic ratings on emotion adjectives, averaged over 60 subjects, are provided in a text file. The JAFFE images may be used for non-commercial scientific research. MMI [162] has been created as a resource for building and evaluating recognition algorithms of facial expression. It comprises over 2900 videos and high-resolution images of 75 subjects. Action Units (AU) in videos were fully annotated and partially coded on frame level. NVIE [163] Natural Visible and Infrared Facial Expressions dataset contains both spontaneous and posed expressions of more than 100 subjects. The images were taken synchronously with a visible and an infrared thermal imaging camera, with illumination from three different angles. The data set also allows a statistical analysis of the relationship between face temperature and emotions.

BU-4DFE
EMOTIC [164] EMOTions In Context is a database of images with people in real environments, annotated with their apparent emotions. The images are annotated with an extended list of 26 emotion categories combined with the three common dimensions (valence, arousal, and dominance). The dataset contains 23,571 images and 34,320 annotated people. [165] is a labeled dataset of spontaneous facial responses recorded in natural settings over the Internet: online viewers watched one of three intentionally amusing Super Bowl commercials and were simultaneously filmed using their webcam. They answered three self-report questions about their experience. The dataset consists of 242 facial videos (168,359 frames).

Affective-MIT
Affectiva [166] is described as the largest emotion dataset growing to nearly 6 million faces analyzed in 75 countries, representing about 2 billion face frames analyzed. Affectiva includes spontaneous emotional responses to consumers while doing a variety of activities. The data set consists of viewers watching media content (ads, movie clips, TV shows, and viral campaigns online). The dataset has been expanded to include other contexts such as videos of people driving cars, people in conversation interactions, and animated GIFs.

DISFA [167] Denver Intensity of Spontaneous Facial Action
The dataset contains high-resolution stereo videos (1024× 768) of 27 people (12 women and 15 men) that capture the spontaneous (non-posed) emotions of the persons while watching video clips. Each record frame was manually coded for presence, absence, and intensity of facial action units according to the facial action unit coding system (FACS). An extension, DISFA+, comprises also posed facial expressions data, more detailed annotations, and meta data in the form of facial landmark points (in addition to the self-report of each individual regarding every posed facial expression). [168] comprises 9800 video segments with a large content diversity. Affective annotations along the valence and arousal axes were achieved using crowdsourcing through a pair-wise video comparison protocol. The videos were selected from 160 diversified movies.

LIRIS-ACCEDE
FABO [169] Bimodal Face and Body Gesture contains 1900 videos of face and body expressions recorded simultaneously by two cameras. The dataset combines facial cues and body in an organized bimodal manner.
Kinect FaceDB [170] includes facial images of 52 persons acquired by Kinect sensors. The data were captured in different time periods involving 9 different facial expressions under different conditions: neutral, smile, open mouth, left profile, right profile, occlusion eyes, occlusion mouth, occlusion paper, and light on.
YouTube emotion datasets [171] contain 1101 videos annotated with 8 basic emotions using Plutchik's Wheel of Emotions [54]. The research efforts was focused on recognizing emotion-related semantics.

Multimodal Emotion Datasets
Multimodal datasets combine video data with synchronously recorded physiological signals.
The following examples are also listed at the bottom of Table 8. [172] consists of multimodal recordings of participants in their response to excerpts from movies, images, and videos. The modalities include multicamera video of face, head, speech, eye gaze, pupil size, ECG, GSR, respiration amplitude, and skin temperature. The recordings for all excerpts were annotated by the 27 participants immediately after each excerpt using a form asking five questions about their own emotive state through self-assessment manikins (SAMs) [32]. A precise synchronization permits researchers to study the simultaneous emotional responses using different channels.

MAHNOB-HCI
DEAP [173] Database for Emotion Analysis using Physiological Signals is a multimodal dataset combining face videos with electroencephalogram (EEG) and peripheral physiological signals. 32 participants were recorded while watching 40 1-min-long excerpts of music videos. The participants rated each video in terms of the levels of arousal, valence, like/dislike, dominance, and familiarity.
RECOLA [174] Remote Collaborative and Affective Interaction dataset consists of audio, visual, and physiological signal (ECG and EDA) recordings. Video conference interactions between 46 French-speaking participants while solving a cooperation task were recorded synchronously. The Emotions expressed by the participants were reported by themselves using the Self-Assessment Manikin (SAM) [32].
EMDB [175], the Emotional Movie Database, consists of 52 affective movie clips from different emotional categories without auditory content. Recorded signals are kin conductance level (SCL) and heart rate (HR). Subjective scores for annotation by the participants were arousal, valence, and dominance (all on a scale from 1 to 9). DREAMER [179] is a multimodal database consisting of electroencephalogram (EEG) and electrocardiogram (ECG) signals together with 23 participants' self-assessments of their emotion in terms of valence, arousal, and dominance. The signals were captured using portable, wearable, and wireless equipment during affect elicitation by means of audio-visual stimuli.

MMSE
MEISD [180] a large-scale balanced Multimodal Multilabel Emotion, Intensity, and Sentiment Dialogue dataset (MEISD) collected from different TV series that has textual, audio, and visual features. For annotating the dataset, six basic emotions are used. Emotion annotation list is extended to incorporate two more labels, namely, Acceptance and neutral. The "acceptance" emotion has been taken from the Plutchik's [54] wheel of emotions.

Systems for Emotion Recognition
This chapter sketches a number of systems or Application Programmable Interfaces (APIs) for emotion recognition, which are used in numerous areas such as health care, education, and entertainment. These systems are based on the various methods discussed above, namely face analysis, speech processing, physiological signs, recognition, and analysis of emotional phrases in social media, body language, and gesture expressions. An overview of these systems is given in Table 9.

Text-Based Interfaces for Emotion Detection
IBM Watson 8 is an analyzer of emotions in written text. It detects emotional tones, social tendencies and writing styles from simple texts of any length. Currently, the tool can analyze online texts such as tweets, online reviews, email messages, product reviews, or user texts for emotional content.
ToneAPI 9 was created for marketing people to evaluate (and potentially improve) the emotional impact of their advertising texts quantitatively and qualitatively. For this purpose, input texts are analyzed, compared with other texts from a corpus, and emotions and their intensity are derived. A total of 8 emotions are identified and their intensity is evaluated with a value between 1 and 100.
Receptiviti 10 is a computational language psychology platform that aims at helping to understand the emotions, drives, and traits that affect human behavior. A set of algorithms uncovers signals from everyday human language, e.g., stress, depression, etc. The analysis is performed in real time without needing self-reports or surveys.
Synesketch [182] is an open source tool for analyzing the emotional content of text sentences and transforming the emotional tone into some visualizations. It is a dynamic text representation in animated visual patterns to reveal the underlying emotion.
EmoTxt [183] is an open-source toolkit for emotion detection from text. It was trained and tested on two large gold standard datasets mined from Stack Overflow and Jira. It provides supporting both emotion recognition from text and training of custom emotion classification models.

Audio-Based Interfaces for Emotion Detection
EMOSpeech 11 is an interface based on an end-to-end psychological model. It is designed to help automated call agents analyze recorded customer calls and then send realtime feedback to supervisors. It uses a three-dimensional emotion representation model and recognizes ten emotions from acoustic features in the voice.
The open-source version of the software selects between these five emotions with high accuracy, even when hearing the speaker for the first time, according to the manufacturer. The "plus" version is said to reach the performance level of a dedicated human listener.

Video-Based Interfaces for Emotion Detection
FaceReader 13 created by Noldus [184] is a professional tool for automatic analysis of facial expressions. More than 10,000 manually annotated images were used to train the recognition component: emotion, gaze direction, head orientation, and personal characteristics, such as gender and age.
RealEyes 14 is a platform for emotion recognition using Webcams [185], used to measure people's feeling when they watch video content online. Computer vision and machine learning techniques were used to analyze signals from physiological sensors, voice, and posture.

Multimodal Interfaces for Emotion Detection
Cloud Vision 18 is a tool created by Google, which understands faces, signs, landmarks, objects, text, as well as emotions by detecting facial features within image or video. The Cloud platform takes an image as an input, and then returns the expected percentage of each emotion for each face in that image. 19 or Microsoft Project Oxford is a set of tools that make it possible for a computer to identify emotions in photographs using facial recognition technology. It detects emotional depth for each face using the core seven emotional states as well as "neutral". Each scanned image is bounded with a box for the face, and then assigned a score between zero to one, where zero corresponds to a complete absence of the emotion in question and one is a strong emotional response.

Microsoft cognitive services
CLMtrackr 20 created by MIT is a javascript library for fitting facial models to faces in videos or images [186]. It is an open and free to use JavaScript library for precise tracking of facial features. This library recognizes four emotional states: angry, sad, surprised, and happy. Kairos 21 integrates both face detection and important demographics data (Age, Gender, and Ethnicity). It detects real-time emotion from face as well as ethnicity to understand the diversity of human face. Amazon 22 recognition is used to identify the objects, people, text, scenes, and activities, as well as emotion. This tool accepts two sources of data input: image and video. The level of confidence in the determination is ranging: (zero) minimum value to (100) maximum value. This application based on the same learning technology is developed by Amazon's computer vision to analyze billions of images and videos daily.
Sightcorp 23 Platform provides face analysis and face recognition software using computer vision and deep learning techniques. It allows for emotion recognition, age detection, gender detection, attention time, and eye gaze tracking in images, videos, and real-life environments. SHORE 24 is used to detect the emotion, age, and gender of a person from a standard webcam. The special feature of this tool is its ability to analyze and recognize the respective emotion from a video input with multiple faces simultaneously. nViso 25 analyzes real-time emotions from facial expressions in video using 3D facial imaging technology. nViso can monitor many different facial data points to produce likelihoods for main emotion categories.
The iMotions 26 Facial Expression Analysis (FEA) module provides 20 facial expression measures, seven core emotions (joy, anger, fear, disgust, contempt, sadness, and surprise), facial landmarks, and behavioral indices such as head orientation and attention. These output measures are assigned probability values to represent the likelihood that the expected emotion will be expressed. Summary values for engagement and valence are also provided. Affectiva 27 is designed to detect facial cues or physiological responses based on emotions. It tracks a person's heart rate from the human face using the webcam without any other sensors being worn, depending on the color change in the person's face, which pulses every time the heart beats.

Information Systems (IS) Exploiting Emotion Data
Emotion recognition applications such as those mentioned in the previous chapter are used in various fields and embedded in domain-specific information systems. Examples of such domains include medicine, e-learning, human resources, marketing, entertainment, and automotive; these are briefly outlined below. Table 10 lists some such systems, organized by application domain.
Health/Medical: Stress and various psychological problems require proper psychometric analysis of the patient. A healthcare system that focuses on emotional aspects may improve the quality of life. Such systems automatically monitor both the environment and the person to provide help and services. Also, coupling emotion recognition systems with activity recognition systems [187,188] can generate important insights for supporting older people in the AAL context. E-learning:Student emotions play an important role in any learning environment, whether in a classroom or in e-learning. In face-to-face classes, it is easier to observe student behavior, because the instructor interacts with students faceto-face in the same environment. In the virtual classroom, this is more difficult, especially since course participants often turn off their cameras. This is where techniques based on multimodal recognition come in to assist the lecturer.
Hiring/interview: Companies have begun to integrate emotion recognition technology into hiring processes to capture employees' emotions. Employers argue that the technology can help eliminate hiring biases and reject candidates with undesirable traits. During interviews, employers can track candidates' emotional responses to each question in the interview to determine how honest the applicant is. In addition, analyzing employee stress levels can impact productivity and career success. The discussion of whether one can ethically justify such an approach has to take place elsewhere.
Entertainment: Modern computer games use information about the player's emotions and emotional reactions to dynamically adjust the game's difficulty level and audiovisual features. Monitoring is done with multimedia tools, for example in video games: this involves measuring player behavior and emotional states to improve player engagement, challenge, immersion, and excitement, in addition to adjusting game features. In addition, emotion recognition is also used to evaluate the success of a game according to the experience of the player interacting with the game. HCI/Robotics: Human-computer interaction, or the interaction between information technologies and humans, has become similar to human-human interaction. Detecting the emotional state during human-computer interaction aims to make this interaction more comfortable, effective, and easy. Similarly, the integration of emotions in robots is being researched to make robots more "social" and "human-like".

Marketing:
Marketing has moved to the study of consumer attitudes and measures the factors that influence consumer decisions. Emotions have become an important aspect of  [189,190] Help to decide when patients necessitate medicine [191] Help monitoring (rehabilitation/therapy) [192] Monitor the autism spectrum disorders (ASD) [193,194]

E-learning
Emotion feedback of the students [195] Monitor the level of students concentration [196] Detect the emotion of the learner [197] Hiring/interview Track candidate's emotions during interview [198] Analyze the stress level of employees [199,200]

Entertainment
Adapt games according to the player's mood [201,202] Test game success according to user's experience [203,204] HCI/Robotics Improve Human-Computer Interaction (HCI) [205,206] Effective human-to-robot communication [207,208] Marketing Monitor the impact of advertisements [209,210] Monitor emotion in purchasing decisions to improve Sales [211,212]

Automotive industry
Alert the driver when his/her looks sleepy or drowsy [213,214] Improve driving experience [215,216] this. They help companies better understand the opinions expressed about a consultation or product.

Automotive industry:
In the automotive industry, emotional information can be used to respond appropriately to recognized driver emotions to enhance safety and support the driving experience, e.g., by suggesting useful information or a conversation.

Summary
As shown in the previous sections, there are a variety of methods and tools that can be used to model, analyze, and predict human emotions. We have reviewed the relevant literature on this as far as it was available to us. The extensive bibliography substantiates this. Overall, we can summarize the contribution of our study as follows: -It combines emotion models, languages, ontologies, datasets, and systems. -It provides a comprehensive and systematic overview of the field of human emotion recognition, organized into tables to give the reader an easy and coherent way to find the data they need. -It provides a comprehensive benchmark for users to create human support in a specific environment by exploiting the available semantic and contextual information; there is no need to reinvent the wheels. -It can serve as a starting point for anyone studying human emotions, especially budding researchers. -It introduces a resource for many applications in various fields, for instance, computer science, psychologists, machine learning, human-computer interaction, e-learning, information systems and cognitive science.
It has been noted that there are numerous trends in emotion research with different goals and models. Facial expressions tend to be the most promising approaches to emotion measurement, but they are easier to fake in different situations compared to other recognition methods (e.g., voice or biosignals). Emotion recognition accuracy can be increased by combining multiple modalities with information about the context, preferences, and situation of the observed person. Multimodal emotion recognition has been shown to be superior to unimodal, so there are several datasets that serve as benchmarks for emotion modalities that have recently been made available. The next generation of recognition tools will capture emotions from multiple input devices using both recent technological advances and integration methods.
The context in which emotions are experienced is another important topic currently being discussed by researchers. Some work has implemented ontology-based methods to better interpret contextual human emotion manifestation [96][97][98][99].
Other developments include the ability to recognize and interpret not only current physical activities and reactions, but also relevant mental states beyond basic emotions, such as shame and pride.

Open Research Questions
As an additional outcome of this literature review, we summarize a list of limitations and open research questions that we would like to share with the research community: -Current emotion representation languages such as [65, 66, 68-71,] are XML extensions that do not provide more advanced knowledge representation or automated reasoning capabilities. None of these languages is general enough to cover all emotion vocabularies. -Known recognition frameworks (listed in "Datasets") have a limited concept space by ignoring some real-world details. This limits their ability to reflect the user's intentions in unpredictable situations. -So far, several emotion ontologies have been proposed that share many similarities in terms of their concepts, classes, and underlying psychological theory. For example, Ekman's basic theory [39] is adapted in most ontologies. However, there does not yet seem to be a universally shared ontology. -The majority of emotion recognition interfaces are based on recognizing and analyzing facial data (i.e., using image or video data) to measure emotions. However, studies [217][218][219][220][221][222][223] have shown that integrating different modalities provides better accuracy than a single model for emotion recognition. Therefore, it stands to reason that future research will focus on multimodal emotion recognition [172][173][174][175][177][178][179]. -A dataset that integrates emotion recognition with context is lacking; such data could be useful to answer theoretical questions about the causes of emotional responses. -We identify the need for a conceptual human emotion framework, in particular, a domain-specific modeling language [224] that can describe human emotions in a comprehensive, reliable, and flexible way. The main challenge in developing such a language is the diversity of theoretical models and the complexity of human emotions. In addition, there is the influence of temporal structure and human context in interpreting emotions. For example, a person may express the same emotion in different situations depending on the context, some-times with different intensity. We have presented a first approach to such a language, our "Human Emotion Modeling Language", in [225].
Finally, we take the opportunity to thank the anonymous reviewers of this paper for their valuable comments.
Funding Open access funding provided by University of Klagenfurt. We did not apply for external funding for this research.

Conflict of Interest
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.