1 Introduction

In the last years, growing interest has been observed in the development of socially intelligent robots, which are envisioned to interact with humans in a variety of social and emotional roles, such as household assistants, companions for children and elderly, partners in industries, guides in public spaces, educational tutors at school and so on [1]. There is accumulating evidence that expressive robots, equipped with the ability to show human-like emotions, are rated as more likable and humanlike, and lead to higher engagement and more pleasurable interactions [2,3,4,5]. Additionally, trust, acceptance, and cooperation with a robot are dependent on the match between the social context of the situation and the emotional behaviour of the robot [6, 7]. Therefore, understanding how people perceive and interact with emotional robots is crucial, given the growing deployment of these robots in social settings.

Humans are experts in social interaction. During face-to-face social interactions, the human sensory system uses multimodal analysis of multiple communication channels to recognize another party’s affective and emotional states [8]. A channel is a communication medium; for example, the auditory channel carries speech and vocal intonation communicative signals, and the visual channel carries facial expressions and body language signals. A modality is a sense, used to perceive signals from the outside world (i.e., sight, hearing) [9]. Engaging in a “routine” conversation is a rather complex multimodal task; a human must carefully attend to and decipher cues encountered in different sensory modalities and several communication channels at once. Multimodality ensures that the analysis of affective information is highly flexible and robust. Failure of one channel is recovered by another channel and information in one channel can be explained by information in another channel (e.g., a facial expression that might be interpreted as a smile will be interpreted as a display of sadness if at the same time we see tears and hear weeping) [8].

To be effective social interaction partners, robots must also exploit several channels (i.e., auditory, visual) and mechanisms (e.g., body posture, facial expressions, vocal prosody, touch, gaze) to communicate their internal emotional states and intentions in an authentic and clear way [10, 11]. Many researchers, explore the design space of anthropomorphic or zoomorphic robots equipped with expressive faces (e.g., [5, 12,13,14,15,16]), emotional voices (see [17] for a survey), body language (e.g., [18,19,20,21]), and other features and capacities to make human–robot social interactions more human-like. While initially, Human–Robot Interaction (HRI) research on the emotional expressions of robots had largely focused on single modalities in isolation [22], more recently, researchers have begun to integrate multiple channels, in order to approach the richness of human emotional communication. For instance, HRI studies have examined the perception of emotional expressions involving faces [23], faces and gestures [24], gestures and voices [25] and face-voice-gesture [23] combinations.

The results of these studies show that recognition accuracy as well as attitudes towards robots, such as expressiveness [23], likability [25] and trust [6], increase when multiple channels are used to convey congruent emotional information. Although it may not be possible to incorporate all features of human communication into robots (due to the complexity of the phenomenon), the affect-expression capability of humans can serve as the “gold standard” and a guide for defining design recommendations for multimodal expression of human-like affective states. The importance of congruence in multimodal emotional expressions of robots remains largely under-explored. Two of the primary channels which robots use to express emotion multimodally are the auditory and the visual channel. However, the importance of using an appropriate combination of audio-visual stimuli when conveying emotions remains under-explored in HRI. Both the psychological and Human–Computer Interaction (HCI) literature suggest that people favour congruence (also known as consistency) over incongruence (also known as inconsistency). For instance, studies with Embodied Conversational Agents (ECAs) [26,27,28,29,30] that have presented congruent and incongruent auditory and visual stimuli at the same time, showed that emotional information conveyed in one modality (i.e., vocal prosody, facial expressions) influences the processing of emotional information in the other modality, and congruent emotional information across auditory and visual channels tends to facilitate emotion recognition. Conversely, incongruent emotional responses can result in adverse consequences on user ratings (i.e., trust, likability, expressiveness) towards ECAs [30]. Does the same apply during interactions with robots? For example, what do people perceive if they observe a robot with a happy body posture combined with a concerned voice? Do people base their perceptions of the emotion on one channel more than another? That is, does either the visual or audio channel dominate in perceptions of the emotional expression, or are they both essential? Additionally, what is the impact of incongruent emotional expressions on people’s attitudes towards robots?

In this article, we aim at investigating the multimodal perception of emotions of humanoid robots, in the context of social interactions with humans. We draw insights and perspectives about the multisensory integration of congruent and incongruent emotional information from the fields of HRI, HCI, psychology, and neuroscience. We investigate how people recognize emotions expressed by a humanoid robot multimodally, via two different modalities: the body (i.e., head, arms and torso position and movement) and the voice (i.e., pitch, timing, loudness and non-verbal utterances). We consider two distinct cases of incongruence, namely, contextual incongruence and cross-modal incongruence. The first case refers to the conflict situation where the robot’s reaction is incongrous with the socio-emotional context of the interaction (e.g., a robot expresses happiness in repsonse to a sad situation). The second case refers to the conflict situation where an observer receives incongruous emotional information across the auditory (robot’s vocal prosody) and visual (robot’s whole-body expressions) modalities (e.g., a robot expresses sad voice and happy body postures). We investigate the effects of contextual incongruence and cross-modal incongruence on people’s ability to recognize the emotional expressions of a robot, as well as on people’s attitudes towards a robot (i.e., believability, perceived intelligence and likability). Specifically, we address the following research questions:

  1. 1.

    How are voice (i.e., pitch, timing, loudness, and non-verbal utterances) and body (i.e., head, arms and torso position and movement) expressions of a humanoid robot perceived when presented simultaneously with congruent and incongruous socio-emotional context?

  2. 2.

    How are voice and body expressions of a humanoid robot perceived when presented simultaneously in congruent and incongruent multimodal combinations?

  3. 3.

    What impact does incongruence have on people’s perceptions of the robot, in terms of believability, perceived intelligence, and likability?

The rest of this article is organized as follows: we start by discussing the importance of multisensory interaction in human social interactions, and highlight research which has examined multisensory integration effects using behavioural experiments, functional neuroimaging (fMRI) and electroencephalography (EEG) measurements in psychology. We then discuss the importance of congruence in the context of HRI and detail relevant research that has already investigated this space. Following this, we describe a social HRI laboratory experiment which we conducted to investigate our research questions. A discussion of the results is then provided, followed by a set of guiding principles for future research towards the design of multimodal emotional expressions for humanoid robots.

2 Background and Related Work

2.1 Multisensory Interaction (MI) Research in Psychology and Neuroscience

MI refers to the processes by which information arriving from one sensory modality interacts with, and sometimes biases, the perception of cues presented in another modality, including how these sensory inputs are combined to yield a unified percept [31,32,33]. MI effects have been studied using behavioural experiments, functional neuroimaging (fMRI) and electroencephalography (EEG) measurements with faces and voices [34,35,36], faces and bodies [37, 38], body expression and voices [38, 39], and body and sound stimuli [40]. The results suggest strong bidirectional links between emotion detection processes in vision and audition. Additionally, there is accumulating evidence that integration of different modalities, when they are congruent and synchronous, leads to a significant increase in emotion recognition accuracy [41]. However, when information is incongruent across different sensory modalities, integration may lead to a biased percept, and emotion recognition accuracy is impaired [41].

Perception of Emotion From Face and Voice Previous MI research has mainly investigated the perception of emotional face-voice combinations [34,35,36]. For example, de Gelder and Vroomen [34] presented participants with static images of facial expressions that were morphed on a continuum between happy and sad, combined with a short spoken sentence. This sentence had a neutral meaning but was spoken in either a happy or sad emotional tone of voice. Participants were instructed to attend to and categorize the face, and to ignore the voice, in a two-alternative forced-choice task. The results showed a clear influence of the task-irrelevant auditory modality on the target visual modality. When asked to identify the facial expression, while ignoring the simultaneous voice, participants’ judgments were nevertheless influenced by the tone of the voice and vice versa.

Perception of Emotion From Body and Voice More recently, researchers examining the integration of emotional signals from different modalities have started to pay attention to bodily expressions. A handful of studies have examined body expression and face combinations [37, 38], body and sound stimuli (e.g., [40]), as well as body and voice combinations (e.g.,  [38, 39]). The results follow a similar pattern to studies of emotional faces and voices. For example in [38], the authors used a similar paradigm as de Gelder and Vroomen [34] but tested for the effect of body expressions. Participants were presented with static images of whole-body expressions combined with short vocal verbalizations. The results indicate that the perceived whole-body expression influenced the recognition of vocal prosody. When observers make judgments about the emotion conveyed in the voice, recognition was biased toward the simultaneously perceived body expression.

Perception of Emotion From Face and Contextual Information A few studies have studied interactions between emotional faces paired with contextual information (e.g., [42,43,44]). Such experimental paradigms reveal that emotion perception is not driven by information in the face, body or voice alone but also derives from contextual information in the environment, such as the emotion-eliciting situation where the perceived emotion occurs, or the observer’s current emotional state (e.g., [43, 45]). Using fMRI, Mobbs et al. [42] demonstrated that pairing identical faces with either neutral or emotionally salient contextual movies results in both altered attributions of facial expression and mental-state. In this study, evaluators were presented with 4 s of a movie (positive, negative, and neutral) and were then shown an image of an emotional face (happy, fear, and neutral). Evaluators rated the combined presentations. Faces presented with a positive or negative context were rated significantly differently than faces presented in a neutral context. Furthermore, fMRI data showed that pairings between faces and emotional movies resulted in enhanced BOLD responses in several brain regions which may act to guide appropriate choices across altering contexts. In another study [44], situational cues in the form of short vignettes were found to influence the labelling of subsequently presented facial expressions (e.g., a sad face was labelled as “sad” when presented in isolation but was labelled as “disgust” when preceded by a disgust-related vignette). Niedenthal et al. [43] investigated congruent or incongruent facial expressions and surrounding context pairs. When the surrounding emotional context did not match the facial expression, observers often reinterpreted either the facial expression (i.e., the face does not reveal the person’s real feelings) or altered their interpretation of the contextual situation.

2.2 Multisensory Interaction Research in HCI

Research on the integration of multiple emotional signals is a relatively new topic in the areas of HCI and HRI. However, accumulating evidence from recent studies with anthropomorphic ECAs (e.g., [26,27,28,29]) and ro-bots (e.g., [6, 23,24,25, 46, 47]) shows that MI effects are also highly pronounced in the perception of synthetic emotional expressions. A number of HCI studies have used ECAs to investigate the experimental conflict situation where an observer receives incongruent information from two different sensory modalities (i.e., vocal prosody, facial expressions or body expressions). For example, Clavel et al. [26] studied the role of face and body in the recognition of emotional expressions of an ECA, using congruent emotional expressions, where the emotions expressed by the ECA’s face and body matched, and incongruent expressions, where the emotions expressed across the two modalities were mismatched. Their results showed that emotion recognition improves when the facial expression and body posture are congruent. The authors also reported that emotional judgments were primarily based on the information displayed by the face, although recognition accuracy improved when congruent postures were presented. Mower et al. [28] investigated the interaction between the face and vocal expressions and their role in the recognition of an ECA’s emotional expressions. In this study, the authors combined human emotional voices with synthetic facial expressions of an ECA, to create ambiguous and conflicting audio-visual pairs.

The results indicated that observers integrate natural audio cues and synthetic video cues only when the emotional information is congruent across the two channels. Due to the unequal level of expressivity, the audio was shown to bias the perception of the evaluators. However, even in the presence of a strong audio bias, the video data were shown to affect human perception. Taken together, the abovementioned results from ECA studies that have presented congruent and incongruent auditory and visual stimuli at the same time, suggest that emotional information conveyed in one modality influences the processing of emotional information in the other modality and that congruent emotional information tends to facilitate emotion recognition.

Studies with ECAs also report that congruence is associated with increased expressiveness (e.g., [23]), likability (e.g., [25, 27]) and trust towards the agents (e.g., [6, 23],). Creed et al. [27] investigated the psychological impact of a virtual agent’s mismatched face and vocal expressions (e.g., a happy face with a concerned voice). The mismatched expressions were perceived as more engaging, warm, concerned and happy in the presence of a happy or warm face (as opposed to a neutral or concerned face) and the presence of a happy or warm voice (as opposed to a neutral or concerned voice). Gong and Nass [29] tested participants’ responses to combinations of human versus humanoid (human-like but artificial) faces and voices using a talking-face agent. The pairing of a human face with a humanoid voice, or vice versa, led to less trust than the pairing of a face and a voice from either the human or the humanoid category.

2.3 Multisensory Interaction Research in HRI

With the exception of a handful of studies (e.g., [6, 46, 47]) that examine the perception of robotic facial expressions in the presence of incongruent contextual information (i.e., movie clips or pictures), the perception of multimodal emotional signals from robots has mainly focused on studies comparing responses to unimodal versus congruent bimodal emotional stimuli (e.g., [23,24,25]). These studies have examined the perception of emotional expressions involving faces and voices [23], faces and gestures [24], gestures and voices [25] and face-voice-gesture [23] combinations. The results follow a similar pattern to ECA studies. Recognition accuracy is higher, and attitudes towards robots are more favourable when participants observe congruent bi-modal expressions than unimodal expressions. For instance, Costa et al. [24] showed that congruent gestures are a valuable addition to the recognition of robot facial expressions. Another study [23] using speech, head-arm gestures, and facial expressions, showed that participants rated bimodal expressions consisting of head-arm gestures and speech as more clearly observable than the unimodal expressions consisting only of speech. Salem at al. [25] showed that a robot is evaluated more positively when hand and arm gestures are displayed alongside speech.

A small number of HRI studies (e.g., [6, 46, 47]) have examined the perception of robotic facial expressions in the presence of incongruent contextual information. The context was manipulated by having participants watch emotion-eliciting pictures or movie clips or listen to news clips with positive or negative emotional valence. Participants were then asked to rate the facial expressions of a robot (congruent vs. incongruent with the contextual valence). Overall, results showed that the recognition of robotic facial expressions is significantly better in the presence of congruent context, as opposed to no context or incongruent context. Providing incongruent context can even be worse than providing no context at all [47]. Furthermore, [46] showed that when the expressions of a robot are not appropriate given the context, subjects’ judgments are more biased by the context than the expressions themselves. Finally, results also suggest that trust towards the robot is decreased when the robot’s emotional response is incongruent with the affective state of the user [6]. The findings from the above-mentioned studies suggest that the recognition of robot emotional expressions and attitudes towards robots can be affected by a surrounding context, including the emotion-eliciting situation in which the expression occurs and the observer’s emotional state [43].

Far less is known about the effects of mismatched or incongruous multimodal emotional information, especially when a robot conveys incongruous data from different channels. To the best of our knowledge, there are no HRI studies that investigate the experimental conflict situation where an observer receives incongruous information from a robot’s body and voice, within the context of a social interaction scenario. Given that the voices and whole-body expressions of humanoid robots (such as NAO or Pepper) are increasingly used together to convey emotions, it is natural to question how incongruous emotional cues from these two modalities interact with each other, as well as with the contextual situation where the multimodal emotion occurs. Emotional expressions of humanoid robots are especially vulnerable to such conflicts, and artificial experimental conflicts produced in the laboratory can be seen as simulations of natural ones. Conflicts result from two main types of factors. One factor is related to the fact that synthetic modalities, such as the face and body of humanoids, typically contain only a few degrees of freedom, and synthetic speech is not yet ready to efficiently portray human-like emotions. Consequently, a conflict or mismatch may be created if, for example, a sad and empathic body expression is coupled with a monotone synthetic voice. Another source of conflict is noise, which usually affects one modality at a time (e.g., vision or audition). In these situations, the presented information may not adequately express an intended emotion, and it is the role of the observer to decide how to integrate incomplete or incongruous audio-visual information. As these examples highlight, conflict situations, where two sensory modalities receive incongruous information can easily occur in the context of social HRI, therefore, it is important to investigate human observers or robot interaction partners integrate different and incongruous emotional channels to arrive at emotional judgments about robots.

Table 1 The \(3\times 3\) experimental design with the two independent variables—Socio-emotional context of the interaction and Emotional congruence of the robot’s reaction—resulting in 9 experimental conditions

3 Materials and Methods

3.1 Experimental Design

We conducted a laboratory human-robot interaction experiment where participants were invited to watch movie clips, together with the humanoid robot Pepper. We manipulated the socio-emotional context of the interaction by asking participants to watch three emotion-eliciting (happiness, sadness, surprise) movie clips alongside the robot. Emotion elicitation using movie clips is a common experimental manipulation used in psychology studies of emotions [48], and has been successfully used in HRI studies to elicit emotional responses in healthy individuals in the laboratory (e.g. [47, 49]). We also manipulated the emotional congruence of the multimodal reactions of the robot (consisting of vocal expressions and body postures) to each movie clip as follows (see Table 1):

  • In the congruent condition, the emotional valence of the multimodal reaction of the robot was congruent with the valence of the socio-emotional context of the interaction (elicited by the movie clip). For example, the robot expresses a sadness in response to a sad movie clip.

  • In the contextually incongruous condition, the emotional valence of the multimodal reaction of the robot was incongruous with the valence of the socio-emotional context of the interaction. For example, the robot expresses happiness in response to a sad movie clip).

  • In the cross-modally incongruous condition, the multimodal reaction of the robot contains both congruent and incongruous cues with respect to the valence of the socio-emotional context of the interaction. For example, the robot expresses happy vocal expressions and sad body postures in response to a happy movie clip).

Fig. 1
figure 1

The 2D valence-arousal model of emotion proposed by Russel [50]

In the context of this study, (in)congruence is defined based on emotional valence (i.e., positivity/negativity of the emotion), according to the two dimensional categorical model of emotion proposed by Russel [50] (see Fig. 1). A number of previous studies have also suggested that the effects of congruence may vary by valence (e.g., [46, 47, 51]).

We chose to investigate the emotions of happiness, sadness, and surprise for a number of reasons. Firstly, happiness, sadness, and surprise are all “social emotions” [52], namely emotions that serve a social and interpersonal function in human interactions. This category of emotions is especially useful for social robots. Second, the expression of happiness, sadness, and surprise through body motion and vocal prosody has often been studied; thus by choosing these emotions, we were able to find reliable sources for the design of the audio-visual stimuli. Finally, we chose emotions which belong to different quadrants of the valence-arousal space [50]. As shown in see Fig. 1, happiness and surprise are both arousing emotions, which vary on only on the valence dimension. Happiness has positive valence while surprise can have any valence from positive to negative. Both these emotions contain clear action components in the body expression (in contrast to a sad body expression) [53]. In fact, body expressions of happiness and surprise share physical characteristics (i.e., large, fast movements and vertical extension of the arms above the shoulders) [53, 54]. On the other hand, sadness and happiness differ in both valence and arousal; happiness has high arousal and positive valence, while sadness has low arousal and negative valence. Happiness and sadness share minimal body and vocal characteristics. To create prominent incongruous stimuli, we combined happiness with sadness and sadness with surprise.

We asked participants to label the emotional expressions of the robot and to rate the robot in terms of believability, perceived intelligence, and likability. In addition to these quantitative measures, we collected dispositional factors, namely the dispositional empathy of the participants. Empathy is defined as an affective response stemming from the understanding of another’s emotional state or what the other person is feeling or would be expected to feel in a given situation [55]. It was included in the study since evidence suggests that individuals with a low level of dispositional empathy achieve lower accuracy in decoding facial expressions of humans [56] as well as emotional expressions of robots [12, 57].

3.2 Participants

Participants were recruited through online and university advertisements. In total, 30 participants (mean = 29.5, SD = 4.82, 47% female, 53% male) who met the inclusion criteria (at least 18 years of age, basic English skills) were invited to the lab and completed the study. Participants gave informed consent and received monetary compensation for their participation (15 Euros).

3.3 Setting and Apparatus

The robot used in the study was Pepper by Softbank Robotics; a human-like robot with a full-motion body with 20 degrees of freedom. The experiment was carried out in a lab, furnished as a living-room environment, with a sofa, a small table with a laptop computer, and a large TV screen (see Fig. 2). The participants sat on the sofa, facing the TV screen, and the robot was placed between the participant and the TV, slightly to the right of the TV screen. Throughout the experimental session, the participant was observed via the built-in camera of the robot. A trained experimenter, in an adjacent room, utilized the video-feed to trigger the robot’s emotional behaviour promptly.

Fig. 2
figure 2

(Left) Overview of the experimental setting. The robot is facing the participant to express an emotional reaction after a movie clip. (Right) Examples of body expressions for a happiness sequence (straight head, straight trunk, vertical and lateral extension of arms), b sadness sequence (forward head bent, forward chest bent, arms at side of trunk), and c surprise sequence (backward head bent, backward chest bent, vertical extension of arms)

3.4 Stimulus Material

3.4.1 Emotion Elicitation Movie Clips

Each participant watched three short emotion-eliciting movie clips, extracted from the following commercially available movies: An officer and a gentleman (happiness), The Champ (sadness), and Capricorn One (surprise). Target emotions and details about the movies are listed in Table 2. The procedure of validating the efficiency of these videos in eliciting the target emotions is discussed in Rottenberg et al. [57]. For a specific description of the scenes, see the Appendix of [57]. The order of presentation of the clips was randomized for each participant.

Table 2 Target emotions and corresponding emotion-eliciting movies used in the experiment

3.4.2 Robot Emotional Expressions

In response to each movie clip, the robot expressed a multimodal emotional expression consisting of two modalities: auditory (vocal prosody), and visual (whole-body expression). The facial expression of the robot remained unchanged across all the stimuli (Pepper has a static face).

Table 3 Motion dynamics (velocity, amplitude), body animations (head, torso, arms) and vocal prosody characteristics (pitch, timing, loudness, non-linguistic utterances) used to generate audio-visual expressions for the target emotions

Pitch, timing, and loudness are the features of speech that are typically found to correlate with the expression of emotion through vocal prosody [17]. We manipulated these three features using the Acapela Text-to-Speech (TTS) engine (English language) to generate a set of vocal expressions. Our implementation was based on the phonetic descriptions of happiness, sadness, and surprise proposed by Crumpton et al. [17]. Given our interest in how vocal prosody (and not semantic information) influences emotion perception, the expressions were emotionally-inflected sentences with factual descriptions of the scenes shown in the movie clips, without any meaningful lexical-semantic cues suggesting the emotions of the robot (i.e., “A boxer is laying injured on the table and asks to see his son. A young boy approaches and starts talking to him”). No information regarding the age or gender of the robot could be derived from the speech. HRI research suggests that there is potential for non-linguistic utterances (NLUs) to be used in combination with language to mitigate any damage to the interaction should TTS generated language fail to perform at the desired level (e.g., [58]). In light of these findings, we decided to combine the sentences with a set of NLUs that emphasize the target emotion. NLUs were selected from an existing database of exemplars created in previous work [57] where evaluators rated each NLU using a forced-choice evaluation framework (Sadness, Happiness, Anger, Surprise, Neutral, I don’t Know and Other). All of the chosen NLUs were correctly recognized above chance level [57]. Table 3 summarizes the vocal prosody characteristics and NLUs we used for each target emotion. The resulting set was composed of three distinct sentence blocks (one for each video), each one recorded with different prosody features to portray two different emotions (congruent, incongruent with the situational valence). For example, for the “sadness” movie clip, the same sentence block was generated with two different prosody characteristics (pitch, timing, and loudness), and was combined with two different NLUs, for happiness and sadness respectively.

The vocal prosody stimuli were synchronized with (congruent and incongruent) body movements to create the audio-visual emotional expressions of the robot. The body expressions of the robot were modelled after the way humans move their head, torso, and arms to express emotions. Sources for our implementation were studies investigating the relevance of body posture and body movement features (i.e., velocity and amplitude) in conveying and discriminating between basic emotions in humans ([53, 54, 59, 60]). De Silva and Bianchi-Berthouze [59] found that vertical features and features indicating the lateral opening of the body are informative for separating happiness from sadness. For instance, hands are raised and significantly more extended to indicate happiness and remain low along the body for sadness [59]. In a study by De Meijer [53] the trunk movement (ranging from stretching to bowing) was used to distinguish between positive and negative emotions. Based on these findings, in our design, happiness is characterized by a straight robot trunk, head bent back, a vertical and lateral extension of the arms and large, fast movements. Surprise is characterized by a straight trunk, backward stepping, and fast movements, whereas sadness, is characterized by a bowed trunk and head, downward and slow body movements. In a pre-evaluation online survey, we validated that people correctly perceived the emotion that each isolated body part is intended to convey [57]. We selected the animations that received the highest overall recognition score and used them to generate more complex animations for this study. Table 3 summarizes the whole-body expressions we used for each target emotion. Examples pf body expressions can be seen in Fig. 2.

3.5 Procedure

In order to avoid effects of expectation, participants were instructed that the experiment focuses on the ability of the robot to recognize emotions from audio-visual cues in the movie clips (“In this study, we test whether our robot can detect the emotional cues in the movie clips and can react accordingly”), instead of the actual aim. After reading a description of the experiment and signing a consent form, the participant was escorted to the lab where the experiment took place. Upon entering the room, the robot looked at the participant, waved and introduced itself (“Hello! I am Pepper. Welcome to the lab.”). The experimenter then left the room, and the robot uttered, “We are going to watch some movies together! Start the first clip when you are ready”, and turned towards the TV. While the participant watched a clip, the robot also looked at the TV. At the end of the clip, the robot turned towards the participant and expressed its emotional reaction to the clip (congruent, contextually incongruous or cross-modally incongruous). The duration of the robot’s reactions varied between 20 and 50 s. During the rest of the time, the robot displayed idle movements (i.e., gaze/face tracking, breathing). We decided not to include any other type of verbal interaction between the robot and the participant, in order to minimize possible biasing effects on the participant’s perception of the robot.

After the emotional reaction of the robot, an on-screen message on the laptop prompted the participant to answer an online questionnaire (built using the online tool Limesurvey) with questions about their experience of the movie clip and their perception of the robot’s reaction (see Sect. 3.6). To limit carryover effects from one movie to the next, a 1-min rest period was enforced after completing the questionnaire. The participant was told to use this time to “clear your mind of all thoughts, feelings, and memories”, before watching the next clip. This approach was originally used by Gross et al. [61] in their experiments on emotion elicitation using movie clips.

At the end of the third emotional expression of the robot, an on-screen message prompted the participant to answer a series of questions about demographics and personality traits (see Sect. 3.6). Afterwards, the robot thanked the participant and said goodbye (“Thank you for participating in this experiment. Goodbye!”). The experimenter then entered the room, answered any potential questions, debriefed the participant about the real purpose of the experiment and gave the monetary compensation. The experiment took about 60 min on average.

Table 4 The seven items of the believability questionnaire, adopted by [62]

3.6 Measures

Manipulation Check—Experience of the Movie Clip To ascertain whether the desired emotion (happiness, sadness, surprise) had been properly elicited by the movie clip, we asked participants to report the most prominent emotion they experienced while watching the clip. Participants chose one option from a list of 11 emotions (amusement, anger, disgust, despair, embarrassment, fear, happiness/joy, neutral, sadness, shame, surprise) and the options neutral and other.

Emotion Recognition We asked participants to label the most prominent emotion expressed by the robot in response to each movie clip. Participants choose one option from a list of 11 emotions (amusement, anger, disgust, despair, embarrassment, fear, happiness/joy, neutral, sadness, shame, surprise) and the options neutral and other.

Attitudes Towards the Robot—Believability We asked participants to rate their perceptions about the believability of the robot. Participants rated seven conceptually distinct dimensions of believability (awareness, emotion understandability, behaviour understandability, personality, visual impact, predictability, behaviour appropriateness), as defined by Gomes et al. [62]. Table 4 contains the assertions used for each dimension. All items were rated on a 5-point Likert scale ranging from 1 “Strongly disagree” to 5 “Strongly agree” and “I don’t know.”

Attitudes Towards the Robot—Perceived Intelligence and LikabilityParticipants rated the robot on the Perceived Intelligence and Likability dimensions of the Godspeed questionnaire [63]. All items were presented as a 5-point semantic differential scale.

Demographics and Personality Traits Participants reported basic socio-demographic information (age, gender, profession and previous experience with robots). Participants were also asked to fill in the Toronto Empathy Questionnaire [64], a 16-item self-assessment questionnaire assessing dispositional empathy.

3.7 Data Analysis

Manipulation Check—Experience of the Movie Clip Of the 90 movie clip ratings we obtained (30 participants \(\times \) 3 clips per participant), 12 ratings were inconsistent with the intended situational valence manipulation (i.e., the movie clip failed to elicit the targeted emotion in the participant) and were thus excluded from further analyses. Consequently, the statistical analysis reported below was performed on the basis of a final sample of 78 ratings (Happy clip n = 25, Surprise clip n = 25, Sad clip n = 28).

Exploratory Regression Analysis As discussed in the previous sections, there are various factors that seem to influence the perception of robotic emotional expressions (i.e., the socio-emotional context of the interaction, incongruence between modalities, the rater’s gender and dispositional empathy). Therefore, the first step of the data analysis was an exploratory regression analysis, performed to identify which factors would best account for whether or not participants correctly recognized the emotional expressions of the robot in this study. We coded the dependent variable (emotion recognition) as a binary value for whether or not the participant accurately recognized or not the expression of the robot and ran a logistic regression (a method typically used for such exploratory analyses [65]), to ascertain the effects of (in)congruence (congruent, contextually incongruous and cross-modally incongruous conditions), emotion being expressed (happiness, sadness and surprise), gender and dispositional empathy score on the likelihood that participants accurately recognize the emotional expression of the robot.

Emotion Recognition—Effects of Incongruence In the second step of the analysis, hit rate and unbiased hit rate [66] were analysed. These measures were chosen because we were interested in whether incongruence decreased target emotion detection rate. Hit rate (\(Hs\)) is the proportion of trials in which a particular emotion is shown that is correctly labelled. Although \(Hs\) is one of the most frequently used measures of accuracy, this metric does not take account false alarms (i.e., the number of times in which a particular emotion label is incorrectly used) or personal biases during the performance (i.e., the bias to say happy for all expressions). The unbiased hit rate (\(Hu\)), proposed by Wagner [66], takes this problem into account and results in calculations of accuracy rates that are more precise. The computation of \(Hu\) scores involves “the joint probability that a stimulus is correctly identified (given that it is presented) and that a response is correctly used (given that it is used)”(Wagner [66] p. 16). In other words, in order to measure recognition accuracy for a given emotion, the number of misses (e.g., the number of times in which a particular emotion was present and the participant responded it was absent) as well as the number of false alarms (e.g., the number of times in which the participant responded the target stimulus was present when in reality it wasn’t) are taken into account. \(Hu\) scores were computed for each emotional expression of the robot as follows:

$$\begin{aligned} Hu = \frac{Ai}{Bi} \times \frac{Ai}{Ci} \end{aligned}$$

where \(Ai=\) frequency of hits, \(Bi=\) number of trials where \(i\) is the target and \(Ci=\) frequency of \(i\) responses (hits and false alarms).

To investigate if people recognized the robot’s emotional expressions correctly, we compared the emotion recognition ratings of the congruent and contextually incongruous conditions against the ideal distribution using a Chi-Square test. This statistical analysis approach was previously used in [11]. For instance, if ten people had to choose the right expression for the robot, out of a list of three different expressions (e.g., happy, sad, neutral), and the robot expressed “sadness”, then the ideal distribution would be 0, 10, 0.

Next, to investigate the effects of contextual incongruence (i.e., the conflict situation where the robot’s reaction is incongrous with the socio-emotional context of the interaction), we compared emotion recognition ratings of the congruent conditions (e.g., happy expression in response to a happy movie clip) against emotion recognition ratings of the contextually incongruent conditions (e.g., happy expression in response to a sad movie clip) by means of Chi-Square tests.

Thirdly, to investigate the effects of cross-modal incongruence (i.e., how incongruous auditory (vocal prosody), and visual (whole-body expression) cues are processed when presented simultaneously), we compared the emotion recognition ratings of the congruent condition (e.g., happy body and happy voice) against the emotion recognition ratings of the cross-modally incongruous condition (e.g., happy body and sad voice, sad body and surprised voice) by means of Chi-Square tests.

Table 5 Results of the regression analysis of congruence, emotion being expressed, gender and dispositional empathy score on the likelihood that participants accurately recognize the emotional expression of the robot

Attitudes Towards the Robot—Effects of Incongruence In the last part of the analysis, we investigated the effects of contextual incongruence and cross-modal incongruence on participants’ attitudes towards the robot. The Believability, Perceived Intelligence, and Likability Questionnaires were calculated by summatively building up the scales. Cronbach’s alpha was calculated to prove the internal reliability of the scales (all scales achieved a value higher than 0.7 and can thus be considered reliable). Since our data were not normally distributed, we used non-parametric tests, suitable for ordinal numerical data. Specifically, the Kruskal–Wallis H test and subsequent Mann–Whitney U tests were used to determine if there are statistically significant differences in the scores between the three experimental conditions (congruent, contextually incongruous, cross-modally incongruous).

Table 6 Raw emotion recognition ratings, for each movie clip in the congruent (Cong.), cross-modally incongruous (Mod. Incong.) and contextually incongruous (Cont. Incong.) conditions

Toronto Empathy Questionnaire The Toronto Empathy Questionnaire was calculated by reversing the inverted items and computing the summative score overall 16 items.

4 Results

4.1 Exploratory Regression Analysis

A logistic regression was performed to ascertain the effects of congruence (congruence, cross-modal incongruence, contextual incongruence), emotion being expressed (happiness, sadness, surprise), gender and dispositional empathy score on the likelihood that participants accurately recognize the emotional expression of the robot. The logistic regression model was statistically significant (\(x^2(6)\) = 16.64, p = 0.01). The model explained 28.9% (Nagelkerke \(R^2\)) of the variance in emotion recognition accuracy and correctly classified 80.5% of cases. The results of the analysis are presented in Table 5 and discussed below.

Congruence There was a significant association (p = .01) between the congruent condition and the likelihood of correctly recognizing the emotional expression of the robot. Additionally, the cross-modally incongruous condition was significantly associated with a decreased likelihood of correctly recognizing the emotional expression of the robot (p = .01). There was no significant association between the contextually incongruous condition and the likelihood of correctly recognizing the emotional expression of the robot (p = .65). Nevertheless, the contextually incongruous condition was associated with a decreased likelihood of correctly recognizing the robot’s emotional expression, compared to the congruent condition. An indication of the size of the effects of the contextually incongruous and the cross-modally incongruous conditions can be seen in the odds ratios reported in Table 5 (values discussed below as 1—odd ratio):

  • The odds of people recognizing the emotional expression of the robot were 32% (or 1.47 times) lower in the contextually incongruous condition than in the congruent condition (baseline condition).

  • The odds of people recognizing the emotional expression of the robot were 87% (or 7.69 times) lower in the cross-modally incongruous condition than in the congruent condition (baseline condition).

The effects of contextual congruence and cross-modal congruence on the likelihood that participants accurately recognized the emotional expression of the robot can also be seen in the hit rate (\(Hs\)) and unbiased hit rate (\(Hu\)) scores across the three conditions (see Table 7 and Sect. 4.2).

Emotion Being Communicated The results of the regression analysis showed no significant association between the type of emotion being communicated by the robot (i.e., happiness, sadness, surprise) and emotion recognition accuracy (p = .48). However, as indicated by the odds ratios (Table 5), the expressions of sadness were 57.8% (or 2.36 times) less likely to be recognized than expressions of happiness. Likewise, expressions of surprise were 55.9% (or 2.26 times) less likely to be recognized than expressions of happiness. As expected, due to its ambivalent nature, the emotion of surprise was the least well-recognized emotion in the congruent condition. However, the overall lowest recognition accuracy score was reported for the emotion of sadness, in the cross-modally incongruous condition (i.e., when the robot expressed sadness and happiness simultaneously).

Gender The results of the regression analysis showed a borderline level significance association between gender and emotion recognition accuracy (p =.05). Female participants were 3.43 (or 70.9%) times less likely to recognize the robot’s emotion than males.

Dispositional Empathy The total score of the Toronto Empathy questionnaire can range from 0 (no empathy at all) to 64 (total empathy) [64]. Our analysis resulted in a mean value of 47.13 (SD = 6.41). There were no significant differences between female (mean = 49.42, SD = 5.44) and male (mean = 45.12, SD = 6.51) participants. Our results for female participants were slightly higher than the range given in the source of the questionnaire [64] (between 44 and 49 points). Likewise, our results for male participants were slightly higher than the source (between 43 and 45 points), indicating that, overall, our participants had a slightly higher dispositional empathy than average. The results of the regression analysis showed no significant association between empathy score and emotion recognition accuracy (\(p=.59\)). However, this analysis was performed after controlling for our manipulation check (i.e., participants who did not recognize the emotion elicited by the movie clip were excluded from this analysis). It is likely that those individuals in particular had a particularly low dispositional empathy score, and a regression analysis including their responses would reveal different results.

4.2 Emotion Recognition Accuracy: Effects of Incongruence

To provide all the relevant information about false alarms and potential biases in the emotion recognition ratings of participants, Table 6 provides a rapid overview of the detailed confusion matrices for the data. Table 7 summarizes the derived \(Hs\) and \(Hu\) scores across all experimental conditions and for the three target emotional expressions. \(Hu\) scores range between a minimum of zero to one, one indicating that all stimuli of an emotion have been correctly identified and the respective emotion has never been falsely chosen for a different emotion. We report the \(Hu\) scores, because of the popularity of this metric in the literature. However, we feel that this is not ideal for this study, because of the fact that some of our stimuli are not easily recognizable expression prototypes but rather complicated expressions, often based on the combination of mismatched emotional information across modalities. Therefore, in most cases, and especially in the cross-modally incongruous condition, the \(Hu\) severely reduces the hit rate for those responses that do not fit the target category, assuming that only one single answer can be correct”—namely, the one corresponding to the intended emotion category. For this reason, in the following subsections, we discuss the emotion recognition accuracy based on the \(Hs\) scores.

In the congruent condition, the mean \(Hs\) was 88% (SD= 0.13), with scores ranging from 75% to 100%, for the the three different emotions. In the contextually incongruous condition, the mean \(Hs\) was 61% (SD = 0.31), with scores ranging from as 29 to 89%. In the cross-modally incongruous condition, the mean \(Hs\) was 77% (SD = 0.26). In this particular condition, we consider that there is no correct target emotion since the robot simultaneously expresses two different emotions (i.e., happy body and sad voice). Therefore, separate \(Hs\) and \(Hs\) scores were calculated for each of the two target emotions expressed by the robot (see Table 7). For example, in the cross-modally incongruous condition where the robot simultaneously expressed happiness and sadness in response to a happy clip, 50% of the participants rated the expression as happiness, while 38% rated the expression as sadness (see Table 7). The remaining 12% corresponds to other emotion labels (see confusions in Table 6).

When participants watched a happy movie clip, and the robot expressed happiness (congruent condition), all the participants rated the expression of the robot as happy. However, when the robot expressed happiness in response to a sad clip (contextually incongruous condition), only 67% of the participants rated the expression of the robot as happy. The remaining 33% said that the robot was “amused” by the movie. When the robot expressed an audio-visual behaviour consisting both of happiness and sadness in response to a happy clip (cross-modally incongruous condition), 50% of the participants said that the robot was happy, while 37% found the robot to be sad.

When participants watched a sad movie clip, and the robot expressed sadness (congruent condition), 90% ra-ted the expression of the robot as sadness. When the robot expressed sadness in response to a happy clip (contextually incongruous condition), 89% rated the expression of the robot as sadness. In other words, sadness was well recognized both in a congruent context (i.e., context with similar emotional valence) and an incongruous context (i.e., context with opposing emotional valence).

Table 7 Measures of accuracy across the three experimental conditions. Stimulus hit rate (\(Hs\)) and unbiased hit rate (\(Hu\))
Table 8 Chi-square tests on emotion recognition ratings for the emotional expressions (EE) of happiness, sadness and surprise. Figures in bold show non-significant results (\({p}>.01\)), meaning that the ratings did not differ significantly between the two conditions being compared

To add more depth to our analysis, we also investigated what would happen if the robot expressed sadness in response to a clip eliciting the ambivalent emotion of surprise (contextually incongruous condition). In this case, 57% of the participants rated the robot’s expression as sadness, while 43% rated the robot’s expression as surprise. When the robot expressed an audio-visual behaviour consisting both of happiness and sadness in response to a sad clip (cross-modally incongruous condition), 22% of the participants said that the robot was sad, while 44% found the robot to be happy. Finally,when participants watched a movie clip electing the feeling of surprise, and the robot expressed a surprised reaction (congruent condition), 75% of the participants rated the robot’s expression as surprise. Interestingly, when the robot expressed an audio-visual behaviour consisting both of surprise and sadness (cross-modally incongruous condition), 78% of the participants said that the robot was surprised, while no one rated the expression of the robot as sad.

Table 8 summarizes the results of the Chi-Square tests. The tests comparing the emotion recognition ratings of the congruent condition against the ideal distribution were not significant for all the emotions. In other words, there was no significant difference between the emotion recognition ratings of the participants and the ideal distribution (which does not mean that the result equals the ideal distribution, but only that it did not significantly differ from it). Likewise, the Chi-Square tests comparing the emotion recognition ratings for the contextually incongruous condition against the ideal distribution were not significant for all the emotions. These results indicate that the participants were able to recognize the emotional expressions of the robot, both in the baseline congruent condition, but also when the expressions were incongruous with the emotional valence of the interaction context. These findings are in line with the results of the regression analysis (i.e., there was no significant association between the contextually incongruous condition and the likelihood of correctly recognizing the emotional expression of the robot). To investigate further the effects of contextual inconguence on the likelihood of correctly recognizing the emotional expression of the robot, we compared the emotion recognition ratings of the participants in the congruent condition (i.e., happy emotional expression in response to a happy movie clip) and contextually incongruous condition (i.e., happy emotional expression in response to a sad movie clip) using Chi-Square tests. The result was not significant (\(X^2\) (11, N = 8/N = 9) = 3.66, \(p = .98\)). In other words, there was no significant difference between the emotion recognition ratings of the participants in these two conditions. However, there was a significant difference in the ratings for the emotional expression of sadness in the congruent condition, compared to the contextually incongruous condition where the robot expressed sadness in response to a movie clip eliciting happiness (\(X^2\) (11, N = 10/N = 9) = NA, \(p = .00\)); as well as the contextually incongruous condition where the robot expressed sadness in response to a movie clip eliciting surprise (\(X^2\) (11, N = 10/N = 7) = NA, \(p = .00\)). The regression analysis revealed a negative association (\(p =.01\)) between the cross-modally incongruous condition and the likelihood of correctly recognizing the emotional expression of the robot. To investigate this finding further, we compared the emotion recognition ratings of the congruent condition and cross-modally incongruous condition using Chi-Square tests. There was no significant effect of cross-modal congruence on the ratings of happiness (\(X^2\) (11, N = 8)  = 8.00, \(p = .71\)). In other words, the ratings of the participants for the expression of happiness did not differ significantly between the congruent and cross-modally incongruous conditions. However, there was a significant effect on the ratings of sadness (\(X^2\) (11, N = 10/N = 9) = 30.00, \(p = .0016\)) and surprise (\(X^2\) (11, N = 9) = NA, \(p = .00\)). In other words, the emotion ratings of the participants differed significantly between the congruent and cross-modally incongruous conditions.

Fig. 3
figure 3

Mean ratings for Believability (7 items) and Godspeed Questionnaire items: Perceived Intelligence (5 items) and Likability (5 items)

4.3 Attitudes Towards the Robot: Effects of Incongruence

Believability A Kruskal–Wallis H test showed a statistically significant difference in the believability scores (mean of the seven distinct dimensions) between the three different experimental conditions (congruent, contextually incongruous, cross-modally incongruous), H (2) = 21.16, p < 0.001, with a mean rank recognition score of 54.33 for the congruent condition, 33.04 for the cross-modally incongruous condition and 27.50 for the contextually incongruous condition (Fig. 3). Subsequent MannWhitney tests, were used to make post-hoc comparisons between conditions. The believability scores for the congruent condition were significantly different from the contextually incongruous condition (U = 91.00, p <.001). Also, the believability scores were significantly higher for the congruent condition (Mdn = 35.00) than the contextually incongruous condition (Mdn = 16.64). The believability scores for the congruent condition were significantly different from the cross-modally incongruous condition (U = 145.50, p\(<\,\).001). More specifically, the scores were significantly higher for the congruent condition (Mdn = 32.83) than for the cross-modally incongruous condition (Mdn = 18.90). There was no significant difference between the contextually incongruous and the cross-modally incongruous conditions (U = 271.50, p = .42).

Perceived Intelligence A Kruskal–Wallis H test showed a statistically significant difference in the perceived intelligence score between the three different conditions, H (2) = 20.46, p < 0.001, with a mean rank recognition score of 54.23 for the congruent condition, 35.75 for the cross-modally incongruous condition and 26.54 for the contextually incongruous condition (Fig. 3). Subsequent MannWhitney tests, showed that the perceived intelligence scores for the congruent condition were significantly different from the contextually incongruous condition (U = 103.50, p <.001) and the cross-modally incongruous condition (U = 163.500, p <.005). There was no significant difference between the scores for the contextually incongruous and cross-modally incongruous conditions (U = 235.00, p =.08). The perceived intelligence score was significantly higher for the congruent condition (Mdn = 34.52) than for the contextually incongruous condition, (Mdn = 17.14). Additionally, the perceived intelligence score was significantly higher for the congruent condition (Mdn = 33.21) than for the cross-modally incongruous condition (Mdn =19.79).

Likability A Kruskal–Wallis H test showed that there was a statistically significant difference in the likability score between the three different context conditions, H (2) = 8.25, p < 0.05, with a mean rank recognition score of 48.69 for the congruent condition, 31.30 for the cross-modally incongruous condition and 36.71 for the contextually incongruous condition (Fig. 3). Subsequent MannWhitney tests, showed that the congruent condition was significantly different from the contextually incongruous condition (U = 180.50, p <.01) and the cross-modally incongruous condition (U = 230.500, p\(<.005\)). There was no difference between the scores for the contextually incongruous and the cross-modally incongruous conditions (U = 277.00, p = .36). The likability score was significantly higher for the congruent condition (Mdn = 31.56) than for the contextually incongruous condition, (Mdn = 20.22). Additionally, the likability score was significantly higher for the congruent condition (Mdn = 30.63) than for the cross-modally incongruous condition (Mdn = 22.37).

5 Discussion

5.1 Overview and Significance of Results

An exploratory regression analysis showed that cross-modal incongruence (i.e., the conflict situation where an observer receives incongruous emotional information across the auditory (vocal prosody) and visual (whole-body expressions) modalities) is associated with a decreased likelihood of correctly recognizing the emotional expression of the robot. In addition, both cross-modal incongruence and contextual incongruence (i.e., the conflict situation where the robot’s reaction is incongrous with the socio-emotional context of the interaction) negatively influence attitudes towards the robot in terms of believability, perceived intelligence and likability. The significance of these findings is discussed in more detail below, and in Sect. 6 we provide a number of recommendations regarding design choices for humanoid robots that use several channels to communicate their emotional states in a clear and effective way.

Effects of Cross-modal Incongruence Interesting findings were obtained regarding the effects of cross-modal incongruence. Statistically significant results indicate that when emotional information is incongruous across the auditory and visual modalities, the likelihood that people accurately recognize the emotional expression of the robot is significantly decreased, compared to the situation where congruent information is presented across the two channels. Our findings are in line with previous work in psychology, neuroscience and HCI literature and suggest that theories about MI in Human–Human interactions (e.g., [34,35,36,37,38,39,40,41,42,43,44,45]) and Human–Agent interactions (e.g., [26,27,28,29,30] ), also extend to Human–Robot interactions. The descriptive analysis of the emotion recognition scores revealed that emotional expressions that contained a happy body and a sad voice (or vice versa) resulted in a confused perception, where the emotional expression of the robot was perceived by some people as happiness and by others as sadness. A few people also labeled one of the incongruous expressions as neutral (see confusions in Table 6). This suggests that neither the visual nor the auditory channel dominated the participants’ perception, and therefore, the responses of the participants were split between the two emotions expressed by the robot. Regarding those who rated the expression neither as happiness nor as sadness, it is possible that they tried to make the robot’s emotional expression consistent on both channels, by rating it as something in between happy and sad (i.e., neutral or amused). On the other hand, when participants watched the robot expressing an incongruous multimodal combination of surprise and sadness, the highly expressive body and voice cues of surprise (i.e., large, fast movements, extension of the arms, gasping utterance) dominated over the more subtle cues of sadness (i.e., small, slow movements, arms at side of trunk), and as a result no one rated the expression of the robot as sadness (see Table 7).

Effects of Contextual Incongruence Previous work indicates the contextual effect on the recognition of robot [46, 47] and human [45] emotional expressions. For instance, Niedenthal et al. [43] found that observers may be influenced by their own emotional state when they were asked to attribute emotional states to human facial expressions. In the first phase of our analysis, we found no significant association between incongruous context and the likelihood of correctly recognizing the emotional expression of the robot. Nevertheless, a more in-depth analysis of the emotion recognition ratings of the participants, showed that the recognition accuracy scores for all emotions were lower in the presence of incongruous socio-emotional context, compared to congruent context. The effects of contextual incongruence on the likelihood that participants accurately recognize the emotional expression of the robot were most prominent in the case of surprise, an ambivalent emotion which can potentially result in ambiguous emotion recognition ratings. Specifically, when the robot expressed sadness in response to a movie clip eliciting the emotion of surprise, we observed confused assessments about the robot’s emotional expression and a significant drop in the recognition accuracy scores. When we took our analysis a step further and compared the participant’s ratings for each emotion in the congruent and contextually incongruous conditions, statistically significant results showed that the recognition accuracy scores for the sadness and surprise expressions differed between these two conditions. For the case of happiness, we did not find a significant difference between the ratings of the participants in the two conditions. This finding does not undermine the importance of the context but instead suggests that, in our study, expressions of happiness had a more dominant effect than the socio-emotional context (elicited by the movie clip) on the emotion recognition ratings of the participants. This can be explained by the fact that happiness is a basic emotion which is, in general, easy to recognize.

Impact on Perceived Believability, Intelligence, and Likability With regard to the effects of incongruence on people’s attitudes towards the robot, we found that both contextual incongruence, as well as cross-modal incongruence, significantly reduced participants’ ratings of believability, likability and perceived intelligence of the robot. For instance, in these two conditions, the robot was rated as less intelligent, less responsible and less kind than the congruent condition. Furthermore, since there was no statistically significant difference between the believability and intelligence ratings in the contextually incongruous and the cross-modally incongruous conditions, we can conclude that the conflict situation where information conveyed in one of the communication modalities (auditory or visual) is incongruous with the socio-emotional context of the interaction is almost as harmful as the conflict situation where the robot’s overall mutlimodal reaction is incongrous with the socio-emotional context of the interaction.

5.2 Limitations

The findings reported above should be considered in light of possible limitations and constraints of our laboratory experiment. There are obviously numerous robot-related and person-related factors that could influence the perception of the robot and its emotional expressions. However, it is practically and methodologically impossible to investigate and control them all within a single study.

In terms of robot-related factors, we limited our research to audio-visual expressions based on the body and voice of the robot. The results partially indicate that the choice of a robot without facial expressions, and the selected emotions made it hard for participants to distinguish between certain emotions (e.g., the ambivalent emotion of surprise). Moreover, another limitation is the fact that our robot had a static face, which to some extent reflected “happiness”, due to the arrangement of the eyes and mouth. It is undeniable that the face of the robot was yet another affective signal, which participants integrated with the body and voice signals, to arrive at emotional judgments. Although we tried to avoid the communication of emotions through the linguistic content of the robot’s speech, it is likely that this was another factor that influenced participant’s judgments, especially since the robot was communicating via natural language [17]. In terms of person-related factors that influence the perception of a robot and its emotional expressions, the design of this study did not consider cultural specificity. There are numerous theories about the role of culture in human–human [67, 68] and human–robot [69] affective interaction. However, most of the participants in our study were Austrian, meaning that the results may not be directly transferable to participants with a different cultural background. Future studies should consider a larger sample size and participants with more diverse characteristics. Moreover, this study focused only on a small set of three emotions (happiness, sadness, and surprise). Future studies should include more emotions, such as Ekman’s basic emotions [70], or more complex emotions (e.g., embarrassment) which are more difficult to simulate in a robot. Lastly, although we have used previously validated movies (see [61] for details) for our emotion elicitation manipulation, it remains unclear whether the different length of the chosen movie-clips (see Table 2) had an impact on the elicitation of the target socio-emotional context of the human–robot interaction.

These limitations do not necessarily mean that the results of this work cannot be generalized. The integration of incongruent voice and body signals has not been empirically investigated with humanoid robots prior to this research. Thus, our work provides important insight for researchers and designers of humanoid social robots. The primary findings of our experiment suggest that incongruent body and voice emotional cues result in confused perceptions of emotion and influence attitudes towards a robot in a negative way. It is likely that investigating different incongruent visual and auditory emotional cues, with other robots, in different interaction contexts, will also reveal interesting information towards understanding the perception of multimodal emotional expressions of social robots.

6 Conclusion

This article discussed how incongruous emotional signals from the body and voice of humanoid robot influence people’s ability to identify and interpret a robot’s emotional state, and how incongruence impacts attitudes towards a robot during socio-emotional interactions. Our laboratory HRI study showed that incongruous audio-visual expressions do not only impair emotion recognition accuracy, but also decrease the perceived intelligence of a robot, making its behaviour seem less believable and less likable. These findings highlight the importance of properly designing and evaluating multimodal emotional expressions for robots intended to interact with people in the context of real socio-emotional HRI environments.

Based on our findings, we provide some recommendations regarding design choices for robots that use several channels to communicate their emotional states in a clear and effective way. When emotional information is incongruous across the auditory and visual channels, the likelihood that people accurately recognize the emotional expression of the robot significantly decreased. Therefore, great attention to detail is required when attempting to simulate multimodal emotional expressions. Designers must ensure that the different channels used to express emotions are appropriate and congruent with each other. Conflict situations, where two sensory modalities receive incongruous information can easily occur in the context of social HRI in real-world environments. For example, humanlike robots use highly expressive body postures or facial expressions to convey emotion, but often combine those with synthetic voices which lack the naturalness of human voice. If designers are able to anticipate the channel upon which a user will rely on when making emotional assessments about a robot, then they can tailor the information presented in that channel, to maximize the recognizability of the robot’s expression. Conflict situations can also occur due to noise, which usually affects only one modality (i.e., sight or hearing). For instance, if the environment is very loud (e.g., a school or hospital), the voice of the robot may be masked, and the resulting audio-visual expression may not adequately convey the intended emotion. Hence, when choosing communication modalities for a robot, a designer should consider information about the environment.

Given the fact that contextual incongruence (i.e., a robot expresses happiness in response to a sad situation) can have a detrimental effect on the believability, likability and perceived intelligence of the robot, as a general guideline, designers should not only assess whether the multimodal emotional expressions of a robot are accurately recognized, but also which effects they have on the interaction partners’ attitudes towards the robot. Furthermore, in certain social contexts, if a robot’s perception capabilities are not precise enough, designers may need to reconsider the use of emotional expressions. In other words, if a robot is likely to make a mistake in assessing the affective state of the user or the socio-emotional context of the interaction, then it might be better to opt for neutrality instead of showing an emotional reaction that is inappropriate. This can result in the robot appearing unintelligent, irresponsible or even unkind towards its human interaction partner.