Experimental Design
We conducted a laboratory human-robot interaction experiment where participants were invited to watch movie clips, together with the humanoid robot Pepper. We manipulated the socio-emotional context of the interaction by asking participants to watch three emotion-eliciting (happiness, sadness, surprise) movie clips alongside the robot. Emotion elicitation using movie clips is a common experimental manipulation used in psychology studies of emotions [48], and has been successfully used in HRI studies to elicit emotional responses in healthy individuals in the laboratory (e.g. [47, 49]). We also manipulated the emotional congruence of the multimodal reactions of the robot (consisting of vocal expressions and body postures) to each movie clip as follows (see Table 1):
-
In the congruent condition, the emotional valence of the multimodal reaction of the robot was congruent with the valence of the socio-emotional context of the interaction (elicited by the movie clip). For example, the robot expresses a sadness in response to a sad movie clip.
-
In the contextually incongruous condition, the emotional valence of the multimodal reaction of the robot was incongruous with the valence of the socio-emotional context of the interaction. For example, the robot expresses happiness in response to a sad movie clip).
-
In the cross-modally incongruous condition, the multimodal reaction of the robot contains both congruent and incongruous cues with respect to the valence of the socio-emotional context of the interaction. For example, the robot expresses happy vocal expressions and sad body postures in response to a happy movie clip).
In the context of this study, (in)congruence is defined based on emotional valence (i.e., positivity/negativity of the emotion), according to the two dimensional categorical model of emotion proposed by Russel [50] (see Fig. 1). A number of previous studies have also suggested that the effects of congruence may vary by valence (e.g., [46, 47, 51]).
We chose to investigate the emotions of happiness, sadness, and surprise for a number of reasons. Firstly, happiness, sadness, and surprise are all “social emotions” [52], namely emotions that serve a social and interpersonal function in human interactions. This category of emotions is especially useful for social robots. Second, the expression of happiness, sadness, and surprise through body motion and vocal prosody has often been studied; thus by choosing these emotions, we were able to find reliable sources for the design of the audio-visual stimuli. Finally, we chose emotions which belong to different quadrants of the valence-arousal space [50]. As shown in see Fig. 1, happiness and surprise are both arousing emotions, which vary on only on the valence dimension. Happiness has positive valence while surprise can have any valence from positive to negative. Both these emotions contain clear action components in the body expression (in contrast to a sad body expression) [53]. In fact, body expressions of happiness and surprise share physical characteristics (i.e., large, fast movements and vertical extension of the arms above the shoulders) [53, 54]. On the other hand, sadness and happiness differ in both valence and arousal; happiness has high arousal and positive valence, while sadness has low arousal and negative valence. Happiness and sadness share minimal body and vocal characteristics. To create prominent incongruous stimuli, we combined happiness with sadness and sadness with surprise.
We asked participants to label the emotional expressions of the robot and to rate the robot in terms of believability, perceived intelligence, and likability. In addition to these quantitative measures, we collected dispositional factors, namely the dispositional empathy of the participants. Empathy is defined as an affective response stemming from the understanding of another’s emotional state or what the other person is feeling or would be expected to feel in a given situation [55]. It was included in the study since evidence suggests that individuals with a low level of dispositional empathy achieve lower accuracy in decoding facial expressions of humans [56] as well as emotional expressions of robots [12, 57].
Participants
Participants were recruited through online and university advertisements. In total, 30 participants (mean = 29.5, SD = 4.82, 47% female, 53% male) who met the inclusion criteria (at least 18 years of age, basic English skills) were invited to the lab and completed the study. Participants gave informed consent and received monetary compensation for their participation (15 Euros).
Setting and Apparatus
The robot used in the study was Pepper by Softbank Robotics; a human-like robot with a full-motion body with 20 degrees of freedom. The experiment was carried out in a lab, furnished as a living-room environment, with a sofa, a small table with a laptop computer, and a large TV screen (see Fig. 2). The participants sat on the sofa, facing the TV screen, and the robot was placed between the participant and the TV, slightly to the right of the TV screen. Throughout the experimental session, the participant was observed via the built-in camera of the robot. A trained experimenter, in an adjacent room, utilized the video-feed to trigger the robot’s emotional behaviour promptly.
Stimulus Material
Emotion Elicitation Movie Clips
Each participant watched three short emotion-eliciting movie clips, extracted from the following commercially available movies: An officer and a gentleman (happiness), The Champ (sadness), and Capricorn One (surprise). Target emotions and details about the movies are listed in Table 2. The procedure of validating the efficiency of these videos in eliciting the target emotions is discussed in Rottenberg et al. [57]. For a specific description of the scenes, see the Appendix of [57]. The order of presentation of the clips was randomized for each participant.
Table 2 Target emotions and corresponding emotion-eliciting movies used in the experiment Robot Emotional Expressions
In response to each movie clip, the robot expressed a multimodal emotional expression consisting of two modalities: auditory (vocal prosody), and visual (whole-body expression). The facial expression of the robot remained unchanged across all the stimuli (Pepper has a static face).
Table 3 Motion dynamics (velocity, amplitude), body animations (head, torso, arms) and vocal prosody characteristics (pitch, timing, loudness, non-linguistic utterances) used to generate audio-visual expressions for the target emotions Pitch, timing, and loudness are the features of speech that are typically found to correlate with the expression of emotion through vocal prosody [17]. We manipulated these three features using the Acapela Text-to-Speech (TTS) engine (English language) to generate a set of vocal expressions. Our implementation was based on the phonetic descriptions of happiness, sadness, and surprise proposed by Crumpton et al. [17]. Given our interest in how vocal prosody (and not semantic information) influences emotion perception, the expressions were emotionally-inflected sentences with factual descriptions of the scenes shown in the movie clips, without any meaningful lexical-semantic cues suggesting the emotions of the robot (i.e., “A boxer is laying injured on the table and asks to see his son. A young boy approaches and starts talking to him”). No information regarding the age or gender of the robot could be derived from the speech. HRI research suggests that there is potential for non-linguistic utterances (NLUs) to be used in combination with language to mitigate any damage to the interaction should TTS generated language fail to perform at the desired level (e.g., [58]). In light of these findings, we decided to combine the sentences with a set of NLUs that emphasize the target emotion. NLUs were selected from an existing database of exemplars created in previous work [57] where evaluators rated each NLU using a forced-choice evaluation framework (Sadness, Happiness, Anger, Surprise, Neutral, I don’t Know and Other). All of the chosen NLUs were correctly recognized above chance level [57]. Table 3 summarizes the vocal prosody characteristics and NLUs we used for each target emotion. The resulting set was composed of three distinct sentence blocks (one for each video), each one recorded with different prosody features to portray two different emotions (congruent, incongruent with the situational valence). For example, for the “sadness” movie clip, the same sentence block was generated with two different prosody characteristics (pitch, timing, and loudness), and was combined with two different NLUs, for happiness and sadness respectively.
The vocal prosody stimuli were synchronized with (congruent and incongruent) body movements to create the audio-visual emotional expressions of the robot. The body expressions of the robot were modelled after the way humans move their head, torso, and arms to express emotions. Sources for our implementation were studies investigating the relevance of body posture and body movement features (i.e., velocity and amplitude) in conveying and discriminating between basic emotions in humans ([53, 54, 59, 60]). De Silva and Bianchi-Berthouze [59] found that vertical features and features indicating the lateral opening of the body are informative for separating happiness from sadness. For instance, hands are raised and significantly more extended to indicate happiness and remain low along the body for sadness [59]. In a study by De Meijer [53] the trunk movement (ranging from stretching to bowing) was used to distinguish between positive and negative emotions. Based on these findings, in our design, happiness is characterized by a straight robot trunk, head bent back, a vertical and lateral extension of the arms and large, fast movements. Surprise is characterized by a straight trunk, backward stepping, and fast movements, whereas sadness, is characterized by a bowed trunk and head, downward and slow body movements. In a pre-evaluation online survey, we validated that people correctly perceived the emotion that each isolated body part is intended to convey [57]. We selected the animations that received the highest overall recognition score and used them to generate more complex animations for this study. Table 3 summarizes the whole-body expressions we used for each target emotion. Examples pf body expressions can be seen in Fig. 2.
Procedure
In order to avoid effects of expectation, participants were instructed that the experiment focuses on the ability of the robot to recognize emotions from audio-visual cues in the movie clips (“In this study, we test whether our robot can detect the emotional cues in the movie clips and can react accordingly”), instead of the actual aim. After reading a description of the experiment and signing a consent form, the participant was escorted to the lab where the experiment took place. Upon entering the room, the robot looked at the participant, waved and introduced itself (“Hello! I am Pepper. Welcome to the lab.”). The experimenter then left the room, and the robot uttered, “We are going to watch some movies together! Start the first clip when you are ready”, and turned towards the TV. While the participant watched a clip, the robot also looked at the TV. At the end of the clip, the robot turned towards the participant and expressed its emotional reaction to the clip (congruent, contextually incongruous or cross-modally incongruous). The duration of the robot’s reactions varied between 20 and 50 s. During the rest of the time, the robot displayed idle movements (i.e., gaze/face tracking, breathing). We decided not to include any other type of verbal interaction between the robot and the participant, in order to minimize possible biasing effects on the participant’s perception of the robot.
After the emotional reaction of the robot, an on-screen message on the laptop prompted the participant to answer an online questionnaire (built using the online tool Limesurvey) with questions about their experience of the movie clip and their perception of the robot’s reaction (see Sect. 3.6). To limit carryover effects from one movie to the next, a 1-min rest period was enforced after completing the questionnaire. The participant was told to use this time to “clear your mind of all thoughts, feelings, and memories”, before watching the next clip. This approach was originally used by Gross et al. [61] in their experiments on emotion elicitation using movie clips.
At the end of the third emotional expression of the robot, an on-screen message prompted the participant to answer a series of questions about demographics and personality traits (see Sect. 3.6). Afterwards, the robot thanked the participant and said goodbye (“Thank you for participating in this experiment. Goodbye!”). The experimenter then entered the room, answered any potential questions, debriefed the participant about the real purpose of the experiment and gave the monetary compensation. The experiment took about 60 min on average.
Table 4 The seven items of the believability questionnaire, adopted by [62] Measures
Manipulation Check—Experience of the Movie Clip To ascertain whether the desired emotion (happiness, sadness, surprise) had been properly elicited by the movie clip, we asked participants to report the most prominent emotion they experienced while watching the clip. Participants chose one option from a list of 11 emotions (amusement, anger, disgust, despair, embarrassment, fear, happiness/joy, neutral, sadness, shame, surprise) and the options neutral and other.
Emotion Recognition We asked participants to label the most prominent emotion expressed by the robot in response to each movie clip. Participants choose one option from a list of 11 emotions (amusement, anger, disgust, despair, embarrassment, fear, happiness/joy, neutral, sadness, shame, surprise) and the options neutral and other.
Attitudes Towards the Robot—Believability We asked participants to rate their perceptions about the believability of the robot. Participants rated seven conceptually distinct dimensions of believability (awareness, emotion understandability, behaviour understandability, personality, visual impact, predictability, behaviour appropriateness), as defined by Gomes et al. [62]. Table 4 contains the assertions used for each dimension. All items were rated on a 5-point Likert scale ranging from 1 “Strongly disagree” to 5 “Strongly agree” and “I don’t know.”
Attitudes Towards the Robot—Perceived Intelligence and LikabilityParticipants rated the robot on the Perceived Intelligence and Likability dimensions of the Godspeed questionnaire [63]. All items were presented as a 5-point semantic differential scale.
Demographics and Personality Traits Participants reported basic socio-demographic information (age, gender, profession and previous experience with robots). Participants were also asked to fill in the Toronto Empathy Questionnaire [64], a 16-item self-assessment questionnaire assessing dispositional empathy.
Data Analysis
Manipulation Check—Experience of the Movie Clip Of the 90 movie clip ratings we obtained (30 participants \(\times \) 3 clips per participant), 12 ratings were inconsistent with the intended situational valence manipulation (i.e., the movie clip failed to elicit the targeted emotion in the participant) and were thus excluded from further analyses. Consequently, the statistical analysis reported below was performed on the basis of a final sample of 78 ratings (Happy clip n = 25, Surprise clip n = 25, Sad clip n = 28).
Exploratory Regression Analysis As discussed in the previous sections, there are various factors that seem to influence the perception of robotic emotional expressions (i.e., the socio-emotional context of the interaction, incongruence between modalities, the rater’s gender and dispositional empathy). Therefore, the first step of the data analysis was an exploratory regression analysis, performed to identify which factors would best account for whether or not participants correctly recognized the emotional expressions of the robot in this study. We coded the dependent variable (emotion recognition) as a binary value for whether or not the participant accurately recognized or not the expression of the robot and ran a logistic regression (a method typically used for such exploratory analyses [65]), to ascertain the effects of (in)congruence (congruent, contextually incongruous and cross-modally incongruous conditions), emotion being expressed (happiness, sadness and surprise), gender and dispositional empathy score on the likelihood that participants accurately recognize the emotional expression of the robot.
Emotion Recognition—Effects of Incongruence In the second step of the analysis, hit rate and unbiased hit rate [66] were analysed. These measures were chosen because we were interested in whether incongruence decreased target emotion detection rate. Hit rate (\(Hs\)) is the proportion of trials in which a particular emotion is shown that is correctly labelled. Although \(Hs\) is one of the most frequently used measures of accuracy, this metric does not take account false alarms (i.e., the number of times in which a particular emotion label is incorrectly used) or personal biases during the performance (i.e., the bias to say happy for all expressions). The unbiased hit rate (\(Hu\)), proposed by Wagner [66], takes this problem into account and results in calculations of accuracy rates that are more precise. The computation of \(Hu\) scores involves “the joint probability that a stimulus is correctly identified (given that it is presented) and that a response is correctly used (given that it is used)”(Wagner [66] p. 16). In other words, in order to measure recognition accuracy for a given emotion, the number of misses (e.g., the number of times in which a particular emotion was present and the participant responded it was absent) as well as the number of false alarms (e.g., the number of times in which the participant responded the target stimulus was present when in reality it wasn’t) are taken into account. \(Hu\) scores were computed for each emotional expression of the robot as follows:
$$\begin{aligned} Hu = \frac{Ai}{Bi} \times \frac{Ai}{Ci} \end{aligned}$$
where \(Ai=\) frequency of hits, \(Bi=\) number of trials where \(i\) is the target and \(Ci=\) frequency of \(i\) responses (hits and false alarms).
To investigate if people recognized the robot’s emotional expressions correctly, we compared the emotion recognition ratings of the congruent and contextually incongruous conditions against the ideal distribution using a Chi-Square test. This statistical analysis approach was previously used in [11]. For instance, if ten people had to choose the right expression for the robot, out of a list of three different expressions (e.g., happy, sad, neutral), and the robot expressed “sadness”, then the ideal distribution would be 0, 10, 0.
Next, to investigate the effects of contextual incongruence (i.e., the conflict situation where the robot’s reaction is incongrous with the socio-emotional context of the interaction), we compared emotion recognition ratings of the congruent conditions (e.g., happy expression in response to a happy movie clip) against emotion recognition ratings of the contextually incongruent conditions (e.g., happy expression in response to a sad movie clip) by means of Chi-Square tests.
Thirdly, to investigate the effects of cross-modal incongruence (i.e., how incongruous auditory (vocal prosody), and visual (whole-body expression) cues are processed when presented simultaneously), we compared the emotion recognition ratings of the congruent condition (e.g., happy body and happy voice) against the emotion recognition ratings of the cross-modally incongruous condition (e.g., happy body and sad voice, sad body and surprised voice) by means of Chi-Square tests.
Table 5 Results of the regression analysis of congruence, emotion being expressed, gender and dispositional empathy score on the likelihood that participants accurately recognize the emotional expression of the robot Attitudes Towards the Robot—Effects of Incongruence In the last part of the analysis, we investigated the effects of contextual incongruence and cross-modal incongruence on participants’ attitudes towards the robot. The Believability, Perceived Intelligence, and Likability Questionnaires were calculated by summatively building up the scales. Cronbach’s alpha was calculated to prove the internal reliability of the scales (all scales achieved a value higher than 0.7 and can thus be considered reliable). Since our data were not normally distributed, we used non-parametric tests, suitable for ordinal numerical data. Specifically, the Kruskal–Wallis H test and subsequent Mann–Whitney U tests were used to determine if there are statistically significant differences in the scores between the three experimental conditions (congruent, contextually incongruous, cross-modally incongruous).
Table 6 Raw emotion recognition ratings, for each movie clip in the congruent (Cong.), cross-modally incongruous (Mod. Incong.) and contextually incongruous (Cont. Incong.) conditions Toronto Empathy Questionnaire The Toronto Empathy Questionnaire was calculated by reversing the inverted items and computing the summative score overall 16 items.