The research presented in this paper builds on our previous work on perception of sounds produced by expressive movements of a humanoid NAO robot [15]. Results from this work suggested that mechanical sounds inherent to expressive movements of the NAO robot were not clearly coupled to the emotional reactions associated with respective movements. For example, sounds produced by a joyful gesture conveyed a sensation of frustration. We also observed that certain mechanical sounds did convey emotional characteristics when presented in an auditory-only condition. Moreover, sounds generally communicated arousal more effectively than valence.
In the current work, we expand on the previous study, taking a mixed-methods approach combining quantitative and qualitative methods. While the previous study focused on quantitative ratings of emotions for robot sounds, the first experiment of the study presented in the current paper focuses on descriptions of sounds in free-form text. The purpose of this approach was to explore and characterize robot sounds going beyond a set of predefined scales. Following up on this experiment, we also conducted a second experiment to explore how blended sonification can be used to enhance certain acoustic characteristics of robot sounds in order to improve clarity of non-verbal robot communication. In other words, the study consisted of two separate parts, Experiment 1, focusing on descriptions of the original sounds produced by expressive movements of a NAO robot, and Experiment 2, focusing on perceptual ratings of these sounds and blended sonifications thereof. A schematic representation of the procedure is presented in Fig. 2.
The experiments were organized as follows: participants were first welcomed by the instructor (author 1) and then received instructions on the first page of an online form designed for data collection. They were informed that the SONAO project focused on HRI, but did not know about the origin of the sounds used in the experiments, i.e. that the sounds were produced by expressive robot movements. After the initial instructions, participants were asked to fill out a demographics and musical experience form. They then proceeded with the experiments. Experiment 1 focused on labelling sounds using free-form text annotations. Experiment 2 focused on rating emotions conveyed by sonifications on a set of predefined emotional scales. The participants took part in both experiments and performed them after each other (the same participants took part in Experiment 1 and 2). The experiments were carried out in a lab setting at KTH Royal Institute of Technology and KMH Royal College of Music in Stockholm. Participants listened to the sounds in an online web interface and the experiment was purely acoustic-driven: no video representation of the robot’s gestures were shown.
Participants
A total of 31 participants (14F, avg age = 36.26 yrs) took part in the experiments. However, one participant did not complete the second experiment, reducing the number of participants included in the data analysis to 30 (14F, avg age = 36.23) for Experiment 2. In our previous study [15], we observed that some participants found it hard to put into words the sonic qualities of the mechanical sounds produced by the NAO robot. We hypothesized that the overall experience of listening to these sounds might be affected by level of musical experience; musicians might be more accustomed to listening to (as well as describing) abstract sounds since they are more familiar with e.g. contemporary music. Since our current experiment made use of the same original recordings of robot sounds that were used in our previous study [15], a prerequisite for participants to take part in the current study was to have some musical experience. In other words, the decision to recruit participants with a certain level of musical expertise was guided by the hypothesis that this would result in more rich and informed free-form text descriptions of the evaluated sounds. Moreover, it has been shown in several investigations (see for example [3, 43, 47]) that people with musical skills have shown to acquire an auditory expertise that is not found in laypeople. This expertise includes for example more precise detection of pitch height, discrimination of specific audio streams in noisy situations, and high sensitivity in the identification of frequency deviations (such as in miss-tuned sounds). Therefore, musical experts can provide more stable and reliable data for our investigation, and as a consequence help us in the design of reliable and robust sonic feedback from a robot.
Participants were recruited from the staff and students from the KTH Royal Institute of Technology, staff from the KMH Royal College of Music, as well as amateur musicians from the KTH Symphony Orchestra (KTHs Akademiska Kapell, KTHAK) and the KTH brass band (Promenadorquestern, PQ). Some students from the Interactive Media Technology master programme which had previously completed courses in Sound and Music Computing also took part. In total, 17 of the participants were students at KTH, and 14 participants were not. The level of musical expertise ranged from expert/full-professional activity as musician or singer (9 participants) to little experience as amateur musician/singer (7 participants). A total of 7 participants reported semi-professional activity as a musician or singer with several years of practice, and 7 participants reported being advanced amateur musicians/singers with some years of practice. A mixed-model analysis revealed no significant between-subjects effect of musical expertise on ratings, nor any significant interactions with musical expertise.
Compliance with Ethics Standards
The authors declare that this paper complies with the ethical standards of this journal. All subjects gave informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki. At the time that the experiments were conducted, no ethics approval was required from KTH for behavioral studies such as the one reported in this paper. For the management of participants’ personal data, we followed regulations according to the KTH Royal Institute of Technology’s Ethics Officer. Participants did not receive any monetary compensation, however, subjects from the Symphony Orchestra and the KTH brass band received a cinema ticket for their participation.
Technical Setup
The two experiments were conducted after each other at the KTH Royal Institute of Technology lab, using a laptop connected to a pair of Genelec 8030B speakers. For the participants from KMH Royal College of Music, the experiments were conducted in their studios, with Genelec 8250A speakers.
Experiment 1
In our previous work we observed that some participants described the mechanical sounds produced by the NAO robot as unpleasant and disturbing [15]. We wanted to know if participants had positive or negative associations to these mechanical robot sounds overall. Moreover, we wanted to investigate how participants would describe these sounds, as well as the emotions that they conveyed, when being allowed to answer in free-form text, i.e. when participants were not restricted to predefined scales. One advantage of using free-form text descriptions is that coherence in listener’s responses could be of more significance than if they were merely items checked from a pre-defined list [16].
Stimuli
The same audio recordings that were used in [15] were used in the current studyFootnote 4. These recordings were sounds of a NAO robot performing expressive gestures (frustration, relaxation and joy), i.e. sounds produced by the mechanical movement and engines of the NAO robot. There was a total of 4 stimuli, for the following emotions: frustration, relaxation and joy. The frustrated sound file was 9 seconds long, the relaxed sound file was 6.5 seconds and the two joyful sound files were 10 versus 9 seconds. The two joyful sounds were produced by two gestures that differed in terms of variation of the non-verbal expression along a joyful axis, as described in [1]. Two versions were included for comparative purposes. The gesture that produced the first sound file was rated as more joyful than the gesture producing the second sound (see [1]).
Procedure
The presentation order of the stimuli was randomized for each participant. The participants could listen to the sounds as many times as they wanted. For each stimulus, they were asked the following questions:
-
1.
Please describe the emotion(s) that you think that this sound conveys.
-
2.
Do you have any other comments about this particular sound?
Participants had to answer question 1 but were not required to answer question 2. There was no limitation in terms of how many words that could be used for respective question.
Analysis
It has been shown in previous research on emotions in film music that a two-dimensional model (valence, arousal) of emotions gives comparable results to that of a three-dimensional model (valence, arousal, tension) [11]. Therefore, for the sake of simplicity, we adopted a two-dimensional model for the classification of emotional descriptions in our experiment. Since the majority of the text entries were in the form of a single word or few synonyms, we decided to take a high-level approach focusing on aspects related to motion (activity) and emotion (valence) expressed by these words, based on a categorization into the two-dimentional circumplex model of affect [38]. The model provides a dimensional approach, in which all affective states arise from two fundamental neurophysiological systems, one related to valence (a pleasure-displeasure continuum) and the other to arousal, or alertness [42]. As such, we categorized the free-form text words and sentences into two dimensions, based on their arousal (activity) and valence. A schematic representation of the two-dimensional circumplex model of affect is displayed in Fig. 3. The categorization based on the circumplex model enabled evaluation of the descriptions for respective sound file along a limited set of emotional dimensions.
In the first step of this analysis, lists of words for respective sound stimulus were given to the authors. The authors were not aware of which list that originated from which sound file. Separate analyses were then conducted for respective word list. In the first step of the analysis, both authors independently classified keywords based on the activity dimension, categorizing words into the following categories: activation (high arousal), deactivation (low arousal) or in between. In the next step, the same words were further categorized into positive or negative valence categories. Finally, the encoding of the two authors’ lists were compared. An inclusion criteria was defined so that words were included in the results if both authors agreed in their categorization along both dimensions. For example, if a word was described in terms of activation and positive valence by both authors, it was included in the final results. This approach was used to remove words with ambiguous meanings. The schematic representation of analysis procedure is shown in Fig. 4.
Experiment 2
One of the conclusions presented in our previous work was that the emotional expression of a NAO robot’s gestures is not necessarily tightly clearly coupled with how sounds produced by these gestures are perceived [15]. In other words, sounds inherent to robot movements can influence how a robot’s gesture is perceived. This can be problematic, especially if the sound communicates something that contradicts the gesture. To solve this issue, we wanted to investigate if mechanical robot sounds could be processed in a way that enhances, rather than disturbs, the emotion conveyed through robot gestures. This blended sonification strategy is described in detail below.
Stimuli
Two different sonification models were implemented: one “rhythmic sonification”, producing shorter sounds with regular or irregular Inter-Onset-Intervals (IOI), and one “continuous sonification”, producing a continuous stream of sounds, without interruptions. The sonification models were designed based on previous research on emotions in speech and music. A detailed description of the sonification for respective emotion (i.e. expressive gesture) is presented below. The sound design is also summarized in Table 1. Sound files for respective model are available as supplementary material.
Since the mechanical sounds of the NAO robot are always present, unless they are masked by other sounds, we decided to investigate how the sonifications would be perceived when presented in combination with the original recordings of the expressive gestures. Sonifications were therefore mixed with original recordings. Since we also wanted to investigate how the sonifications were perceived when presented alone, the final stimuli collection consisted of three emotions,Footnote 5 presented in five different soundmodel conditions:
Since there were three emotions and five conditions, a total of 15 stimuli was obtained. The output level of all rhythmic sonifications was obtained by scaling the magnitude of the original input sound file using an exponential scale. For the continuous sonifications, output level of the sound was mapped to peak amplitude of the original signal.
Rhythmic Sonification
The speed at which sonic events are produced (events/second) is one of the most important cues for the communication of emotions in both speech (speech rate) and music (tempo). The speed can be perceived if there is a rhythmic regularity in the display of the sonic events, otherwise it is difficult to perceptually identify it [12]. In a previous study we have found that experienced musicians chose clearly defined speed values for communicating different emotional intentions in music performances [5]. Therefore, we decided to test if introducing a rhythm in the sonification of the robot movements, for communicating their speed, could help the perception of its emotional intentions.
While little work has focused specifically on the term frustration in the context of emotion in music and speech research, substantial work has been conducted on the term anger. In the review on emotion in vocal expression and music performance conducted by Juslin and Laukka [25], anger is said to be associated with fast speech rate/tempo, high voice intensity/sound level, much voice intensity/sound level variability, much high-frequency energy, high F0/pitch level, much F0/pitch variability, rising F0/pitch contour, fast voice onsets/tone attacks, and microstructural irregularity. In a study focusing on modeling the communication of emotions by means of interactive manipulation of continuous musical features, changes in loudness and tempo were associated positively with changes in arousal, but loudness was dominant [5].
Based on above described findings, the frustrated sonification was designed to be characterized by fast and loud sounds, with a lot of intensity variability, and irregular rhythms. This was achieved by triggering a simple sample-based (granular) synthesis engine at the first peak of the original sound file. The synth was then triggered based on the magnitude of the original frustrated audio recording. The synthesis engine produced grains of random size between 200-300 ms.Footnote 6 The grains were randomly sampled from the original frustrated sound recording. The pitch of each grain was shifted based on the magnitude of the original audio signal. Finally, distortion was added to the outputted sound.
In contrast to frustration, which is characterized by negative valence and high arousal, relaxation is characterized by positive valence and low arousal. In general, positive emotions appear to be more regular than negative emotions; irregularities in frequency, intensity and duration seem to be a sign of negative emotion [25]. In music, differences in arousal are mainly associated with differences between fast and slow tempi [16]. Relaxed speech has been found to more whispery and breathy than stressed speech [17]. We designed the relaxed sonification model so that it presented soft sounds with little frequency and amplitude variability. These sounds were presented with a regular tempo. This was achieved by generating pulses of noise, with 700 ms time difference, that were filtered through a resonant band-pass filter, with variable center frequency (800–1000 Hz) and Q-value (0.3–0.4), depending on absolute magnitude of the input sound. The output was then feed through a reverb.
In Western music, differences in valence are mainly associated with major versus minor mode [16]. For the purpose of our study, the emotion joy could be considered to be similar to the sensation of happiness. According to Juslin and Laukka [25], happiness is associated with the following cues: fast speech rate/tempo, medium to high voice intensity/sound level, medium high-frequency energy, high F0/pitch level, much F0/pitch variability, rising F0/pitch contour, fast voice onsets/tone attacks, and very little microstructural irregularity. In [4, 5], authors conclude that happy music performances are characterized by a relatively fast tempo and loud sound with staccato articulation and clear phrasing (i.e. changes in tempo and sound level organized in accelerando/crescendo and rallentando/decrescendo couples for the communication of a musical phrase). Moreover, happiness is said to best be expressed with major mode, high pitch and high tempo, flowing rhythm and simple harmony [5, 20]. For the joyful sonification, we used a rectangular (pulse) oscillator with a rather short envelope duration, mapping magnitude of the input signal to pitches in a C major scale, with an Inter-Onset-Interval (IOI) of 180 ms between notes (about 5.5 notes/second).
Continuous Sonification
The continuous sonifications were, as the name suggests, not characterized by any particular rhythm. Similar to the rhythmic case, the continuous sonification of frustration made use sample-based (granular) synthesis, but without a rhythmic component and added distortion. Grains were also generated based on peak amplitude. The sonification model for relaxation was similar to the one described for the rhythmic relaxed condition, with the difference that the sound was continuous, and that the resonant band-pass filter was set to a variable range of 300-500 Hz and the Q value to 0.2-1.0. Scaling between absolute magnitude was also slightly different. For the joyful sonification, a simple FM synthesizer was used to generate a tonic and major third in a C major scale, with increasing pitch depending on magnitude of the original sound file.
Procedure
After completing Experiment 1, participants proceeded with Experiment 2. They were presented with sound stimuli that they could listen to as many times as they wanted. Participants were then asked to rate perceived emotions on a set of five-step scales (sad, joyful, frustrated, relaxed), ranging from not at all (0) to very much (4), with an annotated step size of 1. These are the same scales that were used in our previous study [15]. The reason why the sad scale was included, despite the fact that no “sad” stimuli was used, was to obtain results comparable to those presented in [15]. The participants were given the following instructions: “This sound represents an emotional reaction. Rate how much of the following emotions you perceive in the sound.” The presentation order of the stimuli was randomized for each participant.
Analysis
Since the data was collected on scales that displayed numeric values of equal distance, we proceeded with statistical analysis using parametric methods. With 30 observations per category, data could be assumed to be normally distributed according to the Central Limit Theorem. For the purpose of this study, we performed analysis of ratings within each stimulus category (frustrated, relaxed and joyful), through separate Two-Way Repeated Measures ANOVAs with the following within-subjects factors: emotional scale (frustrated, joyful, relaxed and sad) and condition, i.e. sound model (original sound, rhythmic sonification, continuous sonification, original + rhythmic sonification and original + continuous sonification). The purpose of these tests was to investigate if there was an interaction effect between emotional scale and condition (sound model). Greenhouse-Geisser estimates of spehericity were used when the assumptions of sphericity were not met. When the omnibus test for the interaction was significant, it was followed by the application of a post hoc procedure to explore which pairs of cell means that were significantly different, i.e. which condition that resulted in significantly different ratings for respective scale compared to the original sound. Paired t-tests with Bonferroni corrections were used to account for multiple comparisons.