In this study, we report the validation results of the EU-Emotion Voice Database, an emotional voice database available for scientific use, containing a total of 2,159 validated emotional voice stimuli. The EU-Emotion voice stimuli consist of audio-recordings of 54 actors, each uttering sentences with the intention of conveying 20 different emotional states (plus neutral). The database is organized in three separate emotional voice stimulus sets in three different languages (British English, Swedish, and Hebrew). These three sets were independently validated by large pools of participants in the UK, Sweden, and Israel. Participants’ validation of the stimuli included emotion categorization accuracy and ratings of emotional valence, intensity, and arousal. Here we report the validation results for the emotional voice stimuli from each site and provide validation data to download as a supplement, so as to make these data available to the scientific community. The EU-Emotion Voice Database is part of the EU-Emotion Stimulus Set, which in addition contains stimuli of emotions expressed in the visual modality (by facial expression, body language, and social scene) and is freely available to use for academic research purposes.
The tone of voice, or prosody, of others is an important cue to understand their affective states. During a social interaction, prosody is key to accurately determining the emotion that others are experiencing (Banse & Scherer, 1996). Even very young infants are capable of recognizing the different intonations of their mothers (Fernald, 1989; Fernald & Morikawa, 1993). Clinical conditions, however, can alter both the production and the recognition of intonation. For instance, atypical emotional prosody has long been considered a hallmark of autism, a neurodevelopmental condition marked by deficits in social communication and interaction (Asperger, 1944; Kanner, 1943). Abnormal tone of voice was noted as the primary contributor to the perceived oddness of people with autism during social interaction (Paul et al., 2005; Van Bourgondien & Woods, 1992), and thus puts them at risk of social exclusion. In addition, many studies have reported a deficit in emotional prosody perception in people with autism (Globerson, Amir, Kishon-Rabin, & Golan, 2015; Golan, Baron-Cohen, Hill, & Rutherford, 2007; Rutherford, Baron-Cohen, & Wheelwright, 2002), which could explain some of their social difficulties. Learning how to recognize and apply emotional prosody is hence a common challenge for people with autism, and an important skill to train in order to increase their chance of social inclusion.
Training recognition and production of prosody in autism
This study is part of a larger project in which the authors (and additional collaborators) developed and evaluated an educational online game to train autistic children 5–10 years old to recognize and produce emotional prosody.
In this article, we report the production and validation of the EU-Emotion Voice Database, a unique and large set of emotional vocal stimuli that were used as training material for the activities aimed at helping children with autism recognize and express emotions in the vocal modality. The EU-Emotion Voice Database also served in the development of a voice analyzer (Marchi et al., 2015). This voice analyzer was trained with the EU-Emotion voice stimuli—via machine learning—to become able to discern the properties of an emotional voice necessary for a particular emotion to be identified by a human listener. Finally, we used the EU-Emotion Voice Database to provide a pool of validated emotional voice stimuli for a psychology experiment investigating differences in emotional prosody recognition between children with and without autism in the UK, Sweden, and Israel (Fridenson-Hayo et al., 2016).
The EU-Emotion Voice Database
The features of the EU-Emotion Voice Database differ from those of other existing emotional voice databases (see Table 1 for an overview of published emotional voice databases) in several ways. First, it includes emotional vocal stimuli for 20 different emotions (plus neutral), whereas most databases are limited to fewer emotions (Bänziger, Mortillaro, & Scherer, 2012; Hawk, Van Kleef, Fischer, & Van Der Schalk, 2009). Most previous emotional voice databases have included at least some of the six basic emotions (fear, anger, surprise, sadness, happiness, and disgust), because those emotions are thought to reflect innate and culturally universal emotion categories (Ekman & Friesen, 1971). However, complex/subtle emotions, which may be culturally dependent and mastered later in life, have rarely been included to any substantial extent in previous emotional voice databases (see Table 1). The presence of several subtle/complex emotions in this database, in addition to the basic six, is hence novel, unique, and important to the study of emotions, as it permits investigation of emotion recognition abilities for complex/subtle emotions expressed through the voice alone.
Second, the database contains emotional voice stimuli, portrayed by a total of 54 actors across a wide age span and across three languages, resulting in a total of more than 2,000 validated stimuli. The inclusion of emotional voice stimuli in three different languages (British English, Swedish, and Hebrew) is another novel aspect of the EU-Emotion Voice Database, as most previous emotion voice databases had stimuli in one language only (see Table 1). Thus, the EU-Emotion Voice Stimuli could be useful for studying emotion perception from vocal cues cross-culturally as a way to shed light on the aspects of emotional prosody that are culture specific, and those that are universal.
Naturally, when creating a large database including a wide range of emotional expressions with the purpose of training children with autism, it is crucial to assess the degree with which each stimulus conveys the intended emotion. Therefore, all the stimuli in the EU-Emotion Voice Database were validated by typically developing adults for emotion recognition accuracy and perceived emotional valence, intensity, and arousal. (For similar validation approaches on vocal emotional stimuli, see Schröder, 2003, and Belin, Fillion-Bilodeau, & Gosselin, 2008.) Here we report the validation results of those stimuli, so as to make them available to the wider scientific community.
Finally, although the EU-Emotion Voice Database is a large and unique pool of stimuli in its own right, it is part of a larger emotional stimulus database, the EU-Emotion Stimulus Set.Footnote 1 This includes emotional stimuli expressed in the visual modality (facial expressions, body language, and social scenesFootnote 2; see O’Reilly et al., 2016). Only one previous database has contained both audio and visual emotional stimuli (Bänziger et al., 2012), but it does not include social-scene stimuli that provide the contextual cues that are potentially important to recognize certain complex emotions.
Voice stimuli creation
Three sets of healthy actors (N = 18 per site, nine females) were recruited to express the different emotions.Footnote 3 The actors in each set were either native speakers of British English, Swedish, or Hebrew. Their age ranged from 10 to 70 years in the UK, and 9 to 67 years in Sweden and 11 to 72 in Israel (see Table 2). The actors were recruited from professional acting agencies or drama schools within the three countries. Trained researchers from each site guided the actors through their performance.
The emotional voices in the EU-Emotion Stimulus Set include 20 emotional states (afraid, angry, ashamed, bored, disappointed, disgusted, excited, frustrated, happy, hurt, interested, jealous,Footnote 4joking, kind, proud, sad, sneaky,Footnote 5surprised, unfriendly, worried) and the neutral state. These 20 emotional states were selected from originally 27 by autism experts (n = 47) and parents of children with autism (n = 88), who perceived them as the most important states for social interaction (see Lundqvist et al., 2013).
There were two different scripts for the sentences to be read by the actors (see the supplementary materials, Table A). These voice scripts were first written in English and then translated to Swedish and Hebrew, using back-translation. Each of those two scripts contained both semantically neutral and semantically emotional sentences for the ten emotional states. The sentences contained two to ten words apiece (mean = 4.64, SD = 1.14). Each actor was assigned to one script (i.e., one set of emotional state), and thus produced both semantically neutral and semantically emotional sentences. The semantically neutral sentences were produced for all different emotions, but the semantically emotional sentences only in the compatible emotion. For each sentence, the actors produced three items or exemplars. The same protocol was used across the three sites.
In the UK and in Sweden, the six basic emotions (anger, disgust, fear, happiness, sadness, and surprise) were portrayed at high and low intensity. The other 14 complex emotional states were expressed at a high intensity only. The following example instruction was given to help guide the intensity of expression across all modalities: “High Intensity—In this situation, you are quite angry, not a little angry, not very angry, but quite and unmistakably angry.” In Israel, the actors were asked to express the emotions naturally, with no separate expressions of high and low intensity (and no explicit instruction). Each actor portrayed only a subset of the ten emotional states (three basic and seven complex). The ten emotional states portrayed by each actor depended on the script they had been assigned (see the supplementary materials, Table A), and the two scripts were enacted by equal numbers of actors with comparable distributions of gender and age (Table 2). The members of the research team provided feedback throughout to guide the actors’ performances. A total of 4,781 voice stimuli (British English k = 1,569, Swedish k = 1,574, Hebrew k = 1,638) were recorded in a soundproof studio at each site.
Voice stimuli validation process
The actors recorded each script three times, and the best portrayal the actor made of that script was selected by the recording company, under the supervision of a trained researcher, to go through for validation. This procedure resulted in the discarding of 56% of the originally recorded stimuli and in the selection of 695 stimuli.
A large proportion of the stimuli (36%) were also discarded in the Swedish sample, due to low acting quality. Similar to the UK, the best portrayals that an actor made of each script was kept, and poor-quality portrayals were discarded. The selection of the stimuli was conducted by the two experienced psychologists involved in the recording of all voice stimuli. This resulted in the selection of 1,011 stimuli.
All recorded stimuli were judged by three members of the Israeli research team. Only those unanimously judged as clearly depicting the target emotion were kept for the validation procedure (72%). This resulted in the selection of 453 stimuli.
The survey structure was first developed in British English and then translated into Swedish and Hebrew (using back-translation) by two native speakers for each language who were also fluent in British English. A total of 84 surveys were distributed (20 in the UK, 30 in Sweden, and 34 in Israel). The online surveys were constructed in such a way that the emotional states were evenly distributed across surveys (to ensure that each emotion category was represented in each survey) and included 34–35 voice stimuli in Sweden and the UK, and 16–18 stimuli in Israel. Each stimulus appeared in one survey only. Each survey took approximately 30 min to complete, and each survey responder responded to only one survey. For each stimulus, survey responders were asked (1) to discriminate the emotion expressed by the voice among six possible choices and (2) to assess the expressed emotion on arousal, valence, and intensity (in the UK and Sweden).
The six possible choices in the discrimination task included the target emotion and five distractors. Among those five distractors were four emotions and a “none-of-the-above” option. The “none-of-the-above” option was proposed in accordance with Frank and Stennett (2001) and O’Reilly et al. (2016), to avoid the possibility of agreement artifacts.Footnote 6 The four emotions operating as distractors were carefully selected among the 20 possible emotional states to make the task equally difficult for all target emotions. Lundqvist et al. (2013) were able to create a similarity/dissimilarity matrix for those 20 emotional states. This matrix was established from over 700 participants rating the similarity/dissimilarity of each of the 20 emotions/mental states involved here against all of the other 20 emotions/mental states. Using this matrix, we classified different ranges of similarity (corresponding to very similar, quite similar, quite dissimilar, and very dissimilar) for each target emotion and selected one distractor emotion in each of those ranges (see emotion similarity/dissimilarity matrix in the supplementary material for details). Importantly, each emotion had an equal chance to be selected as a distractor.
In the UK and Sweden, the dimension analysis of each emotional recording included (a) a question about valence (“how positive or negative is this emotional expression?”) that was rated between 1 (very negative) and 5 (very positive), (b) a question about arousal (“how strongly does this emotion make you feel?”) that was rated between 1 (not at all) and 5 (very strongly) and (c) a question about intensity (“how intense is this emotional expression?”) that was rated between 1 (calm) and 5 (high intensity).
Altogether, a total of 1,739 complete responses were recorded from the three data collection sites (UK: n = 427 [283 females]; Sweden: n = 632 [405 females]; Israel: n = 461 [309 females]). A minimum of 20 survey-responders/participants completed each survey (per data collection site). The average age of the participants was 38 years (range: 18–90) in the UK, 46 years (range: 17–80) in Sweden, and 32 years in Israel (range: 18–79). Participants were recruited using existing research participant databases and university mailing lists, as well as through online resources.
Data treatment and analysis
A raw recognition rate was computed separately for each individual emotional recording. Given that there were six response options, this score was then adjusted for the chance rate using Cohen’s kappa [chance-corrected recognition rate = (proportion of raw correct – (1/6)/(5/6)], as had been the case in previous work of similar nature (Tottenham et al., 2009, and O’Reilly et al., 2016). When the chance-corrected emotion recognition rates (CCRs) were below 0, they were adjusted to 0. We also calculated whether the target emotion was selected above chance with a binomial test for each stimulus. We report the raw emotion recognition rates, the p values for the binomial tests, the CCRs, and the measures of emotional valence, intensity, and arousal (when available) in stimulus item level tables (Table B for the UK, Table C for Sweden, and Table D for Israel) in the supplementary materials. We also report averaged recognition rates (and dimension ratings, when available) across stimuli and respondents for each emotion (and each intensity level, when applicable) and each site in emotion level tables (Table 3 for the UK, Table 4 for Sweden, and Table 5 for Israel; a graphical summary can be found in Fig. 1).Footnote 7 In addition, using Pearson’s correlations, we calculated the intercorrelations between CCRs and ratings of intensity, valence, and arousal for Sweden and the UK overall (Table 6) and for each emotion (Table 7). The data could not be compared across sites due to variation in the experimental conditions, and particularly in the number, age, and sex of the respondents. Finally, we calculated for each site and each emotion category the duration of the emotional voice stimuli (supplementary materials, Table F).
Result and discussion
An overview of the CCRs and emotional rating scores can be found in Table 3 for each emotion and each intensity level (for basic emotions only). The individual data for each voice stimulus that underwent validation in the UK is available in the supplementary material (Table B). The overall CCR for all emotion categories combined was 39% (SD: 31%). This indicates that recognizing an emotion from another’s voice is relatively difficult (as a point of comparison O’Reilly et al., 2016, found a CCR of 63% for recognizing emotions from faces) and variable across stimuli. This variability was apparent both across emotion categories and within an emotion category. Indeed, among the emotions expressed at normal intensity, some were particularly well recognized. This was the case for negative emotions such as worried (mean = 67%, SD = 19%, median = 71%), frustrated (mean = 60%, SD = 26%, median = 60%), and disappointed (mean = 53%, SD = 33%, median = 58%). On the contrary, kind (mean = 23%, SD = 29%, median = 7%), ashamed (mean = 22%, SD = 31%, median = 3%), and jealous (mean = 17%, SD = 24%, median = 4%) had notably low CCRs. Many of the actors who portrayed the UK emotional voices also portrayed (separately) emotions in the face, body, and social context modalities for the EU-Emotion Stimulus Set (O’Reilly et al., 2016). Interestingly, O’Reilly and colleagues also found low CCRs for kind and jealous expressed through facial emotions (9% and 13%, respectively) and for jealous expressed through body language (3%), which was in contrast to the relatively high CCRs when those emotions were represented in social context (61% for kind and 44% for jealous). This suggests that those emotions are difficult to recognize when simply considering the expressive channels of others and are best recognized in context.
In addition, the CCRs of voice stimuli ranged from 0% to above 80% for most emotions, which indicates that the recognizability of the EU-Emotion voice stimuli within an emotion category was highly variable. As is apparent in Table 3, the effect of expressed intensity is not entirely clear here. Some emotions were recognized better when expressed at low intensity (e.g., disgusted and surprised), whereas others were recognized better when expressed at high intensity (e.g., angry and happy). This may be because some emotions (e.g., sadness) are naturally expressed with low-intensity voices, whereas other emotions (e.g., anger) are typically associated with high-intensity voices (e.g., Gopinath, Sheeba, & Nair, 2007), and by producing two levels of intensity per emotions, we may have created congruent and incongruent conditions. The results for the voice stimuli contrast with the results for the EU-Emotion face stimuli, in that those were recognized better at high intensity (O’Reilly et al., 2016). However, these results should be interpreted cautiously, given the smaller number of emotions expressed at low than at high intensity. Nevertheless, as is shown in Table 6a, when all emotions were taken together, CCRs were strongly correlated with ratings of intensity and arousal (themselves correlated) but not with ratings of valence in the UK. This suggests that high perceived intensity and arousal are associated with increased accuracy in recognizing emotions in the vocal modality. However, no correlations between the CCRs and emotional ratings were found for ashamed, bored, disappointed, jealous, sneaky, and joking emotional voices (see Table 7a).
An overview of the CCRs and emotional rating scores can be found in Table 4 for each emotion and each intensity (for basic emotions only). The individual data for each voice stimulus that underwent validation in Sweden are available in the supplementary material (Table C). The overall mean CCR for the Swedish voice stimuli (all emotional categories confounded) was 37% (SD = 31%). This approaches the UK overall CCR very closely and confirms the difficulty of recognizing emotions from the voice of others as well as the variation in recognizability of the EU-Emotion voice stimuli. Further exemplifying this variability, Table 4 shows that the CCRs of voice stimuli ranged from 0% to more than 80% within most emotional categories in Sweden. Nevertheless, among emotions expressed at high intensity, frustrated, disappointed, and bored were particularly well recognized (frustrated: mean = 63%, SD = 25%, median = 73%; disappointed: mean = 61%, SD = 22%, median = 64%; bored: mean = 60%, SD = 20%, median = 66%) whereas jealous, kind, and neutral were particularly poorly recognized (jealous: mean = 18%, SD = 26%, median = 0%; kind = mean = 17%, SD = 21%, median = 8%; neutral: mean = 13%, SD = 13%, median = 13%). This pattern of results is similar to that from the UK, in which frustrated and disappointed were also among the three best-recognized emotions, whereas jealous and kind among the three worst-recognized emotions.
However, unlike in the UK, the CCRs of basic emotions were not dramatically influenced by their levels of expression, except for angry voices, which were recognized better at high than at low intensity (see Table 4). Finally, the correlation analyses showed strong positive correlations between the CCRs and arousal/intensity. This indicates that the higher the perceived intensity/arousal in the voice stimulus, the better the recognition of the expressed emotion in Sweden (see Table 6b). However, no correlations between the CCRs and emotional ratings were found for bored, disappointed, and joking emotional voices (see Table 7b).
An overview of the CCRs and emotional rating scores can be found in Table 5 for each emotion expressed at normal intensity. The individual data for each voice stimulus that underwent validation in Israel are available in the supplementary material (Table D). Across all Israeli voice stimuli, the CCR was 53% (SD = 33%), which is better than the overall CCRs for the UK and Sweden (means = 39% and 37%, respectively). This might be partly due to the absence of low-intensity emotional recordings in Israel, given that low levels of intensity were associated with lower recognition rates for certain emotions in Sweden and the UK, or due to the much higher initial rejection rate of recordings that were accepted for validation in Israel. It could also be due to the fact that the Hebrew actors portrayed a more spontaneous emotion than the British and Swedish actors, since they were not given the instruction to differentiate two levels of speech intensity. There was as much variability as in the UK and Sweden, though, as is shown by CCRs ranging from 0% to 100% for most emotion categories. However, some emotions were recognized particularly well as categories. This was the case for angry (mean = 71%, SD = 25%, median = 78%), frustrated (mean = 65%, SD = 31%, median = 78%), and worried (mean = 72%, SD = 26%, median = 75%). On the contrary, kind (mean = 30%, SD = 30%, median = 24%), hurt (mean = 32%, SD = 36%, median = 19%), and happy (mean = 34%, SD = 33%, median = 23%) were particularly poorly recognized. This is in accordance with the other sites, where kind had a particularly low CCR and frustrated had a particularly high CCR, which suggests that the ability to convey those emotions through the voice is stable across sites.
The EU-Emotion Voice Database is a validated collection of 2,159 emotional voice stimuli in three different languages (695 in British English, 1,011 in Swedish, and 453 in Hebrew), which makes it the largest emotional voice database available for scientific use to date. The overall recognition scores for the emotional voice stimuli sets we found (mean CCR: 39% in the UK, 37% in Sweden, and 53% in Israel) were lower than the overall recognition scores reported by some previous emotional voice databases using sentences (means = 70%, 65.4%, and 72.25% in Polzin & Waibel, 1998, when accuracy was determined, respectively by human performance, an emotion acoustic dependent model, and an emotion-dependent suprasegmental model). This is particularly the case for the auditory stimuli depicting the happy and afraid emotional states (happy: 60 or 94% in Polzin & Waibel, 1998 [depending on the computerized model used] and 19%–34% in our study [depending on the site]; afraid: 73% or 60% in Polzin & Waibel, 1998 [depending on the computerized model used] and 18%–47% in our study [depending on the site]).Footnote 8 However, Polzin and Waibel (1998) only had four basic emotional states (happiness, sadness, anger, and fear) and the neutral state. In contrast, we had 20 emotional states (plus neutral), including many complex ones. Our use of numerous complex emotional states could explain our overall lower recognition scores, as those emotional states are typically harder to recognize from a single perceptual channel than are the basic emotional states. For instance, we found that the complex emotions kind and jealous were among the three most difficult emotions to recognize across all three sites (kind: mean CCRs of 22% in the UK, 17% in Sweden, and 30% in Israel; jealous: mean CCRs of 17% in the UK and 18% in SwedenFootnote 9). In addition, Banse and Scherer (1996) included 14 emotional states (basic and complex) in their database and found an overall mean recognition score of 48%, which is intermediate between the one we observed (with 20 emotions and neutral) and the one reported by Polzin and Waibel (with only four basic emotions and neutral). This further supports the idea that the decreased mean recognition score observed here in comparison to some emotional voice databases was due to our inclusion of many complex emotions, which are more difficult to recognize than basic ones.
Nevertheless, the overall recognition score we observed here for emotions expressed in the auditory modality was lower than the overall emotion recognition score reported by O’Reilly et al. (2016) for emotions expressed in the visual modality, even though those authors used the same 20 emotional states (mean = 63% for the facial modality, 77% in the bodily modality, and 72% for social scenes), as well as a similar age range and gender distribution for their actors and survey responders. This might suggest that it is harder to recognize emotions from others’ voices than from visual cues, a theory that future studies might investigate further. However, it is noteworthy that the lower emotion recognition scores for stimuli in the auditory modality that we found in the present validation study could also simply be due to our inclusion in the database of voice stimuli with low recognition scores and voice stimuli featuring both semantically emotional sentences and semantically neutral sentences (the latter being relatively hard to produce and recognize). Finally, the emotion stimuli expressed in the visual modality were also longer in duration than the ones expressed in the auditory modality (2–52 s, as opposed to 0.5–4.5 s in the auditory modality).
There is some evidence that vocal bursts convey emotion better than do sentences (mean recognition score of 81% in Scherer, 2000 [ten emotions]; see also Hawk et al., 2009). As a result, and also to avoid the linguistic barriers typically associated with the use of sentences, most contemporary databases have applied emotional intonations to vocal bursts rather than to sentences (see Table 1). In this database, however, sentences were used. This atypical choice was constrained by our need to use our emotional voice stimuli for training purposes (i.e., as part of the educational online game). We believe, though, that it contributes to the high ecological validity of our database. Indeed, prosody is in fact the melody of speech, and thus most often is associated with organized speech (sentences) in real life. In addition, the recognition of complex emotions may require stimuli of longer duration than those usually provided through emotional bursts, which we provided through spoken sentences.
Emotion recognition accuracy was variable not only across emotion categories in the EU-Emotion Voice Database, but also within emotion categories. Indeed, within most emotion categories at all three sites, the recognition scores for voice stimuli ranged from 0% to over 80%, which reflects heterogeneity in the recognizability of the voice stimuli included in this database. The stimuli recognized with very little accuracy might, however, be useful for the purpose of machine learning, since, to become more efficient and precise, digital devices need to be trained with emotional stimuli recognized with a good accuracy (as examples that should be recognized by the system), but also with emotional stimuli recognized with poor accuracy (as examples that should not be recognized by the system).
Correlations between the recognition accuracy scores and ratings of valence, arousal, and intensity obtained for each stimulus (in the UK and Sweden) revealed that intensity and arousal were positively associated between themselves and with recognition accuracy, at both sites. Our finding of a positive correlation between ratings of intensity and arousal in the auditory modality are in line with what was found by O’Reilly et al. (2016) in the visual modality, and with the results of Bänziger et al. (2012) across both the visual and auditory modalities, confirming that the arousal and intensity dimensions are strongly associated across modalities. The finding that intensity and arousal correlated with the recognition of emotional voice stimuli is novel, since correlations between recognition scores and ratings of intensity and arousal were not performed as part of previous validation studies in which ratings of both arousal and intensity were collected (Belin et al., 2008; Liu & Pell, 2012), and it suggests that perceived arousal and intensity in the voice could be important cues to the affective state of another. However, it is noteworthy that for certain emotions (e.g., bored, disappointed, joking), the correlations between recognition scores and dimensional ratings (arousal, intensity, valence) were not significant, which suggests that arousal, intensity, and valence are only a crucial factor in emotion recognition accuracy for certain emotions.
To enhance the ecological validity of the emotional vocal stimuli collected for the EU-Emotion Voice Database, we carefully selected professional actors capable of plausibly enacting emotions. Although there might be some differences between the vocal utterances of acted versus experienced emotions (Douglas-Cowie, Campbell, Cowie, & Roach, 2003), we believe that our careful selection of skilled actors (i.e., actors capable of acting in a naturalistic way) minimized those differences. Importantly, to produce the voice stimuli of the EU-Emotion Voice Database, we recruited the largest number of actors ever employed in an emotional voice database (see Table 1), including children and adult actors of both genders. This allowed for increased individual variability in the created emotional voice stimuli, which will be a useful feature for a database training digital devices through machine learning. In addition, this feature might also be important for the experimental study of emotional prosody perception across development, since children could be better at recognizing emotions in peer-aged voice stimuli than in adult voice stimuli. Indeed, although this hypothesis has not yet been tested, in the visual modality, Easter et al. (2005) showed that adolescents were better at recognizing the facial expressions of adolescents than the facial expressions of adults (see also Somerville, Fani, & McClure-Tone, 2011).
The EU-Emotion Voice Database has a number of limitations. First, although emotional voice stimuli were collected in three different languages, a statistical comparison of the validated stimuli across cultures was not possible, due to variation in the experimental parameters across collection sites (i.e., the number of emotions expressed and the intensity levels of expression, number of emotional voice stimuli per emotion category, and number and demographic properties of the participants who participated in the validation). Second, the numbers of stimuli obtained for each of the emotion categories varied within and across sites (N stimuli per emotion category expressed at high/normal intensity: 23 to 42 in the UK, 23 to 52 in Sweden, 6 to 42 in Israel), reducing the interpretability of the differences between emotion categories. Finally, to be validated, the emotional voice stimuli were split among a number of surveys in each country (20 in the UK, 30 in Sweden, and 34 in Israel). Each survey thus included only a selection of the emotional voice stimuli. As a result, survey respondents judged only a subset of the emotional stimuli, which means that there was a degree of heterogeneity in respondents between surveys. In addition, in each country the actors did not enact all emotion categories, but only a subset of those emotion categories that depended on which script they received (see the supplementary material, Table A). This was necessary in order to collect the vast amount of data included in the EU-Emotion Voice Database, but it constrained the type and number of possible data analyses. Finally, the emotional voice stimuli were recorded from actors reading scripts, and not from natural emotional speech. As was outlined by Douglas-Cowie et al. (2003), this is a limitation. Indeed, read speech is distinct from spoken speech (Johns-Lewis, 1986), and lacking context can lead actors to express emotion in a caricatured way.
Nevertheless, the EU-Emotional Voice Database will be particularly useful for future studies investigating the perception of emotional prosody, in that it extends previous emotional databases in its number of emotional voice stimuli (2,159), the number and type of emotion categories (20+ neutral), and the number of actors (18); see Table 1. It is also the only emotional voice database to date that has included emotional voice stimuli in three different languages (British English, Swedish, and Hebrew; see Table 1) from both child and adult actors.
Because the EU-Emotion Voice Database matches the number of emotions and expression intensities that are also part of the EU-Emotion Stimulus Set (O’Reilly et al., 2016), the EU-Emotion materials provide a pool of stimuli portraying emotion in the auditory modality that can be used in conjunction with the visual emotional stimuli (i.e., facial expressions, body language, social scenes). This matching of materials will allow for research into cross-modal deficits of emotion recognition that are present in certain clinical conditions (e.g., autism, but also anorexia, depression, or schizophrenia: Golan, Sinai-Gavrilov, & Baron-Cohen, 2015; Hoekert, Kahn, Pijnenborg, & Aleman, 2007; Kan, Mimura, Kamijima, & Kawamura, 2004; Kucharska-Pietura, Nikolaou, Masiak, & Treasure, 2004) and provide the unique possibility to move recognition assessment and expression training beyond basic emotions toward complex and difficult states.
The research leading to these results received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 289021 (www.asc-inclusion.eu). S.B. was supported by the Swedish Research Council (Grant No. 523-2009-7054), and S.B.-C. was supported by the Autism Research Trust, the MRC, the Wellcome Trust, and the National Institute for Health Research (NIHR) Collaboration for Leadership in Applied Health Research and Care East of England, at the Cambridgeshire and Peterborough NHS Foundation Trust. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR, or the Department of Health.
Freely available to download: www.autismresearchcentre.com/arc_tests.
Social scenes refer to short videos extracted from popular television series, featuring an emotional state that can only be identified from contextual visual cues (the speech is blurred).
Jealous and neutral were not recorded in Israel.
Sneaky has two different possible translations in Swedish (busig and lömsk), and those translations have very different meanings (busig approximately corresponding to rowdy, lömsk corresponding to sinister), neither of which is a straight or unambiguous translation of sneaky. As a result, the sneaky stimuli were not recorded in Sweden.
The “none-of-the-above” option was selected at rates of 10% in UK, 15% in Sweden, and 4% in Israel.
Table E (tabulation 1–3, supplementary materials) presents the emotion level tables for each site, with stimuli further subdivided on the basis of whether they consisted of semantically neutral or semantically meaningful sentences. Note that, as expected, semantically meaningful sentences are often recognized better than semantically neutral sentences. This is particularly true for certain emotions (e.g., jealousy), indicating that recognizing those emotions may be highly reliant on semantic information.
Unfortunately, emotion recognition accuracy scores determined by human performance were not available in Polzin and Waibel (1998) for each individual emotion (happy, afraid, angry, and sad). This slightly undermines the comparison, since our recognition scores, in contrast, were obtained on the basis of human performance.
Jealous was not recorded in Israel.
Abelin, Å., & Allwood, J. (2000, September). Cross linguistic interpretation of emotional prosody. Paper presented at the ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, Newcastle, Northern Ireland.
Asperger, H. (1944). Die “Autistischen Psychopathen” im Kindesalter. European Archives of Psychiatry and Clinical Neuroscience, 117, 76–136.
Banse, R., & Scherer, K. R. (1996). Acoustic profiles in vocal emotion expression. Journal of Personality and Social Psychology, 70, 614–636.
Bänziger, T., Mortillaro, M., & Scherer, K. R. (2012). Introducing the Geneva Multimodal expression corpus for experimental research on emotion perception. Emotion, 12, 1161–1179.
Belin, P., Fillion-Bilodeau, S., & Gosselin, F. (2008). The Montreal Affective Voices: A validated set of nonverbal affect bursts for research on auditory affective processing. Behavior Research Methods, 40, 531–539. https://doi.org/10.3758/BRM.40.2.531
Douglas-Cowie, E., Campbell, N., Cowie, R., & Roach, P. (2003). Emotional speech: Toward a new generation of databases. Speech Communication, 40, 33–60.
Easter, J., McClure, E. B., Monk, C. S., Dhanani, M., Hodgdon, H., Leibenluft, E., ... Ernst, M. (2005). Emotion recognition deficits in pediatric anxiety disorders: Implications for amygdala research. Journal of Child & Adolescent Psychopharmacology, 15, 563–570.
Ekman, P., & Friesen, W. V. (1971). Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17, 124–129.
Fernald, A. (1989). Intonation and communicative intent in mothers’ speech to infants: Is the melody the message? Child Development, 1497–1510.
Fernald, A., & Morikawa, H. (1993). Common themes and cultural variations in Japanese and American mothers’ speech to infants. Child Development, 64, 637–656.
Frank, M. G., & Stennett, J. (2001). The forced-choice paradigm and the perception of facial expressions of emotion. Journal of Personality and Social Psychology, 80, 75–85.
Fridenson-Hayo, S., Berggren, S., Lassalle, A., Tal, S., Pigat, D., Bölte, S., ... Golan, O. (2016). Basic and complex emotion recognition in children with autism: Cross-cultural findings. Molecular Autism, 7, 52.
Globerson, E., Amir, N., Kishon-Rabin, L., & Golan, O. (2015). Prosody recognition in adults with high-functioning autism spectrum disorders: From psychoacoustics to cognition. Autism Research, 8, 153–163.
Golan, O., Baron-Cohen, S., Hill, J. J., & Rutherford, M. D. (2007). The “reading the mind in the voice” test—revised: A study of complex emotion recognition in adults with and without autism spectrum conditions. Journal of Autism and Developmental Disorders; 37, 1096–1106.
Golan, O., Sinai-Gavrilov, Y., & Baron-Cohen, S. (2015). The Cambridge Mindreading Face–Voice Battery for Children (CAM-C): Complex emotion recognition in children with and without autism spectrum conditions. Molecular Autism, 6, 22:1–9. https://doi.org/10.1186/s13229-015-0018-z
Gopinath, D. P., Sheeba, P. S., & Nair, A. S. (2007, March). Emotional analysis for Malayalam text to speech synthesis systems. Paper presented at the International Conference on Electronic Science, Information Technology and Telecommunication-SETIT 2007, Tunisia.
Hawk, S. T., Van Kleef, G. A., Fischer, A. H., & Van Der Schalk, J. (2009). “Worth a thousand words”: Absolute and relative decoding of nonlinguistic affect vocalizations. Emotion, 9, 293–305.
Hoekert, M., Kahn, R. S., Pijnenborg, M., & Aleman, A. (2007). Impaired recognition and expression of emotional prosody in schizophrenia: Review and meta-analysis. Schizophrenia Research, 96, 135–145.
Johns-Lewis, C. (1986). Intonation in discourse. San Diego, CA: College Hill Press.
Kan, Y., Mimura, M., Kamijima, K., & Kawamura, M. (2004). Recognition of emotion from moving facial and prosodic stimuli in depressed patients. Journal of Neurology, Neurosurgery & Psychiatry, 75, 1667–1671.
Kanner, L. (1943). Autistic disturbances of affective contact. Nervous Child, 2, 217–250.
Kucharska-Pietura, K., Nikolaou, V., Masiak, M., & Treasure, J. (2004). The recognition of emotion in the faces and voice of anorexia nervosa. International Journal of Eating Disorders, 35, 42–47.
Liu, P., & Pell, M. D. (2012). Recognizing vocal emotions in Mandarin Chinese: A validated database of Chinese vocal emotional stimuli. Behavior Research Methods, 44, 1042–1051. https://doi.org/10.3758/s13428-012-0203-3
Lundqvist, D., Berggren, S., O’Reilly, H., Tal, S., Fridenson, S., Golan, S., … Bölte, S. (2013, May). Recognition and expression of emotions in autism: Clinical significance and hierarchy of difficulties perceived by parents and experts. Paper presented at the 12th Annual International Meeting for Autism Research (IMFAR 2013), International Society for Autism Research (INSAR), San Sebastián, Spain.
Marchi, E., Schuller, B., Baron-Cohen, S., Lassalle, A., O’Reilly, H., Pigat, D., … Berggren, S. (2015, March). Voice emotion games: Language and emotion in the voice of children with autism spectrum condition. Paper presented at the 3rd International Workshop on Intelligent Digital Games for Empowerment and Inclusion (IDGEI 2015), part of the 20th ACM International Conference on Intelligent User Interfaces, IUI, Atlanta, GA.
Niimi, Y., Kasamatsu, M., Nishimoto, T., & Araki, M. (2001, August). Synthesis of emotional speech using prosodically balanced VCV segments. Paper presented at the 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis, Perthshire, Scotland.
O’Reilly, H., Pigat, D., Fridenson, S., Berggren, S., Tal, S., Golan, O., … Lundqvist, D. (2016). The EU-Emotion Stimulus Set: A validation study. Behavior Research Methods, 48, 567–576. https://doi.org/10.3758/s13428-015-0601-4
Paul, R., Shriberg, L. D., McSweeny, J., Cicchetti, D., Klin, A., & Volkmar, F. (2005). Relations between prosodic performance and communication and socialization ratings in high functioning speakers with autism spectrum disorders. Journal of Autism and Developmental Disorders, 35, 861–869.
Pell, M. D., Paulmann, S., Dara, C., Alasseri, A., & Kotz, S. A. (2009). Factors in the recognition of vocally expressed emotions: A comparison of four languages. Journal of Phonetics, 37, 417–435.
Pereira, C. (2000, September). Dimensions of emotional meaning in speech. Paper presented at the ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, Newcastle, Northern Ireland.
Polzin, T. S., & Waibel, A. (1998, January). Detecting emotions in speech. Paper presented at the Second International Conference on Cooperative Multimodal Communication, CMC 98, Tilburg, The Netherlands.
Rutherford, M. D., Baron-Cohen, S., & Wheelwright, S. (2002). Reading the mind in the voice: A study with normal adults and adults with Asperger syndrome and high functioning autism. Journal of Autism and Developmental Disorders, 32, 189–194.
Scherer, K. R. (2000). A cross-cultural investigation of emotion inferences from voice and speech: Implications for speech technology. In Proceedings of INTERSPEECH 2000 (Vol. 2, pp. 379–382). Beijing, China: ISCA. Retrieved from http://dblp.uni-trier.de/db/conf/interspeech/interspeech2000.html
Scherer, K. R., & Ellgring, H. (2007). Are facial expressions of emotion produced by categorical affect programs or dynamically driven by appraisal? Emotion, 7, 113–130. https://doi.org/10.1037/1528-35126.96.36.199
Schröder, M. (2003). Experimental study of affect bursts. Speech Communication, 40, 99–116. https://doi.org/10.1016/S0167-6393(02)00078-X
Somerville, L. H., Fani, N., & McClure-Tone, E. B. (2011). Behavioral and neural representation of emotional facial expressions across the lifespan. Developmental Neuropsychology, 36, 408–428.
Tottenham, N., Tanaka, J.W., Leon, A.C., McCarry, T., Nurse, M., Hare, T.A., … Nelson, C. (2009). The NimStim set of facial expressions: judgments from untrained research participants. Psychiatry Research, 168, 242–249.
Van Bourgondien, M. E., & Woods, A. V. (1992). Vocational possibilities for high-functioning adults with autism. In E. Schopler & G. B. Mesibov (Eds.), High-functioning individuals with autism (pp. 227–239). New York, NY: Plenum Press.
Electronic supplementary material
(DOCX 15 kb)
Script A and B listing the sentences said by the actors for each emotional category. In yellow, we indicate the sentences that were semantically congruent with the emotion of interest. (DOCX 34 kb)
Item level summary table of validation data in the UK. Stimuli are ordered from the least to the best recognized for each emotional category. In yellow, we indicate the sentences that were semantically congruent with the emotion of interest. (XLSX 157 kb)
Item level summary table of validation data in Sweden. Stimuli are ordered from the least to the best recognized for each emotional category. In yellow, we indicate the sentences that were semantically congruent with the emotion of interest. (XLSX 215 kb)
Item level summary table of validation data in Israel. Stimuli are ordered from the least to the best recognized for each emotional category. In yellow, we indicate the sentences that were semantically congruent with the emotion of interest. (XLSX 105 kb)
Summary of the validation data including mean, range and median chancecorrected recognition rates (CCR) as well as mean valence, arousal and intensity (when applicable) for all emotions (per intensity of expression when applicable, and separately depending on whether the sentence was semantically meaningful or semantically neutral) for each site (tabulation 1: UK, tabulation 2: Sweden, tabulation 3: Israel). SD refers to standard deviation. (XLSX 65 kb)
Average duration of the emotional voice stimuli per emotion category in the UK (left) and in Sweden (right). SD refers to standard deviation. (XLSX 36 kb)
About this article
Cite this article
Lassalle, A., Pigat, D., O’Reilly, H. et al. The EU-Emotion Voice Database. Behav Res 51, 493–506 (2019). https://doi.org/10.3758/s13428-018-1048-1
- Voice stimuli set
- Multisite validation
- Emotion perception