Introduction

Humans can usually understand each other’s emotions and intentions in a reliable manner from isolated speech cues, in the absence of visual information, as exemplified daily through routine telephone conversations (Paulmann & Pell, 2011). In such cases, the vocal parameters of speech, including fluctuations in pitch, loudness, speech rate, and so forth, which are referred to as emotional prosody, play a crucial role in conveying the speakers’ affective disposition and emotional state to listeners (Banse & Scherer, 1996).

In the growing literature on emotional prosody, many researchers are conducting neuropsychological (e.g., Adolphs, Damasio, & Tranel, 2002; Adolphs, Tranel, & Damasio, 2001; Pell & Leonard, 2003; Ross & Monnot, 2008) and neuroimaging (e.g., Grandjean et al., 2005; Paulmann & Kotz, 2008; Paulmann, Pell, & Kotz, 2008) experiments to investigate the neuro-cognitive mechanisms underlying emotional prosody. Other studies are exploring how speakers use prosody to encode discrete emotions (e.g., happiness, anger, disgust, etc.; see Ekman, 1992a, b) and how listeners effectively decode these cues in speech, to advance knowledge of the particular perceptual-acoustic characteristics of vocal emotion expressions (e.g., Castro & Lima, 2010; Pell, Monetta, Paulmann, & Kotz, 2009; Pell, Paulmann, Dara, Alasseri, & Kotz, 2009; Scherer, Banse, & Wallbott, 2001; Thompson & Balkwill, 2006; for a review, see Juslin & Laukka, 2003). Given that emotional prosody is fully embedded in language and that it can be influenced by the linguistic properties of a specific language (Pell, 2001), it has been necessary for many researchers to construct valid emotional stimuli appropriate for the linguistic background of the participants under study. This time-consuming but essential step is necessary if emotional prosody research is to be conducted in different language contexts, and validated recordings are necessary for researchers to successfully control the linguistic content of their stimuli to “isolate” the effects of prosody in relation to concurrent language features (e.g., lexico-semantic cues pertaining to emotion) in each language. While valid vocal emotional stimuli have been mostly established in Indo-European languages in the literature (e.g., Banse & Scherer, 1996; Baum & Nowicki, 1998; Burkhardt, Paeschke, Rolfes, Sendlmeier, & Weiss, 2005; Castro & Lima, 2010; Juslin & Laukka, 2001; Nowicki & Duke, 1994; Pell, Monetta, et al., 2009; Pell, Paulmann, et al., 2009; Scherer et al., 2001; for a review, see Juslin & Laukka, 2003), this study will establish a validated database of vocal emotional stimuli in Mandarin Chinese, a major Sino-Tibetan language spoken by more than a billion people around the world.

Emotional prosody in Mandarin has been studied in various ways by researchers, although there have been few attempts to develop a well-controlled database of vocal emotional stimuli for future work in Mandarin Chinese (cf. You, Chen, & Bu, 2005). Many previous studies used emotional utterances taken from Chinese movies or TV shows/broadcasts as stimuli (Tao, Kang, & Li, 2006; You et al., 2005; Yu, Chang, Xu, & Shum, 2001; Zhang, 2008), which limits control of the linguistic content of speech for studying vocal emotions. Other researchers recruited speakers to produce emotional utterances that had a concurrent emotional semantic context (Anolli, Wang, Mantovani, & De Toni, 2008; You et al., 2005) or to produce “semantically neutral” but emotionally inflected sentences as vocal stimuli (e.g., Li, Shao, & Dang, 2009; Pao, Yeh, & Chen, 2008; You et al., 2005; Zhang, Ching, & Kong, 2006). A potential problem of the latter approaches is that prosody cannot be studied independently of corresponding semantic cues and, also, that “semantically neutral” sentences can sometimes promote unexpected interpretations by listeners when combined with different emotional prosodic meanings (see Pell, 2006, for a discussion). For these reasons, many researchers have constructed language-like pseudoutterances that can be produced by speakers to vocally encode emotions in a relatively naturalistic manner (e.g., Castro & Lima, 2010; Pell, Paulmann, et al., 2009; Scherer, Banse, Wallbott, & Goldbeck, 1991), an approach that was also adopted here.

Another key methodological consideration is the process for perceptually validating vocal emotional stimuli for future use. Many current studies of Mandarin provide only sparse details about this process (You et al., 2005; Zhang, 2008; Zhang et al., 2006) or refer to subjective assessments of their emotional speech corpus (Pao et al., 2008), often involving a very small group of listeners (e.g., two or four; Yu et al., 2001; Zhang, 2008). In some cases, valid emotional stimuli were selected on the basis of whether they sounded “typical” or “effective” to listeners (Thompson & Balkwill, 2006; Zhang, 2008). To establish a database that is suitable for different investigative purposes, it will be important to avoid subjective assessments and to mitigate the influence of individual biases in emotion recognition by collecting a more robust set of perceptual data that define vocal emotional stimuli in Mandarin. As well, since most previous studies investigated a limited number of discrete emotions (e.g., three or four; Pao et al., 2008; Tao et al., 2006; Yu et al., ; Zhang, 2008; Zhang et al., 2006), a new database should seek to provide validated stimuli representing a broader set of emotional meanings currently being studied in the literature.

The primary aim of this study was to establish a validated database of vocal emotional stimuli in Mandarin Chinese using a well-controlled validation procedure, involving seven emotion categories: anger, happiness, sadness, fear, disgust, pleasant surprise, and a nonemotional category (neutrality). Within a discrete emotions framework, these six emotions (excluding neutrality) are typically considered basic human emotions, each with a distinct biological basis and expressive qualities that are universally shared across cultures and languages (Ekman, 1992a, b; Ekman, Sorenson, & Friesen, 1969).Footnote 1 While an inventory of vocal expressions of the basic emotions has been established in several languages (e.g., Castro & Lima, 2010; Pell, Paulmann, et al., 2009), this has not been fully achieved in Mandarin. Our inventory should therefore be a useful tool for conducting basic research on a range of vocal emotions in Mandarin and for conducting cross-cultural/cross-linguistic studies of vocal emotion communication, and these stimuli could be incorporated into assessments of emotional and social functions involving the Mandarin-speaking population.

To facilitate the comparability of our data with those in the published literature (e.g., Castro & Lima, 2010; Pell, Paulmann, et al., 2009), the present study adopted the procedures of Pell, Paulmann, et al. (2009) in their comparative study of vocal emotion expressions in English, Arabic, German, and Hindi. As was noted earlier, an important methodological issue in the literature concerns how to control the linguistic/semantic content of speech stimuli that carry vocal cues about emotion; following Pell, Paulmann, et al. (2009), we required speakers to produce emotionally inflected pseudosentences (e.g., in English: The fector jabbored the tozz) specially constructed for Mandarin Chinese. These stimuli were composed of pseudo content words conjoined by (real) function words, rendering them semantically meaningless but ensuring that the phonetic/segmental and suprasegmental properties were appropriate to native Mandarin speakers/listeners. Similar pseudoutterances have been constructed for a range of languages (e.g., English, German, Hindi, Arabic, and Portuguese; see Castro & Lima, 2010; Pell, Paulmann, et al., 2009; Scherer et al., 1991). Here, Chinese pseudosentences were elicited from four native Mandarin speakers to convey the seven target emotional meanings (anger, happiness, sadness, fear, disgust, pleasant surprise, and neutrality). The recordings were then entered into a perceptual rating study where a group of 24 native Mandarin listeners judged the emotion being expressed by each item in a seven-alternative forced choice task. On the basis of the results of the perceptual study, items that reached a critical recognition consensus rate (i.e., three times chance performance, or 42.86 %) about the emotional meaning of the stimulus were included in the validated database, and acoustic analyses were conducted on these valid items to specify the acoustic characteristics of vocal emotions in Mandarin Chinese.

Method

There were four parts to the study. First, a set of pseudosentences in Mandarin Chinese were constructed and validated to ensure that they were perceived as “language-like” to native Mandarin-speaking listeners (sentence construction). Second, native Mandarin speakers were recruited to produce the pseudoutterances to express seven emotional meanings (stimulus recording). Third, these recordings were validated by a second group of Mandarin-speaking listeners who perceptually identified the emotion expressed by each item (perceptual validation and selection). Finally, the perceptually validated subset of stimuli was subjected to acoustic analysis to better understand the link between perceptual and acoustic features of the stimuli (acoustic study). As described below, all testing took place in Montréal, Canada but was conducted entirely in Mandarin Chinese at each stage of the investigation.

Sentence construction

Forty-five pseudosentences were created by the author, P.L., who is a native Mandarin speaker from North China. They were constructed by replacing content words with random Chinese characters that were semantically meaningless within the sentence context, while maintaining function words to convey grammatical information (see the Appendix for examples). To ensure that pseudosentences were both semantically meaningless and relatively plausible as Chinese sentences, a pilot study was conducted.

Participants

Ten native Mandarin speakers (5 female/5 male; mean age = 25.2 ± 2.6 years) were recruited in the pilot study. These participants were university students from China who had learned Mandarin from birth, had lived in China until at least 18 years of age, had been away from China for less than 2 years, and spoke English as a second language. Each participant was compensated $10 CAD per hour for their participation.

Procedure

The 45 pseudosentences were presented to each participant in random order on a computer screen. Participants were asked to rate the degree of “language-likeness” (i.e., the extent to which the pseudosentence resembled a real Chinese sentence) on a 5-point scale from −2 to 2, in which −2 refers to very unlike while 2 refers to very like. All the instructions were given in Mandarin.

Results

A rating score was calculated for each pseudosentence by averaging the ratings across the 10 participants. A subset of 35 pseudosentences with rating scores above 0 (mean rating, .53; standard deviation [SD], .43) were selected as the most language-like items of the original 45 items constructed. The selected pseudosentences had a mean length of 9.09 characters/syllables (range: 7–12 characters/syllables).

Stimulus recording

Participants

Four native Mandarin speakers (2 female, 2 male) with a mean age of 24.3 (±4.6) years were recruited as encoders to produce vocal emotion expressions in Mandarin. Participants responded to an advertisement posted at McGill University, Montréal (Canada) and were selected for having lay experiences in broadcasting or public speaking in Mandarin Chinese when they were in China (e.g., member of the campus radio station in a Chinese university). All encoders were university students from China, had learned Mandarin from birth, and had lived in China until at least 18 years of age; none had been away from China for more than 2 years. They all spoke standard Mandarin without any accent and spoke English as a second language. Participants were compensated $10 CAD per hour for their participation.

Materials

The 35 pseudosentences, selected in the pilot study, were used as materials to elicit emotional expressions from the 4 encoders in seven emotion categories: anger, disgust, fear, sadness, happiness, pleasant surprise, and neutrality. In addition to pseudosentences, a separate list of 45 real semantic Chinese sentences was constructed for each emotion category and was produced by the encoders in each target emotion as well. Following Pell, Paulmann, et al. (2009), the real semantic sentences were employed as practice to help encoders to produce pseudoutterances more effectively and naturally. Data pertaining to the semantic utterances were not the object of this study, which focused on how emotional prosody operates independently of semantic context.

Elicitation and recording procedure

Each encoder was recorded separately in a sound-attenuated recording booth. Utterances conveying the seven emotional meanings were recorded in separate blocks, which varied in order across encoders. For each category, the encoder produced the semantic utterances first to express the target emotion as practice, followed by the pseudoutterances. Each of the 35 pseudoutterances was presented 1 at a time on a computer screen. Encoders were instructed to produce the pseudoutterance in the target emotion as if talking to the experimenter, in a way that was as natural as possible. During recording, the main author (P.L.) and a research assistant, who are both native Mandarin speakers, monitored the recording process and provided clues (e.g., verbal scenarios) to help the encoder produce the target emotion effectively; however, they never provided vocal examples of the target emotion to the encoders. All instructions and communications during testing were in Mandarin. Breaks were inserted in between blocks to ensure the transition between different emotions. All pseudoutterances were recorded using a Tascam digital recorder and a high-quality head-mounted microphone; the digital recordings were then transferred and saved onto a computer and were edited into individual .wav sound files for each utterance, using Praat speech analysis software (Boersma & Weenink, 2001). The average duration of the edited pseudoutterances was 1.65 ± 0.41 s, although this varied considerably by emotion type, as was expected (happiness, 1.62 s; anger, 1.40 s; sadness, 1.86 s; disgust, 2.14 s; fear, 1.47 s; surprise, 1.58 s; neutrality, 1.48 s).

Perceptual validation and selection

The edited pseudoutterances were entered into a perceptual validation study to evaluate how they were perceived by a group of native listeners. On the basis of the perceptual data, a valid subset of the utterances that reliably conveyed each target emotion could be identified.

Participants

Twenty-four native listeners, or decoders (12 female, 12 male), with a mean age of 25.5 (±3.3) years were recruited for the perception study. Again, they were students from China who had learned Mandarin from birth, had lived in China until at least 18 years of age, had been away from China for less than 2 years, and spoke English as a second language. Each participant was compensated $10 CAD per hour for their participation.

Materials and procedure

The total number of pseudoutterances produced by the speakers was 980 (35 pseudosentences × 7 emotions × 4 speakers). However, 1 item from the sadness category of a male speaker was removed due to a recording artifact, leaving 979 items to be entered into the validation study. Following Pell, Paulmann, et al., 2009, the peak amplitude of all utterances was normalized to 75 dB to mitigate gross differences in perceived loudness for utterances recorded during different testing sessions. Using Superlab presentation software (Cedrus, U.S.), the 979 utterances were randomly combined and divided into four blocks, which were presented in two testing sessions, two blocks during each session (sessions were usually separated by a day). During the testing, each utterance was played once over headphones controlled by the Superlab program, which recorded mouse click responses. Decoders rendered two judgments following each item: First, they identified which emotion was being expressed by the speaker from a list of the seven categories presented on the computer screen; then, with the exception of items identified as “neutral,” participants were immediately presented a 5-point rating scale on the screen to rate the intensity of the emotion that had been recognized (where 1 referred to very weak and 5 referred to very intense). All participants received practice trials prior to the first block during each testing session and frequent breaks during each session. All instructions and communication during the perceptual testing were conducted entirely in Mandarin.

Selection

The accuracy of each decoder in identifying utterances conveying each emotion was first calculated; data for 1 male decoder was subsequently removed from the following analyses, since he selected “neutral” for the vast majority of items. On the basis of the accuracy data, a subset of perceptually robust or valid items was selected from the original 979 recordings (Pell, Paulmann, et al., 2009). To eliminate items that were poorly encoded in the recording sessions, two criteria were adopted to select valid items: (1) a recognition rate of at least 42.86 % (i.e., three times chance performance in the seven-choice emotion recognition task) for the target emotion (Castro & Lima, 2010; Pell, Paulmann, et al., 2009) and (2) recognition rates of less than 42.86 % in any other emotion categories for that item (Castro & Lima, 2010). For each emotion category, items that failed to reach these two criteria were removed from the set of “valid” utterances. Among the removed items, those with a recognition rate lower than 42.86 % on the target emotion, but with a recognition rate of at least 42.86 % on another nontarget emotion, were regrouped into that nontarget emotion category.

The frequency and proportion of valid items for each emotion and speaker are summarized in Table 1. The selection and regrouping of items led to the inclusion of 89 % (874/979) of the original items in the seven emotion categories (speaker C.C., 83 %; G.R.W., 90 %; N.Z., 90 %; T.F.G., 94 %). As is shown in Table 1, pleasant surprise was associated with the lowest proportion of valid items (34 %) among the seven categories and was frequently identified as sounding happy for all four speakers. As is also evident in Table 1, individual differences in the ability to express particular emotions were also observed among the 4 encoders.

Table 1 Frequency of perceptually valid items observed for each emotion category for each of the four speakers

Acoustic study

The selected subset of 874 perceptually validated items was subjected to acoustic analyses to evaluate basic acoustic patterns that differentiate vocally expressed emotions in Mandarin. On the basis of the previous literature, these analyses focused on six major acoustic parameters of vocal emotion expressions that are widely studied: mean fundamental frequency (mean f0, in Hertz), fundamental frequency variation (f0 range, in Hertz), amplitude variation (amplitude range, in decibels), mean harmonics-to-noise ratio (mean HNR, in decibels), HNR variation (SD of HNR, in decibels), and speech rate (in syllables per second). The observed values of mean f0, maximum f0, minimum f0, maximum amplitude, minimum amplitude, mean HNR, SD of HNR, and utterance duration for each item were obtained in Praat, allowing the six parameters of interest to be calculated prior to statistical analyses. Following Pell, Paulmann, et al. (2009), in order to correct for individual differences in a speaker’s mean voice pitch, all f0 measures (mean, maximum, and minimum f0) were normalized in relation to the individual “resting frequency” of each speaker (i.e., the average minimum f0 value of all neutral utterances produced by that speaker; see Pell, Paulmann, et al., 2009, for details). Measures of f0 range were then calculated by subtracting the normalized minimum f0 values from the normalized maximum f0 values. For both normalized values of mean f0 and f0 range, a value of 1 for an utterance represents a 100 % increase in the speaker’s resting frequency, which, as a proportional value, could be compared across speakers. Similarly, in order to correct for individual differences in a speaker’s intensity, the amplitude values (maximum amplitude, minimum amplitude) of each speaker were normalized in relation to the average minimum amplitude value of all neutral utterances produced by that speaker. Then the measures of amplitude range were calculated by subtracting the normalized minimum amplitude values from the normalized maximum amplitude values. Measures of speech rate were calculated by dividing the number of syllables of each utterance by the corresponding utterance duration, in syllables per second.Footnote 2

Results and discussion

Perceptual results

Recognition rates

On the basis of the selected subset of 874 items, emotion recognition rates were calculated for each emotion category across speakers. The hit rate was calculated as an uncorrected measure of target category recognition (percent correct), according to which neutrality (86 %) was recognized best, followed by anger (82 %), sadness (81 %), fear (80 %), happiness (70 %), disgust (67 %), and finally, pleasant surprise (56 %). These data were then converted to Hu scores (Wagner, 1993) to correct for differences in item frequency among categories and individual-participant response biases (i.e., relative use of specific response alternatives). Hu scores were calculated for each participant as the joint probability that a presented item was correctly recognized and that the corresponding target category was correctly used.

To evaluate whether the seven emotion categories could be differentiated perceptually, a one-way ANOVA performed on the Hu scores as a function of emotion category showed a significant effect of emotion category, F(6, 154) = 22.95, p < .0001. Post hoc (Tukey’s) comparisons revealed that fear (.673) had the highest recognition among the seven categories, followed by neutrality (.667), sadness (.643), anger (.643), and disgust (.573); this was followed by happiness (.541), which was significantly lower than fear (.673; p < .05). Pleasant surprise (.253) was recognized significantly less accurately than the other six categories (ps < .01; see Fig. 1 for an illustration).

Fig. 1
figure 1

Mean recognition rate (Hu score) for seven vocal expressions of emotion in Mandarin Chinese (across speakers; error bars indicate the standard deviations)

On the basis of the analysis of unbiased accuracy rates (i.e., Hu scores; Wagner, 1993), there was expected variation in the recognition of discrete emotions from vocal cues in Mandarin. Specifically, recognition was most accurate for fear, neutrality, sadness, and anger, compatible with previous data in other languages (Banse & Scherer, 1996; Castro & Lima, 2010; Pell, Paulmann, et al., 2009; Thompson & Balkwill, 2006). A general advantage for recognizing negative emotions from vocal speech cues, independently of language, is compatible with evolutionary views that vocal signals associated with threat must be highly salient to ensure human survival (Ohman, Flykt, & Esteves, 2001; Tooby & Cosmides, 1990). Disgust and happiness were recognized relatively poorly in Mandarin, which is also compatible with previous findings (e.g., Banse & Scherer, 1996; Castro & Lima, 2010; Pell, 2002; Pell, Paulmann, et al., 2009; Scherer et al., 1991). It seems likely that happiness and disgust are expressed more saliently in other communication channels, such as through facial cues (Ekman, 1994; Elfenbein & Ambady, 2002; Russell, 1994; Wallbott, 1988), yielding lower accuracy rates when these two emotions must be recognized from isolated vocal cues (Elfenbein & Ambady, 2002; Pell & Kotz, 2011; Pell, Paulmann, et al., 2009). In the case of disgust, this emotion may be conveyed predominantly by nonverbal vocalizations, rather than by vocal inflections of the whole utterance (e.g., Castro & Lima, 2010; Paulmann & Pell, 2011; Scherer et al., 1991). As was expected, pleasant surprise turned out to be the most difficult emotion to identify in Mandarin, with the lowest recognition rate (see also Pell, Paulmann, et al., 2009). It is widely held that surprise is especially difficult to produce and to recognize naturally in a simulated experimental context such as the one employed here; moreover, low accuracy rates for pleasant surprise observed here and by Pell, Paulmann, et al. (2009) may be due to the positive valence of these expressions, which were frequently identified in the perceptual study as intense forms of “happiness,” rather than surprise (see Abelin & Allwood, 2000; Castro & Lima, 2010; Montero et al., 1999; Navas, Hernáez, Castelruiz, & Luengo, 2004).

Intensity ratings

The intensity ratings assigned to the 763 utterances recognized as conveying an emotion (rather than neutrality) were calculated for each emotion category across speakers and were submitted to a one-way ANOVA with six levels of emotion (anger, disgust, fear, sadness, happiness, pleasant surprise). The ANOVA revealed a significant effect of emotion category on intensity ratings, F(5, 132) = 3.72, p < .005. Post hoc (Tukey’s) elaboration of the emotion effect indicated that pleasant surprise (3.57 ± 0.61) and anger (3.41 ± 0.52) were rated relatively high on this scale, followed by fear (3.28 ± 0.53), disgust (3.26 ± 0.67), and sadness (3.20 ± 0.51); however, none of these differences reached statistical significance. Happiness (2.86 ± 0.66) was rated as the least intense emotion, differing significantly from both surprise (3.57 ± 0.61) and anger (3.41 ± 0.52; ps < .05).

Acoustic results

The six acoustic measures of the 874 valid expressions are presented in Table 2 for each emotion category, averaged across speakers. To explore how vocal expressions of the seven emotion categories differed along these dimensions in Mandarin, a one-way MANOVA was performed on the acoustic data as a function of emotion category, with the six acoustic parameters (normalized mean f0, normalized f0 range, normalized amplitude range, mean HNR, SD of HNR, and speech rate) serving as the dependent variables. The MANOVA indicated that the effect of emotion category on the six acoustic parameters was significant, Wilk’s Λ = 0.205, F(36, 3788) = 45.71, p < .0001. Subsequent univariate analyses revealed that the effect of emotion category was significant for normalized mean f0, F(6, 867) = 106.22, p < .0001, normalized f0 range, F(6, 867) = 62.78, p < .0001, normalized amplitude range, F(6, 867) = 8.79, p < .0001, mean HNR, F(6, 867) = 44.33, p < .0001, SD of HNR, F(6, 867) = 30.61, p < .0001, and speech rate, F(6, 867) = 108.27, p < .0001.

Table 2 Acoustic measures (normalized) for each emotion category averaged across speakers

Post hoc (Tukey’s) comparisons were carried out on each acoustic parameter to examine the differences among emotion categories. The observed acoustic differences demonstrated a number of consistencies with those reported for other languages (e.g., Castro & Lima, 2010; Jaywant & Pell, 2012; Juslin & Laukka, 2001; Pell, Paulmann, et al., 2009; Scherer, London, & Wolf, 1973; Thompson & Balkwill, 2006). For instance, sadness was expressed in a low mean f0, a small f0 and amplitude variation, a high mean HNR, and a slow speech rate. Disgust exhibited a low mean f0, a low mean HNR, a large HNR variation, and the slowest speech rate. Pleasant surprise exhibited the highest mean f0, the largest f0 variation, and the largest amplitude variation. Neutrality was conveyed by a relatively low mean f0, a small f0 variation, the smallest amplitude variation, and a moderate speech rate. These cross-language consistencies are compatible with the idea that emotional communication is constrained to a large extent by biological factors and is shared across cultures (Scherer, 1986). On the other hand, greater cross-language variability was observed in the expression of fear, anger, and happiness, which might be due to methodological variability in this literature (e.g., the length of the emotional stimuli that were used, different subtypes of the target emotion that were elicited in different languages , etc.). As well, due to the small number of speakers who produced the emotional stimuli, it is certain that individual variability in emotion expression is partly responsible for many of the observed acoustic differences between speakers and across languages.

To briefly evaluate how well the six acoustic parameters predicted the perceptual classification of items conveying each of the seven emotional meanings, a stepwise discriminant analysis was performed. The analysis revealed six significant canonical functions. Function 1, F(36, 3788) = 45.71, p < .0001, accounted for 47.3 % of the variance and correlated positively with mean f0 (r = .81), speech rate (r = .59), and f0 variation (r = .52). Function 2, F(30, 3454) = 54.23, p < .0001, explained 33.8 % of the remaining variance and correlated positively with speech rate (r = .73) and negatively with f0 variation (r = −.47). Function 3, F(24, 3015) = 64.95, p < .0001, accounted for 14.4 % of the remaining variance and correlated positively with mean HNR (r = .82) and mean f0 (r = .50). Finally, function 4, F(18, 2447) = 80.59, p < .0001, function 5, F (12, 1732) = 99.18, p < .0001, and function 6, F(6, 867) = 108.27, p < .0001, accounted for relatively lower percentages of the remaining variance (4.0 %, 0.4 %, and 0.1 %, respectively) and correlated positively with HNR variation (r = .73), amplitude variation (r = .83), and f0 variation (r = .67), respectively.

This model of six acoustic measures correctly predicted the classification of validated items into the seven emotion categories at an overall rate of 59.4 % (519/874).Footnote 3 These results underscore that the six selected acoustic measures are essential cues in communicating vocal emotions (Bachorowski & Owren, 1995; Castro & Lima, 2010; Mozziconacci, 2001; Pell, Paulmann, et al., 2009; Thompson & Balkwill, 2006; Williams & Stevens, 1972). However, there were major differences among the specific emotion categories in how well they were classified on the basis of the acoustic data: disgust = 76 % (92/121), neutrality = 75.7 % (84/111), sadness = 65.1 % (97/149), pleasant surprise = 54.2 % (26/48), anger = 53.6 % (75/140), fear = 51 % (73/143), and happiness = 44.4 % (72/162). Furthermore, since approximately 40 % of the items could not be classified by the six acoustic measures, future work will need to include additional parameters to fully capture how listeners use acoustic cues to recognize emotions from speech prosody.Footnote 4

Conclusion

The present study sought to establish a well-controlled, validated database of vocal emotional stimuli in Mandarin Chinese for use in future research. Four native Mandarin speakers produced Chinese pseudoutterances in seven emotional categories (anger, happiness, sadness, fear, disgust, pleasant surprise, and neutrality). Their expressions were validated by a new group of 24 native listeners in an emotion recognition task, on the basis of which a valid subset of items was selected and subjected to acoustic analysis. Expected variations were observed among the seven emotion categories in both perceptual and acoustic patterns.

There are several methodological limitations of this study. The first problem is that only a small number of encoders (i.e., 4) were recruited to produce the emotional expressions; these participants were not professional actors but had lay experiences in broadcasting and/or public speaking. The fact that our encoders did not have professional training, as is true of most related studies, could have led to greater variability in the ability of individual speakers to produce vocal emotion expressions, influencing the perceptual and acoustic measures to a certain degree. Another potential shortcoming is that only 24 decoders were recruited for the perceptual validation task; while this sample size is typical of comparable studies in the emotion perception literature (e.g., Burkhardt et al., 2005; Pell, Paulmann, et al., 2009; Thompson & Balkwill, 2006; for a review, see Juslin & Laukka, 2003) and likely robust, a larger sample of decoders could potentially improve the reliability of our data. Finally, it is clear that only a small number of acoustic measures were employed in the acoustic analyses presented and that many additional parameters would be needed to characterize all relevant features of emotional speech in Mandarin Chinese (or any other language). Future studies that include a larger set of acoustic measures will prove important to elaborate current evidence of the acoustic characteristics of Chinese emotional speech.

Despite these limitations, the selected vocal emotion expressions in this study were perceptually validated and exhibited systematic acoustic patterns that were similar to those found in other languages to a certain extent (e.g, Castro & Lima, 2010; Pell & Kotz, 2011; Pell, Paulmann, et al., 2009; Scherer et al., 2001). Therefore, this database, which contains 874 items conveying the seven emotional meanings, could be a valid and useful tool for future research and is currently available to the research community. It will contribute to future behavioral, neuropsychological, and neuroimaging studies on vocal emotions in Mandarin and will also contribute to future cross-cultural/cross-linguistic studies of vocal emotion communication that shed light on both the “universal” and unique features of this communication subsystem. In addition, our new stimuli could be incorporated into assessments of emotional functions involving the Mandarin-speaking population (e.g., to evaluate emotion communication functions in Mandarin-speaking brain-damaged patients). To access the database and the relevant information, please contact pan.liu@mail.mcgill.ca.