Experiment 1 was conducted in a Dutch cultural context. In order to evaluate the robustness of the findings, we sought to repeat Experiment 1 in a distant cultural context. Languages are characterized by prosodic conventions, which might shape the communication of emotions via speech prosody. Choosing distant cultures with different prosodic conventions allows us to interrogate the robustness of the findings, making it unlikely that the same prosodic conventions shape the communication of positive emotions in our study. In Experiment 2, we test whether (1) Chinese listeners can recognize 22 positive emotions from nonverbal expressions and speech prosody from stimuli produced by Chinese individuals; and (2) whether positive emotions would better recognized from nonverbal vocalizations of as compared to prosodic expressions also in a Chinese cultural context.
Sample size was determined in the same way as Experiment 1. Two hundred native Chinese Mandarin speakers (109 women, 90 men, 1 prefer not to say; Mage = 27.51, SDage = 4.50, range = 19–35 years old) with no (self-reported) hearing impairments were recruited via a Chinese online data collection platform, https://www.wjx.cn. Participation in the study was compensated with a monetary reward.
Materials and Procedure
Posed vocal expressions of positive emotions in Chinese Mandarin were recorded at the University of Amsterdam’s psychology laboratory, using the same procedure as the recordings of the Dutch vocalizations (see Experiment 1, Stimuli). Eligibility criteria for participating in the recordings were: (1) being a native Chinese Mandarin speaker, (2) having been in the Netherlands for no more than 3 months by the time of the recording, (3) having lived in China until the age of 18, and (4) never having lived outside of China for more than 2 years. Based on these criteria, twenty participants (10 women, 10 men; Mage = 23, SDage = 2.63, range = 19–31 years old) were invited to the laboratory to record vocalizations. Participants reported never having been diagnosed or treated for any voice, speech, hearing, or language disorder.
The experimenter was a Chinese Mandarin native speaker, and the entire recording procedure was in Chinese Mandarin. The target emotions, accompanying definitions, and situational examples (given in Table 1), as well as the neutral phrase used for recordings of speech prosody (“六百四十七” from Chinese Mandarin: six hundred forty-seven), were provided in Chinese Mandarin. All 880 recorded vocalizations were used as stimuli in Experiment 2. Average duration as 1.25 s (SD = 0.64) for nonverbal vocalizations, and 1.64 s (SD = 0.45) for speech prosody. An example of vocalizations for each positive emotion and vocalization type is available from https://emotionwaves.github.io/chinese22/.
The experimental procedure was the same as in Experiment 1, except that the stimuli were from the Chinese Mandarin recordings.
For data analysis and outlier detection, the preregistered plan was followed. Before data analysis, data were checked for participants with 3 SD or more below the mean on overall recognition performance. Based on this criterion, one participant’s data were excluded from the analysis. The statistical analyses were identical to those employed in Experiment 1.
Confusion matrices for average recognition percentages for nonverbal vocalizations and speech prosody are shown in Fig. 1. Comparisons of recognition performance to chance level per positive emotion for nonverbal vocalizations and speech prosody can be found in Table 2.
Sixteen positive emotions were recognized at better-than-chance level from nonverbal vocalizations. In the order of coefficient size in log-odd scale, these emotions were amusement (Est. = 3.587, SE = 0.324), relief (Est. = 2.924, SE = 0.338), schadenfreude (Est. = 2.494, SE = 0.282), amae (Est. = 2.319, SE = 0.410), determination (Est. = 2.123, SE = 0.322), interest (Est. = 1.931, SE = 0.264), surprise (Est. = 1.635, SE = 0.300), triumph (Est. = 1.613, SE = 0.330), sensory pleasure (Est. = 1.408, SE = 0.177), admiration (Est. = 0.821, SE = 0.275), elation (Est. = 0.762, SE = 0.259), inspiration (Est. = 0.751, SE = 0.238), elevation (Est. = 0.692, SE = 0.248), pride (Est. = 0.656, SE = 0.222), lust (Est. = 0.643, SE = 0.274), and excitement (Est. = 0.604, SE = 0.269). These findings show that nonverbal vocalizations are highly effective means of conveying many positive emotions.
In contrast, only seven positive emotions were recognized better than would be expected by chance from speech prosody. These emotions in the order of coefficients in the log-odd scale were amusement (Est. = 1.453, SE = 0.284), relief (Est. = 1.227, SE = 0.263), determination (Est. = 1.165, SE = 0.244), interest (Est. = 0.662, SE = 0.200), pride (Est. = 0.503, SE = 0.180), triumph (Est. = 0.479, SE = 0.200), and awe (Est. = 0.465, SE = 0.205). These results suggest that prosodic expressions are not very effective in conveying positive emotions, with recognizability highly dependent on the emotion expressed. Estimates from the GLMM models are visualised in Fig. 2. Full details of the GLMMs are provided in the Supplementary Materials, Tables S1 and S2.
As in the Dutch cultural context, we sought to test the hypothesis that positive emotions would be more accurately recognized from nonverbal vocalizations than from speech prosody. As predicted, participants categorized nonverbal vocalizations of positive emotions better than speech prosody overall (GLMM: z = − 10.69, p < 0.001). Next, we compared performance accuracy across vocalization types for each emotion, showing that 16 positive emotions were recognized with better accuracy from nonverbal vocalizations. None of the emotions was more accurately recognized from speech prosody (see Table 3). It is worth noting that not all of the 16 emotions that were recognized better from nonverbal vocalizations than speech prosody were recognized above chance levels for both kinds of expressions (see Fig. 3b). Admiration, amae, elation, elevation, excitement, inspiration, lust, surprise, sensory pleasure, and triumph were recognized at better-than-chance levels only when expressed as nonverbal vocalizations. These emotions might thus be expressed with unique nonverbal vocalizations, while they are not clearly communicated via speech prosody cues. These results suggest that recognizability of some positive emotions depends on the vocalization type through which the emotion is expressed. Summary of random effects in GLMM models can be found in Supplementary Materials, Table S3.
Experiment 2 revealed that naïve Chinese listeners recognized 17 out of 22 positive emotions better than expected by chance from vocal expressions of native Chinese Mandarin speakers. Moreover, 16 positive emotions were recognized with higher accuracy from nonverbal vocalizations compared to speech prosody, suggesting a communicative advantage for nonverbal vocalizations. When compared to nonverbal vocalizations, a relative lack of distinctive acoustic cues of positive emotions expressed via prosodic expressions might be leading to poorer recognizability.
Acoustic Classification Experiments
Machine learning approaches were employed to attempt to automatically classify the nonverbal vocalizations and speech prosody of 22 positive emotions based on their acoustic features. All stimuli collected from the Dutch speakers in Experiment 1 and the Chinese Mandarin speakers in Experiment 2 were used. We first extracted a large number of acoustic features for each audio clip and then performed discriminative classification experiments with machine learning algorithms to try to classify the 22 positive emotions based on the extracted acoustic features. If acoustic classification is higher for nonverbal vocalizations than for speech prosody, this might be one of the contributing mechanisms to better recognition of positive emotions from nonverbal vocalizations in Experiment 1 and 2. The acoustic characteristics of the vocalizations used in this study (duration, Rms amplitude, pitch mean, pitch standard deviation, spectral central of gravity, and spectral standard deviation values, extracted using Praat: Boersma & Weenink, 2011) are presented in Fig. 4.
By utilizing openSMILE software (Eyben et al., 2013), we extracted acoustic features from the extended version of the Geneva Minimalistic Acoustic Parameter Set (eGeMAPs, see Eyben et al., 2016). GeMAPs is a standardized, open source method for measurement of acoustic features for emotional voice analysis. The acoustic features included the frequency, energy/amplitude, spectral balance, and temporal domains. Features of the frequency domain include aspects of fundamental frequency (correlated with the perceived pitch), as well as formant frequencies and bandwidths. Energy/amplitude features refer to the air pressure in the sound wave, and are perceived as loudness. Spectral balance parameters are influenced by laryngeal and supralaryngeal movements and are related to perceived voice quality. Lastly, features from the temporal domain reflect the duration and rate of voiced and unvoiced speech segments. We extracted 88 acoustic features in total from these four domains. For each stimulus, the feature vector was the mean of the whole audio clip.
We conducted acoustic classification experiments with four machine learning algorithms: support vector machine (Linear SVM), linear, radial basis function (RBF SVM), polynomial SVM (Poly SVM), and random forest. These are the most commonly used models for classification (Poria et al., 2017). Scikit-learn, a python-based machine learning library was used for machine learning evaluation (Pedregosa et al., 2011). For all of the machine learning models, we performed tenfold cross-validation and grid search to select the hyperparameters that produced the best results.
We tested classification of 8 positive emotions for each run in order to reflect the findings on human recognition performance in Experiments 1 and 2, in which participants had to select one of 8 emotion options in a forced-choice task. We performed three separate classification runs for all stimuli that had a specific emotion category, henceforth called “emotion category group”. There were 22 emotion category groups corresponding to the 22 emotion categories. First, we used each emotion category group’s actual category plus seven randomly selected emotion categories from the other 21 categories (i.e., excluding the target category). Next, we selected another seven random categories from the remaining 14 categories in addition to the emotion category group’s actual category. Finally, we used the last seven categories and the emotion category group’s actual category. Hence, eight categories were used for each classification run; all 22 categories were included by the end of the third run.
To perform the classification during each run, we split the data into a train-test split using a 60:40% ratio. We optimized our machine learning models on the training set using a hyperparameter grid search. Next, we performed classification on the test set. We then combined the predictions for each of the 22 emotion label groups into one confusion matrix.
Classification accuracy for each machine learning model is summarized in Table 4; confusion matrices for the most accurate machine learning models for each group are shown in Fig. 5.
For both Dutch and Chinese stimuli, nonverbal vocalizations of all positive emotions except hope and inspiration were classified with above-chance (i.e., 12.5% (1/8) given that there were 8 emotion labels) accuracy. The results revealed that the best classified positive emotions mapped into the emotions well-recognized from nonverbal vocalizations. For speech prosody, only eight positive emotions (admiration, amae, awe, excitement, gratitude, schadenfreude, and tenderness) were classified at above chance levels. Across the machine learning models, nonverbal vocalizations were classified more accurately compared to speech prosody. When vocalization types were compared for each positive emotion, acoustic classification accuracy was higher for nonverbal vocalizations of 18 positive emotions, while none of the emotions were classified with better accuracy from speech prosody. These results illustrate the lower distinctiveness of the acoustic patterns of positive emotions expressed through prosodic expressions as compared to nonverbal vocalizations, providing a likely explanation for the better recognition of positive emotions from nonverbal vocalizations found in Experiment 1 and 2.
Ancillary Acoustic Analyses
In order to better understand acoustic distinctiveness of nonverbal expressions and speech prosody, we first visualised acoustic similarity structure of positive emotions across the two vocalizations types using t-distributed stochastic neighbor embedding (t-SNE; https://lvdmaaten.github.io/tsne/). In the resulting multidimensional scaling projection, distance between the elements (i.e., acoustic structure of each vocalization) denotes their similarity (see Fig. 6). The similarity space for vocalizations across the two vocalization types derived by t-SNE revealed that nonverbal vocalizations and speech prosody form distinctive clusters.
In order to better understand the acoustic characteristics of nonverbal expressions and speech prosody, we identified the five most important acoustic features based on feature weights. Feature weights represent how much of each of the acoustic features are used by the machine learning model in classifying emotions for nonverbal vocalizations and speech prosody produced by Dutch and Mandarin Chinese speakers. Table 5 lists these parameters together with their definitions, features weights, and standard variations for nonverbal vocalizations and speech prosody, separately. These calculations highlight that feature weights, in general, were higher for nonverbal vocalizations compared to speech prosody. Acoustic features of nonverbal vocalizations were more influential in classification of nonverbal vocalizations compared to speech prosody. Moreover, pitch cues were among the most important cues for speech prosody but not for nonverbal vocalizations, while loudness and spectral-balance cues were among the most important features for both vocalization types. Temporal cues were important in vocal expressions produced by Chinese Mandarin speakers, but not by Dutch individuals. This might reflect differences in linguistic structures across these languages. Specifically, Chinese Mandarin is a syllable-timed language (spacing syllables equally across an utterance) while Dutch is a stress-timed (emphasizing particular stressed syllables at regular intervals) language (e.g., Benton et al., 2007). Most, but not all, of the acoustic features have more variation in nonverbal vocalizations compared to speech prosody. This could be due to linguistic constraints in the production of speech prosody. The production of nonverbal vocalizations—unlike speech—does not require precise movements of articulators, because they are not constrained by linguistic codes (Scott et al., 2010, see General Discussion for a discussion).
We further performed cross-classification analyses in order to test whether producers’ emotion encoding strategies overlap between nonverbal vocalizations and speech prosody, and whether the encoding strategies are shared across Dutch and Chinese Mandarin speaking participants. For the cross-classification analyses we thus conducted two types of analyses: (1) trained models on nonverbal vocalizations and tested on speech prosody, and vice versa; (2) trained models on Dutch speaking participants vocalizations and tested on vocalizations produced by Chinese Mandarin speaking participants, and vice versa. The accuracy of all models are shown in Table 6.
The results show that classification models in each of the cross-classification types performed statistically better than chance, indicating shared encoding strategies used in the production of emotional vocalizations. In cross-vocalization type evaluations, performance was nearly equivalent for training and test in both directions for the Dutch vocal expressions. However, for the vocalizations produced by Chinese Mandarin speakers, the accuracies were slightly higher for training on speech prosody and testing on nonverbal vocalizations as compared to the reverse. In cross-cultural evaluations, training on the Dutch vocalizations and testing on Chinese vocalizations performed similarly as training on the Chinese vocalizations and testing on Dutch vocalizations. Cross-cultural classification performance was better for nonverbal vocalizations compared to speech prosody, suggesting more robust differentiation of positive emotions based on acoustic configurations across cultures when expressed via nonverbal vocalizations. Cross-classification evaluations demonstrate that encoding strategies used in production of emotional vocalizations shared across vocalization types as well as the speakers from the two cultures.