Superior Communication of Positive Emotions Through Nonverbal Vocalisations Compared to Speech Prosody

Kamiloğlu, Roza G.; Boateng, George; Balabanova, Alisa; Cao, Chuting; Sauter, Disa A.

doi:10.1007/s10919-021-00375-1

Superior Communication of Positive Emotions Through Nonverbal Vocalisations Compared to Speech Prosody

Original Paper
Open access
Published: 24 July 2021

Volume 45, pages 419–454, (2021)
Cite this article

Download PDF

You have full access to this open access article

Journal of Nonverbal Behavior Aims and scope Submit manuscript

Superior Communication of Positive Emotions Through Nonverbal Vocalisations Compared to Speech Prosody

Download PDF

Roza G. Kamiloğlu ORCID: orcid.org/0000-0002-1018-2595¹,
George Boateng²,
Alisa Balabanova¹,
Chuting Cao¹ &
…
Disa A. Sauter¹

5298 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

The human voice communicates emotion through two different types of vocalizations: nonverbal vocalizations (brief non-linguistic sounds like laughs) and speech prosody (tone of voice). Research examining recognizability of emotions from the voice has mostly focused on either nonverbal vocalizations or speech prosody, and included few categories of positive emotions. In two preregistered experiments, we compare human listeners’ (total n = 400) recognition performance for 22 positive emotions from nonverbal vocalizations (n = 880) to that from speech prosody (n = 880). The results show that listeners were more accurate in recognizing most positive emotions from nonverbal vocalizations compared to prosodic expressions. Furthermore, acoustic classification experiments with machine learning models demonstrated that positive emotions are expressed with more distinctive acoustic patterns for nonverbal vocalizations as compared to speech prosody. Overall, the results suggest that vocal expressions of positive emotions are communicated more successfully when expressed as nonverbal vocalizations compared to speech prosody.

Decoding emotions from nonverbal vocalizations: How much voice signal is enough?

Article 20 July 2019

Good vibrations: A review of vocal expressions of positive emotions

Article Open access 02 January 2020

What is the Melody of That Voice? Probing Unbiased Recognition Accuracy with the Montreal Affective Voices

Article 06 April 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The human voice expresses a wealth of information, serving as an audible index of a person’s age, sex, identity, and emotional state (Kreiman & Sidtis, 2011). Among the features conveyed by the voice, an important element for everyday social interactions is the expression of emotions. Independently of semantic information (i.e., what is being said), the voice can express emotions via speech prosody (i.e., how it is being said), like speaking louder or softer. Emotion can also be expressed vocally via nonverbal vocalizations like screams and laughs. The voice can communicate discrete emotions such as anger, fear, and happiness to listeners via both nonverbal vocalizations and speech prosody (e.g., Banse & Scherer, 1996; Juslin & Laukka, 2001, 2003). However, it is poorly understood how the type of vocalization (nonverbal vocalizations vs speech prosody) influences listeners’ recognition of emotions. For example, is it easier to recognize amusement from a laugh as compared to a word spoken with amusement? Moreover, research to date has tended to include a limited number of positive emotions or to use a single positive emotion category, referred to as happiness or joy. In the present study, we examine recognizability of 22 positive emotions expressed via nonverbal vocalizations and speech prosody, and compare the accuracy levels between the two vocalization types.

Nonverbal Vocalizations Versus Speech Prosody

Nonverbal vocalizations are brief non-speech expressions of emotions like laughs, moans, sighs, and grunts. They exclude linguistic interjections (e.g., “Surprise!”) that form part of semantic speech content (e.g., Ameka, 1992). Nonverbal vocalizations are similar to affect bursts (Scherer, 1994), but they do not include emblematic affect expressions like “yuck!” and they do not have to include a change in facial expressions. Nonverbal vocalizations are considered ‘pure’ expressions of emotions in the sense that they closely reflect physiological and autonomic changes (Scherer, 1986). Speech prosody, on the other hand, refers to suprasegmental attributes of spoken language, such as intonation, that can communicate emotions concurrently with semantic content (Juslin & Laukka, 2003; Scott et al., 2010). Acoustic features mainly related with rhythm and melodies in speech constitute the domain of prosody, which are sometimes referred to as paralinguistic features (Gibbon, 2017). It has been proposed that nonverbal vocalizations might be easier to understand than speech prosody (Scott et al., 2010). The production of emotional speech prosody is constrained because during speech production, there can be conflicts between the prosodic features associated with an emotional state and the ones used to denote linguistic information. For instance, changes in pitch levels signaling emotional information can conflict with changes relating to linguistic stress in a sentence or the rising pitch of a question. Such conflicts may create ambiguity in the speech prosody, resulting in less discriminable emotional information compared to nonverbal vocalizations. Nonverbal vocalizations, on the other hand, are largely unconstrained by linguistic structure (e.g., Pell et al., 2015; Scott, Sauter, & McGettigan, 2010). They are produced at the glottal/subglottal level with reduced volitional control of the vocal tract configurations (Trouvain, 2014). A lack of volitional control and linguistic constraints on nonverbal vocalizations leads to greater acoustic variability in nonverbal expression of emotions compared to prosodic expressions of emotions (e.g., Jessen & Kotz, 2011; Lima et al., 2013). Such flexibility might lead to the expression of emotions with higher discriminability in terms of acoustic structures.

Some emotions are typically expressed by means of nonverbal vocalizations and rarely vocalized via speech (Banse & Scherer, 1996). This proposal (yet untested) implies that recognizability advantage of nonverbal vocalizations over speech prosody might differ per emotion. For example, relief is typically expressed with sighs and amusement is expressed with laughs, with such vocalizations also occurring in other mammals, like rats (Panksepp & Burgdorf, 2003; Soltysik & Jelen, 2005). These emotions might have more distinctive acoustic configurations when expressed via nonverbal vocalizations compared to speech prosody, leading to higher recognizability. However, it is also a possibility that some emotions like feeling respected might not have any prototypical nonverbal vocalizations and might be preferred to be expressed through prosodic expressions. These emotions might have better differentiated acoustic profiles in speech prosody compared to nonverbal vocalizations.

Even though arguments for enhanced communication of emotions via nonverbal vocalizations as compared to speech prosody have been put forward, there is little research formally testing this notion. Studies conducted to date have found that negative emotions are recognized more accurately (Hawk et al., 2009; Lausen & Hammerschmidt, 2020; Sauter, 2007) and rapidly (Castiajo & Pinheiro, 2019; Pell et al., 2015; Schaerlaeken & Grandjean, 2018) from nonverbal vocalizations compared to speech prosody. For positive emotions, this perception advantage of nonverbal vocalizations has been tested for happiness/joy and pride, yet it is not established whether it generalizes to other positive emotions. In the present study, we compare human listeners’ recognition performance for nonverbal vocalizations and speech prosody for a wide range of different positive emotions.

Theorists have highlighted the need to differentiate between different positive emotions in vocal expressions for several decades. In an early review, Scherer (1986) lamented that such distinctions were rarely made in the literature and noted that it is therefore not clear what researchers refer to when they use the term ‘happiness,’ which makes it difficult to compare results between studies. Ekman (1992) suggested that ‘happiness’ should be replaced with several positive emotions and proposed that listeners might be able to differentiate these emotions from vocal expressions. The need to differentiate between positive emotions is further supported by an empirical study comparing more intense and less intense form of emotional vocalizations (Banse & Scherer, 1996). The two positive emotions, elation and happiness, were rarely confused with each other, suggesting that they might be expressions of two distinct positive emotions.

In recent years, researchers examining vocal communication of emotions are increasingly differentiating between distinct positive emotional states. There is empirical evidence showing that several positive emotions are expressed with vocal expressions characterized by distinct acoustic patterns and that they can be recognized by naïve listeners (see Kamiloğlu et al., 2020 for a review). However, studies to date have tended to include only a few categories of positive emotions and have focused on either nonverbal vocalizations or speech prosody. In the present study, we test whether listeners can recognize 22 different positive emotions from nonverbal vocalizations and speech prosody. We then test the robustness of these findings by conducting an identical experiment in a second cultural context.

The Present Study

In the present study, we aim to compare recognition accuracy for nonverbal vocalizations to that from speech prosody for positive emotions. To do so, we first examined which of the 22 positive emotions could be recognized at better-than-chance levels for each type of vocalization. This allowed us to differentiate positive emotions that were not recognized from nonverbal vocalizations or speech prosody. We then tested the hypothesis that positive emotions are more accurately recognized when expressed as nonverbal vocalizations compared to speech prosody. We further sought to exploratorily examine recognition accuracy differences between the two vocalization types for each emotion. In order to be inclusive of a wide range of positive emotions, a total of 22 positive emotions that have been examined in the scientific literature were included: admiration, amae [presumption on others to be indulgent and accepting (Behrens, 2004)], amusement, awe, determination, elation, elevation, excitement, gratitude, hope, inspiration, interest, lust, moved, pride, relief, respected, schadenfreude, sensory pleasure, surprise, tenderness, and triumph (see Table 1 for definitions and examples).

Table 1 Positive emotions, accompanying definitions, and situational examples used in production of vocal expressions

Full size table

In Experiment 1, naïve Dutch listeners were asked to complete a forced-choice emotion categorization task for vocal expressions produced by native Dutch speakers. Experiment 2 was a replication of Experiment 1 in which naïve Chinese listeners completed an identical forced-choice emotion categorization task with vocalizations of 22 positive emotions produced by native Chinese Mandarin speakers. The hypotheses, methods, and data analysis plan for both experiments were pre-registered on the Open Science Framework (https://osf.io/6c8v3/?view_only=) before data collection was commenced.

In order to compare acoustic patterns of positive emotions expressed via nonverbal vocalizations to that of speech prosody, we conducted an acoustic analysis. Machine learning models were used to classify the nonverbal vocalizations and speech prosody stimuli from Experiment 1 and 2 based on their acoustic features. We hypothesized that acoustic classification accuracy of positive emotions would be higher for nonverbal vocalizations compared to speech prosody.

Experiment 1: Dutch Listeners’ Recognition of Positive Emotions from Dutch Vocalizations

In Experiment 1, we first examine whether naïve Dutch listeners would be able to recognize 22 positive emotions from vocal expressions produced by native Dutch speakers at levels significantly above chance. We then test whether recognition of positive emotions is better from nonverbal vocalizations as compared to speech prosody overall, and provide a breakdown of the results per emotion.

Method

Participants

We estimated the sample size through data simulation. A generalized linear mixed model was constructed to test whether participants would recognize 22 positive emotions at better-than-chance levels, with the dependent variable being a binary response (correct or incorrect). Positive emotion with 22 factors was set as a fixed effect. Participant and vocalization IDs were entered as random factors accounting for participant and speaker variability. Chance level was set to 1/8, which is the chance of selecting the correct emotion category by random guessing in an 8-way forced-choice task. We defined logit for the reference probability of 1/8, which was entered in the model as an offset term. The simulations indicated that using a sample size of N = 200 would ensure that the experiment would be well powered (80%) for testing whether recognition performance would be at better-than-chance levels. Simulations were run in R (version 1.1.383, www.r-project.org) using lme4 package (Bates, Mächler, Bolker, & Walker, 2015). The simulation script is provided in Supplementary Materials Script 1S.

Two hundred native Dutch speakers (105 women, 92 men, 2 other, 1 preferred not to say; M_age = 21.75, SD_age = 3.82, range = 18–40 years old) with no (self-reported) hearing impairments were recruited via the University of Amsterdam, Department of Psychology’s research pool, and by advertisements posted on Facebook. Participation in the study was compensated with course credit or monetary reward.

Materials and Procedure

Stimuli

Posed vocal expressions of positive emotions were recorded at the University of Amsterdam’s psychology laboratory. The walls of the laboratory were covered with high quality acoustic fabric to prevent echoes. Individuals whose native language was Dutch and who had never been diagnosed or treated for any voice, speech, hearing, or language disorder were considered eligible for participation in the study. Twenty participants (10 women, 10 men; M_age = 22.42, SD_age = 2.64, range = 20–31 years old) were invited to the lab to record vocalizations.

Upon arriving at the lab, participants were seated in front of a lab computer, which displayed each emotion term in turn, together with its definition. Participants were then asked by the experimenter to describe the emotion in their own words to ensure that they understood the definition correctly. If needed, they were provided examples. Then, they read a situational example and were asked to imagine the situation as vividly as possible. They were then asked to produce a vocal expression of the corresponding emotion. The target emotions, accompanying definitions, and situational examples can be found in Table 1.

Participants were positioned approximately 30 cm from the microphone. They produced both nonverbal vocalizations and speech prosody for each of the 22 positive emotions. When producing nonverbal vocalizations, participants were asked to avoid actual words (e.g.,” “no,” “yes”) and vocalizations with conventionalized semantic meanings (e.g., “yuck,” “ouch”). For speech prosody, speakers were asked to produce the semantically neutral word “zeshonderd zevenenveertig” (from Dutch: six hundred forty-seven) in a way that expressed the target emotion. We chose to use a neutral word to make sure that vocal emotions would be communicated solely in terms of prosodic cues. Participants were instructed to avoid inserting any additional sounds such as laughs or sighs into their speech (e.g., Hawk et al., 2009). Each type of vocalization (nonverbal vocalizations and speech prosody) constituted a separate block. The order of the two blocks and the order of emotions in each block were randomized across participants. Participants were allowed to produce multiple vocal expressions for a given emotion. If they did, they were asked to choose the expression they thought best depicted the emotion they were trying to express. All stimuli were recorded using a high-quality microphone (Sennheiser MKE 600) and a Tascam DR-100 MK3 recorder.

In total, 880 vocalizations were collected. Average duration was 1.30 s (SD = 0.56) for nonverbal vocalizations, and 2.28 s (SD = 0.44) for speech prosody. All of the vocalizations were used in the recognition experiment without any preselection. Before the recognition experiment, recordings were digitalized at a 44 kHz sampling rate (16 bit, mono) and normalized for peak amplitude using AudaCity software (version 2.2.2, https://www.audacityteam.org). A representative vocalization for each positive emotion and vocalization type can be listened from https://emotionwaves.github.io/dutch22/.

Experimental Procedure

The recognition study was run online using the Qualtrics (Provo, UT) survey tool. Before the experiment, participants were instructed to complete the experiment in a silent environment and to use headphones. After being informed about the general procedure and giving informed consent, they were provided with the definitions of 22 positive emotions (see Table 1). After reading the definitions, participants completed two practice trials, each of which played an emotional vocalization that was not included in the main experiment (taken from www.findsounds.com). After listening to each vocalization, they were asked to select the emotion they thought the individual was expressing, choosing from eight response options. During the practice trials, participants were asked to adjust to a comfortable sound level and to keep it constant for the rest of the study. After the practice trials, participants were presented with two screening questions. One of these played a bird sound and the other a car horn. On these trials, participants were asked to indicate what they heard, with “bird sound” and “car horn” as response options. These questions were used to make sure that participants were paying attention and listening to the stimuli. Participants who failed one or both of the screening questions were not able to continue to the main experiment.

After the practice and screening questions, participants were assigned to one of fourteen conditions, half of which were speech prosody, and the other half nonverbal vocalizations. In each condition, three stimuli from each emotion category (e.g., three nonverbal vocalizations expressing admiration) were presented. Each participant thus completed 66 trials (22 emotion × 3 vocalization) in total. This way, each of the 880 stimuli were judged by at least one participant. After hearing each vocalization, participants were asked to make a forced-choice emotion categorization judgment, selecting from eight emotion categories (“Select the positive emotion you think the individual was expressing”). These response options included the target category (i.e., the emotion category of the stimulus on that given trial), and seven nontarget categories (emotion categories randomly selected from the remaining 21 positive emotions). Across all participants, all target response categories were paired with all nontarget response categories. For instance, across trials that included admiration vocalization as stimuli, the nontarget response options included all of the other 21 emotion categories for some participant(s). The presentation order of stimuli was randomized for each participant, and the response options were presented in a randomized order on each trial. There was no time constraint on completing each trial, and participants were able to replay each stimulus as many times as needed to make a judgment.

Statistical Analysis

Statistical analysis and outlier detection were done based on the preregistered analysis plan. Before the analysis, data were checked for participants with exceptionally low performance, defined as 3 SD or more below the mean in terms of overall recognition performance. None of the participants met this criterion, and so all were retained in the analyses.

We constructed a generalized linear mixed model (GLMM) to analyze whether listeners were able to categorize positive emotions at better-than-chance levels for nonverbal vocalizations and speech prosody. GLMM was used because it allows for fixed effects to be defined in addition to taking advantage of the computation of random effects. Positive emotion was set as a fixed effect. Participant and vocalization IDs were entered as random factors to account for participant and speaker variability. Chance level was set to 1/8, which is the probability of selecting the correct emotion category by random guessing. We defined logit for the reference probability of 1/8, which was entered into the model as an offset term. The dependent variable was a binary response (i.e., correct or incorrect response):

glmer (response ~ offset(logit(1/8)) + PositiveEmotion + (1|ParticipantnID) + (1|VocalizationID), family = binomial)

To test our prediction that participants would recognize emotions from nonverbal vocalizations better than from speech prosody, we constructed a second GLMM. In this model, type of vocalization was set as fixed effect, and similarly to the previous model, participant and vocalization IDs were entered as random factors:

glmer (response ~ offset(logit(1/8)) + Type + (1|VocalizationID) + (1|ParticipantID), family = binomial)

All analyses were performed in R (version 1.1.383, www.r-project.org) using the lme4 package (Bates, Mächler, Bolker, & Walker, 2015).

Results

Confusion matrices for average recognition percentages for nonverbal vocalizations and speech prosody for each emotion are shown in Fig. 1. Recognition performance compared to the chance level per emotion is shown in Table 2.

Table 2 GLMM model comparing emotion recognition performance to chance level (1/8)

Full size table

Fourteen positive emotions were recognized at better-than-chance level when expressed as nonverbal vocalizations. These emotions, ordered based on coefficients for fixed effects in log-odd scale, were relief (Est. = 4.274, SE = 0.331), amusement (Est = 2.484, SE = 0.378), tenderness (Est. = 2.194, SE = 0.339), admiration (Est. = 2.099, SE = 0.331), lust (Est. = 1.919, SE = 0.376), surprise (Est. = 1.671, SE = 0.187), sensory pleasure (Est. = 1.642, SE = 0.269), schadenfreude (Est. = 1.642, SE = 0.269), determination (Est. = 1.451, SE = 0.166), excitement (Est. = 1.308, SE = 1.308), interest (Est. = 1.233, SE = 0.244), awe (Est. = 1.170, SE = 0.330), being moved (Est. = 0.821, SE = 0.315), and inspiration (Est. = 0.628, SE = 0.244). These findings demonstrate that many positive emotions are recognizable from nonverbal vocalizations for naïve listeners.

For speech prosody, 10 positive emotions were recognized better than would be expected by chance. These emotions, ordered by coefficient size for fixed effects in log-odd scale, were determination (Est. = 1.725, SE = 0.198), amusement (Est. = 1.401, SE = 0.453), surprise (Est. = 1.671, SE = 0.187), lust (Est. = 0.997, SE = 0.467), relief (Est. = 0.830, SE = 0.326), being respected (Est. = 0.771, SE = 0.247), admiration (Est. = 0.753, SE = 0.240), triumph (Est. = 0.626, SE = 0.254), excitement (Est. = 0.454, SE = 0.001), and being moved (Est. = 0.091, SE = 0.002). These results suggest that some positive emotions can be recognized from speech prosody. Figure 2 illustrates the estimates from the GLMM models. Full details of the GLMMs are provided in the Supplementary Materials, Tables S1 and S2.

We hypothesized that positive emotions would be recognized with higher accuracy rates from nonverbal vocalizations compared to speech prosody. The results showed that participants categorized nonverbal vocalizations of positive emotions significantly better than speech prosody overall (GLMM: z = − 8,599, p < 0.001). When performance accuracy was compared for each emotion separately, 14 positive emotions were recognized with significantly better accuracy from nonverbal vocalizations. However, two emotions were recognized better from speech prosody than nonverbal vocalizations: amae and feeling respected (see Table 3). However, not all of these emotions were recognized better-than-chance level for both expressions of nonverbal vocalizations and speech prosody, suggesting that some emotions can only be recognized from some one kind of vocal expression. Awe, inspiration, interest, schadenfreude, sensory pleasure, and tenderness were recognized better than expected by chance only when expressed via nonverbal vocalizations. In contrast, feeling respected was accurately recognized only from the prosodic expressions. This might be an emotion that is expressed by differentiable prosodic configurations in speech, but lacking a unique nonverbal vocalization. Even though amae was recognized better when expressed via speech prosody as compared to nonverbal vocalizations, it was not recognized from either vocalization type at above chance levels. However, in many cases, positive emotions were recognized from both nonverbal vocalizations and speech prosody at better-than-chance levels, with recognition being higher for nonverbal vocalizations compared to speech prosody. Figure 3a illustrates the comparisons of accurate responses across vocalization types per emotion. Random effects in GLMM models are summarised in Supplementary Materials, Table S3.

Table 3 GLMM models comparing emotion recognition performance across vocalization types per emotion

Full size table

Taken together, the results from Experiment 1 showed that 16 of the 22 positive emotions were recognized better than would be expected by chance level by naïve Dutch listeners from the vocal expressions of Dutch speakers. Moreover, 14 positive emotions were recognized better when expressed via nonverbal vocalizations compared to prosodic expressions, indicating superior recognition of most positive emotions from nonverbal vocalizations. As compared to speech prosody, nonverbal vocalizations of most positive emotions might have relatively distinctive acoustic profiles that are highly differentiated from those of other emotions, leading to higher recognition scores.

Experiment 2: Chinese Listeners’ Recognition of Positive Emotions from Chinese Vocalizations

Experiment 1 was conducted in a Dutch cultural context. In order to evaluate the robustness of the findings, we sought to repeat Experiment 1 in a distant cultural context. Languages are characterized by prosodic conventions, which might shape the communication of emotions via speech prosody. Choosing distant cultures with different prosodic conventions allows us to interrogate the robustness of the findings, making it unlikely that the same prosodic conventions shape the communication of positive emotions in our study. In Experiment 2, we test whether (1) Chinese listeners can recognize 22 positive emotions from nonverbal expressions and speech prosody from stimuli produced by Chinese individuals; and (2) whether positive emotions would better recognized from nonverbal vocalizations of as compared to prosodic expressions also in a Chinese cultural context.