Categorization of Vocal Emotion Cues Depends on Distributions of Input

Woodard, Kristina; Plate, Rista C.; Morningstar, Michele; Wood, Adrienne; Pollak, Seth D.

doi:10.1007/s42761-021-00038-w

Categorization of Vocal Emotion Cues Depends on Distributions of Input

RESEARCH ARTICLE
Published: 10 April 2021

Volume 2, pages 301–310, (2021)
Cite this article

Download PDF

Affective Science Aims and scope Submit manuscript

Categorization of Vocal Emotion Cues Depends on Distributions of Input

Download PDF

2919 Accesses
3 Citations
12 Altmetric
Explore all metrics

Abstract

Learners use the distributional properties of stimuli to identify environmentally relevant categories in a range of perceptual domains, including words, shapes, faces, and colors. We examined whether similar processes may also operate on affective information conveyed through the voice. In Experiment 1, we tested how adults (18–22-year-olds) and children (8–10-year-olds) categorized affective states communicated by vocalizations varying continuously from “calm” to “upset.” We found that the threshold for categorizing both verbal (i.e., spoken word) and nonverbal (i.e., a yell) vocalizations as “upset” depended on the statistical distribution of the stimuli participants encountered. In Experiment 2, we replicated and extended these findings in adults using vocalizations that conveyed multiple negative affect states. These results suggest perceivers’ flexibly and rapidly update their interpretation of affective vocal cues based upon context.

The paradoxical role of emotional intensity in the perception of vocal affect

Article Open access 06 May 2021

Human Non-linguistic Vocal Repertoire: Call Types and Their Meaning

Article Open access 30 September 2017

Automatic Brain Categorization of Discrete Auditory Emotion Expressions

Article Open access 28 August 2023

Central theoretical questions in emotion research concern the issue of learnability. Emotion cues vary, to some extent, across individuals, situations, groups, and cultures. This variability may be construed as evidence that emotions would be too difficult to learn without the existence of some stable emotion categories across individuals or, conversely, that the amount of variability itself argues against any core set of emotion categories. Both of these positions have been articulated in the literature (Barrett et al., 2019; Keltner et al., 2019; Scarantino, 2014). Yet, there is no question about the fact that despite the variability encountered, children and adults continue to refine and use perceptual categories to systematically distinguish between emotional states. The present study explores whether the perceptual variability in emotion cues is itself an important source of emotion learning. To do so, we examine whether individuals can track and use variability in the distribution of the cues they encounter in their perceptual categorization of emotion.

One path to progress on this question about the role of variability in emotion learning is to draw from advances in the field of language learning. A central question in the field of speech perception has been how to understand the efficiency of speech perception given the “lack of invariance” in the input people receive (Liberman, 1957; Liberman et al., 1967). A phoneme—the smallest unit of sound that makes up language, like the /ɛ/ in “pen”—can sound very different depending on who is saying it, the context or co-articulation of the phoneme with other aspects of an utterance, speech conditions, and random errors in speech production (Miller & Eimas, 1995). As an example, the /ɛ/ in “pen” will sound slightly different depending on whether it is produced by an adult or child, how fast or slow someone is speaking, the dialect of the speaker, and the sounds produced right before or after the word (e.g., Kleinschmidt & Jaeger, 2011). Yet despite this variation, humans learn to perceive speech sounds categorically, quickly, and with high accuracy.

One mechanism that supports perceptual learning of speech cues is humans’ ability to track and adapt to the distributional properties of acoustic input (Kleinschmidt & Jaeger, 2015; Samuel & Kraljic, 2009). Categorization of phonemes is surprisingly relative as well as context- and speaker-dependent. For example, individuals are able to quickly adjust to different rates of speaking (Newman & Sawusch, 1996), differences in speech caused by vocal tract size (Johnson, 2005), foreign-accented speech (Clarke & Garrett, 2004; Xie et al., 2018), dialects (Dahan et al., 2008), and variation in vowel pronunciation (Weatherholtz, 2015). These rapid adaptations are not always temporary; rather, they can update perceivers’ representations over time and transfer to novel situations and speakers (Clarke-Davidson et al., 2008; Kleinschmidt, 2019; Weatherholtz & Jaeger, 2016; Xie et al., 2018).

Here, we examine whether learning that is based upon the distributional properties of perceptual input also applies to vocal emotion cues. This type of learning has already been implicated in a range of developmental processes that includes children’s category learning in language (Saffran, 2020), faces (Dotsch et al., 2017), color, and action sequences (see Frost et al., 2019 for review). For instance, distributional statistics can aid children’s language learning by allowing them to detect phoneme categories (Maye et al., 2002) and can influence adults’ color perceptions based on the amounts of each color in their current environment (Levari et al., 2018). Recent findings suggest that distributional information also influences the learning of visual facial cues for emotion categories (Levari et al., 2018; Plate et al., 2019; Plate et al., in press). Brief exposure to images of a person who was either facially expressive or unexpressive caused children and adults to shift their threshold for categorizing a face as emotional.

The vocal expression of affect parallels these other domains in many respects. Specifically, there are statistical consistencies in how some affective states are conveyed (Banse & Scherer, 1996; Juslin & Laukka, 2001; Sauter et al., 2010). For example, anger is often conveyed with high pitch, high intensity, and, if using spoken word, rapid speech rate (Johnstone & Scherer, 2000; Scherer, 2019). However, vocal affect also reflects a “lack of invariance”: similar vocal properties (such as high pitch) can predict different emotions, speaker differences that affect speech perception can also impact vocal emotion, and there are currently no one-to-one mappings between any combinations of acoustic features and a specific emotion (Ito, 2018; Sauter et al., 2010).

The present work examines whether similar distributional learning processes also operate on affective information conveyed through the voice. Testing this idea is important because some reports suggest that this type of learning may be specific to some modalities, such as auditory versus visual, or specific to certain kinds of stimuli, evidenced by the lack of transfer learning of novel stimuli (Frost et al., 2015). Moreover, learning performance across different modalities is often weakly correlated (Siegelman & Frost, 2015), making it important to test assumptions of generalizability. We also examined this learning process in children, who are still acquiring vocal emotion categories, based upon data indicating that prior knowledge and experience with stimuli affects statistical learning (Siegelman et al., 2018). To do so, we tested 8- to 10-year-olds because children at this age can rely on either lexical or prosodic information to interpret auditory expressions of emotion (e.g., Friend & Bryant, 2000; Morton & Trehub, 2001), but have lower accuracy than adults when identifying auditory emotion categories (Aguert et al., 2013; Morningstar, Ly, Feldman, & Dirks, 2018; Morningstar, Nelson, & Dirks, 2018).

We tested how perceivers categorized nonverbal (i.e., a yell) and verbal (i.e., spoken word, with hostile tone) auditory stimuli of different emotional intensities as either “calm” or “upset.” We include both verbal and nonverbal stimuli as both adults and children tend to have higher accuracy recognizing nonverbal vocalizations (Hawk et al., 2009; Sauter et al., 2013), and this could impact how individuals adjust to these vocalizations. In Experiment 1, children and adults were trained to a baseline and then exposed to different distributions of vocal stimuli—that is, vocal stimuli with different ranges of intensity. Thus, after training, some participants were exposed to a greater proportion of vocal cues at higher intensities (upset shifted), some participants were exposed to a greater proportion of vocal cues at lower intensities (calm shifted), and some participants were exposed to the same proportion of stimuli throughout the entire study (unshifted).

We predicted that adults and children exposed to these different ranges of vocal stimuli would adjust how they categorized whether or not vocalizations were “upset.” For instance, participants who were exposed to more intense ranges might categorize certain vocalizations as “calm,” whereas participants exposed to less intense ranges might categorize this same stimulus as “upset.” We predicted these changes in categorization because the distributions of stimuli encountered are giving different information about how expressive the individual is. Comparing the performance of adults with children, who are still acquiring emotion categories, afforded the opportunity to examine developmental differences in how representations of affective vocal cues are updated. If children, in addition to adults, exhibit such sensitivity to statistical distributions of vocalizations, then statistical learning might support initial acquisition of emotion cue categories just as it does learning in other domains. In Experiment 2, we use the same paradigm to test the replicability of this perceptual mechanism and determine if the effects hold when individuals also need to track other negatively valenced emotions and speakers.

Experiment 1

Method

Participants

Eighty-four children (41 female; age range = 8–10 years, Mage = 9.70 years, SDage = 0.88 years) and 87 adults (58 female; Mage = 19.10 years, SDage = 0.73 years) participated in this experiment. We had three between-subject conditions and aimed for 30 participants in each condition (90 total) based on sample sizes of previous research (Experiment 1 in Plate et al., 2019), however, we ended up slightly short of our recruitment goal because of the COVID-19 outbreak. Since we ended data collection early, we report post hoc power analyses with the results; these should be interpreted with the understanding that they were conducted after data collection (see Zhang et al., 2019). Two children completed only one condition (verbal). Children were recruited from the community in Madison, Wisconsin (8.33% African American, 7.14% Asian American, 4.76% Hispanic, 2.38% more than one race, 77.38% White). All children received a prize, and parents received $25 for their participation. Adults were undergraduate students at the University of Wisconsin-Madison and received course credit (2.30% African American, 19.54% Asian American, 11.49% Hispanic, 5.75% more than one race, 60.92% White). The Institutional Review Board approved the research.

Stimuli

We presented participants with both nonverbal and verbal auditory stimuli. Nonverbal stimuli, created using Soundgen, were based on a male vocalization of a “yell” (roar_059; Anikin & Persson, 2017). Validation for the stimuli is reported in Anikin (2019) and R scripts generating the stimuli and the stimuli themselves are available at: https://osf.io/749xq/?view_only=ef7f9d9509284ed2927948509c596db5). Twenty-one nonverbal morphs were generated. These stimuli were morphs from a neutral “ahh” (0% “upset”) to a hostile “ahh” (100% “upset”) that varied in 5% increments in features including pitch, amplitude/loudness, and other cues of vocal quality (Banse & Scherer, 1996; Juslin & Laukka, 2001; Sauter et al., 2010). Verbal stimuli were morphs of recordings of a male actor saying a statement (“I can’t believe you just did that”) in both a neutral voice and an angry voice (see Morningstar et al., 2017 for details about recording procedure). Morphs were created by linearly manipulating the waveform in 10% increments of the actor’s original portrayals from neutral to hostile, using STRAIGHT acoustic manipulation tools (Kawahara et al., 2008) in Matlab. The STRAIGHT tool manipulates F0, amplitude, spectral envelope, and periodicity simultaneously at the spectrogram level (Kawahara & Morise, 2011). This procedure yielded 10 recordings ranging from 10% emotional intensity (i.e., 10% anger, 90% neutral) to 100% emotional intensity (i.e., 100% anger, 0% neutral). Validation of these stimuli in a forced-choice emotion recognition task suggests that listeners’ (n = 190) capacity to identify these recordings as “angry” increased with the emotional intensity of the morphs (Morningstar et al., under review), going from 11% accuracy for the 10% intensity recording to 86% accuracy for the 100% intensity recording (where chance was 14%). The difference in morphing increments for the verbal and nonverbal conditions occurred because of the stimuli availability. Additional details about the creation and validation of stimuli are available in the Supplemental Material.

Procedure

The present task tested how perceivers categorized auditory cues of varying intensity as either “calm” or “upset.” The experiment included three phases: (1) a practice phase, (2) a training phase, and (3) a testing phase. The training phase gave participants explicit feedback on whether each cue should be categorized as “upset” or “calm” in order to create a baseline category boundary. The testing phase examined whether the category boundary established in the training phase would shift in response to different statistical distributions of stimuli (e.g., in response to hearing more or less upset vocalizations) in one of three conditions: calm shifted, unshifted, or upset shifted. Participants completed the entire procedure (practice, training, and testing) for both verbal and nonverbal stimuli, with order counterbalanced across participants such that half participated first in the verbal condition and half participated first in the nonverbal condition. Stimuli were presented with PsychoPy (v1.83.04).

Practice Phase

During the practice phase, participants were introduced to “John” (neutral image of Actor 24 from the MacArthur Network Face Stimuli Set; Tottenham et al., 2009) and told that, “Just like everyone, sometimes John feels upset and sometimes he feels calm. Today we need your help figuring out if he is upset or calm.” Participants were then taught that when John is feeling upset, he likes to “go to the red room and practice boxing,” and when he is feeling calm, he likes to “go to the blue room and read a book.” The goal of this design was to task participants with predicting the next action of the speaker based on their vocalization. On each trial, participants saw an image of headphones, and had to click on the headphones when they were ready to hear John make a sound. After hearing the sound, participants selected either a red room with an image of a punching bag (indicating they think he feels “upset”) or a blue room with an image of an easy chair and book (indicating they think he feels “calm”) using a computer mouse (see Supplemental Material, Figure S1). The side of the screen where each room appeared was counterbalanced between participants. Participants completed 6 practice trials with feedback (“Correct!” or “Incorrect! Please try again”) and repeated incorrect trials until they responded correctly. These practice trials included three calm trials (0%, 10%, 20% upset morphs were labeled as “calm”) and three upset trials (80%, 90%, and 100% upset morphs were labeled as “upset”). The order of morphs was randomized.

Training Phase

During the training phase, participants completed 24 trials with feedback in random order. Stimuli consisted of morphs ranging from 20% upset to 80% upset. The 50% morph was omitted in order to emphasize the category boundary at the midpoint. Participants received feedback (“Correct!” or “Incorrect!”) after each trial, with “upset” being the correct response for morphs greater than 50% upset, and “calm” being the correct response for morphs less than 50% upset.

Testing Phase

During the testing phase participants completed 72 trials in random order. Participants were randomly assigned to one of three conditions: calm shifted, unshifted, and upset shifted. In the unshifted condition, participants heard the same stimuli as in the training phase (20% upset to 80% upset with the 50% morph omitted to create a category boundary). In the upset shifted condition, participants heard stimuli with a higher average percentage of intensity (40% upset to 100% upset with the 70% morph omitted to create a category boundary). In the calm shifted condition, participants heard stimuli with a lower average percentage of intensity (0% upset to 60% upset with the 30% morph omitted to create a category boundary). No feedback was given to participants during this phase.

Results

We sought to analyze whether adults and children flexibly shifted their category boundaries—the point on a morph continuum where they switched from categorizing the stimuli as “calm” to “upset”—for both verbal and nonverbal vocalizations based upon the distributional sampling of the stimuli they encountered. First, we evaluated whether participants were able to learn the category boundary during the training phase, and whether adults and children were able to similarly learn this boundary. Determining participant behavior during training ensures that differences observed at testing resulted from the distributions of the stimuli, rather than some feature of the stimuli or response biases that participants had prior to participation in the experiment. Next, we evaluated if participants’ category boundaries changed based upon their exposure to the distribution of stimuli during the training phase. Figures depicting the training phase performance are available in the Supplemental Material. We analyzed verbal and nonverbal conditions separately because our hypotheses were formulated around learning based upon probabilistic sampling of perceptual input, and we did not have a priori hypotheses about differences across stimuli. However, we do present a post hoc comparison of the verbal and nonverbal conditions in the Supplemental Materials. Analyses were completed in R version 3.6.2 (R Core Team, 2019) using the tidyverse package (Wickham et al., 2019), the lme4 package for our mixed-effects models (Bates et al., 2015), the ggplot2 (Wickham, 2016), and sjPlot (Lüdecke, 2020) packages for our graphs and tables, and the simr package for power analyses (Green & MacLeod, 2016). Stimuli, data, task scripts, and R scripts are available online at https://osf.io/749xq/?view_only=ef7f9d9509284ed2927948509c596db5. Although the participants who produced these stimuli did not consent to sharing the stimuli publicly online, they did give us permission to share the stimuli with other researchers upon request.

Do Perceivers Shift Their Categorization of Verbal Vocalizations Based upon the Distribution of Stimuli They Encounter?

Training Phase

Both adults and children had high accuracy (children mean accuracy = 97.4%; adult mean accuracy = 96.6%), where accurate means labeling a sound more than 50% upset as “upset” (by clicking the red room) and less than 50% upset as “calm” (by clicking the blue room). To look at whether or not there were any group differences in accuracy, we ran a logistic generalized linear mixed-effects models predicting accuracy based on Age (children coded as − 0.5 and adults coded as 0.5) with a by-participant random intercept. We found no difference in accuracy between adults and children, b = − 0.24, z = − 1.13, p = .26, OR = 0.78. We also tested for group differences in how likely adults and children were to categorize each morph as “upset” (whether one age group was more likely to identify morphs as upset earlier or later in the continuum). To examine this, we used logistic generalized linear mixed-effects models on children’s categorization of the vocal expressions (“calm” = 0, “upset” = 1) with a main effect of the Percent Upset of the stimuli (0% to 100% upset, mean-centered in increments of 10%), a main effect of Age (Children vs. Adults), the interaction between Percent Upset and Age, a by-participant random slope for Percent Upset, and a by-participant random intercept. Overall, no age differences in performance emerged during the training phase. Both adults and children similarly learned the category boundary, with vocal stimuli being more likely to be categorized as “upset” with each 10% increase in intensity, b = 2.79, z = 17.68, p < .001, OR = 16.28. There were no age-related differences in learning the category boundary, b = − 0.14, z = − 0.65, p = .52, OR = 0.87, and no interaction between Age and Percent Upset, b = − 0.21, z = − 1.02, p = .31, OR = 0.81.

Testing Phase

Next we examined whether participants would shift their emotion category boundaries after unsupervised exposure to a new statistical distribution of vocal input without feedback. We again used a logistic generalized linear mixed-effects model. The full model regressed participant responses on a three-way interaction between Percent Upset (mean-centered), dummy-coded Shift Type (calm shifted, unshifted, upset shifted), and Age (Children vs. Adults) plus all lower-order fixed effects, and a by-participant random slope for Percent Upset and a by-participant random intercept.

As in the training phase, there were no differences in performance between adults and children, b = 0.07, χ²(1) = 0.03, p = 0.86, OR = 1.07. As predicted, exposure to shifted distributions of vocal stimuli affected participants’ categorization, χ²(2) = 489.07, p < .001 (Fig. 1). Those in the calm shifted condition identified vocal stimuli as upset earlier in the morph continuum, b = 3.09, z = 10.86, p < .001, OR = 21.97, while those in the upset shifted condition identified vocal stimuli as upset later in the morph continuum, b = − 3.19, z = − 10.84, p < .001, OR = 0.04. Those in the unshifted condition also had a steeper category boundary between identifying stimuli as “upset” versus “calm,” which was not unexpected as individuals in this condition did not have to learn a different category boundary from training, χ²(2) = 21.66, p < .001 (interaction between Percent Upset and Shift Type, both dummy-coded terms were significant in the expected direction as well). These results indicate that participants adapted their categories about which auditory cues constituted “upset” based on the distribution of auditory morphs encountered in the shifted experimental conditions.

We conducted a post hoc power analysis by running 100 simulations in the SIMR package (Green & MacLeod, 2016) and found that we had essentially 100% power (95% CI: 96.36–100%) to detect our effect of Shift Type and 99% power (95% CI: 94.55–99.97%) to detect the Shift Type * Percent Upset interaction.

Do Perceivers Shift Their Categorization of Nonverbal Vocalizations Based upon the Distribution of Stimuli They Encounter?

Training Phase

We examined participants’ categorization of nonverbal vocalizations, using the same analytic models. Adults and children learned the emotion category boundary during training (children mean accuracy = 87.3%; adult mean accuracy = 90.3%); adults were slightly more accurate than children, b = 0.31, z = 2.81, p < .01, OR = 1.36. Next, we examined if there were differences in how likely adults and children were to categorize each morph as “upset.” We again found that adults and children were able to learn the category boundary, with auditory stimuli being more likely to be categorized as “upset” with each 10% increase in intensity, b = 1.66, z = 25.74, p < .001, OR = 5.24. There was no main effect of Age, b = 0.11, z = 0.96, p = 0.34, OR = 1.12, indicating that children were not identifying morphs as upset earlier or later in the morph continuum than adults. However, there was an interaction between Age and Percent Upset, b = 0.29, z = 2.66, p < .01, OR = 1.33, indicating that adults had a steeper category boundary between “calm” and “upset” than children (reflecting the children’s slightly higher error rate). These results indicate that both adults and children successfully learned the 50% category boundary during the training phase, but that children made more errors and had less precise category boundaries.

Testing Phase

Next we examined whether exposure to a new statistical distribution of auditory input would shift participants’ categorization. We used the same logistic generalized linear mixed-effects model as above. Unsupervised exposure to shifted distributions of nonverbal auditory stimuli again impacted participants’ categorization behaviors, χ²(2) = 707.84, p < .001 (Fig. 2). Those in the calm shifted condition identified vocal stimuli as “upset” earlier in the morph continuum, b = 3.01, z = 12.98, p < .001, OR = 20.27, while those in the upset shifted condition identified vocal stimuli as “upset” later in the morph continuum, b = − 3.93, z = − 16.65, p < .001, OR = 0.02. There was no main effect of Age, b = 0.36, χ²(1) = 1.53, p = 0.22, OR = 1.44, no age-related interactions, and no significant interactions in the model. These data indicate that participants adapted their categories about which auditory cues constituted “upset” based on the distribution of auditory morphs they encountered in the experimental conditions.

We conducted a post hoc power analysis by running 100 simulations in the SIMR package (Green & MacLeod, 2016) and found that we had essentially 100% power (95% CI: 96.38–100.00) to detect our effect of Shift Type.

Discussion of Experiment 1

Participants shifted their categorization of both verbal and nonverbal auditory stimuli based on the distributions of input they encountered. Adults and children learned and adapted to the variation in input with similar flexibility and speed, except that children found the training phase of the nonverbal task slightly more difficult (see Supplemental Material for more detail).

Experiment 2

Experiment 2 tests whether the learning effects observed in Experiment 1 continue to emerge beyond the presentation of just a single prototypical cue. To extend the findings of Experiment 1, we used the same general procedure, but with a few key changes. First, we used multiple emotion categories in Experiment 2, which allowed us to examine the effect of Shift Type (calm shifted, unshifted, and upset shifted) as a within-participant manipulation, with each emotion assigned to a different Shift Type (see Procedure for more details). Second, the inclusion of prototypes of sadness and fear—which are typically harder to identify accurately than is anger in the voice—made the task more difficult (Johnstone & Scherer, 2000; Morningstar, Ly, Feldman, & Dirks, 2018; Scherer, 2019). This added complexity provided a rigorous test of whether the shifting effects observed in Experiment 1 would continue to be observed beyond the limited, single stimulus condition in Experiment 1, providing both a replication and extension of those data. Thus, Experiment 2 allowed us to test whether participants continue to track the distributions of vocal cues under more complex conditions involving multiple speakers and vocalization categories.