Introduction

Emotion is contagious, whether it be a smile, a laugh, or a song. When we see others expressing emotion, it inspires us to do the same, often without conscious thought. But how do we do this—what mechanisms allow us to understand and mimic the emotions of others? In the current study, we used electroencephalography (EEG) to investigate the neural basis of emotion recognition. In particular, the current study is the first to assess whether activity in sensorimotor cortices (i.e., engagement of the human mirror neuron system [hMNS]) may support the recognition of emotion that is expressed nonverbally through speech.

Speech is an audiovisual signal, including speech sounds and facial speech movements (Sumby & Pollack, 1954). Both of these sensory modalities are important to understanding the emotion of a speaker, which can be conveyed via auditory cues (i.e., their vocal expression) or visual cues (i.e., their facial expression). The overwhelming majority of research on emotion recognition has focused on visual cues that are revealed by facial expressions. One of the brain mechanisms thought to underpin facial emotion recognition involves activity in sensorimotor cortices. When we perceive facial expressions of emotions, we unconsciously mimic them (Dimberg, 1982; Dimberg et al., 2000; Larsen et al., 2003). Furthermore, when such facial mimicry is prevented (e.g., by holding a pen in the teeth) or otherwise impaired (e.g., Autism Spectrum Disorder), it reduces our ability to recognize facial expressions of emotion involving those same muscles (Borgomaneri et al., 2020; Oberman et al., 2007; Ponari et al., 2012). In some studies, the degree of mimicry also has been associated with our ability to recognize the expressed emotions (Ipser & Cook, 2016; Künecke et al., 2014; Livingstone et al., 2016). For instance, Livingstone et al. (2016) found that patients living with Parkinson’s Disease varied in the extent to which they mimicked faces producing emotional speech and song. In addition, the greater the response of a patient’s zygomaticus major muscle (required for smiling) to the perception of happy faces, the faster that patient identified the emotion.

Wood et al. (2016) offer a useful conceptual explanation for these phenomena. They argue that when perceiving facial emotions, the face regions of the somatosensory and motor cortices are activated in response. Wood et al. refer to this response as “sensorimotor simulation” (Gallese, 2007; Jeannerod, 2001), because the neural systems required for the production of emotional facial expressions also are activated during the perception of those same emotions. Peripheral facial mimicry also may emerge as a downstream consequence of this sensorimotor simulation (Oberman et al., 2009; Prochazkova & Kret, 2017; Russo, 2020; Shamay-Tsoory, 2011). This simulation (both neural and peripheral) leads to activation of brain areas associated with the experience of those specific emotions, which in turn allows the perceiver to recognize them. This embodied account of emotion perception harkens back to the James-Lange theory of emotion (James, 1884), which proposed that we understand our emotional states by interpreting our bodies’ responses to the environment. More recent formulations of embodied accounts would not suggest that simulation is necessary, nor that it is the sole mechanism of emotion perception (Goldman, 2006); however, it may be especially helpful when judgments are rushed or when emotions are challenging to recognize (Wood et al., 2016; Karakale et al., 2019).

As pointed out by Wood et al. (2016), the precise neural underpinnings of sensorimotor simulation are a matter of debate. However, a strong candidate is the hMNS, a collection of brain areas that responds to both the execution of an action (e.g., a facial expression) as well as the observation of that same action (Cross et al., 2009; Rizzolatti & Craighero, 2004). Although the precise nature of this network is debated (Cook et al., 2014; Hickok, 2009), the existence of a network with these properties is generally accepted (for meta-analyses, see Caspers et al., 2010; Molenberghs et al., 2009). Audio and visual information are proposed to enter the hMNS via the superior temporal sulcus (STS), which provides sensory input to the inferior parietal lobule (IPL) and in turn the premotor cortex (PMC) and inferior frontal gyrus (IFG; Chong et al., 2008; Press et al., 2012). In addition to these “classical” hMNS areas, there is an “extended” hMNS—including the somatosensory cortex (S1/S2), presupplementary motor area (pre-SMA), middle temporal gyrus (MTG), and insula—which supports the classical hMNS, despite perhaps not possessing the same mirror-like properties (Bonini, 2017; Pineda, 2008).

Brain areas in this network appear to work together to simulate perceived action, including emotions, although the precise roles of each area are not entirely clear. One account argues that the parietal node of the hMNS (IPL, S1/S2) simulates the afferent (somatosensory) aspects of an action, whereas the frontal node (PMC/IFG, pre-SMA) simulates the efferent (motor) aspects of an action (Avenanti et al., 2007; Bastiaansen et al., 2009; Russo, 2020). In the case of emotion perception, the parietal node may process the somatic sensations associated with the expression of an emotion, whereas the frontal node may process the associated motor plans. In addition, other areas of the extended hMNS involved in the processing of emotions, including the insula, may enable us to empathize with those emotions (Bastiaansen et al., 2009; Iacoboni, 2009). These three components of the hMNS (parietal, frontal, and insular) may serve the functions deemed necessary by Wood et al.’s (2016) model, particularly the somatosensory, motor, and affective aspects of simulation, respectively.

There is considerable neuroimaging evidence for the view that the hMNS is involved in the recognition of facial emotions (Oberman & Ramachandran, 2007; Wood et al., 2016). Functional magnetic resonance imaging (fMRI) studies have regularly found greater activation of the classical hMNS (i.e., IPL and PMC/IFG) and extended hMNS (e.g., insula) in response to facial emotion (or, more often, emotion-related judgements of faces) than their neutral counterparts (Carr et al., 2003; Van der Gaag et al., 2007; Sarkheil et al., 2013; Wicker et al., 2003; Zaki et al., 2009). For instance, Van der Gaag et al. (2007) found that viewing facial emotion (happy, fearful, and disgusted) led to greater activation of the STS, IFG, and pre-SMA than viewing neutral faces. In addition to fMRI, another method that has been widely used to study facial emotion processing in the hMNS, often with the same results (Moore & Franz, 2017), is EEG (see also Karakale et al., 2019; Moore et al., 2012; Perry et al., 2010; Pineda & Hecht, 2009). The signal of interest in these studies is generally the alpha frequency band (8-13 Hz) of the mu rhythm, generated over sensorimotor brain areas. Power reduction in the mu rhythm, also called mu event-related desynchronization (mu-ERD), may reflect decoupling of oscillations across the two nodes of the hMNS (frontal and parietal). Thus, mu-ERD often is considered a measure of activity in the hMNS, with greater mu-ERD indicating greater activity (Fox et al., 2016; Lepage & Théoret, 2006; Muthukumaraswamy et al., 2004).

Although the above studies support the role of the hMNS in the recognition of facial emotions, little research has assessed the involvement of the hMNS in the recognition of vocal emotion, particularly as expressed in speech sounds. This is despite some evidence that the hMNS is involved in the perception of speech sounds generally (Jenson et al., 2014; Wilson et al., 2004; but see Hickok, 2010). Nonetheless, a few studies have considered the hMNS in the context of vocal emotion outside of the speech domain. For example, Warren et al. (2006) found that the left pre-SMA had a greater response to higher-intensity nonverbal vocalizations (i.e., triumph vs. amusement) and that the left IFG had a greater response to more positively valanced nonverbal vocalizations (i.e., triumph vs. disgust). McGarry et al. (2015) found that audiovisual presentations of sung intervals led to greater activation of the extended hMNS than did nonbiological control stimuli matched for mean pitch and gross facial movement.

In sum, although a considerable body of evidence suggests that the hMNS plays a role in perceiving facial emotions, and perhaps also emotional nonverbal vocalizations, no study has considered whether the hMNS plays a role in the perception of emotion as it is expressed in speech. This unexplored topic is likely to be more ecologically valid than the use of static faces or nonverbal vocalizations as stimuli. After all, dynamic vocal-facial expression of emotions in speech are far more common in the real world than static displays of facial emotion captured at the apex of an expression, or nonverbal vocalizations such as a shout of triumph. To explore this topic, the current study asked 24 healthy adults to observe and classify emotional speech stimuli while brain activity was measured and source-localized using EEG. The stimuli were drawn from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and encompassed happy, sad, and neutral expressions of emotion. The stimuli were presented in each of three sensory modalities: audio-only, visual-only, and audiovisual.

Our primary prediction, based on the literature summarized above, was that the hMNS would show greater responsiveness to emotional speech (happy and sad) than to neutral speech. A strong candidate brain area for this effect would be the pre-SMA, which previous studies of facial and vocal speech emotion have found to be sensitive to emotional intensity (Van der Gaag et al., 2007; Warren et al., 2006). We also had two secondary predictions. First, we predicted that the hMNS would be activated most strongly in response to modalities containing visual information (i.e., visual-only and audiovisual). This was based on the findings of Crawcour et al. (2009), an EEG study that reported greater mu-ERD over motor areas in response to facial speech movements than to speech sounds. Second, we predicted that the hMNS response to emotion (e.g., in the pre-SMA) would be positively correlated with participants’ accuracy when classifying emotions. This result has been found in some prior fMRI studies of emotion processing in the hMNS (Bestelmeyer et al., 2014; McGettigan et al., 2013), and a similar result here would suggest that the hMNS confers an active benefit to emotion recognition.

Methods

Participants

Twenty-four healthy adults (20 females, 4 males) participated in the study, recruited from undergraduate psychology courses and from the community. Participants either received course credit for taking part or did so as volunteers. Ages ranged from 20 to 37 years old (M = 20.88, SD = 4.93). All participants reported having normal hearing, and all but three reported being right-handed. The sample size was chosen to resemble other EEG-hMNS studies that have considered the effect of emotion (McGarry et al., 2015; Moore et al., 2012). The R package “Superpower” (Lakens & Caldwell, 2021) calculated that this sample size allowed us to detect effects of emotion and modality on mu-ERD as small as d = 0.33 with 80% power, assuming that mu-ERD is moderately correlated across conditions (r = 0.50). The study was approved by the Research Ethics Board at Ryerson University (protocol number 2018-241).

Design

The study was based on a two-factor, within-subject design. The independent variables were the emotion (happy, sad, or neutral) and modality (audio-only, visual-only, or audiovisual) of the presented speech stimuli. The dependent variables were mu-ERD, emotion classification accuracy, and emotion classification reaction time. The study consisted of three blocks, each corresponding to one modality. These blocks were presented in a fully counterbalanced order. There were six counterbalancing orders, each completed by four participants.

Stimuli

Stimuli were taken from the RAVDESS (Livingstone & Russo, 2018), particularly from actors 1, 4, 5, 8, 9, and 24 (3 males, 3 females). Stimuli ranged from 3.10 to 4.40 s in duration (M = 3.59, SD = 0.26). These durations did not differ between the three emotions (ps > 0.40). To minimize the possibility that differences in mu-ERD between conditions were due to low-level stimulus features, all stimuli were root-mean-square normalized in amplitude using Praat version 6.0.40. Based on the RAVDESS validation data, the specific actors and stimuli were chosen in such a way to ensure that there were no outlier stimuli with emotions that were especially difficult to identify or rated as being especially low in genuineness (more than two SD below the mean). For each participant, two actors (1 male, 1 female) were randomly assigned to each modality block. Within each modality block, each actor contributed 24 stimuli, 8 stimuli per emotion. For happy and sad stimuli, there were two neutral statements (“The kids are talking by the door” or “The dogs are sitting by the door”), two emotional intensities (low or high), and two repetitions of each stimulus. Neutral stimuli used the same two statements, two iterations (a second recording of the same stimulus), and two repetitions of each stimulus. For each participant, stimuli were randomized within each modality block.

EEG Recording

EEG data were collected using a 64-channel BioSemi ActiveTwo EEG system at a sampling rate of 1,024 Hz. Source localization of EEG signals was performed using independent component analysis (ICA; Delorme et al., 2007), a technique that identifies unique sources of spectrotemporal variability in each participant’s data. Our research group has previously used EEG and ICA to replicate hMNS results found using fMRI (McGarry et al., 2012). The number of channels (64), the sampling rate (1,024 Hz), and the total duration of stimulus presentation per participant (7.2 min) ensured that enough EEG data was collected to perform an ICA, for which at least (number of channels)2 × 30 data points are recommended.

Procedure

Participants entered the lab and gave informed, written consent. They were then fitted with a 64-electrode EEG electrode cap according to the international 10–20 system (Homan et al., 1987), with two electrodes attached to the mastoids (right and left). One horizontal eye electrode and one vertical eye electrode (both on the right) were used to facilitate measurement of eye blinks and eye movements. In addition, two electrodes were placed on the zygomaticus major (right), required for smiling, and two were placed on the corrugator supercilii (right), required for frowning. The results from the electrodes over facial muscles were not considered as the data quality was poor. Participants were then seated in a sound-attenuated chamber, approximately 60 cm in front of a Windows computer monitor (screen size 61-cm diagonally) on which the experiment was presented using Psychtoolbox-3 (Pelli, 1997) in MATLAB version 9.2. EEG data were monitored and saved on a separate Windows computer.

Participants were then taken through general task instructions. Once they understood the requirements of the task, they completed the first modality block. Specific task instructions and the visual components of stimuli (i.e., faces) were presented on the computer monitor, and audio components of stimuli (i.e., speech sounds) were presented via two speakers on either side of the computer monitor at a peak sound level of approximately 60 dB SPL (which is typical of everyday conversation). Within each trial, participants were presented with a fixation cross that randomly ranged from 4.00 s to 7.00 s, followed by a stimulus with an average duration of 3.59 s, and finally a blank screen that randomly ranged from 2.30 to 3.40 s. Using a keyboard, participants then classified the emotion that they perceived as quickly as they could while still being accurate. There were 48 stimuli presented per block. After the completion of the first block, participants were able to take a rest before completing the second one, and likewise for the third (final) block. At the break, participants were allowed to take as long as they wanted. After completing all blocks, participants completed a short demographic questionnaire before being debriefed.

Data Processing

The EEG data were processed in MATLAB R2019b using EEGLAB version 19.1 (Delorme & Makeig, 2004). The processing pipeline used was the same as Copelli et al. (in press). Raw EEG data were band-pass filtered from 1–60 Hz using “pop_eegfiltnew.” Next, noisy scalp electrodes were identified using “pop_rejchan” and interpolated on the basis of surrounding channels using “pop_interp,” while external electrodes were rejected. For each participant, between 0 and 8 of the 64 total scalp channels were interpolated (with an average of 2.83). Data were re-referenced to the average using “pop_reref” and epoched with a 4.00-s baseline and a 3.10-s stimulus, the duration of the shortest stimulus. For each trial, the average over its 4.00-s baseline was subtracted from the values from 0 to 3.10 s.

ICAs were then conducted using “runica” and the InfoMax algorithm. To avoid rank-deficient data, each participant’s ICA yielded n components, where n is the number of electrodes less all interpolated channels. An initial ICA was used to identify and reject noisy epochs using default EEGLAB epoch rejection parameters. For each condition averaged across participants, the percentage of noisy epochs rejected ranged from 0% to 8.16% (with an average of 4.46%). A second ICA captured stereotyped neuroelectric activity, including brain sources (Miyakoshi, 2020). This led to the final ICA weight matrixes for dipole fitting. Dipoles were located according to the boundary element model using “DIPFIT,” and automated independent component selection was done using “IClabel” (Pion-Tonachini et al., 2019). Automated selection of independent components (ICs) was done using a custom-written EEGLAB script (Miyakoshi, 2020). This script also rejected components with more than 15% residual variance, which represented an average of 18% of components per participant.

ICs were fit into clusters using the k-means algorithm (MacQueen, 1967), which amounted to one component per participant per cluster (EEGLAB Wiki: Chapter 05, 2014). As we were only interested in changes in spectral power that occur as a result of the mu rhythm in hMNS areas, clusters were formed using ERSPs and dipole location as our determining factors. Outlier components were classified as over 3 standard deviations from the mean and were separated out. The resulting data were fit into 18 source clusters (EEGLAB Wiki: Chapter 05, 2014). Data from all clusters within the hMNS were exported to Excel using a custom MATLAB script. For visualizations that considered the effect of time, the data were averaged across only the frequency band of interest (8–13 Hz). For analyses and visualizations that did not consider time, the data were also averaged across time (0–3.10 s) in addition to frequency.

Statistical Analyses

All statistical analyses used R version 4.0.3 (R Core Team, 2020). Analysis of variance (ANOVA), conducted using the “ez” package, was used to assess the effects of emotion and modality (and their interaction) on hMNS activity and the effect of modality on emotion classification accuracy and reaction time. Reaction time data were log-transformed due to being right-skewed (Whelan, 2008). Greenhouse-Geisser correction was applied to effects that failed Mauchly’s test of sphericity. In cases where main effects were significant, follow-up pairwise comparisons used Holm-Bonferroni correction (Holm, 1979). For all ANOVAs, the source of error was between components rather than between participants.

Correlational analyses were also used to assess the relationships between emotion classification accuracy, reaction time, and the hMNS response to emotion. For the latter, we isolated the hMNS response to emotion by subtracting mu-ERD in response to the neutral condition from mu-ERD in response to emotional conditions (happy and sad). For correlations, in cases where participants contributed more than one component to a cluster, only the component with the lowest residual variance was selected (Denis et al., 2017). Bivariate outliers were identified with a bagplot (Rousseeuw et al., 1999) using the “aplpack” package. A bagplot is a bivariate generalization of the one-dimensional boxplot, which draws a “bag” around 50% of all data points and draws a fence by inflating the bag by a factor of three. Data points outside of the fence were deemed bivariate outliers and excluded from the correlation.

Results

Neural Data

Of the 18 brain source clusters, we analyzed four in brain areas consistent with the hMNS: the left pre-SMA (Brodmann area [BA] 8), left posterior IFG (BA 44), and left and right PMC (BA 6). See Table 1 for all effects and interactions on mu-ERD in each cluster.

Table 1 Effects of emotion and modality on mu power in all clusters of interest

Left pre-SMA cluster

This cluster was located at Talairach coordinates (−3, 26, 38), which is within the boundaries of the left pre-SMA as described by Mayka et al. (2006). It was composed of 36 components contributed from 19 participants (Fig. 1). The timecourse of mu-ERD by emotion averaged across modalities shows that mu-ERD reached a peak approximately 0.70 s after stimulus onset (Fig. 2). At this point, mu-ERD became greater in the happy and sad conditions than in the neutral condition, with this difference persisting until the 2.00-s mark. Averaging across time revealed results consistent with these observations: there was a significant main effect of emotion on mu-ERD, which was significantly greater in the happy condition than in the sad condition (p = 0.037) and marginally greater in the sad condition than in the neutral condition (p = 0.071; Fig. 3). There was no effect of modality and no interaction between emotion and modality on mu-ERD. There also was no effect of participant sex on mean mu-ERD, although the small number of male participants (only three males contributed components to this cluster) resulted in low power for this analysis.

Fig. 1
figure 1

Components constituting the left pre-SMA cluster. This cluster is located at (TAL −3, 26, 38).

Fig. 2
figure 2

Effects of time and emotion on mu power in the left pre-SMA. Mu-ERD appeared greater in the happy and sad conditions than in the neutral condition over the period of approximately 0.70–2.00 s after stimulus onset.

Fig. 3
figure 3

Effect of emotion on mu power in the left pre-SMA. Mu-ERD was significantly greater in the happy condition, and marginally greater in the sad condition, than in the neutral condition. Lower mu power is synonymous with greater mu-ERD. Error bars represent the standard error of the mean.

Left posterior IFG cluster

The left posterior IFG cluster, located at Talairach coordinates (TAL −41, 16, 8), was composed of 26 components contributed from 18 participants (Hammers et al., 2006). There was a significant main effect of modality, with greater mu-ERD in the audiovisual condition than in the audio-only condition (p < 0.0001), as well as greater mu-ERD in the visual-only condition than in the audio-only (p < 0.0001) and audiovisual conditions (p = 0.022; Fig. 4A). Nonetheless, mu-ERD was still significantly greater than zero in the audio-only condition (p < 0.0001). There was no effect of emotion and no interaction between emotion and modality.

Fig. 4
figure 4

Effect of modality on mu power in all clusters responding to modality. A. Left posterior IFG. B. Left PMC. C. Right PMC. Across these three clusters, the most consistent trend is that conditions containing visual information (visual-only and audiovisual) always led to lower mu power (greater mu-ERD) than the audio-only condition. Error bars represent the standard error of the mean.

Bilateral PMC clusters

The left PMC cluster (TAL −35, −11, 37) was composed of 20 components contributed from 14 participants, and the right PMC cluster (TAL 34, 7, 46) was composed of 15 components contributed from 11 participants (Mayka et al., 2006). In both of these clusters, there was a significant main effect of modality on mu-ERD, although this differed slightly between hemispheres (Figs. 4B/C). In the left PMC, mu-ERD was greater in the visual-only condition than in the audio-only condition (p = 0.0006) and greater in the audiovisual condition than in the audio-only condition (p = 0.009); while in the right PMC, mu-ERD was greater in the visual-only condition than in the audio-only condition (p = 0.0007) and marginally greater in the visual-only condition than in the audiovisual condition (p = 0.051). In the left PMC, but not the right PMC, mu-ERD was significantly greater than zero in the audio-only condition (p < 0.0001). In both the left and right PMC, there was no effect of emotion and no interaction between emotion and modality.

Behavioural Data

There was a significant main effect of modality on emotion classification accuracy (F[2,46] = 24.9, p < 0.0001, ηG2 = 0.41; Fig. 5). Accuracy in the audio-only condition was significantly lower than in the visual-only (p = 0.0002) and audiovisual conditions (p < 0.0001), and accuracy in the visual-only condition was marginally lower than in the audiovisual condition (p = 0.064). Thus, recognizing emotions was most challenging in the audio-only condition, whereas in the other two modalities, accuracy was near ceiling and exhibited little variability across participants. There was no effect of modality on reaction time (F[2,46] = 0.42, p = 0.66, ηG2 = 0.004).

Fig. 5
figure 5

Effect of emotion-on-emotion classification accuracy. Note. Accuracy was higher in the visual-only and audiovisual conditions than in the audio-only condition, and it also was marginally higher in the audiovisual condition than the visual-only condition. Error bars represent the standard error of the mean.

Correlations

After removing one bivariate outlier, there was a significant negative correlation between emotion classification accuracy and reaction time in the audio-only condition (r[21] = −0.48, p = 0.021). This correlation may suggest that some participants were better at classifying emotions than others, with those who were more accurate also responding faster. This same correlation was not significant in the visual-only condition (two outliers) or audiovisual condition (no outliers), likely due to range restriction caused by a ceiling effect in the accuracy data.

We also analyzed the correlation between emotion classification accuracy and the mu-ERD response to emotion in the pre-SMA cluster. Our expectation was that participants with higher accuracy would have greater mu-ERD (i.e., lower mu power), which could help to explain why some participants are better at emotion identification than others. However, we found no evidence for such a relationship. In all three modalities, there was no significant correlation between accuracy and the pre-SMA response to happy (happy−neutral), sad (sad−neutral), or emotion overall (mean[happy, sad]−neutral). These correlations remained nonsignificant when considering emotion classification reaction time rather than accuracy. Across these correlations involving mu-ERD, the number of outliers removed ranged from zero to three.

Discussion

In the current study, 24 healthy adults listened to speech that was happy, sad, or neutral, as revealed nonverbally through facial and/or vocal expression rather than the semantic content of the speech (i.e., the words being used). At the same time, we measured participants’ brain activity with EEG and ultimately source-localized this activity using ICA. We predicted that the classical or extended hMNS would demonstrate more mu-ERD in response to emotional (happy and sad) speech than to neutral speech; that the hMNS response would be greater to visual-containing than audio-only stimuli; and that the hMNS response to emotion would be positively correlated with participants’ emotion classification accuracy.

We used ICA to identify four areas of the hMNS that were activated: the left pre-SMA, the left posterior IFG, and the right and left PMC. Interestingly, all of these areas were motor (frontal) components of the hMNS, and no somatosensory or affective sites were identified (e.g., IPL or insula). This has been found in other studies of emotion perception (e.g., vocal; Warren et al., 2006) and may suggest that motor aspects of simulation are more important than somatosensory or affective aspects of simulation, at least under the conditions of the current study. All four of the identified hMNS areas have previously been implicated in the perception of speech, particularly under challenging conditions (Du et al., 2014; Meister et al., 2007; Scott et al., 2004). It has been proposed that when speech is perceived under optimal conditions, the motor system is not needed, with the ventral stream of speech processing more than sufficient (Peelle, 2018). The dorsal stream and its motor termini are only recruited when the perception of speech—particularly speech sounds—is challenging (Hickok & Poeppel, 2000, 2004, 2007, 2012). This includes any conditions under which speech sounds are less stereotyped than usual, such as when it is distorted or vocal emotions are present (Russo, 2020).

Of the four areas mentioned above, activation of only one of them—the left pre-SMA—differed by emotion. In particular, the left pre-SMA was more active during the perception of happy and sad speech than neutral speech. Thus, our prediction that the hMNS would demonstrate more activation to emotional speech was supported. The pre-SMA is a frontal component of the extended hMNS that lies at the intersection of prefrontal and motor cortices and is associated with higher-order aspects of complex motor control (Bonini, 2017; Lima et al., 2016; Pineda, 2008; Rizzolatti & Luppino, 2001). While the IFG and SMA-proper are likely responsible for the sequencing and execution of specific actions, the pre-SMA may be involved in the preparation and selection of appropriate actions as well as the inhibition of inappropriate actions (Cunnington et al., 2005; Lima et al., 2016; Pineda, 2008; Rochas et al., 2012), including speech production (Alario et al., 2006). The pre-SMA also contributes to the perception of emotion (Zhang et al., 2019), and it may play a similar, regulatory role in that context, including by inhibiting inappropriate emotional responses (Etkin et al., 2015).

The current study is one of many to implicate the pre-SMA in emotion perception. For instance, Van der Gaag et al. (2007) found greater pre-SMA activation when viewing emotional faces—both positively and negatively valanced—than neutral faces, as have Carr et al. (2003) and Seitz et al. (2008). In the context of vocal emotion, Warren et al. (2006) found that pre-SMA activation was greater for emotions of higher intensity (e.g., triumph vs. amusement). More recently, Aziz-Zadeh et al. (2010) also reported pre-SMA activation in response to emotional prosody signals, whereas McGettigan et al. (2013) reported pre-SMA activation in response to both spontaneous and voluntary social-type laughter. Furthermore, Kreifelts et al. (2013) found that four weeks of nonverbal emotion communication training produced changes in a network that included the bilateral pre-SMA. The observed timecourse of pre-SMA activation in the current study also was consistent with this prior research, with peak activation reached approximately 0.70 s after stimulus onset (Rochas et al., 2012; Seitz et al., 2008).

We also found stronger evidence that the pre-SMA responded to happy speech, as opposed to sad speech, because mu-ERD to sad speech was only marginally greater than mu-ERD to neutral speech. Other studies have previously found a stronger association between the pre-SMA and happiness than other emotions. For instance, studies of patients with epilepsy have demonstrated that electrical stimulation of the left pre-SMA produces laughter and merriment (Fried et al., 1998; Krolak-Salmon et al., 2006) and that the pre-SMA is selectively activated during the observation and production of happy facial expressions rather than a range of other emotions or neutral expressions (Krolak-Salmon et al., 2006). More recently, in a healthy sample, it was found that TMS over the left pre-SMA impaired happy recognition, but not fear or anger recognition (Rochas et al., 2012). One possibility for the greater responsivity of the pre-SMA to expressions of happiness is that head movements tend to be greater for happy than for sad expressions (Livingstone & Palmer, 2016). Similarly, the frequency range and intensity of vocal emotions tend to be larger for happy expressions (Livingstone & Russo, 2018; Livingstone et al., 2013). Thus, while both happy and sad speech are more likely to recruit the pre-SMA than neutral speech, happy speech may be able to elicit a stronger response.

Why might the pre-SMA be recruited during emotion perception? A review by Lima et al. (2016) argues that it is an active contributor to emotion recognition rather than a mere by-product of action observation, as supported by several studies finding greater pre-SMA activation associated with more accurate emotional judgements (Bestelmeyer et al., 2014; McGettigan et al., 2013). The authors further proposed that the pre-SMA may generate sensory expectations about incoming stimuli, which in turn inform our perceptions and increase their accuracy under challenging conditions. This is consistent with two observations about the pre-SMA: it is more active when we have prior experience with the stimuli (e.g., biological vs. nonbiological action, or familiar vs. random music; Peretz et al., 2009) and when the stimuli are challenging to perceive accurately (e.g., speech emotion). For instance, Shahin et al. (2009) found that the pre-SMA is activated when gaps in speech—a stimulus that we have extensive experience producing—are “filled in” and perceived as continuous by the listener. This explanation by Lima et al. is consistent with the hypothesis that sensorimotor simulation in the hMNS provides a form of predictive coding. This account suggests that a more abstract, semantic pathway (which includes the MTG) makes predictions about the intentions of an observed action, and that the somatosensory and motor consequences of that action are then processed more concretely (i.e., simulated) in the parietal and frontal nodes (Kilner, 2011; Kilner et al., 2007). These consequences are then compared to the action being observed, and our perception and interpretation of that action are updated accordingly.

In addition to emotion, we also assessed whether activation of hMNS areas differed by modality. In the pre-SMA, modality did not affect activation, nor did the effect of emotion differ by modality. Thus, the pre-SMA may simulate speech emotion in a similar manner whether it is seen on the face or heard in the voice. In contrast, the left posterior IFG and bilateral PMC exhibited greater activation in response to stimuli that included visual information (visual-only and audiovisual) than those that did not (audio-only), although these areas nonetheless responded to audio-only stimuli. This suggests that while audio-only biological stimuli (i.e., speech sounds) may be simulated in the hMNS (Jenson et al., 2014; Wilson et al., 2004), visual biological stimuli (i.e., facial speech movements) are the dominant form of input. These results are generally consistent with our prediction, and also agree with Crawcour et al. (2009), who found greater mu-ERD over motor areas in response to facial speech movements than to speech sounds. Visual information also appears to drive hMNS simulation of other human actions, including a pair of hands ripping a piece of paper (Copelli et al., in press).

Emotion classification accuracy was lowest in the audio-only condition (82%), and much higher in the visual-only (93%) and audiovisual (96%) conditions. This suggests that visual information not only drives the hMNS but also is most important for the recognition of speech emotions, at least for the stimuli used in the current study. Previous research also has concluded that facial emotions are more easily recognized than vocal emotions (Elfenbein & Ambady, 2002; Scherer, 2003; see also Livingstone & Russo, 2018 for the RAVDESS validation). Accuracy in the audiovisual condition was marginally higher than in the visual-only condition, although it seems likely that the true advantage of the audiovisual condition was constrained by a ceiling effect. If audiovisual speech stimuli were more easily recognized than visual-only speech stimuli, it would suggest that speech sounds enhance our ability to recognize emotions, despite its smaller role than facial speech movements. Because the pre-SMA response did not differ between audiovisual and visual-only conditions, this result also serves as a reminder that the hMNS is not the only mechanism involved in emotion recognition (Yu & Chou, 2018).

We found that across participants, accuracy was negatively correlated with reaction time in the audio-only condition. This suggests that some participants were better at recognizing emotions than others, and that these participants had both faster reaction times and greater accuracy. (If, instead, these variables were positively correlated, it would suggest a speed-accuracy trade-off.) As a result, we opted to investigate whether accuracy also was correlated with the left pre-SMA response to emotion and thus whether this brain area actively supports emotion recognition. However, there was no evidence of a correlation between accuracy (or reaction time) and the pre-SMA response to emotion in any modality. This contrasts previous studies to find this relationship, including Bestelmeyer et al. (2014) and McGettigan et al. (2013; see also Lima et al., 2016; Rochas et al., 2012). Thus, although the pre-SMA appears to be recruited during the recognition of speech emotion in audio and visual modalities, we did not find any strong evidence that it plays an active role in this process. This may instead have had something to do with the emotion classification task used. The facial and vocal cues were not obscured in any way, and there were only three dissimilar emotions to choose from (happy, sad, and neutral). A more challenging task may well have led to greater variability in accuracy as well as greater pre-SMA activation, correlated with accuracy, to compensate for this.

The current study had a number of strengths. First, unlike most EEG studies of the hMNS, we used source-localization to ensure that differences in mu-ERD (particularly in response to emotion) were driven by frontal sensorimotor activity (i.e., in the pre-SMA) rather than attention-related alpha (Klimesch, 2012; Hobson & Bishop, 2017). Indeed, we found that in two occipital clusters (left cuneus and right middle occipital gyrus) as well as one temporal cluster involved in face perception (left fusiform gyrus), mu-ERD did not differ by emotion. Source-localization also allowed us to ensure that our left pre-SMA cluster did not fall within the frontal eye fields (Vernet et al., 2014), in which case eye movement may have confounded our measurement of mu-ERD. We also selected the task and stimuli carefully. Having participants classify emotions not only allowed us to measure their emotion classification accuracy, but also ensured that they stayed engaged with the task. In addition, to avoid semantic confounds, the stimuli were emotional only in their expression and not in their linguistic content (Warren et al., 2006); and to avoid low-level acoustic confounds, all stimuli across the three emotions were normalized to the same average amplitude (McGarry et al., 2015).

This study also had some limitations. One confound that we were unable to control for was the magnitude of facial motion in the visual and audiovisual conditions. To investigate this, we conducted an optical flow analysis (Horn & Schunck, 1981) of the first 3.10 s of the visual-only stimuli—the section during which mu-ERD was analyzed—using FlowAnalyzer (Barbosa et al., 2008). This revealed that there was more motion in the happy stimuli (M = 0.59, SD = 0.14) than the sad stimuli (M = 0.38, SD = 0.072), and more in the sad stimuli than the neutral stimuli (M = 0.26, SD = 0.054). Thus, one might argue that the effect of emotion on mu-ERD in the left pre-SMA was driven by differences in magnitude of motion (see Copelli et al., in press). However, at least two facts cast doubt on this explanation for our left pre-SMA findings. First, there was no interaction between emotion and modality on mu-ERD in this cluster. This suggests that the effect of emotion was similar in the audio-only condition, in which no facial motion was present, to the two visual-containing conditions. Second, we found greater mu-ERD in response to happy and sad stimuli than neutral stimuli, while for visual motion, happy had by far the most motion, followed by sad and then neutral. This suggests that visual motion does not entirely account for differences in mu-ERD by emotion, otherwise the mu-ERD differences would have mirrored the visual motion differences. The limited spatial resolution of EEG is also a limitation. Although ICA techniques can achieve a spatial resolution of 1 cm (Makeig et al., 2004a, b), this is inferior to many other neuroimaging methods, including fMRI.

In sum, the current study was the first to consider whether the hMNS is involved in the recognition of emotion expressed nonverbally through speech. We found that the left pre-SMA, a frontal component of the extended hMNS, was more active in response to happy and sad stimuli than to neutral stimuli. This was true whether emotions were perceived on the face, in speech sounds, or both. Activity levels in other areas of the hMNS (the left posterior IFG and bilateral PMC) did not differ by emotion, although they were driven by visual speech information. As proposed by others (Lima et al., 2016), the pre-SMA may actively support speech emotion recognition by using our extensive experience expressing emotion to generate sensory predictions that in turn guide our perception. However, the lack of a correlation between emotion classification accuracy (or reaction time) with the pre-SMA response to emotion leaves this possibility in need of more evidence. More research is needed to pinpoint the precise role of the pre-SMA in speech perception, particularly when speech is emotional.