Automatic imitation of speech is enhanced for non-native sounds

Wilt, Hannah; Wu, Yuchunzi; Evans, Bronwen G.; Adank, Patti

doi:10.3758/s13423-023-02394-z

Automatic imitation of speech is enhanced for non-native sounds

Brief Report
Open access
Published: 17 October 2023

Volume 31, pages 1114–1130, (2024)
Cite this article

Download PDF

You have full access to this open access article

Psychonomic Bulletin & Review Aims and scope Submit manuscript

Automatic imitation of speech is enhanced for non-native sounds

Download PDF

Hannah Wilt¹,
Yuchunzi Wu^2,3,
Bronwen G. Evans¹ &
…
Patti Adank¹

1231 Accesses
5 Altmetric
Explore all metrics

Abstract

Simulation accounts of speech perception posit that speech is covertly imitated to support perception in a top-down manner. Behaviourally, covert imitation is measured through the stimulus-response compatibility (SRC) task. In each trial of a speech SRC task, participants produce a target speech sound whilst perceiving a speech distractor that either matches the target (compatible condition) or does not (incompatible condition). The degree to which the distractor is covertly imitated is captured by the automatic imitation effect, computed as the difference in response times (RTs) between compatible and incompatible trials. Simulation accounts disagree on whether covert imitation is enhanced when speech perception is challenging or instead when the speech signal is most familiar to the speaker. To test these accounts, we conducted three experiments in which participants completed SRC tasks with native and non-native sounds. Experiment 1 uncovered larger automatic imitation effects in an SRC task with non-native sounds than with native sounds. Experiment 2 replicated the finding online, demonstrating its robustness and the applicability of speech SRC tasks online. Experiment 3 intermixed native and non-native sounds within a single SRC task to disentangle effects of perceiving non-native sounds from confounding effects of producing non-native speech actions. This last experiment confirmed that automatic imitation is enhanced for non-native speech distractors, supporting a compensatory function of covert imitation in speech perception. The experiment also uncovered a separate effect of producing non-native speech actions on enhancing automatic imitation effects.

Effects of stimulus response compatibility on covert imitation of vowels

Article Open access 13 March 2018

Automatic imitation of human and computer-generated vocal stimuli

Article Open access 28 November 2022

Sensorimotor learning during synchronous speech is modulated by the acoustics of the other voice

Article Open access 02 July 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Action observation engages neural mechanisms of action execution (Buccino et al., 2001; Fadiga et al., 1995; Nishitani & Hari, 2002). For vocal actions, the engagement of speech-production mechanisms in speech perception has been demonstrated using functional magnetic resonance imaging (fMRI) (Park, 2020; Pulvermüller et al., 2006; Wilson et al., 2004), transcranial magnetic stimulation (TMS) (Fadiga et al., 2002; Murakami et al., 2011; Watkins et al., 2003), and electroencephalography (EEG) (Michaelis et al., 2021; Oliveira et al., 2021; Pastore et al., 2022). Simulation accounts of speech perception (Pickering & Garrod, 2013; Wilson & Knoblich, 2005) propose that speech actions are automatically and covertly imitated by listeners. This covert imitative process informs forward models of the perceived speech, conducting real-time simulations to generate top-down predictions of the speech signal to support perception.

Evidence for a causal role of covert imitation in speech perception comes from experiments using TMS to temporarily disrupt speech motor areas. D’Ausilio et al. (2009) found that inhibitory stimulation of the lip area of the primary motor cortex (M1) specifically hindered discrimination of lip-articulated contrasts, while stimulation of tongue M1 obstructed discrimination of tongue sounds. Möttönen and Watkins (2009) showed that inhibitory TMS to lip M1 disrupted participants’ phonemic categorisation of lip-articulated speech sounds. The articulator-specific disruption of phonetic perception through inhibitory stimulation of motor areas supports a role for covert imitation in speech perception.

Behaviourally, covert imitation is measured through stimulus-response compatibility (SRC) paradigms. In manual SRC tasks (e.g. Brass et al. 2000), participants perform an action prompted by a visual cue (e.g., index-finger movement prompted by a “1”) while a distractor is presented. The distractor is compatible (e.g., video clip of the same index-finger movement) or incompatible with the target response (e.g., video clip of a middle-finger movement). Slower response times (RTs) for incompatible target-distractor pairs compared to compatible pairs are thought to reflect the automatic activation of motor processes elicited by the distractor, facilitating responses for compatible trials and inhibiting responses for incompatible trials (Heyes, 2011). The automatic imitation effect, computed as the difference in RTs between incompatible and compatible trials, indexes covert imitation of the distractor stimulus. In speech SRC tasks, participants produce speech sounds in response to prompts superimposed over a distractor (e.g., a video of a speaker saying [ba]). Using auditory-only, visual-only or audiovisual distractors, speech SRC tasks have demonstrated significant automatic imitation effects for consonants (Galantucci et al., 2009; Ghaffarvand Mokari et al., 2020; Jarick & Jones, 2009; Kerzel & Bekkering, 2000; Roon & Gafos, 2015; Trotter et al., 2023; Wilt et al., 2022; Wu et al., 2019) and vowels (Adank et al. 2018; Ghaffarvand Mokari et al. 2020; Ghaffarvand Mokari et al. 2021).

Motor activation during speech perception has been demonstrated extensively for sounds in the perceiver’s native repertoire, yet the implication of covert imitation in processing unfamiliar speech sounds is less well established. Simulation theories disagree on the conditions under which covert imitation occurs, leading to distinct predictions on the implication of covert imitation in non-native speech perception. Wilson and Knoblich (2005) propose that imitative motor activation serves as a compensatory mechanism when speech perception is challenging, as is the case when processing non-native speech sounds (Adank et al., 2009; Floccia et al., 2009; van Wijngaarden, 2001). Hence, this account predicts that perceiving non-native sounds elicits more covert imitation than native sounds. Alternatively, Pickering and Garrod’s integrated theory of language production and comprehension (Pickering & Garrod, 2013) posits that speech perception preferably relies on covert imitation when the signal is familiar to the listener, utilising the “simulation route” for action perception. When the speech is unfamiliar, speech perception relies more on auditory mechanisms (the “association route”). Covert imitation is expected to be enhanced when listening to native speech sounds compared to non-native speech sounds (Pickering & Gambi, 2018). Predictions of the integrated theory of language are consistent with theories of action perception claiming that action-perception associations are learned through sensorimotor experience, for example, the Theory of Event Coding (Hommel, 2009, 2019) and Associative Sequence Learning (Heyes, 2005, 2011).

Wilson and Knoblich’s proposal of a compensatory role of covert imitation in speech perception is supported by evidence of enhanced motor activity during the perception of motor and noise-distorted speech compared to clear speech (Alain et al., 2018; Du et al., 2016; Nuttall et al., 2016, 2017). In a transcranial direct current stimulation (tDCS) study (Sehm et al., 2013), facilitatory stimulation of the left inferior frontal gyrus (IFG) enhanced perceptual learning of degraded speech with low intelligibility, suggesting that speech production areas support perception under challenging listening conditions. Enhanced motor recruitment during non-native versus native speech processing has been reported in several fMRI studies (Callan et al., 2003, 2004, 2014; Golestani, 2016; Wilson & Iacoboni, 2006) and TMS experiments (Schmitz et al., 2019), though the opposite effect has also been observed for visual-only speech videos (Swaminathan et al., 2013). Further, infant studies have highlighted a role for production processes in perceiving novel speech sounds. An MEG study by Kuhl et al. (2014) found that while 7-month-old infants displayed comparable activation of auditory and motor cortices when listening to native and non-native speech, by 11–12 months activation was greater in motor regions for non-native speech. Bruderer et al. (2015) demonstrated that pre-verbal infants’ auditory discrimination of the Hindi [d̪]–[ɖ] contrast was hindered by teethers restraining tongue movements, but not by teethers that did not restrict tongue mobility. Together, these studies suggest that speech-production mechanisms may be preferentially activated for unfamiliar sounds.

In contrast, Pickering and Garrod’s proposition of enhanced covert imitation during perception of familiar speech actions aligns with the literature on the covert imitation of manual and bodily actions. Neuroimaging studies have reported enhanced motor activation with increasing familiarity to perceived movements (Calvo-Merino et al., 2005; Haslinger et al., 2005; Margulis et al., 2009), though the opposite effect has also been reported (Liew et al., 2011). Behaviourally in manual SRC tasks, automatic imitation effects increase following mirror training (e.g., participants close their hand when seeing a video of a hand closing) and disappear following counter-mirror training (e.g., participants open their hand when seeing a video of a hand closing) (Cook et al., 2010; Gillmeister et al., 2008; Heyes et al., 2005). In a similar study using speech stimuli (Wu et al., 2019), automatic imitation significantly increased following mirror training (participants produced the same syllable as that presented in audiovisual stimuli) and decreased non-significantly following counter-mirror training (participants produced the alternative syllable to that presented in audiovisual stimuli). Taken together, this line of evidence suggests that covert imitation is enhanced by familiarity and experience.

We aimed to test predictions from Wilson and Knoblich and from Pickering and Garrod in three speech SRC experiments. These experiments aimed to establish whether automatic imitation effects evoked by unfamiliar, non-native speech sounds are greater (as predicted by Wilson and Knoblich) or smaller (as predicted by Pickering and Garrod) than automatic imitation elicited by familiar, native speech sounds. In Experiment 1, participants completed an SRC task with native sounds and an SRC task with non-native sounds. Experiment 2 replicated Experiment 1 online, to strengthen our findings through replication (Schmidt, 2009) and to validate that speech SRC tasks can successfully be conducted online (Wilt et al., 2022). In Experiment 3, participants produced and perceived native and non-native sounds within a single SRC task, allowing to disentangle effects of perceiving non-native distractors from potential effects of producing unfamiliar speech actions on automatic imitation.

Experiment 1

Methods

Participants

Sixty-five participants were recruited. All self-reported being native British-English speakers with normal hearing, normal or corrected-to-normal vision and no speech disorders or neurological disorders. Participants received £20 compensation or course credit for this experiment, which constituted the pre-training session of a two-part study cut short by COVID-19. Sixteen participants were excluded: seven participants did not follow task instructions; one did not complete the full two-part study; one spoke Welsh and hence was familiar to the non-native sound [ɮɑ]; five participants had error rates (ERs) of > 50% in one or more of the SRC tasks; one had an overall ER of over three standard deviations (SDs) from the group mean; one was excluded due to a software error. The final sample comprised 49 participants (33 female, M_age = 23.41 years). The full list of languages spoken by the participants is available in Online Supplemental Material (OSM) Appendix A.

Stimuli

Videos showed a phonetically trained female native British-English speaker from the neckline upward over a blue background. The videos were filmed using a Canon Legria video camera and edited in iMovie and MATLAB. Each video lasted 2,400 ms, beginning and ending with the speaker in resting configuration. The auditory stimuli consisted of productions of [bɑ] (voiced bilabial plosive), [lɑ] (voiced alveolar lateral approximant), [ʙɑ] (voiced bilabial trill) and [ɮɑ] (voiced alveolar lateral fricative) by the same female speaker, recorded using a RØDE NT1-A Condenser Microphone and root-mean-square normalised on Praat (Boersma & Weenink, 2018). The non-native sounds [ʙɑ] and [ɮɑ] were selected as these were both visually and auditorily distinct from one another and from any British English sounds, and hence recognisable to British English speakers without perceptual training. Video and auditory stimuli were aligned on Presentation to create the distractor videos. Key articulatory event timings are displayed in OSM Appendix B. Response prompts comprised of the symbols £, %, &, # in white Helvetica font size 36 pt on a black background, superimposed over the distractor videos using Presentation. These appeared over the speaker’s lips at one of three stimulus onset asynchronies (SOAs): 600 ms, 800 ms or 1,000 ms post articulation onset. The utilisation of multiple SOAs is standard practice in SRC studies to examine the time course of effects. Distractor videos were preceded by a 1,100-ms black screen, followed by a 500-Hz tone for 200 ms after which the screen remained black for an additional jittered duration of 250, 375, 500, 652 or 750 ms (Fig. 1).

Production instruction videos were recorded for each of the four speech sounds. The same female speaker was presented from the neckline upward in front of the blue background. In each video, the speaker first produced the speech sound, followed by an oral description of how to produce the sound, and finally two more productions of the sound. For the [bɑ] sound, instructions were “To produce this sound, bring your lips together to block airflow, let the air out in one burst, and say /a/”. For the [lɑ] sound, instructions were “To produce this sound, move your lips apart slightly, place the tip of your tongue behind your upper teeth to block airflow, release the air slowly, letting it pass by the sides of your tongue, and say /a/”. For the [ʙɑ] sound, instructions were “To produce this sound, bring your lips together to block airflow, release the air slowly, letting it pass between your lips, as if to blow raspberries, and say /a/”. For the [ɮɑ] sound, instructions were “To produce this sound, move your lips apart slightly, place the tip of your tongue behind your upper teeth to block airflow, raise the sides of your tongue, release the air slowly, causing turbulence and letting it pass over the sides of your tongue, and say /a/”.

Procedure

The experiment was conducted in a soundproofed, light-controlled booth. Participants wore a Beyerdynamic DT 297 PV MK II headset as they completed two SRC tasks in Presentation on a Dell PC. In the native SRC task, responses and distractors were [bɑ] and [lɑ]. In the non-native SRC task, responses and distractors were [ʙɑ] and [ɮɑ]. Before each SRC task, speech production instruction videos were displayed on the screen for the two relevant sounds. Participants could play the videos as many times as they wanted and were asked to produce each sound at least five times and/or until the researcher was satisfied with their production. Next, participants learned the prompt-response pairings for each speech sound in the task. Symbols were displayed on the screen above videos of the speaker producing the associated sounds. Twenty-four possible prompt-response pairings where created, to which participants were randomly assigned.

For the SRC tasks, participants were instructed to produce the sound prompted by the symbol cue as quickly as possible and to ignore the distractor video. For each task (native and non-native), participants first completed 20 randomly selected practice trials, followed by six blocks of 30 trials each (180 trials total per task). The order in which the native and non-native SRC tasks were performed was randomised and counterbalanced across participants. Altogether, the testing session lasted approximately 50 min.

Data processing and analysis

Participants’ vocal responses were recorded using a Beyerdynamic DT 297 PV MK II headset microphone. Recordings started at video onset for 3,000 ms. Response annotations and RT measurements were manually determined on Praat. Errors were defined as productions of the wrong or of multiple responses, missing answers or anticipatory responses with RTs < 200 ms. For the non-native sounds, productions were considered erroneous if they could not be clearly auditorily identified as attempts to produce either [ʙɑ] or [ɮɑ], and if the spectrogram did not show clear turbulence (for [ʙɑ] and [ɮɑ]) and/or at least one vocal tract resonance portion (for [ʙɑ] only) (see Kavitskaya et al., 2009).

For the 49 participants, 18,000 observations were collected. Erroneous trials were removed from the analyses (914, 5.08%): 829 productions of the wrong or of multiple prompts; 41 missing answers; and 44 anticipatory responses. Error rates (ERs) averaged 3.96% (SD = 5.02%) in the native task and 6.20% (SD = 7.11%) in the non-native task. A further 1,380 trials were excluded in which RTs surpassed three median absolute deviations (MADs) from a participant’s mean RT for each experimental condition. The remaining 15,706 trials were included in the analyses.

Raw RTs for correct trials were analysed with general linear mixed effects models in R using the lme4 package in R (Bates et al., 2014). Fixed factors were Nativeness (native vs. non-native), Compatibility (compatible vs. incompatible), SOA (SOA1 (600 ms), SOA2 (800 ms), SOA3 (1,000 ms)) and their interactions. Nativeness was coded as -0.5 and 0.5 for the native and non-native conditions, respectively, and Compatibility was coded as -0.5 and 0.5 for compatible and incompatible trials, respectively. This coding scheme is considered preferable to treatment coding in modelling interactions (Singmann & Kellen, 2019). As we were interested in the successive effects of SOA, backward difference coding was used for this factor, allowing for sequential comparisons between each level and its immediate preceding level (i.e., SOA2 vs. SOA1, SOA3 vs. SOA2).

We assumed a gamma distribution and identity link function following Lo and Andrews (2015). This type of link function is considered preferable to transformation for RT data (Balota et al., 2013; Lo & Andrews, 2015; Schramm & Rouder, 2019) and allowed us to avoid potential issues reported with log-transforming and subsequently back transforming RT data (Feng et al., 2013; Lo & Andrews, 2015; Manandhar & Nandram, 2021; Molina & Martín, 2018). Following Barr et al. (2013), the maximal random effect structure to converge and pass singularity checks was used. This included by-participant random intercepts and slopes for Nativeness. Backward selection was then used to identify the model that best fit the dataset. Starting with higher order interactions, predictors were removed systematically and chi-squared tests performed using anova(). Fixed factors were removed from the final model if they did not significantly benefit model fit (p > .05) and were not included in any higher order interactions. At each step, the factor for which there was least evidence of inclusion (i.e., the highest p-value in the chi-squared test) was removed first and the remaining factors reassessed. We stopped when there were no more fixed factors to remove, i.e., when all remaining factors either significantly improved model fit or were included in significant higher-order interactions.

Results

In all, 15,706 trials were analysed. Mean RTs for each experimental condition are displayed in Fig. 2 and OSM Appendix C. All main effects and interactions were included in the final model (Table 1), as the three-way interaction Compatibility x Nativeness x SOA significantly improved model fit (χ²(2) = 11.855, p = .003).

Table 1 Final model of raw reaction times (RTs) in milliseconds (ms) using a gamma distribution and identity link function for Experiment 1

Full size table

There was a significant main effect of Compatibility, with slower RTs in incompatible (M = 721 ms, SD = 170 ms) than compatible trials (M = 672 ms, SD = 181 ms). The overall automatic imitation effect averaged 49 ms (SD = 63 ms), computed from aggregated RTs per participant and experimental condition. The main effect of Nativeness was significant with slower RTs in the non-native (M = 733 ms, SD = 177 ms) than the native SRC task (M = 660 ms, SD = 167 ms). The significant main effect of SOA demonstrated that RTs decreased from SOA1 (M = 745 ms, SD = 196 ms) to SOA2 (M = 698 ms, SD = 176 ms) to SOA3 (M = 646 ms, SD = 141 ms). The interaction Nativeness x Compatibility was significant, with larger automatic imitation effects for non-native (M = 63 ms, SD = 72 ms) than for native sounds (M = 36 ms, SD = 49 ms). This interaction was modulated by SOA, as the difference in automatic imitation effects between native and non-native tasks was smaller at SOA1 (6 ms) than SOA2 (45 ms) and SOA3 (33 ms).

Experiment 1 uncovered enhanced automatic imitation for non-native sounds, in line with Wilson and Knoblich’s account of a compensatory role of covert imitation in speech perception.