Introduction

Developing adequate second language (L2) speech has increasingly become one of the essential goals of globalization today. Whereas there is a consensus that the provision of corrective feedback (CF) facilitates L2 speech learning (Zhang et al., 2021), existing literature has demonstrated that not all L2 learners react the same to these techniques, with some reaching an advanced level while others showing remarkable difficulties. The effectiveness of such CF techniques in L2 speech learning appears to be subject to differences in general cognitive abilities, including working memory, processing speed, attention, and aptitude (Li and Zhao, 2021; Yilmaz and Granena, 2021). Among these factors, auditory processing, defined as one’s capacity to encode and proceduralize spectral and temporal characteristics of sounds, has recently been shown to help determine individual differences in L2 speech learning outcomes (Saito and Tierney, 2022). Building on the previous literature, the current study aims to examine (a) how the provision of recasts can facilitate English vowels /i/-/ɪ/ learning by Chinese native speakers; and (b) how such learning gains from recasts can be tied to individual differences in auditory processing. Taken together, the study took the initial step to straighten out the complex relationship between auditory processing and the effectiveness of recasts in L2 vowel learning.

In the following section, we will review recasts in instructed L2 speech learning, individual differences and the effectiveness of recasts in L2 speech learning, and auditory processing in L2 speech learning first. Next, we will elaborate on the research method, which includes participants, target stimuli, experimental design, outcome measures, and scoring method. Then, we will synthesize the data from our empirical study to elucidate the relationship between auditory processing and the effectiveness of recasts in L2 vowel learning. Subsequently, we will delve into the research findings and compare our results to previous studies mentioned in the background literature section. Finally, we will provide several possible topics for future investigation.

Background literature

Recasts in instructed L2 speech learning

In the latest 50 years, scholars have explored how instruction can facilitate L2 speech learning most efficiently and effectively (Derwing and Munro, 2022). A quantity of research has shown an obvious benefit of integrating instruction into L2 speech learning, including auditory exposure (Neufeld, 1978), auditory discrimination training (Rosenman, 1987), High Variability Phonetic Training (Lively et al., 1994; Logan et al., 1991), explicit articulatory instruction (Castino, 1996), awareness training (Pennington and Ellis, 2000), and a combination of methods (Couper, 2006; Derwing and Rossiter, 2003; Lord, 2010; Wiener et al., 2020). Whereas these language-focused, decontextualized, and explicit instructions have shown potential gains and dominated classroom instruction worldwide, the maintenance and generalizability of such instructional gains to authentic contexts still need to be determined (Trofimovich and Gatbonton, 2006). Scholars have agreed on the importance of shifting learners’ attention to language form, mainly when they engage in a communicative language classroom where conveying meaning is a priority (Ellis, 2016). Theoretically, maintaining learners’ attention simultaneously on meaning and form is considered to help L2 learners proceduralize their declarative knowledge in the long run (DeKeyser, 2020) and then assist them in transferring what they have learned from classroom instruction to communicative contexts in the future (Spada and Tomita, 2010).

Recasts, one of the most commonly occurring focus-on-form techniques in natural communicative settings (Lyster and Ranta, 1997), have been found to affect L2 phonological learning positively (Saito, 2021). According to Lyster and Ranta’s (1997) classification of CF types, recasts are defined as “the teacher’s reformulation of all or part of a student’s utterance minus the error” (p. 46). A series of descriptive studies across English as a second language (ESL) and English as a foreign language (EFL) contexts have found that learners tend to generate more accurate perceptions and modified output in response to recasts on phonological errors (Ellis et al., 2001; Lyster, 1998; Mackey et al., 2000; Sheen, 2006). This finding suggests that recasts might be relatively salient to L2 learners and thus facilitative of their phonological development. More recently, the acquisitional value of recasts has been conducted by empirical studies with a pre-and post-test design. Overall, the results have shown that recasts significantly impact various dimensions of L2 speech learning (Lee and Lyster, 2016 for perception; Gooch et al., 2016; Saito, 2013 for controlled and spontaneous production). Albeit the existing literature has generally shown the facilitative role of recasts in L2 speech learning, the effectiveness of such techniques is subject to plenty of individual variabilities.

Individual differences and the effectiveness of recasts in L2 speech learning

As previously mentioned, the effectiveness of recasts can vary significantly depending on certain independent variables, including learner readiness, explicit phonetic knowledge, and types of target structures (Saito, 2021). Among them, increasing attention has been given to cognitive individual differences. Research in this area has shown that the achievement of L2 speech learning is influenced not only by external factors (e.g., the quantity and quality of the target language they practiced) but by internal factors (e.g., how they have a cognitive instinct for L2 speech learning). For example, phonemic coding abilities play a role in segmental accuracy in naturalistic (Granena and Long, 2013) and classroom (Saito, 2017) contexts. There is evidence that music aptitude impacts L2 suprasegmental learning (Li and DeKeyser, 2017; Qin et al., 2021). Furthermore, L2 learners with higher motivation have been shown to attain more successful L2 speech learning (Nagle, 2018).

These studies suggest a tentative conclusion that specific cognitive individual differences may mediate the effectiveness of recasts. So far, this conclusion has been confirmed by lexicogrammar-focused recast studies focusing on the influential role of attention control (Isaacs and Trofimovich, 2011), explicit aptitude (Yilmaz et al., 2016), language analytic ability (Li, 2013), working memory (Goo, 2012), implicit consequential ability (Granena and Yilmaz, 2019) and cognitive styles (Guo and Yang, 2018). For future studies, it would be intriguing to examine how L2 learners with different cognitive profiles can differentially benefit from recasts for L2 speech learning.

Auditory processing in L2 speech learning

More recently, researchers have begun to propose auditory processing as a potential factor of perceptual-cognitive individual differences, which modulates language learning across the lifespan (Mueller et al., 2012). Auditory processing is the ability to encode and proceduralize spectral and temporal characteristics of sounds (Tierney and Kraus, 2014). This domain-general ability has been described as an anchor at various levels of language learning, such as phonemic, phonological, and prosodic categorization (Werker, 2018), word-, phrase-, and sentence division (Cutler and Butterfield, 1992), processing of suffixes, inflection, articles (Joanisse and Seidenberg, 1998) and word order (Penner et al., 2001).

From a broad perspective, auditory processing comprises two specific constructs, perceptual acuity, and audio-motor integration. Perceptual acuity is the ability to encode spectral and temporal details of sounds. This ability can be assessed using adaptive discrimination tests, where participants discriminate sounds based on a series of auditory cues, including pitch, duration, formant, and amplitude. Audio-motor integration means proceduralizing spectral and temporal patterns. This ability can be measured using melodic and rhythmic sequences. The specific constructs of auditory processing and its corresponding measurement operationalized in the current study are summarized in Table 1 (Saito et al., 2021).

Table 1 Constructs of auditory processing model and corresponding measurement.

In first language (L1) research, the auditory processing deficit hypothesis (Goswami, 2015) states that individual differences in auditory processing may affect the outcome of L1 acquisition. Any impairment in auditory processing can slow down or hinder the learning of speech, morphology, syntax, and vocabulary, ultimately leading to language problems. Based on this hypothesis, scholars have suggested that auditory processing could be the diagnostic facility for specific language impairments, such as dyslexia (Hornickel and Kraus, 2013) and autism spectrum disorders (Russo et al., 2008).

Extending L1 acquisition literature, scholars have begun to explore whether and how the auditory precision hypothesis generalizes to L2 speech learning in adulthood (i.e., Auditory Precision Hypothesis-L2, Mueller et al., 2012). A relatively moderate-to-strong connection was found between auditory processing and L2 speech learning. Both cross-sectional and longitudinal research (Kachlicka et al., 2019; Sun et al., 2021) have suggested that the outcomes of L2 speech learning could be affected equivalently by factors in relation to auditory processing and background information. In addition, certain L2 learners with more precise auditory processing probably take advantage of more input opportunities, making for more significant long-term gains (Saito and Tierney, 2022). Although substantial evidence implicates a significant association between auditory processing and L2 speech learning in adulthood, these initial findings put forward several issues that need further exploration to provide a more fine-grained picture of the mechanisms underlying individual differences in auditory processing for instructed L2 speech learning. First, in the abovementioned studies, L2 speech proficiency was assessed on either perception or production level. However, adult L2 speech proficiency is well-acknowledged as a multi-dimensional phenomenon, and such proficiency needs to be comprehensively measured from both perception and production perspectives via various task modalities (Nagle, 2021). Second, when L2 learners receive different types of auditory input, the sustainability of the link between auditory processing and their L2 speech learning should be measured via immediate and delayed posttests (Sun et al., 2021). Third, to confirm the robustness of the findings, it is necessary to include novel items to test generalization (Saito, 2013). Lastly, to our knowledge, Chandrasekaran et al. (2010) and Shao et al.’s (2023) studies are the only two studies that have examined the role of auditory processing in L2 speech instruction. Though auditory processing was found to mediate the influence of L2 speech instruction, the mode of L2 instruction in these two studies has still been language-oriented. No study has ever explored how auditory processing facilitates L2 speech learning when learners receive more commutatively-authentic, meaning-oriented instruction. In light of these considerations, the current study put the pieces together by addressing the following research questions:

1. How the provision of recasts can facilitate English vowels /i/-/ɪ/ learning by Chinese native speakers?

2. How can such learning gains from recasts be tied to individual differences in auditory processing?

As for RQ1, as initiated by Lee and Lyster (2016), we predicted that recasts would significantly facilitate L2 vowels /i/-/ɪ/ learning regardless of time and lexical items. As for RQ2, following the aptitude-treatment interaction identified by Shao et al. (2023), we hypothesized that learners with more precise perceptual acuity may encode the spectral and temporal information in recasts more accurately, which could, in turn, help them reduce perceptual confusion. In contrast, learners with more robust audio-motor integration may rapidly capture broad acoustic information in recasts, leading to better motor action and subsequent self-modified output.

Methods

Participants

We initially recruited two intact parallel classes at Hainan University in China. They were 68 freshman students majoring in computer science and technology. The participants were required to fill out a background questionnaire. Data from 8 L2 participants were excluded due to exposure to English in childhood (5), or professional music experience (2), or failure to complete training (1), leaving data from 60 L2 participants for statistical analysis. Participants were screened with pure-tone audiometry at octave frequencies between 250 and 8000 Hz at 20 dB and reported no hearing impairment. We randomly assigned the remaining 60 participants to an experimental group (15 females and 15 males, mean age = 18.7 years, range: 18–20) and a control group (15 females and 15 males, mean age = 18.4 years, range: 18–20).

All L2 learners self-reported as native speakers of Mandarin Chinese and started receiving English instruction in middle school. In addition, they were born and raised in Haikou, Hanan province, and had not lived in any other country. They were enrolled in an English course required twice a week for 90 min. The courses target language skills in daily conversation and practices, including speaking, listening, reading, and writing. In addition, related cultural backgrounds, events, and group activities will be introduced and arranged as an indispensable portion of the teaching. We used a standardized International English Language Test (IELT) to control the heterogeneity of English proficiency. The IELT scores showed that there was no significant difference in L2 learners’ overall proficiency (M = 5.5, SD = 0.7) as well as speaking skills (M = 5.3, SD = 0.9), listening skills (M = 5.5, SD = 1.2), reading skills (M = 5.7, SD = 1.4) and writing skills (M = 5.1, SD = 0.6). The study protocol was approved by the Institute Review Board at Hainan University, and participants were compensated with $ 10 each.

Instructor

The instructor, who was teaching the participants, English oral classes, participated in the study. She was a female native speaker of American English and a well-experienced English teacher in China (teaching experience exceeding 5 years) with an MA degree in education. Her teaching approach emphasized developing communicative skills in English. In her English oral classes, she got along well with these participants, so the participants would not be nervous or shy in front of her and would be willing to cooperate with her in the treatment sessions.

English native judges

The listeners were 6 native speakers of English (3 males and 3 females) with a mean age of 38.2 years (SD = 0.6). A pure-tone screening test between 250 and 4000 Hz served as a measure of their octave frequencies. All the listeners were born and raised in Ottawa, Canada, and were language teachers at Carleton University. They had minimal exposure to Chinese learners of English and no explicit knowledge of the Chinese style of English before the experiment. All reported normal hearing and speech ability, and all were right-handed. Two of them (1 male and 1 female) participated in the stimulus preparation.

Target stimuli

The target stimuli of the current study are English vowels /i/-/ɪ/. They are selected because of the well-known difficulty the contrasts present for the Chinese.

From a theoretical standpoint, some major L2 speech theories maintain that learning English vowels /i/-/ɪ/ by Chinese native speakers is considered one of the most difficult specific instances of L2 speech learning. First, in accordance with the Perceptual Assimilation Model (PAM) (Best et al., 1995) and PAM-2 (Best and Tyler, 2007), Chinese perception of English /i/-/ɪ/ should fall into either the “single category” (two L2 sounds are perceived as one L1 sound) or “category-goodness” (one of the two L2 sounds is more similar to the L1 sounds) pattern, indicating that Chinese would face perceptual difficulties in these sounds. Second, the Speech Learning Model (SLM) (Flege, 1995) predicts that learning will be much more difficult for those L2 sounds that are similar but not new to L1 sounds. Under the framework of the SLM, English /i/ and /ɪ/ are very similar to Chinese /i/ in the vowel inventory (see Figs. 1 and 2), and thus Chinese learners of English are known to have great trouble learning these sounds even they have lived in Anglophone countries for many years. Finally, the Natural Referent Vowel (NRV) framework (Polka and Bohn, 2011) argues that detecting a change from one vowel category that is acoustically and articulatory more peripheral in the vowel space is more complex than a less peripheral one. The asymmetry for discrimination of the nonnative English vowels /i/-/ɪ/ persists in adulthood in Chinese L2 learners.

Fig. 1: English monophthongs.
figure 1

In terms of articulatory place and manner, the English /I/ is a high front unrounded vowel.

Fig. 2: Mandarin monophthongs (adapted from Handbook of the IPA, 1999, p. 42).
figure 2

In terms of articulatory place and manner, the Chinese /i/ is a high front unrounded vowel, similar to the English /I/.

From an acoustic point of view, English vowels /i/-/ɪ/ have been traditionally described along different patterns of cue reliance (Escudero and Boersma, 2004). It is generally acknowledged that all else being equal, there are at least two reliance cues in English vowels /i/-/ɪ/, that is, spectral properties and duration. In terms of spectral properties, /i/ has a lower F1 (approximately 342–437 Hz) than /ɪ/ (approximately 427–483 Hz), and as to duration, /i/ (approximately 243–306 ms) is longer than /ɪ/ (192–237 ms) (Hillenbrand et al., 1995). Accordingly, Chinese learners of English are susceptible to relying on durational information (Bohn, 2017; Wong, 2013). No matter what spectral information is, if the duration of the vowel is long, they tend to assume it /i/; if the duration of the vowel is short, they consider it as /ɪ/. The overuse of durational cues might delay their learning of the nonnative phonemic contrasts.

Design

To collect data from the same sample of participants at multiple time points, the current study followed a longitudinal study with a pre and post-test design. Data were collected over 10 sessions in 7 weeks. A control group was also included to avoid test-retest effects due to the same material used in the pre and post-tests. All participants completed the background questionnaire and auditory processing tests in Week 1 (held during the last working day). In many studies involving CF experiments, a standard approach is to conduct 10 treatment sessions, which provides enough exposure to the intervention while minimizing the risk of participants’ fatigue or dropout (Li, 2010). Following this line of thought, in Week 2 and Week 3, participants took the pretests and participated in 10 sessions of recasts within 10 consecutive days, whereas the control group participated in the same sessions without receiving recasts. Immediate posttests followed the final treatment session. As Li (2010) pointed out, a delayed posttest should be conducted after at least 4 weeks have elapsed to allow for any changes resulting from the CF treatment to stabilize. Therefore, 4 weeks later, all participants completed delayed posttests in Week 7 (see Fig. 3).

Fig. 3: Summary of research design.
figure 3

The study follows a longitudinal study with pre- and post-tests design. The auditory processing tests is conducted prior to the pretests, and the perception, controlled and spontaneous production tests are conducted at three different occasions.

Auditory processing tests

Two specific constructs of auditory processing were tested via the online behavioral experiment builder GORILLA (Anwyl-Irvine et al., 2020): perceptual acuity and audio-motor.

Perceptual acuity

Following the design in the previous study (Kachlicka et al., 2019), perceptual acuity was assessed with four subtests designed to measure the ability to encode spectral and temporal details of sounds, including pitch, formant, duration, and amplitude rise time. In each subtest, a total of 100 target stimuli that vary along the continuum of the target acoustic feature were created via customized MATLAB scripts.

Based on an adaptive three-alternative forced-choice procedure (Levitt, 1971), three complex tone stimuli were presented with an inter-stimulus interval of 0.5 s. Participants were requested to identify which sound was different by pressing the number “1” or “3” on a keyboard. The tests began at level 50 (i.e., the target stimulus is 50 steps away from the baseline stimulus), and the difficulty level changed according to the participants’ performance. When the answer was incorrect for three consecutive times, the difference between stimuli would be expanded by 10 steps in the subsequent trial, making the discrimination task easier. When the answer was correct three consecutive times, the difference between stimuli in the follow-up test was reduced by 10 steps, which made the task more difficult. When an increase in acoustic differences was followed by a decrease or vice versa, the reserve happened. Once the step changes in the opposite direction, the step length became smaller. First, it became five steps, then two steps, and finally one step, and then it remains until the end of the test (e.g., 50 → 40 → 30 → 35 → 35 → 33 → 33 → 34 → 34 → 33 → 33 → 32 → 32, etc.). The test was stopped after 70 trials or 8 reversals, and the sound discrimination threshold score was calculated by averaging the stimulation level after the third reversion.

The sound stimuli available at Saito’s team website (www.sla-speech-tools.com) were used in the current study. All sounds are composed of four harmonic polyphones, with a duration of 500 ms, a fundamental frequency (F0) of 330 Hz, and a linear ramp of 15 ms at the beginning and end. The target acoustic stimuli changed in steps of 0.3 Hz in F0 (330.3–360 Hz), 2.5 ms in duration (252.5–500 ms), and 2.85 ms in amplitude rise time (178–300 ms). For the formant discrimination, stimuli were complex tones with a frequency of 100 Hz and a harmonic with a frequency of 3000 Hz. Three formants are applied to these tones using a parallel formant filter bank (Smith, 2007). The first formant (F1) was held constant at 500 Hz, and the third formant (F3) at 2500 Hz. The standardized second formant (F2) was 1500 Hz. The target stimulus of F2 ranged from 1502 to 1700 Hz by a step of 2 Hz. Lower threshold scores indicate better sensitivity to perceptual acuity.

Audio-motor integration

Following the design in Tierney et al. (2017), audio-motor integration was assessed by measuring how they accurately reproduce melody or rhythm.

According to the melodic patterns used in Povel and Essens (1985), 10 melodies provided a manageable number of stimuli that could be used to test the participants’ ability of melody reproduction systematically. Thus, the same number of melodies were conducted in the current study for the melody reproduction test. Melodies were prepared from a scale of five notes with fundamental frequencies of 220, 246.9, 277.2, 311.1, and 329.6 Hz. The duration of each note was 300 ms, with a cosine slope of 50 ms at the beginning and end of the note. The first note of a melody was always the third pitch. The following note was closest to the previous note on the scale, either higher or lower. This process was repeated until all seven notes were selected. The melody could not fall below 220 Hz or rise above 329.6 Hz; once the melody reached these limits, the following note was selected closer to the register center or the same as the previous one. Melodies were repeated three times, with an interval of 1 s. After each melody was played, five boxes numbered 5-4-3-2-1 were shown in a line from top to bottom. Participants were asked to reproduce them by clicking one box at a time (Box 3 at the very beginning). When each box was clicked, the note was played correspondingly. Before the test, participants could listen to an example and practice the box to become familiar with the notes. The notes selected by the participants and the notes in the target melody were compared, and then a percentage score was calculated to determine response accuracy.

Similarly, 10 rhythmic patterns provided a manageable number of stimuli that could be used to systematically test the participants’ ability of rhythmic reproduction (Povel and Essens, 1985). Thus, 10 rhythmic patterns were conducted as stimuli in the current study for the rhythmic reproduction test. The rhythmic patterns consisted of 16 200-ms segments, nine of which contained a drum hit, and the remainder contained a rest. Each rhythm was performed three times, with a 600 ms interval. A 150-ms conga drum strike sound was used for the drum hits, which was downloaded from freesound. org. Participants were instructed to reproduce the beat by pressing the spacebar after hearing the stimuli. The response time of each press was recorded and compared with the drumming clip in the target stimulus. After the accomplishment of the test, the first step was to calculate the interval between responses by converting them to the nearest multiple of 200 ms. Practice trails were included to ensure that participants understood the procedures and were able to perform the task. The content of each segment in the rhythm of participants (i.e., the presence of hits or rests) and the matching segment in the target rhythm were compared, and then a percentage score was calculated to determine the response accuracy. The audio-motor integration scores were obtained by averaging the scores of melodic and rhythmic reproduction tests. Higher threshold scores mean better audio-motor integration.

Treatment sessions

The participants received 10 treatment sessions lasting 90 min (the duration of two classes, one lasting for 45 min in Chinese English classrooms) each to notice and practice the target stimuli in the context of recasts. The tasks were operated based on Ellis’ (2003) definition of tasks: they contained a gap that required learners to primarily focus on meaning and use their linguistic resources with a clear outcome. That is, they aimed to encourage the use of target stimuli by providing participants with specific linguistic cues. Overall, the picture description task is often used to assess an individual’s ability to communicate effectively visually. It requires the individual to analyze and interpret visual information and then use language to describe it accurately. This task can be instrumental in assessing L2 speech proficiency (Derwing and Munro, 2015). On the other hand, a debating task is often used to assess an individual’s ability to present arguments, express opinions, and engage in critical thinking. This task can be instrumental in assessing communication skills (Saito, 2013). Therefore, picture description and debating tasks from the pool of available tasks served different purposes and were used together in the current study.

Picture description task

The participants were required to describe a series of pictures, each depicting the routines of a group of people. The instructor explained to the participants that they were supposed to describe each character’s routines respectively, which created obligatory contexts for using target sounds. Each participant was asked to start his or her sentence with “Every week…”. For example, one of the pictures depicted a scene designed to elicit a sentence such as “Every Monday, Jessica sits on a sofa, and Howard cleans the seats of his car.” The instructor would provide immediate recasts in response to each participant’s untargetlike sounds.

The instructor was required to respond to the participants’ nontarget sounds through a partial (one-word) recast with falling intonation and pause for a short time to wait for the participants’ self-modified output. If the participants could not provide self-modified output, the instructor would continue the conversation without pushing them to produce output. The following is an example of the provision of recasts.

Participant: Every Monday, …Howard cleans the sits of his car.

Instructor: Seats. (recast)

Participant: Umm, seats. (self-modified output)

Debating task

The participants were induced to use a set of word, containing /i/-/ɪ/ in various phonetic context. For example, they were asked to deliver a public speech in front of their classmates to support or object to given topics (e.g., “Is it cheap to eat fried chips outside?”) with reasonable control of eye exchange and a clear voice. Given that recasts can be delivered without stopping the communicative flow, the instructor did her best to provide recasts as naturally as possible, as well as encouraging learners to focus on meaningful communication as their no.1 priority and on accuracy as their no. 2 priority.

Outcome measures

There was a pretest, an immediate posttest, and a delayed posttest, each of which consisted of a perception test, a controlled production test, and a spontaneous test to assess the impacts of recasts on various domains (perception, controlled production, spontaneous production) of English vowels /i/-/ɪ/ learning by Chinese native speakers. The effectiveness of recasts in English vowels /i/-/ɪ/ learning will also be assessed according to two training contexts (trained items and untrained items) to determine whether the participants can generalize their improvement in knowledge of the target sounds to other novel lexical contexts. The tests were carried out in the following order: 1) spontaneous production test first, 2) controlled production test, and 3) perception test to minimize the influence of focusing on form.

Perception test

Forced-choice identification task was used to measure participants’ perceptual accuracy of English vowels /i/-/ɪ/. Following Lee and Lyster (2017)’s perception test, a more accurate measure of participants’ ability to distinguish between /i/ and /ɪ/ can be obtained by using a relatively large number of minimally paired words. Ultimately, the choice of 44 minimally paired words depended on the specific target stimuli, the research questions being investigated, the characteristics of the participant population, and the available resources and constraints. In the perception test, participants listened to a total of 44 minimally paired words (e.g., “peak” and “pick”) and identified exemplars as English tense vowel /i/ and lax vowel /ɪ/. The test consisted of 22 sets of /i/-/ɪ/ minimal pairs (No.1–14 were used as the trained items, and No. 15-22 were used as the untrained items) together with 22 sets of distractor ones (50% changed the onset, and 50% changed the nucleus, e.g., “gun” and “sun,” “would” and “could”) (see Table 2). These minimally paired target words were consonant-vowel-consonant singletons with various initial consonants and finals to avoid bias from the consonantal context. According to the College English Syllabus (the latest edition of the Ministry of Education, 2018), all of the words fall into the basic frequently used words except “heed” and “chink”.

Table 2 Target items in outcome measures.

One male native speaker and one female native speaker of English (they were selected from 6 native listeners) were asked to produce each stimulus several times in a carrier phrase “I said …” (44,100 Hz sampling rate, 16-bit resolution). Each target stimulus of each speaker was selected and judged to be correct and most natural to a third native speaker. The final stimuli were acoustically analyzed to confirm that its F1, F2 and duration were within the normal range (Hillenbrand et al., 1995; Yang, 1996).

Participants were tested individually in a quiet test space. Stimuli were presented on a Dell XPS14-L421X laptop running E-Prime Professional 2, via Dell UC 350 headset at 70 dB SPL. Timing intervals were held constant across two stimuli with self-paced 500 ms. Before the identification test, participants completed practice trials to familiarize themselves with the procedure. None of them failed to meet the criterion. The immediate and delayed post-tests were conducted with the same tokens but in a different order.

Controlled production test

The carrier-sentence reading task was used to measure participants’ controlled production of the target items without much communicative pressure (Saito, 2013). Following Lee and Lyster (2017), the number of words used in the controlled production test should be suitable to ensure that the production task is manageable for participants and that they are able to focus on the specific sounds being tested. Therefore, 7 sets of minimal paired words with the same number of distractors were selected, in which 4 sets of minimal paired words were used as trained items and 3 sets of minimal paired words as untrained items (see Table 2). Each participant was asked to read aloud the testing words one after another in a carrier sentence “I said __” at normal speed. If the participant produced a word more than once, only the first production was included in the data analysis.

Spontaneous production test

The picture narrative task was used to measure participants’ spontaneous production of the target items in more naturalistic, communicative settings (Ellis, 2015) (see Table 2). It was operationalized in an instructor-participant dyadic setting as follows:

  1. (a)

    the participant was given 20 s to memorize a written list of 4 critical words on a sheet of paper associated with 4 pictures they were to describe;

  2. (b)

    when the participant indicated he/she had memorized the 4 critical words, the instructor took the list away and distributed 4 pictures to them;

  3. (c)

    the participant was required to narrate these 4 pictures one after another with the words they just had memorized without any time planning;

  4. (d)

    once the narration was completed, the instructor provided another 4 critical words for another set of 4 pictures.

In total, there were 16 pictures and 16 words in each version: 8 target words, creating contexts for the production of English vowels /i/-/ɪ/, and 8 distractor words. Among the 8 target words, /i/-/ɪ/ were evenly distributed, 4 were used as trained items, and 4 were untrained items. If the participant produced a word more than once, only the first production was included in the data analysis.

All the speech samples in both controlled and spontaneous production tests were digitally recorded by DELL UC 350 headset microphone placed at 30 degrees off-axis and about 3 cm away from the participants in a sound-treated room. A laptop DELL XPS14-L421X was connected at the other end to monitor the continuous speech and collect the mono-sound track of the participants at a 44.1 kHz sampling rate and 16-bit resolution. Similar to the perception tests, there were also two practice trials.

Scoring method

For the perception test, the 30 Chinese learners were asked to complete a total of 7920 trials (3960 target trials: 30 participants × 44 tokens [14*2 trained tokens + 8*2 untrained tokens]×3 sessions [pretest, immediate posttest, and delayed posttest]; 3960 distractors: 30 participants × 44 tokens [14*2 trained tokens + 8*2 untrained tokens] × 3 sessions [pretest, immediate posttest, and delayed posttest]. In the perception pretest and subsequent posttests, there were 44 target tokens with possible scores ranging from 0 to 44. The participants received 1 point for each target token perceived accurately and lost 1 point for each target token misperceived.

For the production test, the 30 Chinese learners were asked to complete a total of 3960 trials (1980 target trials: 30 participants × 22 tokens [6*2 trained tokens + 5*2 untrained tokens]×3 sessions [pretest, immediate posttest, and delayed posttest]; 1980 distractors: 30 participants × 22 tokens [6*2 trained tokens + 5*2 untrained tokens]×3 sessions [pretest, immediate posttest, and delayed posttest]. A Praat script (Boersma and Weenink, 2019) was used to segment a single token from continuous speech and save it as a separate. wav file. at a sampling rate of 44.1 kHz and a resolution of 16 bits. Concerning the target tokens inserted in continuous speech streams, the transcriber carefully listened to the speech sample repeatedly and placed the cursor at the beginning of the word (where any component of the target tokens could be heard), and moved toward its offset by 5 ms steps. Inflected endings with reduced or extra syllables (picked, picking) were not removed from the dataset to avoid significant distortion of segmented tokens. All target tokens were randomly subdivided into 5 blocks (792 tokens per block). The six native speakers judged each block.

The scoring sessions took place with each monolingual native speaker of English listener in a quiet room. The first author sat next to the native listeners during the sessions so that she could answer any questions they encountered during the whole process of scoring. Following Flege (1995) and Lee and Lyster (2017), all tokens were presented in a randomized order using Experiment MFC 7 from Praat (Boersma and Weenink, 2019) binaurally via DELL UC 350 headset microphone on a laptop DELL XPS14-L421X. The native listeners were instructed to identify a given token and then to rate the degree of its goodness. For example, a given token intended as the word “seat” was presented, and the native listeners were asked to choose one among the three options on the computer screen (seat, sit, and neither). Once the native listeners chose one among the three options, a 9-point scale appeared on the computer screen with a line of instruction: “Please judge how good the pronunciation is among 1 (hard to understand)-9 (easy to understand) by clicking the corresponding boxes”. The native listeners could click the “replay” box to listen to the given token as often as they wished before making final options. There were also two practice trials. 1 point was recorded when the intended token matched the chosen token by the native speakers, and its production accuracy score was calculated from the degree of its goodness. 0 point was recorded when the intended token unmatched the chosen token by the native speakers. To obtain the mean score for each target token in each word context of each participant at both controlled and spontaneous production tests, the current study uses the following formula:

$${\rm{Production}}\, {\rm{accuracy}} = {\rm{the}}\, {\rm{total}}\, {\rm{of}}\, {\rm{native}}\, {\rm{listeners}}'\, {\rm{scores}}\, {\rm{for}}\, {\rm{each}}\, {\rm{token}}/6$$

Reliability analysis

The Cronbach α was computed to test and verify the interrater agreement among 6 native judges. The interclass correlation was computed at 0.74 for the entire data set (n = 1980), 0.76 for the controlled production tokens (n = 1260), and 0.73 for the spontaneous production tokens (n = 720).

Results

Results are divided into two sections. The first section provides the overall results over time and presents the results of data analyses. The second section looks at the potential role of auditory processing in L2 vowel learning following recasts.

Pre- and posttest results

In this section, we examined how the provision of recasts could facilitate English vowels /i/-/ɪ/ learning by Chinese native speakers. Raw test scores were summarized in Tables 35. Before each analysis of variance (ANOVA), statistical assumptions such as normal normality, Levene tests, and the Mauchly test were verified.

Table 3 Descriptive results of the perception tests.
Table 4 Descriptive results of the controlled production tests.
Table 5 Descriptive results of the spontaneous production tests.

To find whether there was any significant difference between groups, participants’ pretest scores under-trained and untrained items were entered into a one-way ANOVA with one between-group factor (recast group and control group). No significant effects were found for the groups (p > 0.05), indicating the comparability of learners at the time of pretests.

To examine the improvement in perception following recasts, three-way ANOVA was designed with the group as a between-group factor (recast group and control group), and lexical contexts (trained and untrained items) and time (pretest, immediate posttest, and delayed posttest) as within-group factors. Although neither the effects of the three-way Lexis × Group × Time interaction nor the two-way Lexis × Time interaction were significant (p > 0.05), there were significant effects for Group × Time, F(2, 57) = 80.865, p < 0.001, and for Lexis × Group, F(1, 58) = 16.117, p < 0.001, and for Lexis, F(1, 58) = 4.412, p = 0.040, and for Group, F(1, 58) = 247.203, p < 0.001, and for Time, F(2, 57) = 102.961, p < 0.001. According to Bonferroni multiple comparisons, the recast group significantly improved their perceptual abilities on immediate posttest (in trained items, M = 28.71 → 65.38, p < 0.001, d = 3.218; in untrained items, M = 26.14 → 58.69, p < 0.001, d = 3.549) and delayed posttest (in trained items, M = 28.71 → 61.89, p < 0.001, d = 2.975; in untrained items, M = 26.14 → 54.55, p < 0.001, d = 2.734). In contrast, the control group did not significantly increase their identification scores on immediate posttest (in trained items, M = 26.59 → 27.12, p = 0.853, d = 0.045; in untrained items, M = 26.74 → 30.83, p = 0.179, d = 0.420) and delayed posttest (in trained items, M = 26.59 → 27.27, p = 0.824, d = 0.057; in untrained items, M = 26.74 → 28.79, p = 0.478, d = 0.204).

As for the improvement in controlled production following recasts, although the results of the three-way ANOVA did not yield significant effects for the Lexis × Group × Time interaction and the two-way Lexis × Time interaction (p >0.05), there were significant effects for Group × Time, F(2, 57) = 214.540, p < 0.001, and for Lexis × Group, F(1, 58) = 16.303, p < 0.001, and for Group, F(1, 58) = 569.685, p < 0.001, and for Time, F(2, 57) = 247.115, p < 0.001. In the same vein, Bonferroni post hoc comparisons revealed that the recast group significantly improved their controlled production abilities over time on both immediate posttest (in trained items, M = 1.73 → 5.00, p < 0.001, d = 4.890; in untrained items, M = 1.50 → 4.23, p < 0.001, d = 4.351) and delayed posttest (in trained items, M = 1.73 → 4.33, p < 0.001, d = 4.376; in untrained items, M = 1.50 → 3.80, p < 0.001, d = 3.896). On the contrary, there was no significant improvement in the performance of the control group on immediate posttest (in trained items, M = 1.63 → 1.77, p = 0.380, d = 0.217; in untrained items, M = 1.67 → 1.77, p = 0.501, d = 0.169) and delayed posttest (in trained items, M = 1.63 → 1.73, p = 0.586, d = 0.147; in untrained items, M = 1.67 → 1.70, p = 0.787, d = 0.056).

Concerning the improvement in spontaneous production following recasts, although the result of the three-way ANOVA did not yield a significant effect for the Lexis (p >0.05), there were significant effects for Lexis × Group × Time, F(2, 57) = 3.536, p = 0.036, and for Lexis × Time, F(2, 57) = 4.972, p = 0.010, and for Group × Time, F(2, 57) = 333.128, p < 0.001, and for Lexis × Group, F(1, 58) = 10.496, p = 0.002, and for Group, F(1, 58) = 680.223, p < 0.001, and for Time, F(2, 57) = 333.128, p < 0.001. Once again, Bonferroni post hoc comparisons revealed that the recast group’s gain was significantly evident not only on immediate posttest (in trained items, M = 1.13 → 3.77, p < 0.001, d = 6.733; in untrained items, M = 1.27 → 3.27, p < 0.001, d = 4.113) but also on delayed posttest (in trained items, M = 1.13 → 3.70, p < 0.001, d = 5.722; in untrained items, M = 1.27 → 3.23, p < 0.001, d = 3.580). However, the control group’s gain was not significantly apparent both on immediate posttest (in trained items, M = 1.20 → 1.30, p = 0.375, d = 0.227; in untrained items, M = 1.33 → 1.37, p = 0.787, d = 0.082) and delayed posttest (in trained items, M = 1.20 → 1.27, p = 0.536, d = 0.163; in untrained items, M = 1.33 → 1.37, p = 0.823, d = 0.077). The pre- and post-test results are plotted in Figs. 46.

Fig. 4: Mean values of perception test scores.
figure 4

There is no significant difference between recast group and control group at pretests. However, the recast group significantly outperforms the control group at two posttests regardless of lexical items.

Fig. 5: Mean values of controlled production test scores.
figure 5

There is no significant difference between recast group and control group at controlled production pretests. However, the recast group significantly outperforms the control group at two controlled production posttests regardless of lexical items.

Fig. 6: Mean values of spontaneous production test scores.
figure 6

There is no significant difference between recast group and control group at spontaneous production pretests. However, the recast group significantly outperforms the control group at two spontaneous production posttests regardless of lexical items.

Individual differences in auditory processing and the effectiveness of recasts

First, Pearson correlation analysis was conducted to investigate the interrelationship between the two specific constructs of auditory processing. The results demonstrated that there was no significant correlation between the two specific auditory processing constructs (r = −0.215, p = 0.131). To elaborate, perceptual acuity and audio-motor integration tapped into different dimensions of auditory processing profiles, respectively. Moreover, the raw scores of participants’ auditory processing profiles were illustrated in Table 6.

Table 6 Descriptive statistics of participants’ auditory processing profiles.

Next, a set of simple correlation analyses was conducted to see how such learning gains from recasts were tied to individual differences in auditory processing. Individual differences in perceptual acuity/audio-motor integration were submitted as a predictor variable and the improvement in vowel perception and production as an outcome variable. Here, the improvement in vowel perception was calculated by averaging the improvement scores; one was prepared by subtracting the immediate posttest scores from the pretest scores, and the other was obtained by subtracting the delayed posttest scores from the scores of the pretest. The improvement in controlled production and spontaneous production was also dealt with in the same way. Before conducting each correlation analysis, the assumption of linearity, homoscedasticity, normality, and independence was verified. Table 7 summarized the results of simple correlation analyses of auditory processing and L2 vowel gain scores.

Table 7 The partial correlation analyses of auditory processing and L2 vowel gain scores.

As for trained items, perceptual acuity was significantly related to improvement in vowel perception (r = −0.364, p = 0.048). Referring to the criteria of Plonsky and Oswald’s (2014) (r = 0.25 for small, 40 for medium, and. 60 for large), the correlation strength indicated that the role of perceptual sharpness in vowel perception could be considered small to medium. Audio-motor integration was also significantly related to improvement in controlled production (r = 0.385, p = 0.036) and spontaneous production (r = 0.374, p = 0.042). The correlational strength was both small-to-medium.

Regarding untrained items, similarly, perceptual acuity was significantly related to improvement in vowel perception (r = −0.416, p = 0.022). The size of the correlation was substantially medium. Audio-motor integration was also significantly related to improvement in controlled production (r = 0.438, p = 0.015) and spontaneous production (r = 0.405, p = 0.026). The correlational strength was both medium. Figures 712 for scatterplots displayed the relationship between auditory processing and vowel gain scores.

Fig. 7: Relationship between perceptual acuity and perceptual gains in trained items.
figure 7

As for trained items, perceptual acuity is significantly correlated with improvement in perception.

Fig. 8: Relationship between perceptual acuity and perceptual gains in untrained items.
figure 8

As for untrained items, perceptual acuity is significantly correlated with improvement in perception.

Fig. 9: Relationship between audio-motor integration and controlled production gains in trained items.
figure 9

As for trained items, audio-motor integration is significantly correlated with improvement in controlled production.

Fig. 10: Relationship between audio-motor integration and controlled production gains in untrained items.
figure 10

As for untrained items, audio-motor integration is significantly correlated with improvement in controlled production.

Fig. 11: Relationship between audio-motor integration and spontaneous production gains in trained items.
figure 11

As for trained items, audio-motor integration is significantly correlated with improvement in spontaneous production.

Fig. 12: Relationship between audio-motor integration and spontaneous production gains in untrained items.
figure 12

As for untrained items, audio-motor integration is significantly correlated with improvement in spontaneous production.

Discussion

In the context of Chinese native speakers learning the English /i/-/ɪ/ vowels, the current study examined how the provision of recasts can facilitate L2 vowel learning and how such learning gains resulting from recasts can be tied to individual differences in auditory processing. Our data revealed two significant findings. First, recasts significantly helped Chinese native speakers facilitate their learning of English vowels /i/-/ɪ/ regardless of time and lexical context. Secondly, such learning gains resulting from recasts were significantly correlated with specific constructs of auditory processing. Namely, perceptual acuity was significantly correlated with gains from recasts in identifying L2 vowels, and audio-motor integration with gains from recasts in controlled/spontaneous production of L2 vowels.

As for the role of recasts in L2 vowel learning, the findings here echoed the previous literature (Lee and Lyster, 2016), showing that recasts are pivotal in developing accuracy in L2 vowel perception. They also provide additional support for the instructional potential of recasts in Chinese classrooms, where exposure to L2 is quantitatively and qualitatively limited (Zhang et al., 2021).

Overall, the statistical analyses showed that the recast group significantly outperformed the control group on the immediate and delayed posttests. Exposing Chinese L2 adult learners to task-based interaction without CF may not significantly facilitate their vowel learning between pretest and posttests. The research results are consistent with previous research, showing that task-based interaction benefits L2 learners who lack adequate knowledge of target sounds. Increasing output enhancement by providing CF may be necessary, especially for relatively advanced learners, such as Chinese learners in the current study (Saito, 2021).

Furthermore, whereas the recast group improved significantly in trained and untrained items, the results showed that gains from recasts were more evident in trained than untrained items. Therefore, recasts may have led learners to install, consolidate, and generalize their new phonological knowledge of English vowels /i/-/ɪ/ beyond the lexical items they had practiced during the treatment. This argument aligns with the results of previous studies demonstrating that CF facilitates learners’ attentional transition from vocabulary to sound learning (Gooch et al., 2016; Saito, 2013). Another explanation for the variation in scores for the trained and untrained items could be related to the fact that L2 learners may have had somewhat more opportunities to notice, memorize, and incorporate L2 input from the former than the latter lexical contexts.

That being said, the most noteworthy finding for discussion is that the instructional effectiveness of recasts was significantly correlated with specific constructs of auditory processing. Namely, learners with more precise perceptual acuity may better capitalize on their ability to encode spectral and temporal information in recasts to promote L2 vowel perception. In contrast, learners with more robust audio-motor integration may better use their ability to remember and reproduce auditory information in recasts to facilitate L2 vowel production.

In line with the existing evidence for Auditory Precision Hypothesis-L2 (Mueller et al., 2012), the current study suggests auditory precision anchors language learning since it helps learners encode and memorize sound characteristics, helping them make full use of pedagogical opportunities of each language input. The findings here added that auditory processing mediates the effectiveness of more meaning-oriented instruction (e.g., recasts). To date, aptitude-treatment interaction effects have been only observed in form-oriented instruction (Chandrasekaran et al., 2010; Shao et al., 2023). Here, our findings shed light on the idea that individual differences in auditory processing can explain why L2 learners achieve varying degrees of success under different instructional conditions (form-oriented and meaning-oriented).

Regarding the relative weight of perceptual acuity and auditory-motor integration identified in L2 vowel learning gains, we ascribe this to the three acquisitional functions of recasts. That is, phonological recasts can provide learners with negative evidence (i.e., a clear signal of errors), positive evidence (i.e., instructor speech models), and output practice as well (i.e., self-modified output). First and foremost, recasts can provide learners with a great deal of positive evidence while may, in turn, greatly influence L2 speech perception (Mackey et al., 2000). During the treatment sessions, L2 learners were encouraged to pay selective attention to the perceptual difference between their nontarget-like sounds and the instructor’s target models. In this way, L2 learners were induced to notice and attend to the perceptual details of auditory models, such as formant, pitch, duration, and amplitude dimensions of acoustic sounds. This is precisely the skill tested by the perceptual acuity test, which presents learners with several stimuli that differ in only one acoustic dimension (pitch, formant, duration, or amplitude) and asks them to discriminate them. Learners who can make better use of positive evidence in recasts to notice, restructure, update, and refine the temporal and spectral features of L2 speech may work better for more fine-grained speech perception. Second, the importance of negative evidence in recasts cannot be neglected. The clear signal of error in the form of recasts may help learners double-check (a) whether their production of target sounds is nativelike and (b) modify their output in reaction to the instructor’s target models (Ellis et al., 2001). Upon receiving recasts, L2 learners were prompted to follow the tracks of auditory patterns within seconds, stride over multiple speech segments, and quickly develop the appropriate motor sequence to reproduce these patterns. This is exactly the skill tested by audio-motor integration, which presents L2 learners with several seconds of frequency/temporal patterns and asks them to reproduce them. In this regard, if certain individuals are endowed with more precise, accurate audio-motor integration, they will show more rapid capturing of broad acoustic information in recasts and subsequently formulate more stable and efficient motor action. Third, recasts may provide many self-modified output opportunities (Sheen, 2006). In the recast-modified output sequence, L2 learners with greater talent in linking auditory input with motor output may benefit more from recasts and improve their production.

It is important to mention that the relative effects of auditory processing (perceptual acuity and audio-motor integration) on learning gains have been demonstrated in both trained and untrained items, with untrained items showing superior effects than trained ones. This pattern may be explained by the different depths of auditory processing that items triggered. Learners exposed to trained items might need less time to encode, represent, and integrate basic acoustic cues available in auditory input, which they had practiced during treatment sessions. In other words, auditory processing effects might not necessarily be in active operation for trained items. Conversely, when learners were exposed to untrained items, they needed to code the new auditory input accurately and refine their perception strategies to identify and classify the relative weights of spectral and temporal cues for different L2 vowel learning instances. This might impose additional requirements on auditory processing, demanding learners to pay attention to the dimensions or specific values in the dimensions they were accustomed to ignoring.

Finally, all of the findings here suggest that the source of individual differences in instructed L2 vowel learning originates at least from auditory processing. From the research perspective, auditory processing can serve as an additional interesting measure of individual differences in L2 vowel learning. From the practical perspective, the auditory processing profiles of individuals can suggest how they can benefit the most from recasts (perceptual acuity for perception, audio-motor integration for production).

Conclusions and future directions

To conclude, the current study provides further evidence of the facilitative role of recasts in L2 vowel learning demonstrated in prior studies. Results suggested that recasts significantly helped Chinese native speakers facilitate their learning of English vowels /i/-/ɪ/ regardless of time and lexical context. In addition, the current study is a novel contribution in that we tested how such learning gains resulting from recasts can be tied to individual differences in auditory processing. Results indicated that learners with more precise perceptual acuity may better capitalize on their ability to encode the spectral and temporal information in recasts for prompting L2 vowel perception. At the same time, those with more robust audio-motor integration may better use their ability to remember and reproduce auditory information in recasts to enhance L2 vowel production. These findings deepened our understanding of the relationship between auditory processing and instructed L2 vowel learning.

Considering the exploratory nature of the current study, several future directions can pursue further.

First and foremost, only behavioral protocols of auditory processing were used in the current study. Future investigations with electrophysiological measurement should be included to assess learners’ brainwave representation of sounds.

Second, the target features were English vowels /i/-/ɪ/ in minimally-paired words with a monosyllable. More target features (e.g., lexical tones vs. word stress vs. intonation) are called for to understand better how various target features influence distinct aspects of auditory processing and instructed L2 speech learning.

Third, our current investigation only examined one type of CF (e.g., recasts). In reality ESL/EFL classrooms, instructors may use various CF types to correct L2 learners’ sound errors. Therefore, additional types (e.g., prompts) for a more comprehensive analysis of the effectiveness of CF and its relationship with auditory processing should be considered.

Finally, for the sake of a fully-fledged understanding of the auditory processing mechanism underlying instructed L2 vowel learning, future research can examine the complex relationship between auditory processing, experience factors (e.g., age, learning contexts, proficiency levels), and cognitive factors (e.g., working memory, inhibitory control, and attention control) and their composite impacts on the effectiveness of recasts in the context of L2 vowel learning.