In many species, the young are particularly sensitive to environmental inputs at certain periods during development. The barn owl’s ability to localize prey is calibrated by auditory-visual input during an early sensitive period in development; wearing prisms (or ear plugs) alters the mapping during this period (Knudsen 2002). Binocular fusion is dependent on binocular visual input during a critical period early in development; rearing cats with one occluded eye irreversibly alters binocular representation in the visual centers of the cortex (Hubel and Wiesel 1977; Shatz and Stryker 1978). In songbirds, learning their species-typical song depends on experience during a critical temporal window; presentation of conspecific song during that time is essential for normal development (Konishi 1985; Marler 1970). A recent theoretical paper (Werker and Hensch 2015) discusses the nature of the “critical” periods, especially the biological factors that “open” and “close” them. Here, we review work from our laboratory that focuses on one specific time period for human infants’ learning; namely, the “sensitive period” for phonetic learning and the experiential factors that may influence this learning process. We first discuss the developmental trajectory of infants’ abilities to discriminate native and nonnative phonetic contrasts between 6 and 12 months of age, and then several experiential factors we have observed in laboratory studies that influence infants’ ability to discriminate speech sounds during this sensitive period. Lastly, we discuss future directions for research that will help elucidate the mechanisms through which these experiential factors exert their influences.

Early phonetic learning

Infants’ language learning starts early in development. Infants’ speech perception skills show a dual change toward the end of the first year of life (Figure 1). Not only does nonnative speech perception decline (Best and McRoberts 2003; Werker and Tees 1984), but, also, native-language speech perception skills show improvement, reflecting a facilitative effect of experience with native language (Kuhl et al. 2006; Tsao, Liu, and Kuhl 2006). The mechanism underlying change during this sensitive period in development, and the relationship between the change in native and nonnative speech perception, is of theoretical interest. Data show that at the cusp of this developmental change, infants’ native and nonnative phonetic perception skills predict later language ability, but in opposite directions (Figure 2) (Kuhl et al. 2008). Better native phonetic discrimination at 7.5 months predicts faster native-language advancement; whereas better nonnative phonetic discrimination predicts slower native-language advancement. We have argued that this pattern of results is indicative of “neural commitment” to the native language and reflects infant attention to the acoustic cues made available by language input, especially language input in the form of “motherese” (more recently termed “parentese” because both mothers and fathers use it) in one-on-one social contexts (Ramirez-Esparza, Garcia-Sierra, and Kuhl 2014, 2016). That is, better skills in native phonetic discrimination support neural network development, which allows more efficient processing of native speech sounds. Alternatively, better skills in nonnative phonetic perception reveal uncommitted neural circuitry that is less efficient for processing native speech sounds (see Kuhl 2004, Kuhl et al. 2008, for elaboration).

Figure 1
figure 1

Effects of age on discrimination of the American English /ra/-/la/ phonetic contrast by American and Japanese infants at 6–8 and 10–12 months of age

Note: Mean percent correct scores are shown with standard errors indicated.

Source: Kuhl (2004)

Figure 2
figure 2

A median split of infants whose MMNs indicate better versus poorer discrimination of (a) native and (b) nonnative phonetic contrasts is shown, along with their corresponding longitudinal growth curve functions for the number of words produced between 14 and 30 months of age

Source: Kuhl et al. (2008)

Social influences on phonetic learning during the sensitive period

During infants’ sensitive period for phonetic learning, between 6 and 12 months of age, studies show that infant perception of speech is highly malleable. During this time, laboratory experiments indicate that distributional and statistical learning can occur with just two minutes’ exposure to novel speech material (e.g., Maye, Werker, and Gerken 2002; Saffran, Aslin, and Newport 1996). However, studies have also shown strong social influences in their investigations of whether infants are capable of phonetic learning at 9 months of age from natural first-time exposure to a foreign language (Conboy and Kuhl 2011; Kuhl, Tsao, and Liu 2003). Kuhl and colleagues (Kuhl, Tsao, and Liu 2003), in a foreign-language intervention experiment, exposed 9-month-old infants to Mandarin Chinese, a language with prosodic and phonetic structure very different from English. Infants heard 4 native speakers of Mandarin (2 male, 2 female) during twelve 25-minute sessions of book reading and play during a 4–6 week period. A control group of infants also came into the laboratory for the same number and variety of reading and play sessions, but heard only English. On average, infants heard about 33,000 Mandarin syllables during the course of the 12 language-exposure sessions. Researchers tested two additional groups of infants; they exposed one group to Mandarin language material on a video screen, and presented the second group the exact same Mandarin material in the same room and on the same timetable but in an audio-only condition (Figure 3).

Figure 3
figure 3

Foreign-language learning experiments show the need for social interaction in language acquisition

Notes: Nine-month-old infants experienced 12 sessions of Mandarin Chinese through a natural interaction with a Chinese speaker (left) or the identical linguistic information delivered via television (right) or audiotape (not shown). b Natural interaction resulted in significant learning of Mandarin phonemes when compared with a control group who participated in interaction using English (left). No learning occurred from television or audiotaped presentations (middle). Data for age-matched Chinese and American infants learning their native languages are shown for comparison (right).

Source: Adapted from Kuhl, Tsao, and Liu (2003)

After exposure, researchers tested all 4 groups on Mandarin phonetic discrimination. The results from behavioral tests (conditioned head-turn, see Kuhl et al. 2006) on infants after exposure demonstrated that only the group exposed to Mandarin in a social context by live humans learned the Mandarin contrast. The data demonstrated two things: (a) phonetic learning from first-time exposure can occur at 9 months of age, and (b) phonetic learning from natural language exposure during the sensitive period requires social interaction. Similar second-language exposure experiments using Spanish explored both phonetic and word learning, as well as the degree to which social factors, such as visual attention, during the exposure sessions predict individuals’ learning. Using brain measures (event-related potential, ERP, measures; see Kuhl et al. 2008), the results with Spanish replicated previous findings using Mandarin; additionally, they show that English phonetic discrimination does not decline—in fact, it increases, as expected, as Spanish contrast learning increases (Conboy and Kuhl 2011). Moreover, analyses of the video records revealed a significant positive relationship between infants’ social skills—which allowed them to shift gaze between the foreign-language tutor and the toys as the tutor held new toys and named them in the foreign language—and increased neural responsiveness to the Spanish contrast (Conboy, Brooks, Meltzoff, and Kuhl 2015). These correlations between social responses and brain measures of learning buttress the argument that infants’ social skills are coupled to language learning.

The data on infant speech-perception reviewed above suggest that infants are very sensitive to social language input during the period between 6 and 12 months. Infants’ sensitivity is so high that even a foreign language introduced for the first time at 9 months causes robust phonetic learning when it is delivered in a social context. This leads to the hypothesis that the mechanisms underlying infant speech-perception are somehow “tuned” to language input, delivered socially, during this time. The corollary hypothesis is that only language input can influence these mechanisms at this time.

A recent experiment suggests that the corollary hypothesis must be altered. In the next section, we review the results of an experiment that exposes infants to music in a way that is similar to previous experiments using foreign-language interventions during the sensitive period (Conboy and Kuhl 2011; Kuhl, Tsao, and Liu 2003). In the music intervention, researchers exposed infants to a particular rhythmical structure in music, the triple meter (the waltz), for 12 sessions in a social context, using a randomized control design. The control group experienced similar activities in a social setting, but no music. After 12 sessions, the research team tested both intervention and control infants with violations of rhythmic structure in both music and speech. The results show effects on both music and speech, and reveal activation in the infants’ auditory-sensory and prefrontal cortices. In the remaining sections, we detail these findings and discuss their implications.

Effects of music intervention on infants’ phonetic learning

During the last decade, music training that starts early in development has received increasing attention in the science community as an important early experience, given the growing amount of evidence suggesting the robust and extensive training-related benefits in auditory, language, and cognitive abilities (Kraus and Chandrasekaran 2010; Shahin 2011; Zatorre 2013). Previous studies—using various methodologies, including behavioral, electrophysiological, and neural imaging methods—have demonstrated repeatedly that musically trained adults and children exhibit enhanced processing of musical information (e.g., musical pitch and meter) in comparison to nontrained groups (Fujioka, Ross, Kakigi, Pantev, and Trainor 2006; Geiser, Sandmann, Jancke, and Meyer 2010; Habibi, Cahn, Damasio, and Damasio 2016; Koelsch, Schroger, and Tervaniemi 1999; Pantev et al. 1998; Vuust et al. 2005; Zhao and Kuhl 2015a, b).

More importantly, prior studies have also demonstrated generalization effects in the trained individuals from their early musical experience to other domains, one of the most studied being speech processing. The ability to accurately and efficiently process complex speech sounds is critical in language development as speech processing in infants can robustly predict language abilities in early childhood (see “Early phonetic learning” section); and, at the same time, studies have shown that developmental language disorders (e.g., dyslexia, specific language impairment) have origins in auditory processing deficits (Goswami 2011; Tallal and Gaab 2006). So far, researchers have found that musically trained adults and children can better encode the acoustic details in speech at the level of the brainstem, especially when speech is embedded in noise (Bidelman, Weiss, Moreno, and Alain 2014; Parbery-Clark, Skoe, Lam, and Kraus Parbery-Clark et al. 2009; Parbery-Clark, Tierney, Strait, and Kraus 2012; Strait, Parbery-Clark, O’Connell, and Kraus 2013). At the cortical level, researchers observed musically trained individuals to better process pitch information in both native and foreign speech compared to nonmusicians; one study focusing on the temporal information in speech demonstrated that adult musicians could track syllable structures in words better as well (Magne, Schon, and Besson 2006; Marie, Magne, and Besson 2011; Marques, Moreno, Castro, and Besson 2007; Wong, Skoe, Russo, Dees, and Kraus 2007). These cross-domain effects from early music training to speech perception raise theoretically interesting and important questions about different levels of processing (e.g., lower-level acoustic processing vs. higher-level cognitive skills) affected by early experience and how they can support these observed generalization effects (Kraus and Chandrasekaran 2010).

Following this growing literature, we examined the rich experience of music training in an even earlier developmental stage (9 months of age) for both theoretical and methodological reasons (Zhao and Kuhl 2016). Theoretically, this approach allowed us to compare the effects of music experience during the sensitive period of phonetic learning to other previously studied experiences, such as experience of a foreign language (Kuhl, Tsao, and Liu 2003). Methodologically, (1) we were able to randomly assign infants at this age to complete either a structured laboratory-controlled music intervention (Intervention) or control activities (Control). This approach allowed controlling for effects related to predispositions (e.g., genetics), prior music experience, and the variability in individuals’ music training (e.g., onset, nature, and duration of the music experience); (2) we focused on temporal information processing, which has less experimental data regarding effects derived from early music training. In this study, the Intervention targeted infants’ learning of a specific meter (triple meter—e.g., waltz) and we tested the effects of the Intervention on both music (metrical structure) and speech (syllable structure); (3) we used neural responses, measured by magnetoencephalography (MEG), as outcome measures to compare Intervention and Control infants in the spatial and temporal aspects of their cortical responses.

We predicted enhancement in both music and speech domains, following the rationale that the Intervention—targeting infants’ learning of a specific meter—exerts influence at a higher level of processing. We argued that the Intervention infants would become better at extracting the temporal pattern of complex sounds over time, leading to their ability to make more robust predictions about the timing of future stimuli based on the extracted temporal structure—an ability that would affect both music and speech processing.

The design of the Intervention/Control sessions paralleled our prior studies in the laboratory on infant speech learning at 8–10 months of age (see “Social influences”). Specifically, we recruited 9-month-old infants raised in monolingual English-speaking environments with comparable prior and concurrent music listening experiences at home, whose parents were not performing musicians. We randomly assigned infants to the Intervention or Control group for 12 sessions (15 minutes each), over a 4-week period, of corresponding activity in the laboratory.

In the Intervention/Control sessions, we incorporated several key components to maximize infants’ learning specific to the Intervention while reflecting naturalistic infant music classes: (1) Intervention infants experienced various infant tunes and songs only in triple meter (e.g., waltz). We selected triple meter as the target temporal structure because studies have shown that it is a more difficult temporal structure in Western music for infants to process at this age than duple meter (e.g., marching music) (Bergeson and Trehub 2006), yet infants can rapidly learn temporal patterns in the music of their culture (Gerry, Faux, and Trainor 2010; Hannon and Trehub 2005a, b); (2) Intervention infants, with the aid of caregivers, tapped out the musical beats with maracas or their feet, and their caregivers often bounced them in synchronization to the musical beats—activities that are common in infant music classes and effective in infants’ learning of temporal structure (Phillips-Silver and Trainor 2005); (3) the Control sessions offered comparable visits to a laboratory, familiarity with the laboratory environment, levels of social interaction with other infants and caregivers, and levels of motor activity and engagement, but without music. For example, infants, aided by their parents, played with toy cars, blocks, and other objects that required coordinated movements, such as moving and stacking; (4) in both the Intervention and Control sessions, researchers engaged infants in a social setting with 1–2 other infants and their caregivers, a setting demonstrated in previous work to be effective when infants are exposed to a foreign language (Kuhl, Tsao, and Liu 2003). Experimenters facilitated each session by engaging the infants and their caregivers in the activities to a comparable degree.

To examine whether the intervention enhanced infants’ general ability to extract temporal structure and generate more robust predictions about future stimuli in complex auditory sounds, we examined Intervention infants’ neural responses to temporal structure violations in both music and speech in temporal (auditory) and prefrontal cortical regions, in comparison to their Control group counterparts. We quantified the neural responses by a specific neural response, namely the mismatch response (MMR), traditionally measured by an oddball paradigm. In this paradigm, a standard stimulus is presented on approximately 85% of the trials to establish a temporal structure; on the remaining 15% of the trials, a deviant stimulus that violates this temporal structure is randomly presented on the remaining 15% of the trials (Figures 4a, 5a). The magnitude of the MMR, which peaks around 150-350 ms after the violation onset, thus reflects neural sensitivity to the violation of temporal structure—and thus the tracking and learning of that temporal structure (Bekinschtein et al. 2009; Schwartze and Kotz 2013; Winkler, Denham, and Nelken 2009). We recorded neural responses to all stimuli using magnetoencephalography (MEG), which has excellent temporal resolution and good spatial resolution, allowing the examination of MMR in the specific time windows of interests (i.e., around 150-350ms post violation) and in target cortical regions (i.e., temporal and prefrontal regions).

Figure 4
figure 4

Music condition (MEG)

Notes: a Schematics of stimuli; standard and deviant sounds are acoustically identical, and deviants violate the standard temporal structure. b Top: The group average of the difference waves for the temporal regions of the cortex for the Intervention group and the Control group. Shaded region indicates the selected time window for the MMR. Time 0 marks the onset of the strong beat. Bottom: The group average of the difference waves for the prefrontal regions of the cortex for the Intervention group and the Control group. c Mean MMR values within the target time window by region (temporal region vs. prefrontal region) and group (Intervention vs. Control).

Source: Adapted from Zhao and Kuhl (2016)

Figure 5
figure 5

Speech condition (MEG)

Notes: a Schematics of stimuli; deviants /bibi/ violate the syllable structure of /bibbi/. In a separate recording (lower panel), /bibi/ served as standards in a constant stream. b Top: The group average of the difference waves for the temporal regions of the cortex for the Intervention group and the Control group. Shaded region indicates the selected time window for the MMR, shifted accordingly with the onset of violation (210ms after the onset of the nonword /bibi/, marked by Time 0). Bottom: The group average of the difference waves for the prefrontal regions of the cortex for the Intervention group and the Control group. c Mean MMR values within the target time window by region (temporal region vs. prefrontal region) and group (Intervention vs. Control).

Source: Adapted from Zhao and Kuhl (2016)

Our results supported our hypotheses and answered our specific questions, demonstrating that: (1) the Intervention group exhibited a larger MMR response to violations in temporal structure for music (i.e., triple meter) when compared to the Control group; (2) the effects were observed in both temporal (auditory) and prefrontal regions of the cortex (Figure 4b, c); (3) the enhancement in temporal structure processing generalized to the speech domain, reflected by a larger MMR in temporal and prefrontal cortical regions in response to violations of a foreign temporal structure in the Intervention group (Figure 5b, c).

We therefore demonstrated that a short-term laboratory-controlled music intervention at 9 months of age that reflects naturalistic infant music classes affects not only infants’ functional processing of temporal structure in music but also—more importantly—infants’ processing of syllable structure in speech. We based our prediction of the generalization effects from the Intervention to speech on the rationale that infants would learn to better attend to and extract auditory patterns in the temporal domain, allowing them to generate—from learned patterns—more robust predictions about the timing of future events. Our results thus strongly supported the idea that such enriched music intervention experience may support the development of a broader set of perceptual skills.

The design of the Intervention, as well as the use of foreign syllable structure, in the MEG testing in this study allows us to compare the current results to our previous experiments examining the effects of foreign-language intervention during this sensitive period of phonetic learning. In the next section, we discuss in more detail the implications of the result showing enhanced sensitivity to foreign syllable structure contrasts.

Summary and discussion

In this article, we have introduced the concept of what we term a “sensitive period” for infants’ phonetic learning between the age of 6 and 12 months (Kuhl 2004). Decades of research have demonstrated that infants’ ability to discriminate native speech contrasts improves, in contrast to their ability to discriminate nonnative speech contrasts that decreases during this period (Kuhl et al. 2006; Werker and Tees 1984). Further, we discussed that infants’ phonetic learning during this sensitive period is highly malleable, depending on the auditory input infants receive at that time. The skill to discriminate nonnative speech contrasts provides a window for us to study how inputs during the sensitive period can affect infants’ phonetic learning. In a series of studies, we demonstrated that experience with a foreign language could enhance infants’ ability to discriminate the nonnative speech contrasts in that language. More importantly, language experience during this time needs to be social in nature—the same input delivered through a TV screen did not result in learning (Conboy and Kuhl 2011; Kuhl, Tsao, and Liu 2003). Yet, in our most recent study, we show that a music intervention targeting rhythm learning during this sensitive period also enhanced infants’ ability to discriminate a nonnative speech contrast that is based on syllable structure differences.

How does the enriched auditory experience of foreign language and music exert its influence on infants’ phonetic learning during the sensitive period for phonetic learning? Previous research has demonstrated the influences of cognitive skills on speech perception in this period; 11-month old monolingual infants show a strong negative correlation between specific cognitive controls skills (inhibitory control) and nonnative speech discrimination (Conboy, Sommerville, and Kuhl 2008; Diamond, Werker, and Lalonde 1994; Lalonde and Werker 1995). The authors’ interpretation is that infants with good inhibitory control skills are better able to ignore speech sounds that are irrelevant to their native language, and, therefore, that they exhibit lower nonnative speech discrimination skills, which has been shown to correlate with faster native-language growth (Figure 2; Kuhl et al. 2008). On the other hand, literature on infants and children raised in bilingual language environments demonstrate enhanced cognitive flexibility compared to their monolingual counterparts (Bialystok and Craik 2010; Kovács and Mehler 2009a, b). We, therefore, speculate that an enriched auditory experience (i.e., foreign language and music) provides complex yet patterned auditory input; when delivered in a social setting, it allows infants to develop enhanced cognitive abilities to switch between inputs and attune their attentional resources to the relevant and important auditory information.

One specific mechanism by which infants can learn to effectively allocate attentional resources is predictive coding. By extracting the temporal pattern of input, the dynamic attending theory posits that attentional resources are allocated to time windows during which the brains predict that important information will occur (e.g., musical beats, syllables) (Jones and Boltz 1989). Investigators have demonstrated that infants as young as 3 months of age are able to extract temporal patterns and predict future stimuli based on the extracted information (Basirat, Dehaene, and Dehaene-Lambertz 2014; Emberson, Richards, and Aslin 2015). Our recent data using complex auditory stimuli suggest that a music intervention focusing on temporal information learning may have increased infants’ ability to extract high-level temporal patterns and generate stronger predictions about future stimuli—a skill that they can apply both in music and in speech processing. Future research is warranted to, first, establish the relationships between different general cognitive skills (e.g., inhibition, flexibly switching attention) and infants’ ability to discriminate native and nonnative speech sounds. Then, it will be critical to directly test whether short-term language or music experience, in comparison to no exposure, affects these cognitive skills—which can, in turn, affect phonetic learning during the “sensitive period”. In the longer term, researchers should dissect and systematically examine the various components of these enriched auditory experiences (e.g., social elements, multi-model elements) in order to evaluate the effectiveness of each element and the interactions among them. This will not only enhance our theoretical understanding of infant phonetic learning but will also inform the design of early-education interventions, especially for infants at risk for communication disorders.