For thousands of years and across numerous cultures, human infants are able to perfectly master oral or signed language in only a few years. No other machine, be it silicon or carbon based, is able to reach the same level of expertise. Infants acquire their native language more easily than adults learn a second language, or even a first, as in the rare cases of isolated deaf people who receive instruction late in development (Grimshaw, Adelstein, Bryden, & MacKinnon, 1998; Newport, 1990). We propose to capitalize on this observation and explore the neural architecture that favors this successful learning. Thanks to the development of noninvasive brain imaging techniques, it is now possible to study the infant’s brain without disturbance for the participant. Our approach is thus to study infants’ capacities to process speech, describe their neural implementation, and analyze how close, or far, these neural circuits/computations are from those in other animals. This research axis should not be misunderstood. It does not deny that other animals may share some of these capacities nor that some brain areas used in linguistic tasks are shared with other cognitive capacities. This approach does not discard that some general principles, such as statistical learning, are important in language learning, but it emphasizes that language acquisition relies on a particular neural architecture that implements the proper combination of all these mechanisms and which has been selected through evolution to improve communication. This architecture is defined by its structural connectivity between distinctive brain areas whose local properties are appropriate for encoding particular representations of the environment (e.g., the encoding of fine-grained temporal features might be related to the columnar organization of the left human auditory cortex (Buxhoeveden, Switala, Roy, Litaker, & Casanova, 2001), but also by the dynamics of information propagation within the networks that is adjusted through development by a complex calendar of maturation. Many components of the human language network probably have precursors in other animals, recycled to subserve other goals. Others might have emerged in the homo lineage, but only a careful comparison between animals and human infants’ capacities and their underlying neural circuits can clarify these questions. As brain imaging studies in young children are sparse, many key elements explaining language acquisition are still missing, but this way is promising and some results can already be considered. We present results revealing an early organization of the perisylvian areas in several parallel and hierarchical auditory pathways from the onset of the thalamocortical circuitry at 6 months of gestation, with functional asymmetries. These pathways have different maturational calendars that affect the dynamics within the network and thus probably infant learning stages that concern both the acquisition of the native language features and of social communication skills.

Early parallel and hierarchical organization of the perisylvian areas

In primates, auditory regions are organized in several parallel and hierarchical streams that process different aspects of a sound (e.g., its source, intensity, timbre, movement, familiarity; Kaas & Hackett, 2000; Tian, Reser, Durham, Kustov, & Rauschecker, 2001). In the case of speech, the main information to be extracted can be separated along two main lines: the message (how meaning is conveyed through a combination of arbitrary sounds) and the context of the message (the speaker, his emotion, his localization in the surrounding space, etc.). These representations are progressively elaborated along different streams located in the superior temporal regions, beyond the primary auditory cortex and reaching the inferior frontal lobe (Dehaene-Lambertz, Dehaene, et al., 2006; Wessinger et al., 2001).

This hierarchical and parallel functional organization is already observed in infants’ perisylvian regions (see Fig. 1). When listening to speech, 3-month-old infants (Dehaene-Lambertz, Hertz-Pannier, et al., 2006; Homae, Watanabe, Nakano, & Taga, 2011; Shultz, Vouloumanos, Bennett, & Pelphrey, 2014) and neonates (Pena et al., 2003; Sato et al., 2012) activate a roughly similar network of temporal and frontal regions than those in adults. It is also the case for preterm neonates at 6 months of gestation (Mahmoudzadeh et al., 2013). At this age, the sensory system begins to react to external sounds and the thalamocortical connections reach the cortical plate, starting to feed the first cortical circuits with external information (Kostovic & Judas, 2010). Although the local microcircuitry is different from later ages because most of the neurons are still migrating to reach their final location, and the dendritic trees are sparse, the brain general connectivity plan is already visible (Doria et al., 2010; Fransson et al., 2007; Smyser, Snyder, & Neil, 2011). Superior temporal and inferior frontal regions are already functionally connected, and react to a change of consonant (/ba/ vs. /ga/) and to a change of voice (male vs. female) randomly occurring in series of repeated syllables (see Fig. 1b). These two syllables features are channeled along different neural circuits, as revealed by the distinct temporal and spatial responses generated by the two types of changes in electroencephalography (EEG) and near-infrared spectroscopy (NIRS) recordings (Mahmoudzadeh et al., 2013; Mahmoudzadeh, Wallois, Kongolo, Goudjil, & Dehaene-Lambertz, 2016). These results demonstrate a functional architecture of parallel processing streams devoted to different sound features, including subtle speech features from the onset of the thalamocortical circuitry.

Fig. 1
figure 1

A. Hierarchical organization of the perisylvian regions in 3-month-old infants and adults, illustrated by the gradient of phase of the BOLD response to a single sentence. The mean phase is presented on axial slices placed at similar locations in adult (upper row) and infant (lower row) standard brains and on a sagittal slice in the infant’s right hemisphere. Colors encode the circular mean of the phase of the BOLD response, expressed in seconds relative to sentence onset. The same gradient is observed in both groups along the superior temporal region and extending until the Broca’s area. Blue regions are in counterphase with the stimulation (Dehaene-Lambertz, Dehaene, et al., 2006; Dehaene-Lambertz, Hertz-Pannier, et al., 2006). B. Parallel pathways in preterms. Oxyhemoglobin responses to a change of phoneme (ba vs. ga) and a change of voice (male vs. female) in preterm neonates measured with NIRS in 29wGA-old preterms. A significant increase of the response to a change of phoneme (DP) relative to the control condition (ST) was observed in both temporal and frontal regions, whereas the response to a change of voice (DV) was limited to the right inferior frontal region. The left inferior frontal region responded only to a change of phoneme, whereas the right responded to both changes. The colored rectangles represent the periods of significant differences between the deviant and the standard conditions in the left and right inferior region (black arrows) (Mahmoudzadeh et al., 2013). (Color figure online)

A proxy of the hierarchical organization of the perisylvian regions can be observed by looking at the phase of the BOLD response to short sentences presented to infant and adult subjects using a functional MRI slow-event design (Dehaene-Lambertz, Dehaene, et al., 2006; Dehaene-Lambertz, Hertz-Pannier, et al., 2006). In both populations, the BOLD response peaks earlier in primary auditory areas and slows down along the dorsal-ventral and posterior-anterior axis of the superior temporal regions (see Fig. 1a). The slowest region is in the left inferior frontal region, which is better synchronized with the end than the start of the sentence. The gradient pattern is similar in adults and infants, although the relative time interval between the fastest and slowest region is shorter in adults than in infants (Dehaene-Lambertz, Dehaene, et al., 2006; Dehaene-Lambertz, Hertz-Pannier, et al., 2006). We hypothesized that the BOLD phase might be related to a temporal window of integration, that is progressively longer along these regions, and thus possibly sensitive to greater chunks in the speech stream (Overath, McDermott, Zarate, & Poeppel, 2015). This hierarchical organization may explain infants’ early sensitivity to the sentence organization: For example, they prefer listening to sentences with pauses located at prosodic boundaries rather than within the prosodic units (Hirsh-Pasek et al., 1987). With its embedded units, the prosodic hierarchy is a natural input for these regions, helping infants to segment the speech stream in coherent chunks. Analyses can then be restricted within each prosodic unit explaining why the computations of transitional probabilities between syllables, which is the main proposed mechanism for infants to extract words from the speech stream (Saffran, Aslin, & Newport, 1996), cannot occur across a prosodic boundary (Shukla, White, & Aslin, 2011). Finally, as prosody and syntax are tied, this hierarchical organization might secondarily facilitate the learning of native syntax (Christophe, Millotte, Bernal, & Lidz, 2008).

Infants are skilled in fine-grained temporal coding

The phonetic code heavily relies on fast temporal coding to recover the succession of phonemes in a speech stream. Since the 1970, it is known that human infants are particularly good at processing phonemes. Infants display categorical perception, identify phonemes across speakers (Kuhl, 1983), and despite co-articulation (Mersad & Dehaene-Lambertz, 2015). In addition to the message, infants listening to speech should also encode the messenger. While the phonetic code necessitates a fast sampling of the auditory signal to recover all phonemes, voice recognition is based on slower variations. Humans are commonly better able to normalize the linguistic dimension across voices than to recognize the same voice across different linguistic content (Dehaene-Lambertz, Dehaene, et al., 2006). Adults are better at discriminating voices speaking their native than a foreign language. The same phenomenon is seen in infants (Johnson, Westrek, Nazzi, & Cutler, 2011). Neonates lose the capacity to recognize their mother’s voice when she is reading a page, from the last word to the first (Mehler, Bertoncini, Barrière, & Jassik-Gerschenfeld, 1978). When the linguistic and voice dimensions are orthogonally contrasted, the linguistic contrast is generally more salient than the voice contrast. In other words, infants are more adept at discriminating between phonemes or languages, even when these are produced by different voices, than they are at discriminating between voices when phonemes or languages vary (Kuhl & Miller, 1982; Nazzi, Bertoncini, & Mehler, 1998).

An advantage for phoneme discrimination over voice discrimination is also observed in preterm infants at 6 months of gestation, an age at which the first thalamocortical fibers start to bring exogenous information to the developing cortical plate (Kostovic & Judas, 2010). At this age, preterm infants discriminate the change from /ba/ to /ga/ whereas the response to a change of female to male voice is less mature (Mahmoudzadeh, Wallois, et al., 2016). Using the same stimuli and experimental paradigm in rats, Mahmoudzadeh, Dehaene-Lambertz, and Wallois (2016) described a different sensitivity. The animals were sensitive to the spectral changes and reacted more strongly to a change of voice than to a change of phoneme, a characteristic already reported by Toro, Trobalon, and Sebastian-Galles (2005), who observed that rats trained to discriminate between two languages perform at chance levels when the voices vary across sentences.

These results suggest that humans may benefit from a genetically driven ability in the fine temporal coding of the auditory world, which may contribute to their ability to manipulate speech stimuli. Several experiments have illustrated the relation between the precision of temporal encoding and better performances in tasks using speech stimuli in normal subjects. For example, Kabdebon, Pena, Buiatti, and Dehaene-Lambertz (2015) recorded high-density EEG in 8-month-old infants while they were listening to a stream of syllables concatenated according to an A × C structure (i.e., the first syllable predicted the third syllable), then tested with isolated trisyllabic words that respected, or did not, the hidden structure of the training stream. The discrimination responses in the test were significantly correlated with the temporal locking of the EEG to the syllable frequency during the training stream. Similarly, the temporal similarity between the auditory cortical activity and the speech envelope predicted speech comprehension in adults (Ahissar et al., 2001). A deficit in temporal encoding has been proposed as one of the mechanisms at the origin of oral and written language impairments (Abrams, Nicol, Zecker, & Kraus, 2009; Lehongre, Ramus, Villiermet, Schwartz, & Giraud, 2011).

Another example of the biological constraints on phoneme perception is the effect of a preterm birth on the decay in the discrimination performances of nonnative phonetic contrasts, which usually occurs at the end of the first year of life in full-term infants (Werker & Tees, 1984). As preterm infants are exposed to aerial speech 3 months earlier than full-term infants, do they show an acceleration of this process? Using electroencephalography (EEG) and a mismatch paradigm in which a change of the consonant of CV syllables was introduced after several repetitions of the same syllable, Pena, Werker, and Dehaene-Lambertz (2012) showed that the mismatch response (MMR) to a consonant change that crosses a nonnative phonetic boundary (dental vs. retroflex da) disappeared in 12-month-old full-term infants from Spanish-speaking families as expected (this contrast is not used in Spanish, contrary to Hindi). However, in 12-month-old preterm infants, who were in point of fact 9 months old regarding their postconceptual age, the MMR response endured. To observe the extinction of the MMR when the nonnative boundary was crossed, it was necessary to wait the requisite postconceptual age of 12 months (15 months postbirth). It is interesting to note that Gonzalez-Gomez and Nazzi (2012) showed that, by contrast, the positive learning of the phonotactic rules of the native language was dependent on the duration of the exposure to aerial speech. In this study, French preterm infants were sensitive to the onset of words respecting, or not, the most frequent French phonetic associations. Although the first study used event-related potentials (ERPs) and the second one a behavioral measure (looking time), and thus may not have the same sensitivity, these two results may uncover a critical distinction between a learning mechanism (here, statistical learning proposed as the main mechanism for infants to converge to the phonetic repertoire of their native language) and a biological network dependent on factors that facilitate or hinder learning. For example, it has been proposed that the opening and closure of “critical” windows in the mouse visual cortex relies on two thresholds in the accumulation of homeoprotein Otx2 in GABAergic parvalbumin interneurons (Hensch, 2004). When the Otx2 level reaches a first threshold, learning starts; it then stops, or at least becomes more difficult, when Otx2 reaches a second threshold. A similar mechanism (Werker & Hensch, 2015) might explain that computation of the statistics of the native phonetic environment can only begin after a certain maturational age (probably after 35 wGA, when the migration and maturation of interneurons is sufficiently advanced, but no study has examined this point until now), and stop at a given maturational age, i.e. around the end of the first year.

Left and Right hemispheric differences

Hemispheric differences are now clearly demonstrated in infants from the fetal period on (Bristow et al., 2009; Dehaene-Lambertz, Dehaene, & Hertz-Pannier, 2002; Dehaene-Lambertz, Hertz-Pannier, et al., 2006; Dehaene-Lambertz et al., 2010; Homae, Watanabe, Nakano, Asakawa, & Taga, 2006; Homae et al., 2011; Mahmoudzadeh et al., 2013; Minagawa-Kawai et al., 2011; Pena et al., 2003; Perani et al., 2011; Perani et al., 2010; Sato et al., 2012; Shultz et al., 2014; Vannasing et al., 2016). Sentences in the native language elicit bilateral activations, but with a stronger left hemispheric response in most of these studies: at the level of the planum temporale in fMRI studies (Dehaene-Lambertz et al., 2002; Dehaene-Lambertz, Hertz-Pannier, et al., 2006; Dehaene-Lambertz et al., 2010) and less precisely over the superior temporal region in NIRS studies (Homae et al., 2011; Minagawa-Kawai et al., 2011; Pena et al., 2003; Sato et al., 2012; Vannasing et al., 2016). Responses to the native language are more left lateralized compared to the activations induced by other vocal sounds produced by humans and monkeys (Minagawa-Kawai et al., 2011; Shultz et al., 2014). By contrast, a foreign language and backward speech usually elicit a similar left-hemispheric advantage than the native language (Dehaene-Lambertz et al., 2002; Dehaene-Lambertz, Hertz-Pannier, et al., 2006; Dehaene-Lambertz et al., 2010; Minagawa-Kawai et al., 2011). This similarity suggests that it may be the fast temporal transitions comprised in these stimuli that drive this lateralizationFootnote 1.

Voice appears to be recognized in the right hemisphere (Blasi et al., 2011; Bristow et al., 2009), but Dehaene-Lambertz et al. (2010) reported both a difference between the mother’s and an unknown woman’s voice in the posterior left temporal region, which was attributed to better phonological representations for a known voice, and in the right anterior temporal region, which corresponds to the adults’ voice area proposed by Belin, Zatorre, Lafaille, Ahad, and Pike (2000). Differences in temporal sensitivity of the left and right hemispheres have been proposed to be at the origin of the left hemispheric advantage for linguistic processing and of the right for voice and emotion processing in adults (Boemio, Fromm, Braun, & Poeppel, 2005; Giraud et al., 2007; Zatorre & Belin, 2001). Telkemeyer et al.’s (2009) addressed this hypothesis using NIRS in full-term neonates. They showed that the auditory cortex presented different sensitivities to various temporal modulations of a complex acoustic stimulus. Over bilateral auditory cortices, the greatest response amplitude was recorded with the 25-ms modulated sound whereas deoxy-Hb responses to slower modulations (165 and 300 ms) were very focal: recorded over the right superior-temporal location only. In older babies, the deoxy-Hb response was right-lateralized at 6 months for fast and slow modulations with no difference at 3 months. Oxy-Hb response was left lateralized for fast temporal modulation in both 3- and 6-month-olds whereas it was also left-lateralized for the slow modulation in 3-month-olds but bilateral in 6-month-olds. Further studies are needed to better understand this complex pattern of responses.

Thus, the exact speech features that drive these lateralized processes are not well understood. Note that because speech is encoded at multiple levels in different parallel pathways to interpret the message, but also to recognize the speaker, interpret her emotions, and localize her spatial position, activations to speech are mainly bilateral, and only comparisons on specific linguistic and voice dimensions can shed light on the lateralized processes. Furthermore, attention can affect the measured activations and the balance between hemispheres: A left or a right hemispheric bias is reported if adults are instructed to pay attention to the vowel identity or to the voice pitch in the same syllables (Zatorre & Belin, 2001), and even in the same discrimination task with the same stimuli, if the contrast is linguistically pertinent in the subject’s native language or not (Gandour et al., 2002). Listening strategy determined by the experimental context can also bias infants to one or the other hemisphere. For example, the right bias for music listening reported in Perani et al. (2010) was not observed by Dehaene-Lambertz et al. (2010): Perani et al.’s study comprised only music stimuli whereas in Dehaene-Lambertz et al.’s study, two thirds of the stimuli were speech.

As in other primates, the human infant’s auditory cortex is organized into parallel processing streams, which filter the incoming acoustic information on different time scales and with a particular accuracy when encoding fast temporal variations. We propose that phonetic analyses might be channeled early on toward the left hemisphere because of an earlier maturation of a fine-grained temporal encoding network in this hemisphere. This early bias may subsequently favor this hemisphere for other linguistic processes because of time constraints in information transfer between nodes within a network. By contrast, information about the messenger, which relies on slower spectral variations, appears to be better processed in the right hemisphere. Learning your native language and recognizing your parents are both important for human communication. Addressing each communication channel to one hemisphere is a clever solution to benefit from a similar hierarchical architecture in the perisylvian areas and keep different environmental opportunities for each channel, thanks to the hemispheric heterochrony of the maturational calendar.

Immature but functional frontal areas

Historically, frontal areas in infants were assumed to be poorly functional, as they were considered to be immature. However, brain imaging studies have revealed that they are involved in infant cognition from very early on. As early as 6 months of gestation, the inferior frontal regions react to a change in auditory series: on the left for a change of phoneme, on the right for both a change of voice and a change of phoneme. At 3 months postterm, an increase of activation (repetition enhancement) is observed when a short sentence is repeated and when delayed cross-modal matching of a vowel is required. At the same age, recognition of the prosodic contours of their native language activates the right dorsolateral prefrontal region in vigilant infants, whereas voice familiarity modulates the balance between the median prefrontal regions and the orbitofrontal limbic circuit. The frontal lobe is not only active in infants, but it is parceled into different regions that are distinctively engaged depending on the task, just as it is in older subjects. However, frontal regions react at a slower pace than later on. ERPs studies have shown that late responses, which depend on higher levels of processing, are disproportionally slower in infants relative to adults than the infant–adult difference in early sensory regions. Electrical components, which are proposed to be the equivalent of the adults’ P300, are recorded after 700 ms and even around 1 second until at least the end of the first year (Kouider et al., 2013). By contrast, the latency of the visual P1 reaches adults’ values at around 3 months of age. How this distortion of the dynamics within networks impacts learning should be further studied.

Inferior frontal areas are connected with the temporal areas through dorsal and ventral pathways. We can use the maturational heterogeneity of gray and white matter to reveal functional networks (Leroy et al., 2011). Because the T2 signal is sensitive to free water in the voxels, it changes with age due to the proliferation of membranes (dendrites, myelin, etc.) that increases with maturation. Similarly, diffusion tensor imaging (DTI) provides measures of the water molecule movements (measures of diffusivity) and of their direction (measure of fractional anisotropy), which are affected by the direction of the fibers and their myelination. We used these markers to follow the maturation of the perisylvian regions and observed that structures belonging to the dorsal pathway had a delayed maturation relative to the ventral pathway, but this disparity begins to disappear after 3 months of age (Dubois et al., 2015; Leroy et al., 2011). We propose that this catch-up is related to the increase in vocalizations and to the infants’ progression in their analyses of the segmental part of speech observed at the same age. Because maturation improves both local computations and the speed in the connections between regions, balance between networks may change with development and peculiar patterns of maturation may thus reveal the crucial role of some circuits at a given moment to acquire new competencies. Pondering the weight of the different pathways, and thus of how they learn, through maturational lags at precise nodes of the perisylvian cortex might be a way to genetically control language development.

Although more studies are needed to determine out how frontal regions contribute to language acquisition, several hypotheses, which are not mutually exclusive, can be proposed. The involvement of the inferior frontal regions and the dorsal pathway might provide infants with a long auditory buffer, which seems to be lacking in macaques (Fritz, Mishkin, & Saunders, 2005). A long buffer may favor the discovery of second-order rules by keeping track of segmental elements (Basirat, Dehaene, & Dehaene-Lambertz, 2014; Kovacs & Endress, 2014). Coupled with hierarchical coding along the superior temporal regions, this may favor computations of chunks and increase the sensitivity to deeper hierarchical structures, and to algebraic rules, as was demonstrated in 8-month-olds (Marcus, Vijayan, Bandi Rao, & Vishton, 1999). Finally, the involvement of frontal areas outside the linguistic system may improve the infant’s focus on speech as a pertinent stimulus. Motivation and pleasure, understanding of the referential aspect of speech through social cues, have has been shown to be important for speech learning (Kuhl, Tsao, & Liu, 2003). The activation in dorsolateral prefrontal region, shown in awake infants recognizing their native language, and activation in prefrontal median region, when the voice is familiar, might certainly explain these behavioral observations.

To conclude, we have highlighted a few results to illustrate how brain imaging in infants might bring new elements to discuss the origins of language. These studies are still too scarce to back up strong theoretical models, and they only concerned oral languages. Signed languages share many of the same particularities of oral languages in regard to their neural bases in adults and their calendar of acquisition. Some peculiarities that we have described, such as the capacity for fine temporal coding, might only be useful for oral language and thus are perhaps only an accessory element in language acquisition. Further studies are needed so as to understand how the hierarchical structure uncovered in the temporal regions when infants listen to speech might be recycled to process linguistic signs, and what is the exact code computed by these regions. To discover the principles of the organization of the human brain and how this favors language learning, a better description of cerebral development is needed, as is the continued use of paradigms that exploit the possibilities of brain imaging. This approach is the only way to grasp what is shared with other animals, and what is unique to the human lineage before exposure to the complex and educated human social world.