Introduction

The goal of the present article is (1) to chronicle briefly the discovery, organization, and response properties of the cortical dorsal stream in the Dual Stream model of speech processing (Hickok & Poeppel, 2000; Hickok & Poeppel, 2004; Hickok & Poeppel, 2007; Rauschecker & Scott, 2009), (2) to survey evidence suggesting that this stream is selective for laryngeal control rather than for speech gesture control more generally, and (3) to discuss the implications of this finding for understanding aspects of the evolution of language. We start with a brief overview of the Dual Stream model and then discuss each of the three issues listed.

The dual stream model

The Dual Stream model for speech processing holds that speech is processed along two cortical pathways: a ventral auditory-conceptual stream and a dorsal auditory-motor stream (Hickok & Poeppel, 2000; Hickok & Poeppel, 2004; Hickok & Poeppel, 2007) (Fig. 1); see also (Rauschecker & Scott, 2009). The model has its roots in the classical aphasia models of the 19th century, in particular that of Wernicke (1874/1977) and later Lichtheim (1885) who both proposed an auditory-conceptual pathway and an auditory-motor pathway. Dual pathway models generally have been proposed to explain a wide range of facts in nonspeech sensory systems, including audition (Deutsch & Roll, 1976; Poljak, 1926; Rauschecker, 1998; Rauschecker et al., 1995), vision (Milner & Goodale, 1995; Ungerleider & Mishkin, 1982), and somatosensation (Dijkerman & de Haan, 2007). The prevalence of these models and the empirical coverage that they provide suggests that a cortical division of labor in sensory systems is a general organizational principle, likely driven by the need for distinct computational mappings onto conceptual versus motor systems (Hickok & Poeppel, 2007).

Fig. 1
figure 1

a Schematic diagram of the Dual-Stream model. Phonological network diverges into two streams: a dorsal sensorimotor stream supporting speech motor control and a ventral sensory-conceptual stream supporting comprehension. The relation between the auditory-phonological and the articulatory network and the auditory-phonological and the conceptual network is mediated by distinct interface systems, the “sensorimotor interface” and the “lexical interface.” b Approximate anatomical locations of the Dual-Stream model components. Regions shaded pink represent the more bilaterally organized ventral stream. Regions shaded blue represent the dorsal stream, which is strongly left-dominant. Functional area Spt (Sylvian parietal temporal) is the posterior Sylvian blue shaded region

Discovery of area Spt and the dorsal stream circuit

Hickok & Poeppel (2000) predicted the existence of a cortical auditory-motor interface system: “a reasonable hypothesis is that the inferior parietal lobe contains an important (but not necessarily the only) interface system mediating between auditory and articulatory representations of speech” (p. 135).

Subsequent fMRI experiments in my lab identified a region compatible with this hypothesis, which we termed area Spt to reflect its location at the back of the Sylvian fissure at the parietal-temporal boundary (Buchsbaum et al., 2001; Hickok et al., 2003; Hickok et al., 2009). Spt is part of a functionally identified larger network that includes the posterior superior temporal sulcus, the pars opercularis of Broca’s area, and a more dorsal lateral premotor cite. Spt was found to respond both during the perception and (covert) production of speech; both sensory and motor response properties are characteristic features of sensory-motor integration areas in the dorsal visual stream (Andersen, 1997; Gallese et al., 1997; Grefkes & Fink, 2005; Milner & Goodale, 1995). Spt also was found to be relatively selective for vocally compared with manually mediated auditory-motor interactions (Pa & Hickok, 2008), to be functionally (Buchsbaum et al., 2001; Buchsbaum et al., 2005) and anatomically (Isenberg et al., 2012) connected to premotor regions involved in speech production, and to exhibit distinct activation patterns during auditory and motor phases of the task (Hickok et al., 2009), which argues against a purely auditory or purely motor explanation for its activation pattern. Damage to this region is associated with conduction aphasia, which represents an auditory-motor integration disorder (Buchsbaum et al., 2011), and more specifically to deficits in verbatim repetition of speech, including nonwords, which arguably places particularly strong demands on auditory-motor integration at the phonological level (Rogalsky et al., 2015).

On the basis of these observations as well as others not reviewed here, we proposed that area Spt functions as an interface network transforming auditory-based representations of speech in the posterior superior temporal sulcus into articulatory-based representations of speech and vice versa in the service of state feedback control of speech production (Hickok, 2012; Hickok et al., 2011; Houde & Nagarajan, 2011). This hypothesis and Spt’s anatomical location in/near the posterior parietal lobe fits well with the role of the posterior parietal lobe region in the visual dorsal stream and visuomotor integration (Andersen, 1997; Gallese et al., 1997; Grefkes & Fink, 2005; Milner & Goodale, 1995). Detailed hypotheses regarding the computations performed by the Spt network and the evidence behind the hypotheses are beyond the scope of the present discussion but are provided in several recent papers (Hickok, 2012; Hickok, 2014a; Hickok et al., 2011). Importantly, we conceptualized the object of this speech motor control to include both laryngeal and supralaryngeal articulators. Given that the vocal tract is composed of a number of independent articulators, some of which are under the (partial) control of somatosensory systems (Hickok, 2012; Tremblay et al., 2003), the claim that Spt is involved in the control of all of these action subcomponents may have been an overgeneralization. In what follows, we consider the possibility that Spt may be involved more specifically in laryngeal control.

Reconsidering the function of the Spt circuit

In retrospect, a clue that Spt may not be involved in control of the entire speech production system was evident in the 2003 paper that coined the term for the region where we reported that Spt activity was equally robust during covert speech production and covert humming (Hickok et al., 2003). This clearly showed that Spt was not speech specific. What is surprising is that the increased motor control complexity for speech compared to humming simple melodies did not result in any difference in fMRI signal amplitude in Spt. In fact, if anything, Spt activity was slightly greater for humming than speech (Hickok et al., 2003), which is consistent with an earlier study of the more general planum temporale region showing greater activity to tone stimuli compared to speech (Binder et al., 1996).

A more recent study directly addressed the question of which aspects of the vocal apparatus are controlled by Spt (Isenberg, 2012). We measured neural activity in healthy participants while they tracked an externalized moving sound source either with their tongue (pointing the tip of the tongue continuously in the direction of the source) or with the imagined pitch of their voice. For the latter task, subjects were trained to produce a pitch overtly that corresponded continuously to the horizontal position of the sound source (low pitch = left, high pitch = right); in the scanner subjects performed the task without voicing to avoid contamination from auditory feedback. The results in 19 subjects were robust and clear. The pitch-tracking task (red in Fig. 2) activated Spt (circled) more than the tongue-tracking task (blue), which activated a somatosensory circuit more robustly. Even though the sound source that guided action was the same and both tasks involved the vocal tract, Spt activated differentially depending on whether the action involved the larynx versus the tongue. A similar finding was reported by Brown et al. (2008) who localized laryngeal primary motor cortex by comparing fMRI activation during glottal stop-like movements vs. lip protrusion vs. vertical tongue movements. When the authors looked at “incidental” activations in “non-motor areas” they found that presumed Spt (it was not independently functionally localized) activated during the glottal task but not during the lip or tongue task. In that same study, presumed Spt also was robustly activated in a singing-like task that minimized lip, jaw, and tongue movements, i.e., a task that required a varied sequence of laryngeal movements with auditory-motor feedback.

Fig. 2
figure 2

fMRI activation contrast for “tracking” a moving sound source with the larynx versus the tongue. Red: larynx > tongue; Blue: tongue > larynx. From (Isenberg, 2012)

Implications for the evolution of language

Importance of laryngeal control in the evolution of speech

Speech production is dependent on the voluntary control of respiratory, laryngeal, and vocal tract musculature. Such control is present in humans but only partially present in nonhuman primates who appear to be able to control only supralaryngeal articulators voluntarily (Fitch & Zuberbuhler, 2013). For example, rudimentary forms of vocal learning—social group-driven acoustic modifications to natural vocalizations—have been documented in chimpanzees for sounds produced under supralaryngeal control (e.g., “raspberry” sounds) but not for sounds produced under laryngeal control (Marshall et al., 1999). Explicit training in the articulation of simple spoken words in a home-raised chimp showed that it is possible to learn to produce a small handful of intelligible words but only in a whispered form (Hayes & Hayes, 1951). Thus, the development of voluntary laryngeal control has been argued to be the “key innovation” in the evolution of speech (Jurgens, 2002; Kuypers, 1958a; Kuypers, 1958b).

Anatomical evidence is consistent with the “Kuypers/Jurgens laryngeal hypothesis.” Both humans and nonhuman primates have direct cortical innervation of motor neurons controlling the supralaryngeal vocal tract (Jurgens, 2002), whereas only humans appear to have direct cortical innervation of motor neurons controlling the larynx (Iwatsubo et al., 1990; Kuypers, 1958a; Kuypers, 1958b; Simonyan, 2014). Moreover, a recent diffusion tractography study in healthy humans and macaques reported a sevenfold stronger connectivity of laryngeal motor cortex (LMC) with inferior parietal and somatosensory regions in humans compared with macaques (Kumar et al., 2016). Interestingly, the center of mass for the inferior parietal target of LMC in this study is approximately 1 cm from the center of mass of area Spt calculated on the basis of an aggregate analysis of more than 100 fMRI scans using variants of our Spt localizer paradigm (Buchsbaum et al., 2011). Equally intriguing is that the same aggregate analysis of Spt localization identified another area with similar auditory-motor response properties in the precentral sulcus, which also is located approximately 1 cm from the location of LMC as indicated by a meta-analysis of 19 fMRI studies (Kumar et al., 2016). The rough correspondence between (1) area Spt and the “parietal” target of LMC from Kumar et al. and (2) LMC itself and the precentral sulcus activation found in the broader auditory-motor Spt network raises the possibility that the independently identified Spt and LMC circuits are one and the same. This further implicates the Spt circuit in cortical laryngeal control.

The Kuypers/Jurgens laryngeal hypothesis emphasizes the importance of direct cortical control of laryngeal motor neurons in the evolution of speech. However, research on cortical motor control circuits has shown that the frontal lobe cortical motor system does not work alone; it is dependent on sensory feedback control circuits (Hickok, 2014b; Shadmehr & Krakauer, 2008; Wolpert, 1997; Wolpert et al., 1995). Thus, the human brain must have evolved not only the required efferent motor pathway but also the cortical circuit for controlling those efferent signals. The hypothesis that I am advancing is that the Spt circuit evolved in step with direct cortico-laryngeal control pathway and together represented a key advance in the evolution of speech.

Comparative analysis of the Spt region in human and non-human primates

A comparative analysis of the Spt region and its nonhuman primate homologues may provide important additional clues to the evolution of this circuit. For example, one might expect to see significant anatomical differences in humans compared with nonhuman primates in this region. Functionally, there are no doubt obvious differences in that the nonhuman homologue of the Spt region does not control speech, but one would expect to see some broad functional parallels nonetheless if the current organization of the region has a common evolutionary precursor. I point out below that indeed there are both some significant anatomical differences as well as some broad functional parallels between the Spt region and its presumed nonhuman primate homologues.

Spt is located within the Sylvian (lateral) fissure, typically at its posterior most extent. There is individual variation in the location of the region, which is defined functionally, such that it can involve the posterior portions of the planum temporale and/or the parietal operculum, typically deep in the fissure, but which can extend laterally toward the crown of the superior temporal and/or supramarginal gyri. Cytoarchitectonically, this location corresponds to area Tpt (Galaburda & Sanides, 1980) (Fig. 3). Importantly, this region is not characteristic of auditory cortex. Galaburda and Sanides emphasize that Tpt “lacks specialty features of sensory cortex” (p. 609) and so should not be considered part of auditory cortex. The homologous area in monkey, also called Tpt, has similar cytoarchitectonic features and is accordingly not considered part of auditory cortex (Smiley et al., 2007; Sweet et al., 2005).

Fig. 3
figure 3

Location and cytoarchitectonic organization of the human posterior Sylvian region. The location of the planum temporal on the posterior supratemporal plane is indicated in red outline on an inflated representation of the brain which shows structures buried in sulci and fissures. The inset shows a close up of the planum temporale region. Colors indicate approximate location of different cytoarchitectonic fields as delineated by (Galaburda & Sanides, 1980). Note that there are four different fields within the planum temporale suggesting functional differentiation and that these fields extend beyond the planum temporale. The area in yellow corresponds to cytoarchitectonic area Tpt, which is not considered part of auditory cortex. Functional area Spt likely falls within cytoarchitectonic area Tpt, although this has never been directly demonstrated. PaAi, internal parakoniocortex; PaAe, external parakoniocortex; Tpt, temporoparietal; PaA c/d, caudodorsal parakoniocortex. (Reprinted with permission from Hickok & Saberi, 2012)

The anatomical left > right asymmetry of the planum temporale in humans has long been a topic of discussion in the evolution of language (Geschwind & Levitsky, 1968) and much of this effect appears to be driven by Tpt asymmetry in particular (Galaburda et al., 1978). Similar asymmetries are not present in macaques (Lyn et al., 2011), although they are evident in chimpanzees (Gannon et al., 1998; Spocter et al., 2010), which significantly tempered enthusiasm about the relevance of the planum temporale asymmetry in the evolution of language. A more charitable view is that the comparative asymmetry data suggest that something indeed changed in the planum temporale region in the ape lineage sometime after its split with Old World monkeys, and this may have laid the groundwork for the evolution of speech. What this change might have been is currently unclear. There has been some discussion regarding the relation between planum asymmetries and handedness in both humans and apes (Hopkins & Nir, 2010), which may be part of the story, although this is controversial (Fitch & Braccini, 2013). In this context, it is worth noting that while Spt/Tpt appears to be auditory- and vocal-weighted in its sensorimotor function, it is not exclusively auditory or vocal in humans (Okada & Hickok, 2009; Pa & Hickok, 2008; Pa et al., 2008) or nonhuman primates (for Tpt, see below). This fact means that any observed anatomical differentiation in the planum temporale region, such as the emergence of left-right asymmetries, does not necessarily have to be attributable at each evolutionary step to vocal communication, although presumably anatomical features present in human but not nonhuman primates would be candidates for vocal communication-related differentiation. One such candidate is the finer-grained, left-right asymmetry in planum temporale minicolumn organization (wider column spacing in left than right), which has been found in humans but is absent in both macaques and chimpanzees (Buxhoeveden et al., 2001). The authors conclude that this “strongly infers a rewiring of the human PT between hemispheres” (p. 356), which is consistent with the functional asymmetry observed for Spt (Hickok et al., 2003).

From a functional standpoint, relatively little is known about Tpt in nonhuman primates, but a few studies exist. Unit recordings in macaque Tpt have shown the region to be multisensory with auditory inputs dominating (Leinonen et al., 1980). Interestingly, a fraction of the cells responded during the monkey’s own movements (mostly head rotation) and a majority of the auditory responsive cells were modulated by sound location in head-based coordinates. Although much current theorizing on the role of the dorso-caudal auditory network, including the planum temporale region broadly has focused on its spatial location sensitivity (Rauschecker, 1998; Rauschecker & Scott, 2009), the functional picture of Tpt is consistent with the region serving an auditory-weighted sensorimotor function, specifically in controlling head movements and for processing head-related sound source location. Indeed, human research has reported that head movements induce fMRI measured activation in this same general location, but which is largely distinct from auditory-responsive areas (Petit & Beauchamp, 2003). The important point here is that monkey Tpt and human Spt (assuming there is some anatomical alignment) appear to be performing similar computational functions but for different motor effector systems. Perhaps that region contains a number auditory-motor integration subregions organized around different motor effector systems (head, larynx, etc.), similar to the organization of the intraparietal sulcus for visuomotor integration (Grefkes & Fink, 2005). Even less is known about the function of chimpanzee Tpt, although based on a PET study it appears to be responsive to conspecific vocalization (Taglialatela et al., 2009).

All of this is consistent with the existence of a cortical system already present in the primate brain that is well-suited from a computational-anatomic standpoint for a role in auditorily mediated laryngeal control. Specifically, there is a class of circuits in the posterior parietal cortex (visuomotor) and temporal-parietal junction (auditory-motor) that serve a sensory feedback control function for motor coordination (Desmurget & Grafton, 2000; Diedrichsen et al., 2010; Fogassi et al., 2001; Golfinopoulos et al., 2011; Hickok, 2012; Houde & Nagarajan, 2011; Perkell, 2012; Rizzolatti et al., 1997; Shadmehr & Krakauer, 2008). This provides a computational foundation for sensorimotor control in the primate brain. Evidence suggests that during primate evolution there has been a gradual differentiation of one functional-anatomical portion of the sensorimotor system specifically involving auditory-motor integration in the planum temporale/Tpt region with anatomical asymmetries emerging in great apes and with further microcircuit (Buxhoeveden et al., 2001) and anatomical connectivity (Kumar et al., 2016) differentiation emerging in humans. My suggestion is that human area Spt is the functional region that emerged from this differentiation and that it evolved to support voluntary laryngeal control.

Functional analysis of the utility of voluntary laryngeal control

The ability to control vocalization opens a number of fairly obvious communicative possibilities, which I won’t spend much time on. Briefly, signals can be generated or withheld as appropriate to the social and environmental situation; the ability to time the onset and offset of voicing adds a parameter to articulatory space and therefore expands the inventory of distinctive sounds that can be produced; and control of vocal pitch intensity increases the flexibility of emotional and pragmatic communication. It also may serve a higher-order computational function in speech production by providing scaffolding for phrasal speech planning, which is worth some discussion here.

It is well established that speech is planned not syllable-by-syllable or even word-by-word but in multiword chunks (Dell, 1986; Dell, 1995; Garrett, 1975). One open question concerns the nature of the planning frame (e.g., is it syntactic or something else?) and one interesting hypothesis in the context of the present discussion is that the frame is prosodic in nature (Shattuck-Hufnagel, 2015). A review of the behavioral evidence for this hypothesis is beyond the scope of this article, but it is worth noting that data from the effects of damage to the Spt region is broadly consistent with the claim. One complication with the previous idea that Spt serves as an auditor-motor interface for vocal tract control is that damage to the region doesn’t produce catastrophic deficits in speech production. Rather the effects are limited to an increase in the phonological error rate, which can be quite mild, and a decline in the ability repeat speech verbatim. In general, these deficits are worse when the planning unit is longer (multiple syllables or multiword) (Buchsbaum et al., 2011; Goodglass, 1992; Rogalsky et al., 2015). These effects (relatively mild impairment, planning load dependency) might be explained if Spt supports the planning scaffolding rather than control of within word or within syllable articulation. Specifically, disruption of the planning frame will decrease efficiency, forcing planning to occur more locally (e.g., word by word), but not obliterating the ability to generate connected speech. Or, in the context of state feedback control models, disruption of the auditory-motor interface component of a prosodic planning frame would make it difficult to detect and correct errors in the planning frame, thus increasing the likelihood of phonemic errors as syllabic and segmental information is inserted into the frame.

Summary

Evidence has been mounting for laryngeal control as a key advance in the evolution of speech and language. I have argued here that the much studied human auditory-motor dorsal stream (Hickok, 2012; Hickok & Poeppel, 2007; Rauschecker & Scott, 2009) and Spt in particular comprises the cortical circuit for laryngeal control, that the evolution of this circuit is the functional basis for the evolution of anatomical asymmetries in the planum temporale region, and that the circuit plays an important role in language processes by providing a prosodic frame for speech planning. Given the important role of laryngeal control in vocal music, this circuit also may have played a central role in the evolution of music (Fitch, 2006).