Introduction

Interest in language evolution has surged in the past two decades (Berwick, Friederici, Chomsky, & Bolhuis, 2013; Carstairs-McCarthy, 1999; Christiansen & Kirby, 2003; Dunbar, 1996; Fitch, 2010; Hauser, Chomsky, & Fitch, 2002; Pinker & Bloom, 1990; Pinker & Jackendoff, 2005; Számadó & Szathmary, 2006; Tallerman & Gibson, 2011; Yang, 2013). This surge has been accompanied by fundamental progress in our understanding of this difficult multidisciplinary problem. While there have also been naysayers who deny this progress, or the very possibility of progress in understanding cognitive evolution (Hauser et al., 2014; Lewontin, 1998), the purpose of this review and the special issue of which it is part is to demonstrate that such pessimism is unjustified. The reasons are simple: We have whole new classes of data that provide new insights into key issues and problems (e.g., paleo-DNA). The field also profits from a productive new inter-disciplinary community that is constructively engaging with these problems (centered around the biennial EvoLang conference series), and a flood of more traditional sorts of data (e.g., regarding animal cognition and communication, genetics, and neuroscience). This combination has led to increasingly sophisticated models of language evolution that make multiple testable predictions, and improved evaluation criteria for assessing such models. The result, I will argue here, is an ongoing transition of scientific research on language evolution from one dominated by speculation and pet hypotheses to “normal” science, marked by attempts to empirically evaluate multiple plausible hypotheses.

Despite the nonexistence of time machines, and the oft-mentioned fact that language does not fossilize, there is no reason in principle that models of language evolution need be any more speculative or untestable than those of other scientific disciplines that deal with singular chains of events in the distant past. Cosmologists interested in the Big Bang, or geologists studying continental drift and plate tectonics, are quite familiar with such difficulties. They proceed in their study of the past by examining present-day phenomena empirically, and using these data to evaluate explicit contesting models of what might have happened when, and why. Assuming that certain principles of physics, chemistry, or geology remain unchanged, this enables practitioners in these fields to triangulate on adequate models and further refine them by generating further testable predictions. This process has led, for example, to plate tectonics going from a speculative hypothesis, often ridiculed, to something universally accepted in modern geology (Gohau, 1990). There is little preventing the same general scientific process from being effective in the study of language evolution. We have a relatively clear endpoint of the process in the present, and can reconstruct the starting point (our last common ancestor with chimpanzees) in detail using the comparative method with existing species. Making the reasonable assumption that many of the biological principles underlying genetic, neural, cognitive, and behavioral traits have remained constant during the intervening 6 million years, and fed by the fragmentary but important data of the fossil record (and now “fossil” DNA), we enjoy essentially the same preconditions to progress as 20th-century geologists evaluating plate tectonics. This leads to the real possibility of fundamental advances in understanding language evolution in the coming years, building on the progress of the last few decades.

The present article will attempt to concisely summarize this progress and to provide a snapshot of language evolution research as it stood in late 2016. It will also serve as an overview of the current special issue. The article has five main parts. First, I provide a theoretical overview of the conceptual playing field, stressing the importance of a multicomponent approach to language, of strong inference over multiple plausible hypotheses, and of a comparative approach using behavioral, neural, and genetic data from a broad range of living species to inform our understanding of the mechanisms underlying language. These are general points that apply to any problem in cognitive evolution. I then turn in “What evolved?” to language specifically, focusing on three derived components of linguistic cognition that are not shared with chimpanzees, our nearest living cousins (vocal control, hierarchical syntax, and complex semantics/pragmatics). Part 3 “Models of language evolution” gives a brief overview of some of the debates and models currently dominating the conceptual landscape.

Part 4, “Empirical data,” provides a more comprehensive overview of the data that are relevant to testing models of language evolution. I subdivide these into four broad classes: comparative biological data, fossil/archaeological data, neural data, and genetic data. The sheer abundance and diversity of these data is problematic, because few if any scientists are fully competent to evaluate them all. This means that many “facts” that are accepted and repeated frequently in the secondary literature do not stand up to serious scrutiny by the standards of their specific fields, so that the outsider may be left with the feeling that everything is contested (and therefore nothing can be taken seriously). While skepticism is certainly necessary when evaluating all data, I try to separate the wheat from the chaff and focus on results that seem most solid. Of these four classes of data, the most exciting are genetic data, and particularly paleo-DNA from extinct hominins, which offer the tantalizing hope of explicitly testing and rejecting predictions of current models of language evolution.

Finally the fifth “Synthesis” part of the article attempts to do something I have resisted previously: It offers a comprehensive, testable model of language evolution, from our last common ancestor with chimpanzees to modern humans. I offer this model as an example of a model that is both consistent with existing data and that makes predictions about data yet to be gathered, and compare it in detail with other contemporary models.

Theoretical framework and general overview: Studying cognitive evolution

I will first outline some general principles for studying cognitive evolution, including the need to subdivide any complex trait into component parts, the need to adopt a broad comparative approach to understand the evolution of these components, and the need to adopt a “strong inference,” hypothesis-testing framework to evaluate and test such models.

The multicomponent approach

The first and most obvious theoretical move in understanding the biology and evolution of a complex cognitive ability is to fully acknowledge the multiplicity of mechanisms that underlie it. This is no different in language than in, for example, vision (Hubel, 1988; Marr, 1982), music (Peretz & Coltheart, 2003), aesthetics (Leder, Belke, Oeberst, & Augustin, 2004), or social cognition (Fitch et al., 2010). While this perspective—the multicomponent approach—seems obvious in the case of vision, it has been oddly absent from many discussions of language evolution. Perhaps because of the unique nature of human language, language seems to intuitively invite “single cause” thinking, where some particular trait is singled out as “the key” to language and by extension to human uniqueness. Depending on the scholar, this favored trait may be speech (Lieberman, 1984, 2006), syntax (Berwick & Chomsky, 2016; Chomsky, 2010), or shared intentionality (Tomasello, Carpenter, Call, Behne, & Moll, 2005), but in each case one factor is emphasized and other relevant factors are downplayed. I believe that this widespread tendency toward monolithic thinking about language is one of the root causes of dissent in the study of language evolution, since once a particular factor has been chosen, other factors (and other scholars’ thinking) appear to be irrelevant.

The antidote to this persistent problem is to acknowledge that language is made up of multiple separable (but interacting) components, and undertake to analyze them. This set of components—the so-called faculty of language in a broad sense—is large, but divides naturally into two categories: those that are shared (sometimes widely) with other species and those that are recent acquisitions of the human lineage since our evolutionary divergence from chimpanzees. Obviously, since chimpanzees lack language, this subset bears a disproportionately important explanatory burden in understanding language evolution. But a trait may be novel in humans in this sense but not be unique to our species. The human capacity for complex vocal learning is a case in point: though basically absent in chimpanzees or other primates (see below), it is shared with a diverse if scattered collection of other bird and mammal species (Brainard & Fitch, 2014; Fitch & Jarvis, 2013; Janik & Slater, 1997). This derived subset of language mechanisms is not synonymous with the “faculty of language in a narrow sense”—those traits that are unique to humans and unique, within humans, to language itself (Fitch et al., 2005; Hauser et al., 2002). I will refer to these core traits derived components of language (relative to our last common ancestor with chimpanzees), or “DCLs.” By our current understanding of ape cognition and communication, the set of DCLs contains at least three separable components (see The derived components section): complex vocal learning, hierarchical syntax, and complex semantics/pragmatics (cf. the “three Ss” of speech, syntax and semantics in Fitch, 2010).

The broad comparative approach

Another key component of the framework advocated here is the use of a broad comparative method in studying the evolution of cognitive traits. In this section I will illustrate the comparative approach using the evolution of vision, which clearly demonstrates its value in a domain less controversial than language evolution.

The multiplicity of DCLs implies that each biological mechanism may have different genetic and neural substrates, and often different evolutionary histories. For example, in vision we can clearly separate color vision from the perception of form or movement, both in terms of the retinal and cortical mechanisms involved (Hubel, 1988; Livingstone, 2002) and their evolutionary history and timing (Jacobs & Rowe, 2004). A very rich source of data in understanding this evolutionary history is provided by a broad comparative approach, studying vision in a wide variety of species to develop and test hypotheses about the evolution of particular abilities.

For example, humans and closely related primates (e.g., chimpanzees and macaques) have tri-chromatic color vision (involving three different cone photo-pigments), in contrast to most other mammals, which have only dichromatic vision. One might thus infer that color vision is an “advanced trait” found only in relatively sophisticated species. But the broad comparative dataset, including many species from insects to fish to birds to New World monkeys, clearly demonstrates that this inference would be spectacularly incorrect. Indeed, it turns out that mammals are the outliers and that, in vertebrates, trichromacy, or even tetrachromacy was the primitive initial state, and still typifies fish, lizards, and birds. During the Mesozoic, due to a primarily nocturnal existence, this rich color vision was lost in the ancestor of modern mammals, only to be regained by some primates in the last 10–20 million years (Jacobs, 1993; Jacobs & Deegan, 1999). What’s more, the repeated evolution of complex color vision provides important clues to the function of this trait (Kremers, Silveira, Yamada, & Lee, 2000; Vorobyev, 2004), and sometimes even reveals “deep homology” where the same mutations in the same genes has occurred independently in clades as widely separate as primates and butterflies (Frentiu et al., 2007).

When comparing traits among species, biologists typically recognize two different classes of shared traits: homologies and analogies (technically, analogy is just a subtype of a grab-bag class termed “homoplasy,” which is essentially everything that is not a homology; Lankester, 1870; Sanderson & Hufford, 1996). Homologies are derived from a trait present in the common ancestor of the species in question; thus, homologies provide evidence for inferences about the existence of that trait in that common ancestor. Homologies are crucial for interpreting cognitive, neural, and genetic history, since such data typically leave no fossil traces. When some cognitive mechanism, neural circuit, or genetic sequence is observed in multiple close relatives, we can confidently infer its presence in their last common ancestor. Analogies, in contrast, are convergently evolved traits—they were not present in the common ancestor but arose independently in multiple lineages. This independence means that only analogies represent statistically independent data points for testing evolutionary hypotheses (Harvey & Pagel, 1991). We can thus legitimately use analogies both to test mechanistic hypotheses and functional hypotheses concerning adaptation and natural selection.

In addition to these two traditional categories, the genetic revolution in the last decades has revealed a third type of similarity: deep homology (Shubin et al., 1997, 2009). Deep homology exists when two traits have evolved independently at a phenotypic level (i.e., the trait in question was not present in the common ancestor), but where the genetic and developmental mechanisms underlying the trait are nonetheless shared and homologous. The now classic case concerns complex eyes in insects and vertebrates, which were not present in the common ancestor (and are phenotypically analogies) but nonetheless rely upon many of the same regulatory genes (e.g., Pax-6) for their development (genetic homologies). Another nice example of deep homology is the importance of FoxP2 in vocal learning in humans and birds (Fitch, 2009b, 2011a; Scharff & Petri, 2011)—the same gene plays an important role in regulating vocal learning in these clades, but the ability itself was not present in the common ancestor of birds and mammals. Deep homology is important because there is increasing evidence that it is common and relevant in the evolution of cognition (Arendt, 2005; Parker et al., 2013; Salas, Broglio, & Rodríguez, 2003).

As these examples make clear, our understanding of cognitive evolution would be seriously incomplete if we focused exclusively on comparisons of humans with other primates (a narrow comparative approach). It is unfortunate that such limited comparisons were the primary source of comparative data concerning language evolution for most of the 20th century, despite a few dissenting voices (e.g., Nottebohm, 1972, 1975). Fortunately, the genomic revolution has led to a widespread recognition of the fundamental conservatism of gene function in very disparate species (e.g., sponges, flies, and humans; Coutinho, Fonseca, Mansurea, & Borojevic, 2003) and there is a rising awareness that distant relatives like birds may have as much, or more, to tell us about the biology and evolution of human traits as comparisons with other primates (Emery & Clayton, 2004).

Strong inference and multiple hypotheses

The final general point concerns the need for simultaneous testing of multiple, plausible hypotheses. There is a long tradition in empirical research, stemming from null-hypothesis testing statistical approaches, that pits a single plausible hypothesis against a null statistical hypothesis that often has little a priori scientific plausibility. Null hypotheses include variants on “the data have no pattern,” “the data are randomly distributed,” “there is no relationship between two variables,” or “two categories do not differ.” Although statisticians have long criticized this approach (Anderson, 2008; Cohen, 1994), and newer approaches, including model-comparison and Bayesian methods, are rising in popularity, old habits die hard, and this approach often leads to a trivial “test” of some favorite “pet” hypothesis against an a priori implausible alternative. This “one hypothesis” approach has, unfortunately, been typical in writings on language evolution, which often simply ignore previous work, or stoop to disparage alternative hypotheses with derogatory nicknames (e.g., the “bow-wow” or “ding-dong” theories, in a tradition initiated by Max Müller’s 19th-century attacks on Darwin (Müller, 1861).

Fortunately, an alternative approach has long been available: the method of empirically testing the predictions of multiple scientifically plausible hypotheses simultaneously. Especially when an attempt is made to have the hypothesis set tested be exhaustive, this method has been dubbed “strong inference” (Chamberlin, 1890), and when thoughtfully implemented offers a much more efficient path to resolution of scientific debates and apparently discrepant data. It is precisely this approach, and the steady accrual of consistent data, that transformed “continental drift” from a crazy old idea to accepted scientific theory by 1970 (Gohau, 1990).

Language evolution offers an ideal arena for strong inference because decades of speculation have led to many plausible hypotheses about how specific DCLs evolved, and in some cases detailed arguments about the order in which they appeared in human phylogeny. Similarly abundant models exist when we consider the cognitive and neural bases of language and their relationship to traits found in other species. This plethora of existing models (each of which at least one scholar deemed plausible enough to publish) means that we have quite a full roster of explanations and predictions concerning incoming data. Many of these models can be falsified by new data, especially when their predictions contrast with those from other hypotheses. And, as I will document in detail below, there is plenty of relevant data, and more coming in every day. The main problem for this approach is not with data or hypotheses, but sociological: There is no well-developed tradition of scholars in language evolution taking each other’s models seriously. Instead the tradition has been one where others’ models are ridiculed or (worse) ignored. Many of the articles in this issue illustrate that scientists are increasingly taking into account each other’s models, and a wide variety of data from many disciplines, when proffering their own hypotheses. And that represents real progress.

The role of simplicity

The role of simplicity and parsimony in considering alternative hypotheses presents a challenge. Obviously, the basic principle of parsimony and Occam’s razor (do not create unnecessarily complex hypotheses when simpler ones suffice) play a role in all scientific discourse. But in biology, and evolution in particular, parsimony certainly never has the final say in adjudicating among hypotheses: Because evolution has no foresight, it often tinkers together solutions that are far from simple or elegant (Jacob, 1977). At best, parsimony provides a default preference for simpler hypotheses, in the absence of further information, but this preference should be temporary (Fitzpatrick, 2008). Simplicity considerations can and should be easily trumped by actual data concerning the biological reality (cf. de Boer, 2016).

A deeper question, considered by several authors in this issue, is “what counts as simple?” (Chomsky, 2016; Johnson, 2016; Perfors, 2017). The core idea of the Minimalist paradigm in linguistics is that linguistic syntax can be reduced to a very simple but powerful core operation, Merge, which serves to combine lexical elements (Chomsky, 1995); this conception opened the door to inquiry into the evolution of such an operator (Chomsky, 2010, 2016). But simplicity at the computational level of description does not necessarily translate into implementational simplicity at the neural level (or vice versa), and simplicity of neural implementation is arguably the level most relevant to evolutionary discussions of rapid adaptive change (Johnson, 2016; Perfors, 2017). These issues are key open issues in contemporary discussions of language evolution (Berwick, 1998; Berwick & Chomsky, 2016) and not likely to be resolved until we know more about how genes, brains, and language are inter-related (Ramus, 2006; Fisher, 2016). For now, it seems prudent not to rely overly heavily on parsimony in adjudicating between competing hypotheses about language evolution.

What evolved? Shared and derived components of language

The shared foundations

Language is a complex faculty that allows us to encode, elaborate and communicate our thoughts and experiences via words combined into hierarchical structures called sentences. Words are learned, and thus shared by communities, but differ across languages, and their form is mostly arbitrary (Saussure, 1916) despite a nontrivial amount of onomatopoeia and sound symbolism (Blasi, Wichmann, Hammerström, Stadler, & Christiansen, 2016; Sapir, 1929). Humans are born with a capacity to acquire the words and grammars of their local language(s)—an “instinct to learn language” (Fitch, 2011b; Marler, 1991). It is this capacity—sometimes termed “the language faculty”—whose evolutionary history or phylogeny we seek to explain when studying language evolution (rather than historical change or “glossogeny”; cf. below and Hurford, 1990).

The human faculty of language in this broad sense includes all of the various cognitive and physiological mechanisms that support the human capacity to acquire language; most of these mechanisms are shared with other species (the FLB of Hauser et al., 2002). Thus, despite the fact that language in toto is unique to our species, most components underlying it are shared, sometimes very broadly and sometimes only with a few other species. Because I have already discussed these many shared capacities in detail in other places (Fitch, 2005, 2010), I will only mention the highlights here.

There are several areas of very significant (nearly complete) overlap, and these form the backdrop for any discussion of the few remaining differences (the “shared foundations” of the language system illustrated in Fig. 1). These include basic physiological mechanisms involved in perception and motor control, along with various cognitive mechanisms involving learning, problem-solving, and memory.

Fig. 1
figure 1

An architectural metaphor indicating the many different mechanisms underlying the human capacity to acquire language, illustrating that the vast majority are shared with other species (the “Shared Foundations,” and only a few represent derived characters of our species (relative to our common ancestor with chimpanzees: “Unusual Human Abilities”)

Starting with physiology, human auditory capacities are shared with most other vertebrates, including fish. The essential architecture and function of the auditory system, from the middle ear through cochlea and up to cortex via multiple brainstem waypoints, is shared with other mammals, and there is little evidence of any fundamental differences in human hearing from most other familiar mammals except that adult human hearing occupies a relatively low frequency range (roughly 20 Hz to 15 kHz), and many species go well beyond our upper limit of 20 kHz. Although much has been made recently of differences in the shape of the chimpanzee and human audiogram (Martinez et al., 2013), in an attempt to use the middle ear bones of extinct hominins to reconstruct the evolution of speech perception, I am skeptical of the relevance of these data for two reasons. The first is that the primary determinant of hearing range and acuity is the cochlea, not the outer or middle ears (Ruggero & Temchin, 2002). The second is that the supposed differences between human and chimpanzee audiograms are based on a tiny sample of chimpanzees, which showed considerable differences between individuals (Elder, 1934; Kojima, 1990) and a distinctive so-called W-shaped audiogram (Kojima, 1990) that was not seen in the earlier study. Captive housed animals often suffer noise-induced hearing loss, caused by exposure to loud vocalizations in a reverberant concrete environment, which may explain this “divot” in adult sensitivity. More data will thus be necessary to conclude that this is a real phenomenon in the chimpanzee species as a whole, rather than an isolated problem of one individual.

More important, behavioral tests indicate that both chimpanzees and other species (e.g., dogs) have excellent central abilities to process human speech at the phonetic and lexical levels (Savage-Rumbaugh et al., 1993; Kaminski, Call, & Fischer, 2004), and chimpanzees can even understand bizarre signals such as sinewave speech or noise-vocoded speech (Heimbauer, Beran, & Owren, 2011). A host of other auditory phenomena such as categorical perception or vowel “prototype magnets” have also been documented in other species from birds to chinchillas (Kluender, Diehl, & Killeen, 1987; Kuhl & Miller, 1978). Thus, there is little reason today to accept the old assertion that “speech is special” from a perceptual point of view (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967), or to think that new mechanisms of speech perception needed to evolve in the hominid lineage to support language evolution.

Turning to signal output, there are no clear differences between humans and other apes in terms either of vision or manual/facial control that would prevent an ape from learning signed language if the proper semantic and syntactic “software” were in place—one reason for the popularity of “gestural protolanguage” hypotheses (see below and Corballis, 2002; Hewes, 1973). Thus, discussions of possible differences have focused on speech output. The essential functioning of the human lungs, larynx, and tongue is again shared very broadly with other mammals, from bats to elephants, both in terms of anatomy and regarding the physics and physiology of vocal production (Fitch, 2000b; Herbst et al., 2012; Taylor & Reby, 2010). The human tongue is similar in anatomy to that in other apes (Takemoto, 2008), and we now know that a mild descent of the larynx and hyoid bone occurs in chimpanzees (Nishimura, Mikami, Suzuki, & Matsuzawa, 2006). Given that all mammalian species examined have a flexible capacity to lower the larynx during vocalization (Fitch, 2000a), and many species possess a descended larynx much more extreme than that in humans (Fitch & Reby, 2001; Weissengruber, Forstenpointner, Peters, Kübber-Heiss, & Fitch, 2002), it is clear that the much-discussed role of the descended larynx in human speech has been greatly overestimated in its importance (Fitch et al., 2016). This means that most attempts to estimate the onset of speech abilities based on fossils are dead ends (reviewed in Fitch, 2009a), with the possible exception of thoracic canal size, indicating an increase in breath control (MacLarnon & Hewitt, 1999). Like hearing, the anatomy of the primate vocal tract was essentially “speech ready” whenever the neural control and cognitive capabilities evolved, as concluded long ago by both Darwin and Negus (Darwin, 1871; Negus, 1938). Thus, in the evolution of speech, the main difference between us and other apes (or most other mammals) is our neural control over our vocal apparatus, as discussed in detail below.

Regarding basic cognition, the list of cognitive differences between humans and other animals has grown steadily smaller (Vauclair, 1996; Tomasello & Call, 1997; Shettleworth, 2009). Those most relevant to human language are reviewed in (Hurford, 2007; Fitch, 2010). Shared systems include different forms of memory, from working memory to episodic memory (Emery, 2006; Inoue & Matsuzawa, 2007), a capacity for approximate number (Dehaene, 1992; Dehaene et al., 2015), and a host of more basic capacities including categorization, transitive inference, navigation, and planning. All of these capacities constitute a basic “cognitive toolkit” that we share with most other mammalian and avian species. More specialized capacities including tool use are also shared, not only with other primates but with a variety of avian and mammalian species (McGrew, 2004; Pruetz & Bertolani, 2007; Tebbich, Taborsky, Fessl, & Blomqvist, 2001; van Lawick-Goodall & van Lawick-Goodall, 1967; Weir, Chappell, & Kacelnik, 2004; Whiten, Horner, & de Waal, 2005). Finally, one of the classic supposed differences between humans and other apes was our possession of a “theory of mind”—a capacity to represent the beliefs and desires of other individuals (Povinelli & Eddy, 1996; Premack & Woodruff, 1978). Such a capacity, at a basic level, is now well-documented in chimpanzees and ravens (Bugnyar, Reber, & Buckner, 2016; Hare & Tomasello, 2004). Thus, despite an undoubted increase in intelligence during human evolution, the specific, empirically demonstrated cognitive differences that underlie this quantitative difference have grown increasingly scarce and seem to focus on syntax and semantics/pragmatics, discussed below (cf. Herrmann, Call, Hernàndez-Lloreda, Hare, & Tomasello, 2007).

Again, the short overview above is by no means an exhaustive list of language-related traits that we share with other species. But I hope it clarifies the essential point that most of the mechanisms involved in language acquisition evolved long before humans split with chimpanzees. These mechanisms were already in place when the few specific changes required for language evolved. As discussed below, all of these characteristics were preadaptations to language: mechanisms that are used in language, but did not specifically evolve for language (Fitch, 2012).

Derived components of language

We will now turn, in more detail, to those few changes that can be considered to be key innovations: derived components of language (DCLs) that were both required for language and evolved in a hominin context. These are derived relative to our last common ancestor with chimpanzees and bonobos, the LCAchimpanzee or LCAc hereafter.

Speech and vocal production learning

As discussed in detail elsewhere (Fitch, 2009a, 2010), there is little evidence that any major changes in the vocal apparatus itself were required for our ancestors to gain the capacity for speech. Nonetheless, the belief remains distressingly widespread that major changes in vocal anatomy, and particularly the descended human larynx, were required prerequisites for speech. For example, many authors (Barney, Martelli, Serrurier, & Steele, 2012; Crystal, 2003; Lieberman & Blumstein, 1988; Raphael, Borden, & Harris, 2007; Yule, 2006) state that differences in vocal anatomy are responsible for the complete failure of chimpanzees raised in a human home to acquire even a few clearly understandable words (Hayes, 1951). As one example, “early experiments to teach chimpanzees to communicate with their voices failed because of the insufficiencies of the animals’ vocal organs” (Crystal, 2003, p. 402). This idea is refuted by virtually all modern data on mammalian vocal production (most recently Fitch et al., 2016), so its persistence in modern discussions is perplexing.

The origin of this idea apparently goes back to early studies by Philip Lieberman and his colleagues suggesting limitations in the range of vowels that could be produced by a macaque or chimpanzee (Lieberman et al., 1972; Lieberman, Klatt, & Wilson, 1969). These studies never claimed that the total lack of vocal control and vocal learning in these species could be explained only by vocal anatomy, and Lieberman’s most recent publications suggest that “a nonhuman SVT [supra-laryngeal vocal tract] can produce all vowels excepting quantal vowels” (Lieberman, 2012, p. 613). Thus, even the strongest adherents suggest that these changes represent adaptive “tweaks” to increase the intelligibility of speech, rather than being prerequisites for any form of speech. Other commentators find even this idea dubious (Boë, Heim, Honda, & Maeda, 2002; de Boer, 2010; Fitch, 2010; Nishimura et al., 2006), in part because nonlinguistic functions for changes such as a descended larynx are now known from nonhuman species (Fitch & Reby, 2001).

Thus, from a modern perspective, it seems clear that changes in complex vocal control rather than vocal anatomy were the key innovations required for the evolution of speech. This is fortunately an area where the comparative database is rich, because a level of vocal control enabling the production of novel sounds from the environment has evolved multiple times in birds and mammals (Fitch, 2000b; Janik & Slater, 1997; Jarvis, 2007; Reichmuth & Casey, 2014). The current list of gifted vocal learners includes some species of songbirds, hummingbirds, parrots, cetaceans, seals, bats, and elephants. The conclusion that vocal anatomy is not central to speech is supported by the well-documented cases of speech imitation by nonhuman species such as birds and elephants (Klatt & Stefanski, 1974; Pepperberg, 2005; Stoeger et al., 2012), whose vocal anatomy differs greatly but still can produce easily understandable words and phrases. In the recently documented case of an elephant imitating multiple Korean words, the elephant appears to use movements of its trunk, inserted into the oral cavity, to vary its formant frequencies (Stoeger et al., 2012). In such cases it seems obvious that it is the species’ capacity to control its vocal tract, not vocal anatomy, that is key. In fact, the only case of speech imitation by an animal where the anatomy is largely the same as in humans is the harbor seal Hoover, who apparently used his normal mammalian larynx and vocal tract to produce readily intelligible English words and phrases (Ralls, Fiorelli, & Gish, 1985).

Despite this relatively broad set of species or clades where complex vocal production learning is known, humans remain the only known primate capable of vocal production learning. It is important to note that this is not to say that other primates have no vocal control, or that their vocalizations are produced robot-like given the proper stimulus—such a caricature does not apply to any bird or mammal. Virtually all species tested are able to bring one or more of their innate vocalizations under control (e.g., to produce a call upon a cue), including many primates (reviewed in Adret, 1992; Fitch & Zuberbühler, 2013). For example, Larson, Sutton, Taylor, and Lindeman (1973) could train four macaques to produce various species-typical calls (barks, grunts, and coos) upon command. When reward was made contingent on producing longer calls, all the monkeys succeeded in doing so by switching to the longer coo calls. This is evidence of “call usage learning” but not vocal production learning. Recent data from chimpanzees, where food calling behavior of newcomers to a zoo population gradually became more similar to that of previous residents (Watson et al., 2015), show nothing more than this type of call usage learning (and are not particularly convincing evidence of that; Fischer, Wheeler, & Higham, 2015).

Other recent evidence of intentional (voluntary) vocalization in chimpanzees showed that alarm calls fulfill several criteria for intentional communication developed in the ape gesture literature (Schel, Townsend, Machanda, Zuberbühler, & Slocombe, 2013). Again, however, this is evidence for intentional control over calls within the innate repertoire shared by all chimpanzees, and not evidence of vocal production learning. Indeed, given the strong and long-standing experimental evidence for similar audience effects in chickens (Evans & Marler, 1994; Gyger & Marler, 1988) and many other species, it is difficult to see why this finding received so much media attention. None of these new studies change the basic fact that chimpanzees and other apes cannot learn to imitate novel vocalizations from their environment; rather, they demonstrate a basic capacity for control over innate vocalizations that is found in most birds and mammals. This form of control is believed to result from “gating connections” from cingulate cortex onto the basic brainstem chassis that controls innate vocalizations (Jürgens, 2002).

The current leading hypothesis for the mechanisms underlying the increased vocal control necessary for vocal production learning is that such control requires direct synaptic connections from motor cortical regions (or its equivalent, area RA, in songbirds) onto the motor neurons controlling the larynx (or syrinx in birds; Deacon, 1992; Jarvis, 2004a; Jürgens, 2002; Kuypers, 1958; Ploog, 1988; Simonyan, 2014; Striedter, 2004). Decades of research show that such connections are lacking in most primates (Jürgens, 1998, 2002) but are present in both songbirds and humans (Striedter, 2004; Wild, 1997). These data show that in several groups in which vocal production learning has evolved independently, corresponding direct connections are present, and such connections are not present in relatives that are incapable of vocal learning. This exemplifies a mechanistic hypothesis deriving from comparisons of humans and other primates, later tested in clades that have independently evolved an equivalent ability. Although the current data only concern parrots and songbirds, other clades of vocal learners (including both bats and seals) offer similar potential tests of this hypothesis.

The most intriguing new data suggesting sound learning in apes is fully consistent with the direct connections hypothesis (Marshall, Wrangham, & Arcadi, 1999; Reynolds Losin, Russell, Freeman, Meguerditchian, & Hopkins, 2008; Wich et al., 2009). These are data showing that novel nonphonated (nonlaryngeal) sounds can be controlled and learned by great apes including such attention-getting sounds as a lip buzz (or raspberry) in chimpanzees and whistling in an orangutans. Because great apes do have strong direct cortical connections with the brainstem motor neurons controlling the jaws and lips (Kuypers, 1973), this supports the hypothesis that vocal production learning requires such direct connections (Jürgens, Kirzinger, & von Cramon, 1982). Thus, it seems that it is specifically control over the larynx (the motor neurons for which lie in the nucleus ambiguus within the medulla) that is lacking in most primates and that therefore needed to evolve de novo during recent human evolution.

Hierarchical syntax

As often remarked, an unusual aspect of language is that it can be expressed via multiple modalities: both speech (audio-vocal) and written communication (visuo-motor) are typical of educated humans, and a smaller community communicates via signed languages (also visuo-motor). Although written language is to some extent “parasitic” on spoken language, scholars now universally accept that signed languages are full linguistic systems, as communicatively adequate as spoken language when acquired from birth (Emmorey, 2002; Klima & Bellugi, 1979; Stokoe, 1960). Clearly, language as a system for expressing thought is not limited to the audio-vocal channel (for further implications of this fact, see de Boer, 2016; Goldin-Meadow, 2016; Kendon, 2016).

What does seem to be both universal in human languages, regardless of modality, and required for their expressive power, is hierarchical syntax (Chomsky, 1957, 1965, 2016). Without this fundamental characteristic, our open-ended ability to map novel thoughts onto understandable signals would be impossible. We might have a somewhat useful language nonetheless (like beginners in a foreign language making their way without grammar), but we would be unable to express complex concepts precisely.

Because “syntax” (or “grammar”) are terms used in many ways, it is useful to characterize what, precisely, is the derived aspect of human syntactic abilities. It is certainly not the capacity to produce rule-governed behavior, nor a capacity to produce or interpret signals—virtually any vertebrate does these things. Many organisms can also make “infinite use of finite means” in the restricted sense that they can produce an unlimited variety of signal strings (think of an incessantly barking dog—no day’s barking will be precisely the same as another’s). But of course, to the extent that they “mean” anything, these bark strings all mean the same thing. Some birds do more than this, acquiring quite complex and structured song repertoires via vocal learning, with basic elements numbering in the thousands (Kroodsma & Parker, 1977) and an unlimited variety of orderings (Hultsch & Todt, 1989; Weiss, Hultsch, Adam, Scharff, & Kipper, 2014), and listeners show clear awareness of different song types (Naguib & Kipper, 2006; Naguib & Todt, 1997). While such complex repertoires are far from trivial, and their strings differ discriminably from one another, they do not communicate equally complex semantic messages. So the capacity to combine learned elements in a rule-governed manner—a basic phonological “syntax” of birdsong—is not sufficient to reach the level of human phrasal syntax and semantics (Marler, 2000).

Recent research using artificial grammar learning to probe the specific types of rules that organisms are capable of learning has offered a clearer picture (see ten Cate, 2016), and the use of formal language theory provides a useful metric and common description language for analyzing these rule types in terms of computational complexity (Jäger & Rogers, 2012). The most important distinction at present appears to be between the levels of regular or “finite state” grammars, some types of which are accessible to multiple animal species, and supra-regular grammars that go beyond this (including both context-sensitive and context-free grammars, sometimes termed “phrase structure grammars”). At present, there are no convincing studies showing the successful acquisition of a supra-regular grammar in a nonhuman animal: for every claimed success (Abe & Watanabe, 2011; Gentner et al., 2006; Rey, Perruchet, & Fagot, 2012), there is a convincing critique (van Heijningen et al., 2009; Beckers et al., 2012; Poletiek et al., 2016). The results of such research have been summarized by Fitch and Friederici (2012), and further explored in important recent work by Dehaene et al. (2015) and Wang, Uhrig, Jarraya, and Dehaene (2015).

Thus, the shared component of syntax includes capabilities within the regular or finite state domain (also adequate to account for phonological phenomena in language Heinz & Idsardi, 2013). In contrast, the DCL component of syntax involves supra-regular computational capabilities like those underlying hierarchical linguistic structure (Fitch, 2014). From this perspective we could say loosely that animals have phonology but lack hierarchical syntax. Because both syntactic and semantic structures depend on hierarchical structure (rather than linear ordering), this is a crucial derived feature.

Because many of the pieces in this special issue discuss syntax at length, I will not further belabor this: Without hierarchical syntax, we would not have modern language. Explaining how and why we attained this more sophisticated level of cognitive computation must be a central explanandum in understanding the biology of language.

Semantic and pragmatic components of language

The last, and least understood, DCL involves a poorly defined complex of abilities involving the context-dependent semantic interpretation of signals—semantics/pragmatics. This capacity is based on a complex and highly-developed set of cognitive abilities for social cognition (Herrmann et al., 2007).

Again there is a shared component of semantics: an ability to learn to interpret novel signals (words or gestures) is widespread. Chimpanzees, parrots and domestic dogs are all capable of rapidly learning human words and mapping them onto real-world referents in a context-dependent manner (Kaminski et al., 2004; Pepperberg, 1981; Pepperberg, 1991; Pilley & Reid, 2011; Savage-Rumbaugh et al., 1993). In the case of Alex the grey parrot, the words learned include quite abstract qualities such as number, shape and color, not simple objects responded to reflexively (Pepperberg, 1999). In the case of Kanzi the bonobo, differing word orders can be reliably mapped on to different meanings (cf. Lyn, 2017; Pepperberg, 2016). Thus the capacity to acquire a basic lexicon, mapping signals to concepts, is an ability we share not just with chimpanzees but also dogs and (some) parrots. Extensive field work with primates (mostly with baboons) also indicates that the capacity to form quite complex and context-dependent inferences is present in nonhuman primates (see Cheney & Seyfarth, 2007; Seyfarth & Cheney, 2016).

The derived aspects of semantics/pragmatics concern more complex meanings than single word meanings. A precise delineation of what differentiates humans from other animals in this domain remains elusive and debated. One key area where there are both similarities and differences concerns Gricean inference—the use of context, principles of communicative intent, and theory of mind to derive inferences about the implicit meaning of what is said (Grice, 1975). There is broad agreement about the importance of such pragmatic interpretation in language (cf. Moore, 2016; Scott-Phillips, 2016; Sperber & Wilson, 1986), but debate about the degree to which the logical principles laid out by Grice are in fact cognitively represented by ordinary language users (Bar-On, 2013).

A long-standing tradition held that “theory of mind” (ToM) is uniquely human (Penn & Povinelli, 2007; Povinelli et al., 1990; Premack & Woodruff, 1978). However, a series of recent experiments with chimpanzees shows that at least a basic capacity to represent the knowledge of others is present in apes (Bräuer, Call, & Tomasello, 2007; Call & Tomasello, 2008; Hare et al., 2000), and very recent data based on eye-tracking suggests a capacity to represent false beliefs as well (Krupenye et al., 2016). These results indicate that a capacity for first-order ToM was already present in the LCAc. Of course, humans are able to represent much more complex representations of others’ thoughts (e.g., second-order ToM—knowing what one agent thinks about another agent’s thoughts), and such higher order ToM is fundamental to most pragmatic inference. Thus, higher order ToM still appears to be a DCL that is key to semantic interpretation.

Another trait that appears to differentiate humans from other apes concerns not our ability to communicate, but rather our proclivity to do so, a proclivity for which I have borrowed the German word Mitteilungsbedürfnis (the drive to share thoughts). Apes given training with gestural or keyboard based communication systems learn to use them to provide answers to questions in return for rewards (food, tickling, etc.; see Pepperberg, 2016; Savage-Rumbaugh, 1986; Savage-Rumbaugh et al., 1993). However, they very rarely use these systems to volunteer information themselves, except for requests (typically for food or tickling!). In contrast, children by age 4 are founts of information, pointing to objects and naming them, observing and commenting on the world around them, and in general using language to share information with others. It is easy to overlook the importance of this trait, but without it the free flow of information that provides a prime benefit of language would slow to a trickle. While not uniquely human (honeybees certainly have a strong desire to share information about food locations with one another), this trait does seem to differentiate us from other apes.

Finally, a fundamental aspect of human communication (not just language) is ostension—the signaling of signalhood—and there is an ongoing debate about the degree to which ostensive signaling is unique to humans or shared with apes (Moore, 2016; Scott-Phillips, 2014, 2016). At present, it seems likely that our human propensity to generate communicative acts, and explicitly mark them as such, is quantitatively more highly developed, and perhaps a true qualitative difference, but more research will be required to demonstrate this conclusively.

Thus, it seems clear that the complex of pragmatic/semantic “tools” available to humans is uniquely well-developed (with ToM, Gricean inference, Mitteilungsbedürfnis, and ostension all hypertrophied relative to other apes). But for no one of these subcomponents do we have clear evidence of a definitive qualitative difference: it seems to be the whole package that evolved. This is why I treat this as a single DCL, although future research may cleave this complex into biologically separate components. There can be no doubt that these cognitive abilities are critical for language, but they also play a role in nonlinguistic aspects of human cognition (Scott-Phillips, 2014). They mark the third crucial domain of difference between humans and our nearest primate relatives (Herrmann et al., 2007).

Cultural aspects of language evolution

There is a growing consensus, well reflected in this issue, that cultural change has an important part to play in understanding language (Adger, 2017; Bowling, 2017; Kirby, 2017; Pagel, 2016; Steels, 2016;). This study of cultural change has advanced rapidly in recent years, part of a more general scientific focus on cultural evolution (Boyd & Richerson, 1996; Fitch, 2011c; Laland et al., 2010; Mesoudi et al., 2004). This research has recently taken a strong empirical turn both in humans (Kirby et al., 2008; Morgan et al., 2015; Smith & Kirby, 2008) and animals (Fehér, 2016; Fehér, Wang, Saar, Mitra, & Tchernichovski, 2009; Whiten et al., 1999). Because the term “language evolution” can be interpreted either in terms of biological evolution of the language faculty or the cultural evolution of specific languages, Jim Hurford introduced the useful term “glossogeny” to denote the latter specifically (Hurford, 1990). Although cultural and biological evolution are sometimes considered as mutually exclusive competing explanations (e.g., Christiansen & Chater, 2008), they are increasingly seen as complementary: glossogeny can explain much of the odd, language specific variability we see among different languages, and thus “takes the pressure” off biological explanations from having to explain intricate details of language (Chomsky, 2010; Deacon, 1997; Fitch, 2008, 2011d). The major review article in this issue by Simon Kirby gives an overview of the rapid advance of empirical and modeling approaches for understanding the nature and consequences of glossogeny (Kirby, 2017); Mark Pagel’s article provides a further overview of how the study of glossogeny has progressed in the last decade (Pagel, 2016).

Models of language evolution: Hypothetical protolanguages

I now turn to a brief overview of proposed models of language evolution, and some of the core debates in its study. Although models like these are often termed “theories” of language evolution, I prefer to use the term “model” because “theory” in science typically connotes a much more well-tested and widely accepted model than any existing models of language evolution.

Previous dichotomies and debates

There are three divisive distinctions that, in my opinion, have been more controversial than deserved (cf. Fitch, 2010). The first concerns the internal (thought) versus external (communication) uses of language. As stated earlier, language is a complex faculty that allows us to encode, elaborate and communicate our thoughts and experiences via learned words combined into hierarchical structures (sentences). One of the persistent debates in the field has been whether the use of language for encoding and elaborating thought is primary (Chomsky, 2010; Newmeyer, 2005)—as an “inner tool”—or whether its use for communication was the consistent core function (e.g., Pinker & Jackendoff, 2005). I think that this is a misleading dichotomy. Contemporary language clearly functions both in our inner mental lives and for communication, and it seems Procrustean to denote one as “primary.” Second, even our inner thoughts occur with a phonology (we can reasonably ask a bilingual whether they are presently thinking in English or German), and so the word forms learned via externalization still play a role internally (Jackendoff, 2011). Finally, we know that both animals and global aphasics can think (engage in complex cognitive processes) without language, so language is not necessary for thought to occur (see below). Nonetheless, we clearly use language for thinking, and most people probably spend more time using it internally than externally. I can see no major grounds for enforcing a distinction between these in modern human language; the only genuine issue is whether the original evolutionary function of some DCL was for thought or communication.

This leads to a second persistent debate, regarding continuity versus change in function. Darwin famously solved the problem raised by the evolution of complex organs with his idea that an organ of some complexity could evolve for one function, developing a certain degree of complexity, and then later change its function (Darwin, 1859; Gould, 1985). This old function was long termed a “preadaptation,” but in recent years the new function has been dubbed an “exaptation” to avoid implications of evolutionary foresight (Gould & Vrba, 1982). This notion of change of function is a central, although often overlooked, plank of evolutionary thought (Pievani & Serrelli, 2011) because it can refute arguments of “irreducible complexity” (Denton, 1985)—that complex organs like wings or eyes could not have evolved from simple beginnings because they need to be complex to have any functionality at all. Applied to language evolution, it seems clear that that syntactic combinatoriality and semantic compositionality might both have their roots in prelinguistic conceptual structures (Berwick & Chomsky, 2016; Bickerton, 1990, 2000a). This idea solves various other long-standing problems, including what I call the “lone mutant” problem (Fitch, 2010; Orr & Cappannari, 1964): Who would some hominin lucky enough to have more advanced syntactic competence talk to? The answer initially is “no one” (at least until offspring were born who shared the mutation), but the ability would nonetheless be useful in private thought. To accept this idea is not to claim that language isn’t (or wasn’t) used in communication—only to say that advances do not always need to serve immediate communicative functions (Fitch, 2011e). The idea that conceptual structure led the way is plausible enough that Newmeyer described it as the “consensus view” of the 1990s (Newmeyer, 2005), but no dichotomous either/or decisions are necessary about this issue.

A final persistent issue worth a brief mention is the saltation versus gradualism distinction (Berwick, 1997), which exemplifies a larger debate in evolutionary theory. Darwin was an extreme gradualist, believing that only the accumulation of small successive variants could lead to adaptation, but he was criticized for this viewpoint by many contemporaries (Eldredge & Gould, 1972; Gould, 1982; Theissen, 2009), and the belief that “Darwinism = gradualism” was one reason that many early geneticists opposed Darwinism (e.g., Bateson, 1894). But by the time of the modern synthesis of evolutionary theory and genetics, it became clear that there is a continuum of both tempo (rate of change) and mode (type of change) in evolution (Gould & Eldredge, 1977; Simpson, 1944), a viewpoint that has become ever clearer as genetic mechanisms have become better understood (Fitch & Ayala 1994). At the level of DNA all evolutionary change is discrete, because there are only four discrete bases in the genetic material. The fossil record often exhibits apparently discontinuous bursts of rapid change after long periods of stasis, but discontinuity on a geological time scale does not imply saltation over generations. Furthermore, the fact that a trait like language is discontinuous in terms of extant species carries no implication of phylogenetic saltation (it may have evolved very gradually over millions of years, but no fossils or extant species are left as records of that gradual transition). The only substantive issue in this debate concerns the size of phenotypic change associated with a mutation, and whether large changes can be (or are likely to be) improvements (cf. Fitch, 2010). It is certainly not implausible that small genetic changes or single mutations could lead to rapid and important rearrangements of neural circuitry (Ramus & Fisher, 2009). There is again no reason to assume that evolution always works in either way: we should consider each trait and its genetic/neural underpinnings individually.

Conceptions of protolanguage

I now turn to issues that I believe are central to discussions of language evolution. Any modern model should take for granted the comparative data, and take as its starting point the LCAc, reconstructed via inferences based on comparisons with chimpanzees and bonobos. Although there is much we still don’t know about apes, this research is just “normal” comparative biology, and allows rather confident specification of the starting point of the six million year journey to modern human language. The other fixed point is the end stage: a current understanding of language and cognition, again normal science, but this time linguistics and cognitive science.

The challenging component is the intermediate period: here we are reliant on the sparse data of the fossil record, which is notoriously incomplete and controversial (see Paleontological data section), and which provides rather few restrictions (Tattersall, 2016). But based on brain size, ecology, and toolkits, it seems clear that the earliest hominins (e.g., australopithecines) did not have modern language, although they might have had some DCL precursors. Although things become less clear within the genus Homo, most experts accept that Neanderthals lacked at least some component of modern language (see Paleontological data section and Mellars, 1989; Tattersall, 2009). It is also clear that by the time modern humans dispersed out of Africa (by 60 thousand years ago), we had the full package of modern language DCLs, since all humans around the world have the same essential capacity to acquire any language. Accepting these assumptions as a reasonable starting point, there is a period of roughly 2 million years during which most of the action must have occurred, with only a few anatomically distinct stages between Homo habilis and Homo sapiens. A complete model needs to offer explanations for how all of the empirically deduced derived components of language evolved during this period. Most existing models attempt to only explain some of the DCLs (e.g., speech or syntax, but not both), and few grapple with the entire package.

The existence of multiple DCLs leads logically to a notion of “protolanguage”—some hypothesized system of thought and/or communication that had some DCL(s) but not the full suite. The only way out of this conclusion is to state that some particular component of language was the key, and that “language evolution” amounts solely to the appearance of that chosen component (e.g., Berwick, 1997; Herrmann et al., 2007). Many of the previous debates in the field can be dissolved by recognizing that the models being debated attempt to explain different parts of the problem (syntax, speech, social cognition, etc.), and thus are in fact complementary (Fitch, 2010). Remaining debate should concern these differing conceptions of protolanguage, and explanations of when they existed, how they evolved, and why they were adaptive at the time. A key issue in these models becomes the ordering of acquisition of different DCLs.

Lexical protolanguage

This model has many variants; prominent exponents include Derek Bickerton (Bickerton, 1990, 2000b) and Ray Jackendoff (Jackendoff, 1999, 2010). The essential idea is that hominins first developed a word-based protolanguage that was learned, symbolic, and useful, and only later evolved a capacity for hierarchical syntax. Typically the birth of modern hierarchical syntax is seen as the final stage of language evolution, so this is a “syntax last” model. However, such models leave open the origins of the other DCLs (speech and semantic/pragmatic capabilities), except for predicting that vocal learning should have evolved before syntax. Although Bickerton’s version of lexical protolanguage is often considered definitive (or even simply assumed to be “protolanguage”), there are numerous variants on this basic idea (reviewed in Chapter 12 of Fitch, 2010). All of these models imply that syntax was a very late acquisition in language evolution.

Gestural protolanguage

The term “protolanguage” was first used in an evolutionary context by Gordon Hewes in the context of gestural protolanguage (Hewes, 1973)—the idea that during the first stages of language evolution, communication was via gesture, mime, and sign. Current proponents of this view include Arbib (2002), Corballis (2002, 2016), and Tomasello (2008). A key virtue of this model is that it takes the well-attested superiority of apes in gestural versus vocal communication as its starting point. Its prime flaw is that it has difficulties explaining why the transition to speech, which came later in the evolution of language, was so complete (Emmorey, 2005; MacNeilage & Davis, 2005; Seyfarth, 2005), as further discussed by Arbib (2016) and Kendon (2016) in this issue. The clear prediction of this model is that syntax and semantics preceded speech during evolution.

Musical protolanguage

An alternative model for the earliest stages of language evolution, due to Darwin (1871), is that the first DCL to be acquired in phylogeny was the capacity for complex vocal learning. On the model of birdsong, Darwin suggested that this vocal learning capacity was initially used not for communicating complex propositions, but rather to produce complex vocal performances (cf. Fitch, 2013); a prominent current champion is (Mithen, 2005). Despite its Darwinian origins, this model has a checkered history, apparently having been forgotten and then independently rediscovered multiple times (Brown, 2000; Livingstone, 1973; Richman, 1993). In many ways, Robin Dunbar’s “vocal grooming” hypothesis is consistent with the musical protolanguage hypothesis (Dunbar, 1996, 2016), although it extends beyond song to include more primitive vocalizations such as laughter (see also Dunbar, 2016; Locke, 2016; Provine, 2016). Although Darwin was quite vague about the later stage of semantics (and said nothing about syntax), his ideas were fleshed out by the linguist Otto Jespersen (Jespersen, 1922), who suggested plausible roots for both semantics and syntax (via analysis of previously holistic utterances; cf. Botha, 2009; Tallerman, 2008; Wray, 1998). For critiques of the musical protolanguage hypothesis see Steklis and Raleigh (1973) and Tallerman (2013).

Mimetic protolanguage

A model focused on the intermediate stages of language evolution, and which incorporates aspects of the previous models, is Merlin Donald’s mimetic protolanguage (Donald, 1991, 2016). Similar to musical protolanguage, Donald envisions a crucial role for performative, group-defining rituals, initially devoid of propositional meaning, as an initial stage of language evolution, one he ties to Homo erectus. But mimetic protolanguage was an all-inclusive affair, including both gestural and vocal components, and it thus elides the distinction between gestural and vocal protolanguages (cf. Mithen, 2005). This is appealing in that gesture is present in both apes and in modern humans, so we can assume it was present and playing a communicative role throughout hominin evolution. Many commentators side with the idea that pitting vocal and gestural models against each other creates a false dichotomy (de Boer, 2016; Goldin-Meadow, 2016; Kendon, 2016). Donald’s piece in the current issue lays out this model and predictions concisely.

Summary

The key observation about these models is that they make testable predictions; in particular each of these models makes contrasting predictions about the order of acquisition of DCLs. As emphasized in the next section, paleo-DNA can play a central role in testing these predictions. Thus, for example, the musical protolanguage model predicts spoken language as the first DCL to evolve, while gestural protolanguage suggests it was the last. Musical protolanguage models also suggest a close link between speech phonology and song, a prediction that can be tested using both brain imaging and genetic data. All of these models posit a prolonged period during which the hypothesized protolanguage was the main form of communication, often during the reign of Homo erectus/ergaster, suggesting that the DCL(s) involved should be more robust to brain damage, genetic abnormalities and/or developmental delay (cf. Fitch, 2010). Precisely these sorts of testable contrasting predictions offer our best hope of moving this discipline beyond “story telling” and into real science, and provide grounds for believing that major progress in understanding language evolution can be made in the coming decades. I now turn to the sorts of data available for testing between them.

Empirical data relevant to testing models of language evolution

A wide variety of data are directly relevant to understanding language, most obviously those stemming from cognitive science and linguistics, including developmental, comparative, and historical linguistics. Such data have provided a reasonably clear picture of what language is and how it is acquired during infancy and childhood. The core findings, as taught in introductory courses in linguistics or psychology of language, have been reviewed in detail in many places (e.g., Crystal, 2003; Yule, 2006). Although the correct theoretical and philosophical framework for understanding this picture remains a topic of discussion (see, e.g., the articles by Arbib, 2016; Chomsky, 2016; Jackendoff & Wittenberg, 2016; Scott-Phillips, 2016), it all concerns modern humans and thus defines the “end target” of any evolutionary model, rather than the steps required to get to this point. It is only rather recently that detailed theoretical models of modern language have been used to fuel hypotheses about how language evolved (e.g., Berwick & Chomsky, 2016; Givón, 2002; Hurford, 2011; Jackendoff, 2002, 2010; Scott-Phillips, 2014), though earlier efforts include Bickerton (1990) and Pinker and Bloom (1990).

Comparative cognition and cognitive biology

In my opinion, the data that still remain most under-utilized in analyzing the biology and evolution of language are comparative data from nonhuman animals (“animals” hereafter), particularly those from nonprimate species such as birds, bats, or dogs. There is of course a long tradition of using comparisons between humans and other primates, although even these have tended to be superficial or assume that because some trait is seen in some monkey species (e.g., vervet alarm calls) it was present in our lineage before the evolution of language. In fact, a key role of primate comparisons is to reconstruct our LCA with chimpanzees and bonobos in detail. This common ancestor was not a chimpanzee, but an extinct species for which we have no fossil evidence. It is important to note that both humans and chimpanzees have been evolving since this split, and we thus find both behavioral/physiological traits (e.g., sexual swellings) and genetic differences that are due to chimpanzee evolution, and where humans retain the ancestral or “primitive” state (cf. Pääbo, 2014). Thus, reconstructing the LCAc requires broader primate comparisons (e.g., with orangutans and gorillas) to determine the “base state” from which we started. A more detailed characterization of the LCAc is found in Fitch (2010).

Beyond defining this starting point, we can also use comparisons with an ever widening group of related species to determine homologous traits shared by larger groups (see The broad comparative approach section). This homology-based approach allows us to rebuild our earlier and earlier common ancestors (e.g., with primates, mammals, tetrapods, vertebrates); the comparative method used in this way provides the biologist’s equivalent of a time machine, and (particularly when combined with genetic data) allows us to say with certainty when and how particular cognitive capacities arose during evolution (see section “The Long Time Scale”). This basic approach has been nicely captured in Richard Dawkins’ very accessible The Ancestor’s Tale (Dawkins, 2004), and is further discussed or illustrated by several articles in this issue (Byrne & Cochet, 2016; Fischer, 2016; Lyn, 2017; Seyfarth & Cheney, 2016).

A key point sometimes overlooked when considering animal data is that both animal communication and cognition are relevant. We cannot assume that, because a species does not show some capability in its communication system, it lacks it (cf. ten Cate, 2016)—obviously a cognitive capability could evolve for and be used in other systems, and the species in question has no need to use these in its communication system. Also, if we accept that some capacities key to language may have evolved in contexts other than communication (and were later exapted during language evolution), then we need to consider data from animal cognition on an equal footing with animal communication.

The other major role of cross-species comparisons, as already discussed in “The broad comparative approach” section, is to determine analogies—traits that evolved independently in humans and animals. The evolution of vocal learning is perhaps the most obviously relevant to language, already highlighted by Darwin (1871) and further discussed by Hickok (2016), Locke (2016), and Vernes (2016) and detailed in articles in a recent special issue on animal communication and language (Brainard & Fitch, 2014). But equally interesting examples come from syntax (which can be broadly thought of as perception of patterns of various levels of complexity), where great recent progress has been made in defining what birds can do (cf. Fitch, 2014; Fitch & Friederici, 2012; ten Cate, 2016), or social cognition (cf. Bugnyar et al., 2016; Call & Tomasello, 2008; Fitch et al., 2010; Scott-Phillips, 2016; Seyfarth & Cheney, 2016), where the cognitive parallels of pragmatics and theory of mind can clearly be found in ravens. These examples of analogy provide ways to test adaptive hypotheses (e.g., the social intelligence hypothesis of Byrne & Whiten, 1988 and Humphrey, 1976, or the cooperative breeding hypothesis of Burkart, Hrdy, & van Schaik, 2009 and Lukas & Clutton-Brock, 2012)—both introduced with reference to primates—and to explore the mechanistic basis of these abilities in an independently evolved brain.

This special issue provides a very rich selection of comparative data, including work on nonhuman primates (Arbib, 2016; Byrne & Cochet, 2016; Fischer, 2016; Lyn, 2017; Seyfarth & Cheney, 2016), birds (Fehér, 2016; Okanoya, 2017; ten Cate, 2016), bats (Vernes, 2016), or a mixture of species (Pepperberg, 2016). I will thus delve no further in this introduction into the value and virtues of comparative data.

Neuroscientific data

It is sometimes said that we know almost nothing about how the brain computes language (e.g., Berwick & Chomsky, 2016). While this was true in the 1960s, where aphasia and disorders were the main source of data, it is far from true today. The last decade has witnessed massive advances in our understanding of neural computation in general (e.g., single neuron computation, predictive coding) and in the specializations of the human brain (e.g., the great enlargement of Broca’s area and massive increase in its connectivity via the arcuate fasciculus). Particularly welcome developments in human brain imaging have led far beyond the first neo-phrenological stage of brain imaging (where “the area for x is sought,” x being language, syntax, social intelligence, love, etc.) to the clear recognition of the importance of brain networks (often spread widely throughout cortex) and connectivity in neural computation.

Prominent among these advances is the use of diffusion tensor imaging (DTI) and related techniques to map and measure the white-matter tracts connecting distant brain regions (Anwander, Tittgemeyer, von Cramon, Friederici, & Knösche, 2007; Catani & Mesulam, 2008; Friederici, 2009; Melhem et al., 2002). This allows in vivo analysis of the anatomical connectivity of different regions, and how they vary between species (e.g., Rilling et al., 2008). Such analyses are complemented by functional connectivity analysis (which uses cross-correlation in time across multiple brain areas) to evaluate which regions are coactivated in particular tasks (Sporns et al., 2004; Xiang et al., 2010; Hamilton et al., 2013), an approach also potentially applicable to inter-species comparisons (Rilling et al., 2007). Such approaches have led to important advances in our understanding of the neural computations that underlie language, including the still ongoing debate about whether any are specific to language or not (cf. Fedorenko & Thompson-Schill, 2014).

Another important development is the use of transcranial magnetic stimulation (TMS) and related techniques to noninvasively and temporarily inhibit or enhance brain function in specific, selected brain regions (Pascual-Leone, Walsh, & Rothwell, 2000; Rödel et al., 2009; Udden et al., 2008). These methods allow neuroscientists to actually test the causal hypotheses derived from brain imaging studies, which are typically correlational in nature. Finally, a surge in infant brain imaging has led to unprecedented insights into the surprising degree to which human neural specializations for language are already present and left-hemisphere biased at birth (or before, in premature babies), as ably reviewed by Ghislaine Dehaene-Lambertz in this issue.

Specific major recent advances in neuro-linguistics include the following findings:

  1. 1.

    Data from severe global aphasics clearly demonstrates that language is not necessary for sophisticated thought (Donald, 1991; Fedorenko & Varley, 2016; Varley & Siegal, 2000).

  2. 2.

    The neural basis of syntactic structure-building relies heavily upon an extended network centered on the inferior frontal gyrus, particularly Broca’s area (BA 44/45) (Dehaene et al., 2015; Pallier, Devauchelle, & Dehaene, 2011; Friederici, 2016; Wang, Uhrig, et al., 2015).

  3. 3.

    This area is greatly expanded in humans relative to chimpanzees (Schenker et al., 2010).

  4. 4.

    Its dorsal interconnections to parietal and temporal cortices via the arcuate fasciculus have been massively expanded in humans relative to other primates (Rilling et al., 2008).

  5. 5.

    Much of this network is already present, and left lateralized, before birth, as shown by brain imaging of premature and newborn infants (Dehaene-Lambertz & Spelke, 2015), demonstrating that prolonged exposure to speech is not necessary for the development of these human specializations (Dehaene-Lambertz, 2017).

  6. 6.

    The neural basis of semantic composition relies on multiple temporal and frontal areas, with the anterior temporal lobe playing an important role in early semantic composition (Bemis & Pylkkänen, 2012; Pylkkänen & Marantz, 2003).

  7. 7.

    Higher order semantic combinations are built later, and the ventro-medial prefrontal cortex plays an important role in sustaining such temporary semantic structures for use in other brain regions (Bemis & Pylkkänen, 2013).

The brain imaging studies referred to above have been repeatedly replicated, across many languages, and apply to both spoken, written, and signed languages, and in many cases apply to both perception and production (Poeppel, Emmorey, Hickok, & Pylkkänen, 2012). While the dataset for comparative comparisons is sparser, we have a rather clear conception of what precisely changed during the evolution of the human brain: In addition to a general size expansion, there were particular expansions (both in raw size and in terms of connectivity) of brain regions long known to play an important role in linguistic syntax.

Again, I need not belabor this topic in the introduction: it is covered by many experts in this issue (Arbib, 2016; Boeckx, 2016; Friederici, 2016; Hickok, 2016). Together, this body of research clearly refutes the misconception that we know little or nothing about the brain mechanisms underlying language.

Paleontological data

Another important, if sometimes over-estimated, source of data relevant to language evolution is the hominin fossil and paleontological record (the traditional term “hominid” for human ancestors contrasted with “pongid”—a false or paraphyletic clade containing all of the other great apes; “hominid” is now often used to refer to humans and other great apes, and the modern term to refer to humans and our extinct relatives postdating our split with chimpanzees is “hominin”). While the fossil record provides important clues to such things as body size, brain size, and technological abilities, the inferences these allow about language abilities are tenuous, and remain controversial. All commentators agree that the hominin fossil record reveals a quite “bushy” phylogenetic tree, with multiple different species existing at any given time point; there was no simple linear march from Australopithecus to us. Nonetheless, I will only mention those extinct hominins believed to be within, or close to, our direct ancestral lineage.

The clearest evidence from the fossil record concerns the timing of bipedalism, large body size, and large brains. It is now clear that one of the first events in the hominin lineage was the habitual assumption of bipedal posture, and that this occurred before any major increase in brain size (Stringer & Andrews, 2005). Thus, members of the genus Australopithecus (australopithecines hereafter) had brains roughly the size of chimpanzees, but habitually walked upright. We can infer from tool use in living chimpanzees that the LCAc used simple tools of stone and plant materials, and it is not until the Oldowan period (starting about 2.6 million years ago, MYA hereafter) that we see more sophisticated stone tools with cutting edges, presumably produced by late australopithecines and certainly by early Homo. These tools, while certainly useful, are not far beyond the cognitive capabilities of existing chimpanzees (Toth, Schick, Savage-Rumbaugh, & Sevcik, 1993) and so provide no evidence of language in their makers.

The greatest changes in the hominin lineage occurred with the advent of the genus Homo, which clearly exhibited a suite of derived traits marking these as a truly new kind of primate. Brain size increased considerably (Holloway, Broadfield, Yuan, Schwartz, & Tattersall, 2004), sexual dimorphism decreased, and the toolkit became much more sophisticated. These were also the first of our ancestors to leave Africa, successfully colonizing much of the Old World, and thus had a much more flexible ability to colonize and exploit new niches than australopithecines. Homo erectus is the traditional moniker for the most widespread and successful of these hominins, although authorities often reserve erectus for the Asian members of this group and use Homo ergaster for those remaining in Africa. The Achulean hand-axes that formed an important component of their durable toolkit were sophisticated tools, far beyond the capabilities of modern apes; indeed, we modern humans cannot make these tools without considerable practice and hard work. Despite this, hand-axe technology remained essentially the same for about one million years, strongly suggesting that Homo erectus/ergaster did not have the full cognitive and cultural capabilities of modern humans, and by inference did not have full modern human language. Because of this contrast between features resembling ours (larger brain and body size, sophisticated tools, ecological flexibility) but also unlike us (cultural stasis over a prolonged period), these hominins are often considered a prime candidate for hominins that possessed some sort of protolanguage, with some but not all DCLs of modern language.

Another increase in brain size and cognitive sophistication is represented by a suite of fossils often referred to as Homo heidelbergensis in a broad sense (Ruff, Trinkaus, & Holiday, 1997). These hominins produced excellently balanced wooden throwing spears (Thieme, 1997), and are thought to be close to the common ancestor of Neanderthals and modern humans (debate still rages over whether H. antecessor may be the more suitable moniker; Bermúdez de Castro et al., 1997; Endicott, Ho, & Stringer, 2010). In Europe and Asia, these hominins were succeeded by Homo neanderthalensis and a newly discovered species, Denisovans (see below). Neanderthals had brain sizes identical to or exceeding those of modern humans, and their hunting practices and stone tool kits approached ours in complexity, but many crucial symbolic aspects of the artifacts of modern humans are found rarely or not at all in association with Neanderthals (Mellars, 1998b, 2004; Tattersall, 2016). However, despite the excellent fossil and archaeological record left by Neanderthals, and decades of discussion, their cognitive abilities remain highly controversial (e.g., Dediu & Levinson, 2013; Lieberman, 2007; Stringer & Gamble, 1993; Wynn & Coolidge, 2004)—ranging from “Neanderthals were just like us” to “they lacked key cognitive capacities including symbolic language.” Given their utter disappearance in Europe shortly after the arrival of modern humans from Africa, it seems more plausible to assume at least some cognitive differences. Fortunately, high-quality paleogenetic data is now available for both Neanderthals and Denisovans, data which offers clear hope of progress beyond this impasse.

Genetic and paleogenetic data

One of the most exciting empirical developments of the last decade, relevant not only to language evolution but more broadly to our understanding of the human condition, is in molecular genetics and genomics (cf. Fisher, 2016; Pääbo, 2014). Public databases containing detailed, whole-genome sequences for thousands of individuals are freely available, as are the genomes of thousands of nonhuman species, including our nearest ape relatives. We also have high-quality genomes for two extinct archaic hominins, Neanderthals and Denisovans. The fact that all of these resources are freely available means that science has entered a new era where vague talk of “mutations” can transition to hard facts about gene sequences and allele distributions. These new data also offer unprecedented ways to test hypotheses about language evolution and to uncover the timeline through which genes specifically involved in various DCLs changed and spread through our species. If a single empirical development warrants optimism and excitement about the coming decades of language evolution research, it is these advances in genetics and genomics.

Nonetheless, these data pose daunting challenges, even for the professional molecular geneticists and bioinformaticians who produce them and especially for outsiders from linguistics or cognitive science for whom the data themselves and the tools to analyze them are terra incognita. It is unfortunate that a simplistic “gene for language” approach still tends to dominate the popular press and many scholarly discussions, when it is now clear that most traits involve multiple alleles, and that the function of any gene needs to be understood within a context of the many other genes with which it interacts. Thus in the same way that today's neuroscientists focus less on individual neurons and more on neuronal circuits (and less on specific brain regions than on global brain functional networks), the new genomic era forces us to think more in terms of genetic circuits, gene regulation, and interactions among multiple alleles than was typical during the pregenomic era. The multicomponent approach also provides a valuable scaffolding for understanding the specificity and/or generality of the effects of particular genetic changes. In addition to these essentially conceptual realignments one must add the vast diversity of genes (the names alone—mostly barely pronounceable acronyms—are daunting), and the rapidly changing analytical approaches for studying them. Thus, integrating linguistic, psychological and neural perspectives with the data rapidly pouring out of sequencing labs and onto the Internet is anything but trivial. But, as I will argue below, it will be worth it, because genes provide the closest thing to “fossils of language” we will ever have, and paleo-DNA in particular provides the analog of a time machine, taking us back nearly half a million years to the split between early Neanderthals and modern humans.

There are essentially three time scales at which genomic data are applicable. The longest is that studied by comparing humans with other living species, from bacteria to chimpanzees, and stretches from more than one billion years ago (109 years) to our separation from chimpanzees and bonobos roughly 6 million years ago. The shortest time scale involves, potentially, the genomes of all humans alive today, and is important from a linguistic viewpoint because of the widely accepted observation that humans from all human cultures today have an equivalent capacity for acquiring language, and that the language acquisition system is thus a human universal (barring clinical cases). Examination of the differences among modern humans thus allows the exclusion of a vast amount of genetic diversity as essentially irrelevant to language. More interestingly, we can use signatures of selection derived from comparisons of modern human genomes to gain insights into when particular selective events may have occurred, though the accuracy of this method grows poor past 30–50 thousand years ago (Przeworski, 2002). Finally, and most exciting, an intermediate time scale is offered by gene sequences from extinct hominins including Neanderthals, which pushes our time scale back to at least 500 KYA (kilo-years ago) and can be expected to push back further with the acquisition of more fossil DNA. I now review findings from each of these time scales in turn.

The long time scale: Comparative genomics

Starting with the genomes of living species, we can trace back evolution at the level of many millions (for vertebrates) or even several billion years (for all living things). Applying the comparative approach to compare human genomes with other species, we can reconstruct the evolution of most of human biology (because most of human biology is shared with at least some other species). Returning to eye evolution, trichromatic color vision is a trait that is rare in mammals but shared by humans and other apes (as well as various other primates). But it turns out that the ancestral vertebrate had at least trichromatic color vision (as evidenced by color vision in living fish, reptiles, and birds), and that the ancestral mammal lost one or more color-sensitive proteins, presumably as an adaptation to a nocturnal lifestyle (Collin & Trezise, 2004; Jacobs, 1993). Then, via a gene duplication event and subsequent functional divergence of the duplicates, trichromacy was regained in some primates including our own ancestors. This nonintuitive evolutionary trajectory is just one example of how a comparative genomic approach allows the confident reconstruction of our evolutionary past in exquisite detail—and the truth is not simple. This approach puts the vast majority of our evolutionary history within reach, from the origins of life to our separation from chimpanzees about 6 MYA, and it allows us to confidently catalog the many components of our language faculty that predated the evolution of language in recent humans.

Of course, given the exceptionality of human language relative to ape communication, some of the most pressing questions concern the differences between humans and other apes. The most famous such genetic difference is the FOXP2 gene, a gene involved in oral motor control and speech, discussed in detail in Simon Fisher’s contribution to this issue (Fisher, 2016). But unfortunately, finding further clear differences will involve both luck and hard work, because there has been so much genetic change that it is difficult to come to grips with. Roughly 20 million single-nucleotide substitutions (involving roughly 40% of all proteins) differentiate chimpanzees from humans (Pääbo, 2014). Most of these changes are presumed to have no biological effect, but simply represent accumulated genetic “noise.” Thus, searching for the key functional genetic changes underlying human/chimpanzee differences involves searching for needles in a genetic haystack, requiring probabilistic tools to screen for regions of interest.

One approach is to use bioinformatics tools to seek out signatures of positive selection in the human genome, but due to the relatively rapid time decay of such signatures, this approach is limited to rather recent changes. Still, such comparisons have yielded several genetic changes of interest. A nonlinguistic example involves spines on the penis, which are present in chimpanzees but thankfully absent in humans: This loss appears to involve the loss of function of an enhancer of the gene involved in spine development (McLean, 2011). Loss of function of a particular protein led to inactivation of a muscle protein expressed particularly in the temporalis jaw muscle, MYH16, and may have led to the reduction in this muscle’s size and jaw robusticity (Stedman et al., 2004). Finally, and more relevant to language, a similar loss of enhancer function led to inhibition of the GADD45G gene, which appears to limit cell division in the developing brain (McLean et al., 2011). Decreased activity of this inhibitory gene may be one of several changes involved in the expansion of brain size during hominin evolution. As these examples show, the loss of gene function may be just as important as gains of functionality to understand human differences. These also illustrate the nonintuitive way genetic circuitry and interactions—“inhibit an enhancer of an inhibitor”—often underlie even apparently simple phenotypic changes.

The central quantitative measure of selection in comparisons between species is the dN/dS ratio: the ratio of nonsynonymous to synonymous base pair changes. Because the genetic code is redundant, there are multiple three-base pair codons in DNA that yield the same amino acid in the coded protein; these are termed “synonymous” mutations. Only those DNA changes that actually change the resulting protein—the nonsynonymous mutations—are expected to be “visible” to natural selection. Thus, when the dN/dS ratio is high, with an excess of such coding changes, we can infer that the corresponding region has been under selection. This approach yields the strongest signal in cases where a particular region of the genome has been under continued selection, for example, in regions involve in disease resistance.

Another approach to finding the functional genetic needles in the haystack of nonfunctional changes involves searching for so-called human accelerated regions, or HARs. These are regions of the genome that are, in general, highly conserved (e.g., among mammals or vertebrates), suggesting that they are functionally important, but have nonetheless changed considerably during human evolution. One such region involves a particular RNA sequence, HAR1F, that is expressed specifically in a subclass of human cortical neurons (Pollard et al., 2006) and may play a role in specifying the six-layered structure of the neocortex. This gene illustrates another surprise of the genomic era: the importance of “noncoding” RNA in biology and development. Classically, RNA was seen as little more than the middleman between DNA and the proteins that do the real work. But it is now clear that even RNA that is not translated into protein can, by itself, play myriad biological roles, and this HARF1 gene is an interesting example, again involved in brain development.

Another theme of the genomic revolution is the importance of gene duplication in evolution. This can be either gene duplication with subsequent divergence (already mentioned above for color vision) or simply making more copies of a gene so that more protein product ends up being expressed. A potentially language-relevant example of the first phenomenon is SRGAP2, a gene which has duplicated three times in humans relative to other apes. The protein coded by one human copy has been truncated and binds with the normal proteins in such a way that neurons expressing this human truncated form (Charrier, 2012) have a larger number of long dendritic spines (potentially modifying the receptive field properties and/or operative time scales in neural circuits). As these examples all show, multiple genetic differences between chimpanzees and humans, functionally relevant to human cognition and perhaps language evolution, can already be listed and studied at present, even though the total list of differences involves 20 million base pairs.

The short time scale: Comparing human populations

Comparisons among living human populations and individuals can also play an important role in testing hypotheses about the genetic bases of language and can offer clues about evolution. The most obvious role of contemporary diversity in DNA, as already mentioned, is negative: When some alleles show variation among living human populations, this variation can be assumed to be irrelevant, based on the fact that there is no known significant variation in the language faculty among living humans. Contemporary genetic data reveal, incontrovertibly, that “racial” differences between existing human populations are literally skin deep and concern appearance but no known genes involved in cognition or neural function (Pääbo, 2014).

Another important role concerns disorders with an inherited genetic basis, where individual variations can offer insights into the role of the genes involved in language development. The discovery of the FOXP2 gene, originally uncovered via a British family with a severe speech and language disorder (Fisher, 2016; Fisher, Vargha-Khadem, Watkins, Monaco, & Pembrey, 1998) is a well-known example. But there are many other disorders with genetic bases where genomes of afflicted individuals shed light on genes involved in language (I will call these “language-related genes,” or LRGs). Relevant disorders can either be rather specific to components of language, like FOXP2, specific language impairment (van der Lely & Pinker, 2014), or dyslexia (Mozzi et al., 2016; St Pourcain et al., 2014; Wang, Chen, et al., 2015), or they can be broader disorders like autism, which have important consequences for, but are not specific to, language (Graham & Fisher, 2015; Raff, 2014; Rodenas-Cuadrado, Ho, & Vernes, 2014). A particularly interesting result involves CNTNAP2, a gene coding for a neurexin specifically expressed in the human cortex and involved in cortical development. Initially discovered due to its interactions with FOXP2, variants in one portion of CNTNAP2 are associated with language disorders and autism (Arking et al., 2008; Rodenas-Cuadrado et al., 2014; Vernes et al., 2008). Clinical genetic research is now exploding due to the easy availability of individual genomes and the growth of individualized medicine, and the data that it generates will play a key role in understanding the complex genetic basis for language acquisition and processing.

One intriguing aspect of variation among living humans is that, due to the large size of the human population relative to the genome, virtually all mutations compatible with life probably exist, right now, in some human somewhere today (Pääbo, 2014). Thus, mutations in any gene thought to be involved in language could potentially be sought out, and the resulting phenotype studied. Although this may seem far-fetched at present, the rapid decrease in price and increase in usage of whole-genome sequencing means that such targeted searching may well be within reach in a decade.

But the most potentially exciting application of data from living humans is its use to derive inferences about our (relatively recent) evolutionary past. Now classic examples of this include the analysis of the recent evolution of lactose tolerance in some populations (northern Europe and Africa), which results from genetic variants that became common only in dairy farming populations. This is an excellent example of gene-culture coevolution (Burger, Kirchner, Bramanti, Haak, & Thomas, 2007; Tishkoff et al., 2007). Another nice example is the AMY1 gene, which codes for the starch-digesting enzyme amylase. In the ancestral state (e.g., in chimpanzees, Neanderthals, or some hunter-gatherers) there is a single copy of this gene; but most living humans today have multiple copies. The number of copies correlates with reliance on a starch-rich diet (probably initially due to exploitation of roots and tubers, and later due to agriculture), with up to nine copies found in some contemporary populations. Such population-level differences in allele frequencies can be used to test hypotheses about early human movements, such as the origins and early movements of Indo-European language speakers into Europe and India, complementing historical linguistic data (Bouckaert et al., 2012; Cavalli-Sforza & Piazza, 1993; Gray & Atkinson, 2003). But again, none of these differences are likely to provide clues to genes involved specifically in language, because of the homogeneity of the human language faculty. Indeed, given that humans had already occupied many continents by around 50 KYA, with virtually no gene flow between (say) Australia and North America after then, no genes are likely to have achieved species-wide fixation since then. Any genes fixated at the population, but not the species, level can thus be presumed to have little relevance to the biological basis of the language faculty.

Present-day variation can also hold clues to the more distant past; the most prominent example being various “signatures of selection” that result from strong selection on a particular gene. Various measures have been proposed to quantify such within-species positive selection, including Tajima’s D or the Macdonald/Kreitman test (Nielsen, 2005). For example, strong positive selection can lead to a “selective sweep” where a particular allele goes to fixation (becomes present throughout the population, replacing any other alleles). Limited recombination around such genes means that the entire region of the chromosome surrounding the selected allele shows less variation than the background level (Hancock & Di Rienzo, 2008; Maynard Smith & Haigh, 1974). A careful analysis of such regions can support inferences about when in the past that gene went to fixation, and thus to (roughly) order the acquisition of different alleles (Nielsen, 2005). Unfortunately, recombination leads this signal to decay quite rapidly in evolutionary terms (Przeworski, 2002), so the time depth of this method is limited and will not allow us to peer back, in humans, more than about 100,000 years. Both selective sweeps and other proposed signatures of selection also tend to confound demographic phenomena with selection (Nielsen, 2005); in the case of humans, we know that there was a population bottleneck (a small founder population of our species, in Africa) followed by a massive, ongoing expansion of the human population over the last 100,000 years, so demographic effects are expected to be strong. All of these factors limit the utility of such population-level analyses for understanding language evolution.

The medium time scale: Paleo-DNA from archaic humans

All of the genetic data discussed above is exciting in that it provides unprecedented insights into evolution over long (comparative) or short (modern human) time scales, based on genomes of living organisms. These data are central to attempts to understand the complex linkage between genotype and phenotype, which remains a key unsolved challenge in contemporary biology. But the time scales involved leave a gap between around 6 MYA (our divergence with chimpanzees) and about 100 KYA (the rough limit to backward extrapolation from existing humans). This is unfortunate, because this is precisely the time period within which language evolution occurred in our species, and whatever genes make language possible became fixated in the ancestral population(s) of modern humans. Thus, it is extremely auspicious that a new source of data is available for precisely this time period: “paleo-DNA” retrieved from the bones of now-extinct species.

We now have excellent-quality genomes for two extinct archaic hominins, the Neanderthals and Denisovans. These are two related species of humans, whose homelands were Europe and Asia, that evolved independently of modern Homo sapiens, in these regions. Neanderthals have a rich fossil and archaeological record, while Denisovans were unknown until their genome was sequenced (and thus represent the first fossil hominin species to be discovered based on DNA alone). I will focus on Neanderthals. Although the bones from which these genomes come are not, themselves, very old (about 50,000 years), they push back the time span covered by genomic approaches to roughly 500,000 years, when modern human populations split away from the Neanderthal/Denisovan lineage.

Because of the rich archaeological record for Neanderthals, we know quite a lot about their culture and its development over time (see this issue’s contribution by Tattersall, 2016). These were very robust and powerful humans, with brain sizes slightly larger in absolute terms than those of modern humans, and with a sophisticated toolkit that enabled a hunter-gatherer lifestyle in the tundra-like conditions of Pleistocene Europe. Despite the excellent record, and their obvious cognitive sophistication, Neanderthals did not show the exponential rate of cultural change characterizing our own species from about 80,000 years ago to the present (Mellars, 1998b, 2004; Tattersall, 1999). As already noted, most archaeologists and anthropologists conclude that, despite their many similarities to us, Neanderthals lacked some cognitive and/or linguistic features of our species (Gunz, Neubauer, Maureille, & Hublin, 2010; Hublin, 2009; Mellars, 1998a; Mithen, 2005; Schepartz, 1993; Shea, 2003; Wynn & Coolidge, 2004; for a dissenting view, see Dediu & Levinson, 2013). While I agree with the premise that Neanderthals lacked something, I think it unlikely that they lacked language entirely; rather, a multicomponent perspective suggests that they already possessed some aspects of modern language and cognition, but lacked one or more others. In other words, Neanderthals possessed some form of protolanguage, representing a crucial intermediate step between our common ancestor with chimpanzees and modern humans. The key question then becomes which components of language were shared and which differed, and this is a question that paleo-DNA data are uniquely capable of answering.

Analyses that compare Neanderthal and modern human DNA provide an invaluable window into language evolution. If we take just those genetic differences in which Neanderthals share the ancestral allele with chimpanzees, but all modern humans share a novel, derived allele, we come up with a rather short and manageable list (Prüfer et al., 2014): Although there are about 31,000 base pairs that differ, only about 3,000 are within gene regulatory regions and thus reckoned to influence gene expression, and there are only 96 fixed amino acid changes in 87 different proteins. A disproportionate number of these genes code for proteins involved in brain development, and thus represent plausible candidates for cognition- and language-related genes. Crucially (unlike the chimpanzee/human difference list), this list is short enough that each of these genetic differences can be explored using modern molecular methods (e.g., inserting the modern human allele into genetically engineered cell lines of mice and observing resulting phenotypic differences).

Perhaps the most noteworthy early result of this paleo-DNA work is the finding that Neanderthals (and Denisovans) shared with us the modern derived variant of the FOXP2 gene, indicating (contra the original predictions of Enard et al., 2002, based on selective-sweep methods) that the modern variant evolved and was fixed at least 500 KYA, before the split between us and Neanderthals. Given the important of FOXP2 in human oro-motor control, particularly of complex sequences (Alcock, Passingham, Watkins, & Vargha-Khadem, 2000; Vargha-Khadem, Gadian, Copp, & Mishkin, 2005), this suggests that Neanderthals already possessed some form of speech. Nonetheless, humans differ from Neanderthals in a regulatory region of the FOXP2 gene, a binding site for the transcription factor POU3F2 (Maricic et al., 2013), which suggests some changes in FOXP2 expression, potentially relevant to spoken language, evolved after our split from Neanderthals. These results support the notion that Neanderthals had some components of spoken language, but not others, and that sophisticated vocal control and speech evolved earlier in human evolution than full modern language; they thus provide support for musical protolanguage hypotheses like that of Darwin and many others. They are evidence against the prediction of the gestural protolanguage hypotheses that speech was a late acquisition, occurring only after other components of language (e.g., syntax and semantics) were already in place (cf. Fitch, 2010).

Despite this exciting progress in testing hypotheses about language evolution, clearly a single gene comparison cannot resolve the myriad debates revolving around such hypotheses, and many more LRGs will need to be understood before any firm conclusions can be drawn (cf. Mozzi et al., 2016). But the critical point illustrated by the fact that Neanderthals, unlike chimpanzees, shared our version of FOXP2 is that paleo-DNA allows us in principle to test specific evolutionary hypotheses and address long-standing debates in a way that, only a decade ago, would have been unthinkable. The remaining difficulties at this point are not access to the genetic data, which are available free to everyone, but rather understanding the still vexing complexity of the mapping between genes and phenotypic traits of interest in language variants. The good news is that a very large research community, mostly made up of clinicians and molecular biologists with no specific interest in language evolution, is already hard at work resolving those issues. Every increase in our understanding of particular gene variants relevant to language, cognition, or the brain can now be immediately checked against the Neanderthal genome to see whether modern human variants evolved before or after our split from this extinct species (and to support further inferences about the kind of protolanguage Neanderthals spoke).

Synthesis: A staged-protolanguages model of language evolution

To show how the overall hypothesis testing framework outlined above can be concretely put into action, I will now offer my own new model of language evolution, a model that proposes an order of acquisition for each of the key components introduced in the “What evolved?” section and explains in what contexts they were adaptive. I will build freely upon the previous models of protolanguage reviewed in section “Models of language evolution” and offer concrete testable predictions based on the types of evidence discussed in the “Empirical data” section. My aim here is illustrative, to show that a model can be constructed that is consistent with all available data, and that makes clear testable predictions. I give attention to and respect for previous research, trying to highlight areas of agreement and disagreement with previous workers.

My model is composed of multiple separate hypotheses about key innovations in language evolution. It proposes four clear stages since our divergence from our LCA with chimpanzees, each one of them associated with the acquisition of one or more specific key novel capabilities: “key innovations” on the way to modern language (cf. Liem, 1973, 1990). I start with a brief overview of these proposed stages and then go into more detail about how my proposal differs (or borrows) from those of others.

Stage 1 Vocal learning: Singing Australopithecus

The first stage involved the acquisition of vocal learning capabilities to generate learned vocal sequences without any propositional meaning: a “prosodic protolanguage” (I tentatively suggest that this stage was associated with Australopithecus).

Stage 2 Mimesis: Mimetic Homo erectus

In the next, crucial, stage, this vocal learning capacity combined with the preexisting gestural abilities of the LCA to form a richer and more elaborated mimetic protolanguage (Donald, 2016) that combined gesture and expressive learned vocalizations in shared group rituals and information exchange. During this stage, which I associate with Homo erectus, pressure to learn and elaborate both vocal and manual sequences left its traces in the more elaborated technology of the Achulean—a one million year period during which this mimetic protolanguage was the main communication system, and these highly successful hominins expanded into the entire Old World. Because of these deep roots, clear traces of this stage remain evident in modern humans in music, dance, mime, and their expression in group rituals. The selective pressures during this stage were essentially about group bonding and mate choice among adults, but children would certainly have participated fully in these activities.

Stage 3 Propositional meaning: Semantic Homo antecessor

Although the previous two stages find clear analogs in the animal world, the next proposed stage is unique to the hominid lineage: The key innovation was use of the complex, socially shared mimetic sequences to share propositional meanings for the first time. Previous mimetic sequences were meaningful only in the sense that music and dance are: they connote occasions and express moods or aesthetic trajectories, but cannot convey specific abstract thoughts. I suggest that the next (third) stage was associated with the last common ancestor of Neanderthals and humans (Homo heidelbergensis/antecessor). The driving force for this key innovation, which added explicit semantics to a preexisting mimetic communication system, was sharing detailed propositional information with close kin, especially between parents and their offspring (this was thus a “mother tongue,” driven by kin selection to raise inclusive fitness; Fitch, 2004), with close similarities to what Kevin Laland terms “teaching” (Laland, 2016, 2017). At this point, the information-transmission capabilities of our ancestors made a huge leap, but protolanguage was still limited in communicative scope by its restriction to unsuspicious use with close kin.

From a structural viewpoint, this protolinguistic system would retain the sequential structures typifying the previous mimetic stage, but these sequences would already reflect a certain amount of hierarchical structure stemming from the preexisting hierarchical structure of thought, without there being any specific syntactic markers of such structure. I posit that this propositional protolanguage continued to be the system used by Neanderthals and Denisovans. The crucial problem solved at this stage was “honest” communication of accurate information to others, the evolution of which is hard to explain among unrelated adults, but easy to understand in the context of knowledgeable adults sharing information with their offspring and close kin (cf. Fitch, 2004).

Stage 4 Syntactic Homo sapiens

The fourth and final stage—modern, fully syntactic language—occurred only in anatomically modern Homo sapiens, sometime between 200 and 80 KYA. In stage-three propositional protolanguage, sequences reflected hierarchical structure only accidentally: signalers attempting to express hierarchical thoughts might unconsciously provide cues to the hierarchy (e.g., via word order, or the pauses and prolongations typical of mimetic protolanguage). During this last stage, a rich social communicative “ecosystem” appeared within which children were embedded, creating competition for successful acquisition of the information in these signals, putting selective pressure on children to rapidly and successfully acquire and generalize them (cf. Locke, 2016). I suggest that it was during this last stage that human “dendrophilia” (Fitch, 2014)—our ability and propensity to perceive hierarchical structure in sequences—arose. While this “mental tree reading” ability would initially be selected in children, as adults it would not disappear, and it was in the new context of propositional information transfer among unrelated adults that the Machiavellian side of hierarchical mind-reading had its most telling effects, for it began the upward spiral of multiply embedded theory of mind, epistemic vigilance (essentially skepticism and distrust), and tit-for-tat reciprocal sharing of valued information among nonkin that still typifies our species today.

As will be clear to those familiar with the literature, the “SMSS” (song, mimesis, semantics, syntax) model proposed above builds upon and synthesizes many previous ideas about language evolution, and each of the individual stages, key innovations, and selective pressures has been previously discussed. The first “prosodic protolanguage” stage is closely allied to Darwin’s musical protolanguage hypothesis. The key innovation of this first stage was vocal learning, driven by much the same pressures that drove vocal learning in many other vocal learning species, and requiring the mechanistic acquisition of direct cortico-motor connections (Jürgens, 2002). During this “singing ape” stage, sequencing of learned signals was already in place, adding a selective pressure for rapid and accurate sequence learning as typified by many modern songbirds (Jarvis, 2004b; Kroodsma, 2005; Marler & Slabbekoorn, 2004). But the signals produced, though complex and shared among groups, were not “music” in the modern sense (for discussion, see Fitch, 2013). In the second “mimetic protolanguage” stage, vocal learning combined with preexisting gestural capabilities to yield a much richer communicative system; although this combines aspects of music and gestural protolanguages, it differs from them in not proposing any point at which gesture alone was central: learned vocal displays were there from the beginning. In this I closely follow Donald, Kendon, and others, with acknowledgement of Arbib’s more nuanced “upward spiral” conception of gestural-vocal interaction (Arbib, 2005, 2016).

Once an elaborate, socially shared but nonpropositional communication system is in place, the major problem is how propositional semantics could be added to such a system. In Jespersen’s words (citing Wilhelm von Humboldt): “How did man become, as Humboldt somewhere defined him, ‘a singing creature, only associating thoughts with the tones’?” (Jespersen, 1922; von Humboldt, 1836). Humboldt’s original phrase was “because man, as a type of animal, is a singing creature, but with thoughts bound to the tones” (my translation, p. 76). This is the major problem faced by musical protolanguage theories, and there can be little doubt that both the mimetic potential of gesture and the onomatopoetic nature of sound helped pave the way for this key innovation (Blasi et al., 2016). The key questions then are “why only us?”—of all singing creatures, why are we the only ones to do this—and “what selective pressures?”—what would drive signalers to honestly communicate valuable information (and thus perceivers to attend to them)?. I have previously argued that communication among kin was the key selective pressure on signalers, and that the existence of large bodies of learned knowledge was the key selective pressure on perceivers (Fitch, 2004, 2007). Kevin Laland extends this argument in the current issue, and although I find his term “teaching” somewhat misleading as it connotes, to me, a conscious and formal intention to teach, we essentially agree concerning the basic evolutionary story and the problems solved by this approach (see also Nowicki & Searcy, 2014). To my knowledge, my idea that this “mother tongue” was the communicative system typifying Homo heidelbergensis/antecessor, and later Neanderthals and Denisovans, is novel.

The basic hypothesis that the final stage of language evolution involved the acquisition of full modern hierarchically embedded syntax is shared by many previous writers (notably Bickerton, 1990; Chomsky, 2010; Jackendoff & Pinker, 2005). While incorporating many of the insights of these previous authors, my hypothesis differs from them in the following ways. First, I do not posit any “one word” stage of language evolution: Protolanguages always involved sequences, and words were “condensed out” of holistic utterances by learners, rather than first appearing as isolated building blocks to be put together later. Thus, in terms of word origins, mine is an “analytic” model, rather than a “synthetic” model (Arbib, 2005; Wray, 2000). Furthermore I posit that much of the syntactic work needed for modern language had already been done via prolonged selection for sequence processing and generation in the previous stages—mimetic and propositional protolanguages already had essentially all of the neural mechanisms required for modern spoken phonology and prosody in place. Thus, the key innovation at this stage is dendrophilia (Fitch, 2014)—a domain-general proclivity to perceive hierarchical structure that came to be applied not only to language but also to music and decorative arts. Here I concur with Chomsky (2010), Berwick and Chomsky (2016) and Chomsky (2016) that it is the flexible and unrestricted capacity for hierarchical embedding that is central to our species’ cognitive uniqueness; I differ from them in seeing it as something that was flexible enough to immediately play a role in both structured thought and linguistic communication, as well as in other domains (it is only at this stage that music in its modern sense, replete with hierarchical structure, was born; cf. Patel, 2016). Elsewhere, I have explored the neural changes that were necessary to achieve this final stage of dendrophilia (Fitch, 2014)—but in short, dendrophilia requires both extensive connections between temporal and parietal areas and the prefrontal regions surrounding Broca’s area (the arcuate fasciculus) as well as a great expansion of Broca’s areas (Friederici, 2016; Rilling et al., 2008; Schenker et al., 2010). Finally, I agree with Keller (1995), Deacon (1997), Heine and Kuteva (2002), Steels (2016), Kirby (2017), and many others that much of the complexity evident in the syntax of modern languages has arisen repeatedly by grammaticalization processes of cultural evolution and required no further neural changes beyond those needed for dendrophilia.

In summary, the model proposed above provides a synthesis of many previous ideas about language evolution, incorporating multiple hypotheses from the literature about the order in which the different key innovations of language arose, what neural changes were needed, and why these were selectively advantageous at that time. This model makes a slew of testable predictions, and cashing most of them out will only require a better understanding of neural mechanisms and their genetic basis (normal science, with living organisms) because then we can use the existing genomes of Neanderthals and Denisovans (and hopefully in the future Homo heidelbergensis and erectus) to validate or invalidate the predictions. Indeed, some of these predictions (e.g., that Neanderthals would share the vocal sequencing capabilities of modern humans) are already supported by the fact that they shared our modified FOXP2 gene (see above and cf. Fisher, 2016). However, most of the predictions, for example that Neanderthals had poor theory of mind (were less Machiavellian) and lacked dendrophilia, will require a better understanding of the genetic basis of these traits (e.g., from the study of autistic spectrum disorder for theory of mind, or the development of the arcuate fasciculus for dendrophilia). But again, these advances are part of modern clinical and developmental genetics: no time machines are required to test these hypotheses.

Conclusions

In this introductory review, I obviously could not list every idea about language evolution or every source of data relevant to its study, but I do hope that this review of hypotheses and data whets the reader’s appetite for more, and illustrates a few key points. First, contrary to an oft-stated opinion, there is a refreshingly large volume of data relevant to language origins once the problems and models have been clearly stated, and an open-minded approach based on strong inference is adopted. In particular, different models of language evolution make different predictions in multiple empirical domains, and data capable of discriminating among such well-specified models are often either available or can be gathered with current methods.

Second, the relevant data are almost bewilderingly diverse and voluminous: they span a set of disciplines that no single scholar, however knowledgeable, could hope to individually master. This means that future progress in understanding the evolution of language requires productive collaborations between researchers from different disciplines, something that is easier to talk about than to actually do.

Finally, and most excitingly, this huge amount of data flooding in offers a unique new promise for contemporary researchers. These data are often generated by researchers who have no particular interest in language evolution per se, but see themselves rather as broadly studying cognitive neuroscience, paleo-genomics, brain evolution, theoretical linguistics, or human (cultural) evolution. For those willing and able to educate themselves about these data and the issues specific to language evolution, and who adopt a hypothesis-testing framework, such data can allow real progress. A clear example is the abundant genetic data from many living species (including thousands of individual humans) as well as an ever-increasing number of extinct hominids—data freely available online to any interested person. There is also a steady growth in other relevant freely available data (e.g., the WALS database of linguistic structure; Haspelmath, Dryer, Gil, & Comrie, 2005). Thus, despite the technical challenges of accessing and understanding such data, any scholar or team of scholars can in principle use this body of knowledge without undertaking the vast cost and effort of gathering it themselves. The study of language evolution has reached the “big data” stage, and harnessing this will require significant changes in our scientific attitude to developing and testing models of language evolution.

In the final section, I offered an illustration of the kind of theorizing that will be needed to effect these changes: comprehensive models that seek to explain all of the key innovations that needed to occur to yield modern language, considered  from mechanistic, phylogenetic and adaptive viewpoints. Such models—and mine is only one of many plausible suggestions—consist of multiple specific, testable hypotheses. The data available for such testing include comparative, linguistic, neural, and paleontological findings, but progress will lean heavily on genomics since the closest thing available today to a time machine is DNA, especially the paleo-DNA from extinct hominins. It is unfortunate that these genomic data do not (currently) extend further back in time (e.g., to Homo erectus or Australopithecus), but we shouldn’t look this gift horse in the mouth. Never before have hypotheses about the linguistic capabilities of Neanderthals been so clearly and empirically testable.

To conclude, the evolution of language has been termed “the most challenging scientific problem of our time”, and from a sociological point of view, the interdisciplinary collaboration that it requires is indeed a challenge. But it is also one of the most interesting problems in modern science, and concerns an issue absolutely central to understanding our own species: how we rose from being an ecologically peripheral African ape, a few million years ago, to the most important (and dangerous) species alive on our planet today. The work reviewed here, and the remaining articles in this special issue, suggest that a deep understanding of this ancient problem may be attainable in the next few decades.