Introduction

British archaeologist Steven Mithen, born in 1960, has garnered worldwide acclaim for his influential books The Prehistory of the Mind (1996) and The Singing Neanderthals (2005). The former has been celebrated for its masterful writing, while the latter presents a captivating argument for the musicality of Neanderthals. Despite their contributions, a focused examination of language – a cornerstone in the evolution of the human mind – was conspicuously absent. Mithen addresses this gap in his latest book, The Language Puzzle: How We Talked Our Way Out of the Stone Age (2024), offering a thorough, interdisciplinary exploration of the origins and evolution of language.

In The Language Puzzle, Mithen addresses the puzzle of language evolution from numerous perspectives. At the same time, perhaps because of the breadth of this undertaking, the book could seem to lack a simple, compelling thesis, around which to center the enormous amounts of data collected. While no such thesis is very explicitly stated, the book does, in my view, advocate for a significant paradigm shift in our understanding of language origins. Specifically, Mithen challenges the prevalent gestural origins hypothesis, proposing instead what might be termed an iconic vocal origins hypothesis.

In brief, Mithen’s core suggestion, as I understand it, is that early languages would have been mostly or possibly entirely iconic. Conventionalization would enter into the picture only later, motivated by the inherent constraints of iconic languages (150–151)Footnote 1. Iconic words functioned as a transitional “bridge between the barks and grunts of the chimpanzee-like calls … and the dominance of arbitrary words within fully modern language” (152). This new perspective, whether overtly or subtly, informs most if not all of the book’s chapters. My review will primarily focus on critically evaluating this hypothesis.

Before delving any deeper into Mithen’s proposal, here’s an overview of the book’s contents by chapter. Chap. 2 provides a relatively standard archaeological overview of human prehistory over the past 6 million years, up until the advent of agriculture. Chap. 3 shifts focus to language, introducing distinctions such as those between vocal expressions and language, different word types, and the concepts of compositionality and prosody. Chap. 4 examines the vocalizations of monkeys and apes, and Chap. 5 the vocal tract anatomy of modern humans, chimpanzees, and the relevant data available from paleontology. Chap. 6 delves into the history of philosophy, to revive the iconic origins of language hypothesis. Chap. 7 explores what prehistoric tools can teach us about the evolution of (vocal) language. Chap. 8 turns to insights from computer science, and Chap. 9 to those from developmental psychology. Chap. 10 speculates on the role fire played in bringing people together at times when little practical work could be done. Chap. 11 compares the brains of humans and chimpanzees, and revisits some of Mithen’s ideas about cognitive fluidity from The Prehistory of the Mind (1996). Chap. 12 discusses developments in the field of genetics, and Chap. 13 how new words might be created, for example by word borrowing, compounding, blending and clipping. The final chapters focus on thought and symbolism.

In the remainder of this article, I discuss Mithen’s case for the evolution of language primarily within the vocal rather than gestural domain, and on an iconic rather than symbolic basis. Following this overview discussion, I critically evaluate this proposal. The final section reviews the concluding chapter of Mithen’s book, and attempts to situate his account of the evolution of behavioral modernity within the ongoing debate about (dis)continuity.

The iconic origins of language hypothesis

About three decades ago, the discipline of primatology underwent something of a paradigm shift in focus and methodology. Initially, primatologists gravitated toward vocal communication, mirroring the way humans acquire language. The pioneering work of Cheney and Seyfarth in the 1980s on vervet monkey calls briefly ignited excitement about the potential for symbolic, human-like communication among non-human primates. However, the following decades saw a gradual shift away from vocal communication, and since then a significant body of research has focused on non-verbal communicative acts like gestures, pantomime, gaze following, and referential acts, occasionally aided by Theory of Mind frameworks. Over the past twenty years, this trend has led many researchers to embrace some version of the gestural origins of language hypothesis: the idea that human language evolved out of gestural modes of communication.

In Chap. 4 of The Language Puzzle, Mithen takes a firm stance against the gestural origins of language hypothesis. Alongside his dismissal of Chomsky’s universal grammar—a rejection that has become more mainstream today—Mithen’s argument unfolds in two critical steps. First, it is suggested that “the highly specialized nature of human anatomy and neurology for speaking and listening implies a long and gradual modification of a pre-existing system for vocal expression” (75). In other words, modern human languages are taken to rest on a cognitive system for vocal expression that was already in place at the time of the last common ancestor, likely one similar to those found in extant non-human primates. This view is grounded in the idea that the adaptations in the human vocal and auditory system would be too significant to have evolved during a shorter time span. Mithen further suggests that for modern humans “gesture supports but does not constitute language unless it is heavily formalized into sign language” (75). This too would indicate that language did not evolve from gesture; even a multi-modal origin is rejected.

Second, the gestural origins of language hypothesis would lose further plausibility in light of a better alternative. Interestingly, according to Mithen, versions of this alternative were prevalent in science and philosophy before Ferdinand De Saussure and Charles Sanders Peirce emphasized arbitrariness and conventionality as key features of symbols, including words. Mithen seeks to revive this classic view, which states that vocal words would at first have been iconic, rather than symbolic. Such a notion of iconic language serves to bridge the chimpanzee-like vocal expressions of the last common ancestor and modern human language. Mithen suggests this transition is not as dramatic as it may appear; only “a small cognitive shift would transform their [chimpanzees] vocal communications into something more recognizable to us as language” (93).

Thus, the first step of the argument establishes a vocal rather than a gestural origin of language, and the second step specifies that in prehistoric languages, the relation between the vocal signifier and the signified would have been predominantly iconic. In Chap. 6, Mithen delves into the history of philosophy, for further support of this iconic origins hypothesis. His exploration begins with the Greeks, notably Socrates and Plato, who postulated the idea of a “Perfect Language” – interpreted by Mithen as an iconic language, one which has sounds correspond most adequately to the objects they refer to. Epicurus then contributed by emphasizing the role of the environment in shaping language, proposing it is a natural response to environmental stimuli, which may differ per community as surroundings change. Much later, Rousseau, Herder and others developed these lines of thought further, for example with ideas about synaesthesia (“interconnections of the most different senses”, 131), which Mithen sees corroborated by contemporary research on onomatopoeia. All these threads lead Mithen to conclude that “iconic words likely played a key role in the evolution of language” (144). By contrast, the conventionality of words, as typically emphasized in Peirce-based approaches, would have emerged later, once a certain threshold of iconic languages had been reached.

Mithen, then, seeks to revive a version of the more traditional yet long-overlooked iconic origins of language hypothesis, and specifically within the vocal domain. The essence of his endeavor lies in the conviction that iconic words could have functioned as a transitional bridge toward more fully symbolic languages. More concretely, Mithen considers that “H. erectus had remained reliant on iconic words, these providing sufficient verbal communication to transmit tool-making skills” (179).

As mentioned, Mithen’s book has enormous breadth, and it is impossible to assess the wealth of data discussed in light of this theory. My central concern, however, is that an abundance of data does not of itself create a convincing theory or argument. In spite of its rich interdisciplinarity, which alone makes the book worthwhile reading, Mithen’s theory strikes me as somewhat underdeveloped, to the point that it is unclear what exactly he is suggesting or why. This is mainly for two reasons. First, there is a conspicuous absence of a discussion of the gestural origins hypothesis, and of primate gestures in general – the focal point of the past decades of primatological research. Arguably, this is required for the reader to assess the feasibility of Mithen’s criticism and of his alternative. Second, it seems to me that a lot of the data used to support the idea of prehistoric iconic vocal languages can also be used to argue for alternative theories. I will explore these two critical evaluations in more detail in the next section.

Discussion

Mithen’s dismissal of the importance of gestures for language evolution is motivated mainly on page 75, where a gestural and even a multi-modal origins hypothesis is rejected, apparently for two simple reasons: (i) modern human vocal and auditory anatomy presuppose a long adaptive period, and (ii) modern gestures do not constitute language unless heavily formalized. While the first point is convincing, it does not effectively argue against multi-modal origins. The second point strikes me as more questionable. After all, Mithen’s main suggestion is that prehistoric languages would have lacked formalization anyway, relying on iconicity rather than conventional rules. If so, then the infrequency of formalized gestural systems today does not necessarily indicate that gestures were not important in earlier times, when there were no formalized languages.

In any case, I suspect that the few arguments Mithen presents against gestural origins likely will not change the mind of anyone defending that idea. A review of literature on primate gestures is, in this regard, a significant omission. Not only have gestures been the focal point of recent research on primate social cognition, but a discussion of the status quaestionis there is arguably necessary for the reader to assess the plausibility – and even to understand the significance – of Mithen’s alternative.

An unlucky consequence of this omission is that Mithen misses out on what seems to be a perfectly viable variant of his own theory: an iconic gestural origins hypothesis. I suspect that a lot of the data collected in support of iconic vocal origins could also be made to support an iconic gestural origins hypothesis (some of the data cited is in fact about gestures, 151). To give an example, Mithen appeals to the presence of onomatopoeia in contemporary languages as possibly indicating iconic origins of words. Now it has long been shown that gestures and bodily expressions pervade human communication, and some of these certainly qualify as iconic as well. Therefore, an analogous argument could be constructed for iconic gestural origins.

Presumably, Mithen has his reasons for not discussing gestures or their iconic components. Yet it is odd that the option of iconic gestural origins is not even touched upon in the book. It just seems a lot easier to imagine a duo of H. erectus hominins together enacting a successful hunt, or a hilarious event they witnessed earlier, then it is to imagine them describe a simple series of events with iconic sounds. What could that even look like? Modern humans are quite good at acting out events and imitating each other and animals, whether with hands or their whole bodies. It is not too far-fetched to imagine this was once useful when vocabularies were limited, and possibly later in cross-cultural communication, where there might have been language boundaries. By contrast, at least for modern humans, telling a simple story with only iconic sounds is virtually impossible.

How, then, are we to envision an iconic vocal communicative system in practice? Although arguably the central thesis of the book, this concept nowhere becomes very clear. We can certainly imagine someone imitating the sound of a pig, a bird, or perhaps a spear traveling through the air. But such examples might be quite few. Going from what I see around me: how would I go about imitating a chair, a tennis racket, a cup, or a plant by vocal sound? Again, I could probably enact all these things relatively easily using my hands and arms, but I would struggle to do it verbally, and any listener would have to put in a lot of work to infer what I am trying to say. Even if I manage to get the referential intention (what I refer to) across, what about the social intention (what I want the other to do)? Talking about a language puzzle!

Let me be clear that the following idea does strike me as quite plausible: word invention is partially motivated by iconicity. This might be called the iconic motivation for word invention hypothesis. That is to say, if one has to invent a name for something, it makes sense to depart from something already understood by the community (283). This could be an existing word which is somehow associated with it, or possibly some (acoustic) feature of the object which everyone is familiar with. The circumstance of word invention, however, is not the core of Mithen’s proposal, as I understand it. Mithen’s main suggestion is that prehistoric verbal communicative systems consisted entirely, or else primarily, of iconic words. Such lexical items are processed, cognitively speaking, in that manner, and this would form the cognitive foundation for later conventional sign use.

In The Language Puzzle, Mithen draws on data from various disciplines in support of this view. Among others, he points to traces of iconicity in modern languages (mainly in English, but the point generalizes). It is said that basic vocabulary items show “persistent sound-meaning associations” irrespective of language families (145). For example, both /i/ and /l/ would connote small (134, 145), soft consonants would cross-culturally be associated with mother (135), and the /n/ with nose, reflecting the sound made by this organ (145). Also, in English, words related to “unhurried movement consistently start with sl-, such as in slow, slide, slur, slouch, and slime” (137), and so on.

One possible objection to this line of argument is that the evidence does not straightaway warrant the conclusion. Even if there is some iconicity to words like slide, slow, and slouch, it seems more plausible to cash this out in terms of my earlier iconic motivation for word invention hypothesis, rather than to conclude that there once existed completely iconic languages. Arguably, for the majority of words (including the ones just mentioned), a perfectly adequate signifier is not even conceivable. Instead, the consistent cross-cultural correlation of certain signifiers (or parts thereof) and what they signify might be explained by saying that word invention is sometimes motivated, in part, by resemblance.

Another problem with the argument is that modern humans usually only become aware of any iconic roots once they reach adulthood. This could mean that they actually processed the words as conventional symbols, not as icons. For example, I learned the Dutch word kraai (for crow) as a child, but I never realized its iconic roots (it resembles the sound these animals make) until I had reached adulthood. It is possible that the iconicity helped me to remember the word when first learning it, as Mithen also suggests (144), but this is uncertain. Human infants learn thousands and thousands of new words, which are all stored internally, with or without underlying resemblance.

Perhaps iconic words had certain practical benefits at times when they could not be stored internally. In such an hypothetical scenario, hominins could use the word “kraai” whenever they saw this type of bird. In other words, because they would not have access to a suitable lexical item stored internally, the alternative would be to make a sound that resembles the referent in the vicinity. This way, hominins can effectively communicate about (at least some) objects, without necessarily having to store lexical items.

The idea of an iconic language, then, could make more sense when assuming that lexical items are not stored internally. However, this is a challenging scenario to imagine, as it means prehistoric hominins effectively had to reinvent the wheel each time they spoke. While such a scenario is conceivable, it seems all too evident that humans have evolved to store lexical items, and not to mimic objects by sound – they are much better at mimicking with hands and bodies. Arguably, internal storage of conventionalized lexical items fits better the steady cortical expansion seen during the entire Paleolithic (244–245), also in times with little changes in the material records, and it matches better the way words are used and acquired by humans today.

There might be more nuanced ways to flesh out Mithen’s proposal. Consider the following illustration. We can imagine that a roar-sound, as an iconic word, at some point functioned to refer to an animal making such a sound. More specifically, this roar-sound might have been used only in the presence of such an animal, and specifically within a certain social practice, like hunting. On the basis of this shared practice, listeners might be able to infer the social intention of this communicative act, as a request for a cooperative hunting effort. At the same time, the very same roar-sound might have functioned in different social settings, and with different social implicatures. For example, when preparing for sleep, it might be used to indicate a threat, and to motivate a cooperative effort to scare off a dangerous animal. We might further speculate that although this roar-sound could have been stored internally, others might only have been able to understand the social intention correctly when partaking in the relevant social practice (hunting or sleep preparation).

A lexical item of the sort just described would be characterized by what I have elsewhere called practice-embeddedness. Communicators would rely on a shared practice awareness to provide the relevant contextual clues, which is particularly useful in the absence of more extensive vocabularies (van Mazijk 2024a, b, c). Pointing, which I have elsewhere argued was essential to early hominin language evolution (van Mazijk 2024a), works similarly: we here have a single sign whose meaning varies conveniently depending on shared context understanding. Not only does the referential intention depend on the immediate surroundings (the sign’s referent depends on what I point out), but also the social intention – “what do I want the other to do?” – relies on shared contextual understanding of what we are doing. Communicative success thus depends on live social practices, which could lower cognitive demands of early sign use (van Mazijk 2024c).

In my view, shared practice awareness is key to bridging the simplest referential acts (pointing and basic words) and modern symbolic languages. The outline I just gave, using concepts such as social intentions and practice-embeddedness, might be compatible with Mithen’s account (although this is seemingly contradicted when Mithen says that iconic words of H. erectus were “understood by others without their referents being present”, 372). Still, the theory of the practice-embeddedness of early speech acts does not depend on the iconic vocal origins theory. Early words may have been strongly context-dependent, even when they were not (all) iconic. Put differently, while successful communication always depends on shared context understanding, iconicity is but one way to capitalize on shared context. Besides practice understanding, shared embodiment and shared emotions might also contribute significantly.

I worry that some of the other data cited by Mithen concerning how words change (273–313) might also be turned against his proposal. That words change rapidly will be familiar to anyone who has read century-old texts. For example, according to Mithen, the old-English word saelig once meant “blessed”, but apparently changed to mean “innocent”, “weak”, and then “silly” (interestingly, the Dutch word zalig still means “blessed”, but now also “delicious”). In Chap. 13 of The Language Puzzle, Mithen offers interesting insights into the complex dynamics of such change. He discusses various ways in which words can evolve, such as by word borrowing, compounding, blending, and clipping.

Yet the very fact that words change seems to suggest that they are cognized by humans as conventionalized signs. After all, a sufficient amount of change implies a loss of any initial iconicity. It is thus contradictory to say that H. erectus spoke with iconic words and also to say that those words changed as they do today. In Mithen’s defense, it is conceivable that words only started changing rapidly after the hypothesized threshold of iconic languages had been reached. Indeed, he suggests that in H. erectus times, when “words drifted too far from their iconic roots, their meaning became lost, as did the word itself”, and consequently all H. erectus languages around the world were fundamentally alike (377). While this is conceivable, the research cited concerning how words evolve does nothing to support that idea. If anything, it suggests that humans evolved to process signs as conventionalized symbols, potentially as a result of their propensity to imitate one another, rather than objects.

In summary, the iconic vocal origins hypothesis is a captivating idea, but it faces some serious challenges, which Mithen may not have adequately handled. While not strictly unimaginable, it is certainly challenging to conceive of H. erectus speaking a purely iconic language, and there might not be good enough reasons to accept this demanding idea. There seem to be traces of iconicity in the languages we speak, and this, along with other data, testifies to a role of iconicity in word invention. Yet evidence pertaining to how words change, are acquired, and used by humans might indicate, contrary to Mithen’s proposal, that humans have evolved to store lexical items, to imitate each other’s behavior and languages rather than objects, and hence to process words as conventional rules. Perhaps, then, Plato’s “Perfect Language” was no less mythical than his “World of Ideas” after all.

Continuity or discontinuity? From icons to symbols to metaphors

Arguably the central controversy in contemporary Paleolithic archaeology concerns the question whether “modern behavior” – typically framed in terms of either syntactical capabilities or “modern symbolism” – arose suddenly or gradually. Proponents of continuity approaches assert that modern behavior evolved incrementally, without abrupt changes. Such accounts typically involve a mosaic of different elements which gradually came together. Conversely, discontinuity approaches usually highlight the importance of genetic change – what Bickerton (2000) once called “factor X”. Chomsky is a famous discontinuity theorist about language (Hauser et al. 2002, 1572), and Wynn and Coolidge (2010) and Tattersall (2021) are reputable representatives in archaeology. Nevertheless, it appears that for many scholars today, a sudden “creative explosion” (Pfeiffer 1982) in the Upper Paleolithic has lost most of its original appeal, and this has put significant pressure on discontinuity accounts.

Considering Mithen’s position within this debate is worthwhile. Ironically, though probably intentionally, The Language Puzzle is itself constructed in the form of a jigsaw puzzle (334), with the initial 15 chapters presenting various pieces of the puzzle, while the synthesis of these pieces is deferred until the concluding chapter. Therefore, the final chapter is essential for exploring Mithen’s imaginative outline of how language may have evolved.

Regarding the earlier time periods, the final chapter leans towards a continuity perspective. In summary, Mithen proposes that the vocal cries of the last common ancestor (6 million years ago) evolved into iconic words by approximately 2.8 million years ago, with varying degrees of compositionality developing later. Up until around 750,000 years ago, no significant thresholds are mentioned. The languages of H. erectus before this time were all iconic, and consequently “unbelievably monotonous” (377).

Around 750,000 years ago, Mithen identifies a first hypothetical threshold. This threshold seems to stem directly from Mithen’s own commitment to the idea of prehistoric iconic languages, which require a bridge to conventionalization and grammaticalization, purported to occur after 750,000 years ago with Homo heidelbergensis. This transition is primarily facilitated by an increase in brain size, allowing for an expanded lexicon (377–379). Arbitrary words could then emerge “as the sounds and meanings of iconic words transformed by their repeated use involving mishearing, mispronunciation and misunderstanding” (379).

At the end of this transformation, one final threshold remained. This is said to be a limitation “within the brain”, namely “domain-specific mentality” (382). According to Mithen, Neanderthal language evolved completely under “the constraint of this domain-specific mentality” (384). As a consequence of this genetically fixed limitation, Neanderthals “were unable to blend together the knowledge and ways of thinking used within each domain” (383). Their various thoughts and mental processes were, as it were, compartmentalized in different brain areas. While they did speak, using a limited lexicon of iconic and arbitrary words, they lacked metaphorical thinking, symbolic thought, and abstract concepts (385–387) – those traits which spark human creativity.

This last threshold is breached with modern H. sapiens in Africa, whose brains achieve so-called cognitive fluidity. Interestingly, this discussion brings us back to Mithen’s first classic, The Prehistory of the Mind from 1996, where the idea of cognitive fluidity likewise did a lot of heavy lifting. The idea originated within a Fodor-inspired modularity framework, with different modules functioning independently from one other. As Mithen summarizes his view in The Language Puzzle, “the neural networks for each domain were [for Neanderthals] isolated from each other”, as a consequence of which “[k]nowledge and the ways of thinking required for activity in the social world were detached from those used for engaging with the natural world and for making tools” (382).

Needless to say, our understanding of Neanderthals as well as of the brain’s modularity have changed dramatically since the 1990s, with much recent evidence challenging Mithen’s earlier hypothesis. For instance, a key element of Mithen’s (1996) argument for Neanderthal domain-specific mentality was the lack of archaeological evidence at the time indicating their ability to work with materials other than stone. Mithen (1996) then proposed that Neanderthals had a specialized cognitive module for stone manipulation, which supposedly prevented them from recognizing the potential of other materials.

Clearly, such gaps in the records made it easier to defend domain-specific mentality thirty years ago than it is now. For example, unlike thirty years ago, it is now generally believed – also by Mithen – that Neanderthals did speak in complex ways, interacted significantly with H. sapiens, and possessed some symbolic culture, such as wearing necklaces. Significant behavioral gaps have therefore been closed in recent years, at least partially, which could make the idea that Neanderthals were intrinsically incapable of connecting various modules somewhat less compelling.

Moreover, in recent years, there has been a growing scientific consensus on the importance of Neanderthal demographics, compared to those of H. sapiens. As Mithen also notes, in modern H. sapiens a limited lexicon is often associated with lower population densities – a perspective he extends to Neanderthals. Henrich (2017) has further substantiated this view, arguing that smaller group sizes significantly impact the statistical likelihood of discovering new ideas and techniques, thereby further constraining the successful transmission of knowledge across generations, and limiting cultural accumulation.

This could leave one wondering whether, had Mithen included a chapter specifically on the social lives of earlier hominins, the other puzzle pieces might have been pieced together in a different way. While Mithen’s broad outline of Neanderthal speech is compelling, the concept of cognitive fluidity has not necessarily gained plausibility over the past thirty years. The very notion of a threshold, constituted by genetic mutations and brain globularity, could seem to suggest a discontinuity between modern and archaic humans, and does little to explain how incremental changes, possibly influenced by social context, might have driven the evolution of language in the Middle to Later Stone Age transition. The globular shape of the H. sapiens brain is nowadays also often associated with self-domestication, again implying a strong social context (Benítez-Burraco et al. 2018; Neubauer et al., 2018). Thus, a more detailed consideration of the social lives of archaic and modern humans seems necessary to provide insight into how a gradual transition to modern behavior, rather than a sudden breach of a threshold, could have been possible.

Conclusion

Mithen’s new book provides a wealth of insights into the cognitive and physiological foundations of spoken language. It seamlessly integrates numerous scientific perspectives, and is essential reading for anyone interested in the evolution of human language. While the synthesis of all this information into a cohesive theory about iconic vocal origins is not flawless, The Language Puzzle stands as a testament to Mithen’s ability to engage readers and provoke further thought on some of the most fascinating issues in contemporary science and philosophy.