A Comparative Approach to the Evolution of Human Language: Insights from Empirical Studies

In the last few decades, a surge of studies has emerged with the aim of applying an empirical and comparative approach to animal vocal communication as a way to enhance our understanding of the evolution of the mechanisms enabling human language (hereafter, language). The present work adopts this approach, and includes the following methodological steps: (a) identifying and defining core abilities involved in language; (b) tracing the presence of these abilities in multiple closely-related or phylogenetically independent species; (c) analyzing the factors that may have boosted the evolution of these abilities into their current form in language.

This method, which assumes that core mechanisms underpinning language are broadly shared across nonhuman animals (hereafter, animals), sheds light on the evolutionary link between language and other animal communication systems and thus on the biological foundations of language. Specifically, two main classes of shared traits provide information on the evolution of mechanisms involved in language: “homologies” and “analogies” (see Fitch 2017). Homologies are traits that are shared between different species and that were present in their last common ancestor. This type of shared trait informs the direct genetic inheritance underpinning the studied traits, and its phylogenetic path. For instance, humans and chimpanzees have the ability to mentally represent the goals, intentions, perceptions, and knowledge of other individuals (Fitch et al. 2010; Schmelz et al. 2011; Tomasello et al. 2003). It is, therefore, most parsimonious to suggest that, rather than evolving convergently in a relatively short period of evolutionary time, this trait was present in their last common ancestor, which lived between 6 and 7 million years ago (Kumar et al. 2005). Analogies, on the other hand, are traits that were not present in the last common ancestor between the focus species, but evolved convergently, as a result of similar selective pressures. Hence, the study of this type of shared mechanism provides insights into their adaptive function. An example of an analogous trait is the ability to arrange notes within songs, which humans share with songbirds (Berwick et al. 2011). Much research suggests that this ability, which serves as sexual advertisement in both groups, has evolved under comparable sexual selection pressures (Charlton 2014; Darwin 1871; Miller 2000).

By comparing data across species, this approach sheds light on anatomical, cognitive, and neural commonalities between humans and other animals that can be identified as factors enabling the emergence of linguistic communication in humans. In this regard, it is important to emphasize that the uniqueness of language might not be endowed by one or more core mechanisms that are specific to humans. On the contrary, language might have evolved from the integration of multiple mechanisms, each of which can be individually traced (sometimes in a simpler form) in at least another animal species (Fitch 2010, 2017).

Here, I will take a cross-species, comparative approach to studying language evolution by examining three core abilities underpinning language which are, to some extent, shared with nonhuman species (Fitch 2017; Fitch and Zuberbühler 2013; Rendall et al. 2009; Townsend et al. 2018): (a) the ability to identify and produce phonemes; (b) the ability to process compositional rules underlying vocal utterances; (c) the ability to associate vocal sounds with meanings. Importantly, I will highlight the importance of a key communicative factor, namely emotional intonation of the voice, with the aim to shed light on its facilitating effect on the evolution of these three cognitive abilities underpinning language.

Previous research has suggested that the expression of emotions through voice modulation or musical communication, which has been attested across multiple animal species, might have paved the way for the emergence of language in the first hominids (Altenmüller et al. 2013; Brown 2017; Darwin 1871; Filippi 2016; Filippi and Gingras 2018; Filippi et al. 2019; Fitch 2010; Panksepp 2009; Thompson et al. 2012). However, empirical studies addressing the facilitating effect of emotional intonation on each of these three core abilities within a cross-species and evolutionary perspective have never been conducted. In fact, finding this facilitating effect in animal species would provide support for the hypothesis suggested here, namely that emotional intonation might have boosted the evolution of the ability to process phonemes and combinatorial structures, and to associate words with meanings out of comparable abilities reported in animal species. Specifically, in the present work, I suggest that emotional intonation might have boosted the evolution of these abilities, facilitating cognitive processes such as selective attention, perception, memorization, and learning.

In support of this hypothesis, I will firstly review research attesting the presence of the ability to identify and produce phonemes, process compositional rules, and associate vocal sounds with meanings in animals. Secondly, I will review studies indicating that emotional vocalizations are used as a communication code across a wide variety of animal species (cf. Darwin 1872). Thirdly, I will link this research to recent work on the facilitating effect of emotional intonation of the voice on the human ability to perceive speech sounds within compositional structures and associate words with meanings. Finally, I will integrate these studies within a unified framework on the facilitating effect of emotional intonation on language evolution, suggesting specific research questions that can be addressed empirically within a cross-species perspective.

Language-Related Abilities and Emotional Intonation in Animals

The Animal Ability to Identify and Produce Phonemes

Much research has addressed phoneme identification—i.e., fine-tuned perceptual discrimination of vowel- and consonant-like sounds in animals. In this regard, a study reported that one chimpanzee, who had long been exposed to spoken English before being tested, was able to recognize spoken words, even when spectrally degraded (Heimbauer et al. 2011). Further work shows that animals can learn categorical discrimination of distinct phonemes along an acoustic continuum. For instance, macaques (Macaca mulatta) (Kuhl and Padden 1983) and budgerigars (Melopsittacus undulatus) (Dooling and Brown 1990) can learn to discriminate between voiced and voiceless consonants in the pairs /ba/-/pa/, /da/-/ta/, /ga/-/ka/. Similarly, chinchillas (Chinchilla laniger), a mammalian species with auditory abilities similar to humans, can be trained to discriminate a voiced plosive consonant, /d/, from a voiceless one, /t/ in the initial position of a syllable (Kuhl and Miller 1975).

In addition to research on animals’ ability to learn to identify fine-grained differences between phonemes, extensive research has addressed their ability to produce phoneme-like sounds. A fundamental theory in the study of animal vocal production is the so-called source-filter theory, which identifies two main factors affecting the vocal output: the “source” and the “filter” (Fitch 2000; Titze 1994). The source of vocal sound production is the larynx in mammals, amphibians, and reptiles, and the syrinx in birds. Specifically, vocal sound is generated by tissue vibrations stimulated by the passage of air through the vocal folds, in the source. The lowest frequency of the vocal folds’ opening-closing cycles determines the fundamental frequency of the vocal sound (F0), which corresponds to the tonal sensation of the voice’s pitch. Subsequently, the sound reaches the supralaryngeal vocal tract, i.e., the filter, where certain frequencies are enhanced while others are attenuated by articulation of various parts of the filter, e.g., lips, or tongue. This results in concentrations of acoustic energy in particular frequency bands (called ‘formants’), which are perceivable in vowels and consonants (Fant 1960). For instance, if you produce a sequence of different vowels, equal in duration, F0, and amplitude, the perceived acoustic variation is resultant of the difference in formant frequencies.

Following Lieberman et al. (1969), until the last two decades it was commonly assumed that mammals (including primates) are not able to articulate sounds included in human speech due to an anatomical limitation in the filter, namely a heightened larynx. This has been argued to impact the range of articulatory movements in the vocal tract, and hence the formants that could be produced. However, a recent growing body of converging data from empirical studies and computer models of animal vocal production has been undermining Lieberman’s hypothesis. For instance, research shows that, when resting, the larynx of red and fallow deer (Cervus elaphus and Dama dama, respectively) is in a position comparable to that of humans, and retracts even lower during vocalization (Fitch and Reby 2001). Furthermore, Boë et al. (2017) reported that vocalizations of baboons (Papio papio) have the formant structure of human [ɨ æ ɑ ɔ u] vowels. This finding suggests that, unless the ability to produce these vowels emerged independently in humans and baboons, the ability to articulate vowel-like sounds may be traced to the last common ancestor from which humans and Cercopithecoidea diverged, about 25 million years ago (Stevens et al. 2013). Consistent with this work, a study adopting a computer model based on vocal tract configurations of living rhesus macaques (Macaca mulatta) confirmed that the primate vocal apparatus is potentially capable of producing human-like vowel sounds, as well as a variety of consonants, including stop consonants that are widely shared across languages (e.g., /h/, /m/, /w/, /p/, /b/, /k/, and /g/) (Fitch et al. 2016). This study implies that the human ability for speech required the evolution of specific neural connections between forebrain and laryngeal muscles, rather than anatomical changes in the vocal apparatus. Importantly, this research supports findings from previous studies hypothesizing that, in humans, direct neural connections between the laryngeal motor cortex (LMC) and the brainstem laryngeal motoneurons (which are, in turn connected to the laryngeal muscles), as well as the location of the LMC in the primary motor cortex (as opposed to its location in the premotor cortex in nonhuman primates), might have been key evolutionary steps enabling the ability to control complex laryngeal movements involved in producing learned vocal utterances (Jürgens 2002, 2009; Simonyan 2014; Simonyan and Horwitz 2011). In monkeys, and presumably, other nonhuman primates, the LMC is linked only indirectly—namely, through the reticular formation—to the laryngeal motoneurons in the brainstem (Simonyan 2014). Critically, their innate vocal production seems to be enabled by a specific voice control system in the brain, involving the brain stem and spinal cord sensorimotor phonatory nuclei only (Simonyan and Horwitz 2011). This might explain why the destruction of the LMC region in monkeys does not affect their innate vocal production (Simonyan 2014), which can take place without involving voluntary coordination and control of laryngeal muscles. Comparative studies on primate vocal production are a clear example of how research can help shed light on which trait was present (the anatomy of the vocal tract) and which was still missing (e.g., direct neural connections from motor cortical regions onto laryngeal motoneurons) before full-blown speech evolved.

Strikingly, this picture could be placed into a broader evolutionary scale to gain a wider perspective on the selective pressures enabling the emergence of the neural connections necessary for articulating human speech sounds. Indeed, although much research reports on animals’ ability to produce novel sounds and sound combinations by imitation (see section below), only a few species of mammals and birds seem to be able to learn to modulate their vocal tract to imitate words and sentences in existing human languages, e.g., Asian elephant, Elephas maximusi, (Stoeger et al. 2012); captive harbor seals, Phoca vitulina, (Ralls et al. 1985); gray seals, Halichoerus grypus (Stansbury and Janik 2019); and birds (Grey parrot, Psittacus erithacus, Pepperberg 2010; mynah bird, Acridotheres tristis, Stefanski and Klatt 1974). Social bonding can be identified as a potential selective factor boosting the ability to learn to produce novel sounds that are not included in the given species’ vocal repertoire (Stoeger et al. 2012). Hence, this research provides crucial insights on the key role of social pressures in language evolution and is consistent with work suggesting that social bonding (which is likely highly connected to the use of vocal emotional intonation in inter-individual communications) might have promoted the evolution of neural connections enabling the production of human spoken language (Dunbar 2003).

Generally, although it is plausible that species that are able to produce speech sounds can equally discriminate them at a perceptual level (cf. Pulvermller 2005), more research is needed on this topic within a cross-species perspective. This will favor a broader understanding of the evolutionary pressures behind neural and anatomical predispositions for identifying and producing phonemes.

The Animal Ability to Process Compositional Rules in Vocal Utterances 

A number of cross-species studies revolve around the ability to process vocal sequences according to compositional rules. This line of research aims at understanding the evolutionary precursors and selective pressures that led from the ability to parse simple forms of compositionality, which has been demonstrated in multiple species, to the human ability to parse fully-fledged syntactic systems of languages (Russell and Townsend 2017). Although much research on this topic is still ongoing, our understanding of the evolution of the human ability for syntax has significantly advanced in the last two decades (Collier et al. 2014; Engesser and Townsend 2019; Townsend et al. 2018; Zuberbühler 2020). For instance, it has been shown that Campbell’s monkeys (Cercopithecus campbelli) add an acoustic modifier (i.e., a sort of affix) to predator-specific alarm calls (Ouattara et al. 2009). In this way, the meaning of the alarm call is no longer perceived as linked to a predator, but to the presence of a general disturbance. Intriguingly, the ability to process compositional structures was also found in birds, providing insights for comparative studies on its evolutionary origins. For example, Engesser et al. (2016) showed that southern pied babblers (Turdoides bicolor) respond to combinations of alert and recruitment calls with mobbing-like behavior, while no obvious reaction is elicited by control combinations of foraging and recruitment calls. In a similar study, Suzuki et al. (2016) report that, in Japanese tits (Parus minor), the combination of two calls—namely of a call typically eliciting scanning behaviors in the listeners, and a call typically eliciting approach behavior to the caller—results in the combination of these two behaviors, i.e., scanning and approach. In a control experiment, the inversion of these two calls did not elicit any behavior, suggesting that these birds are processing the call combination according to a specific order. Therefore, these studies suggest that the southern pied babblers and the Japanese tits are sensitive to compositional properties of call sequences, and that structural changes impair signal perception. These systems can be fruitfully compared with compositional structures in language, where variation of words within a sequence (e.g., changing “gimme a break” into “apple a break” or “break a gimme”) can turn a well-formed and meaningful spoken utterance into an ill-formed and meaningless sequence of words.

Further cross-species studies on the ability to discriminate syntactical structures have typically adopted artificial grammars that are created following specific formal rules. For instance, Spierings and ten Cate (2016) found that zebra finches (Taeniopygia guttata) are able to discriminate units of their own vocal repertoire, arranged in a XYX or XXY structure, and that budgerigars (Melopsittacus undulatus) can discriminate and generalize this grammatical rule to novel elements they were never trained on during a previous rule learning phase.

A fundamental strand of comparative research on animals’ ability to process compositional structures has attempted to identify the cognitive abilities that enable humans (but not other animal species) to process more complex compositional structures in language. This research builds on the assumption that the human-specific ability to express an open-ended number of thoughts using a finite set of linguistic units relies on recursion (Everaert et al. 2015), i.e., the operation of embedding constituents within constituents of the same kind (Pinker and Jackendoff 2005; cf. Martins 2012). Building on this assumption, Hauser et al. (2002) proposed that the ability to use recursion might be the key computational ability that differentiates the syntactical competence of humans from combinatorial abilities found in animals (cf. Bolhuis et al. 2018). Within this conceptual framework, much research has relied on the so-called “Chomsky hierarchy” (Chomsky 1956, 1959) as a way to guide empirical work. This hierarchy provides a theoretical structure to identify and classify different levels of computational powers, each corresponding to a specific “grammar”. Each grammar includes a finite number of symbols, rules, and operators to apply to these symbols. One of the aims of this classification is to identify the level of computational power that enables an automaton to process natural languages on a mere mathematical and abstract level, i.e., excluding aspects such as lexical semantics, interactional dynamics, or context. This highly formal character of grammars favors well-controlled cross-species investigations of computational abilities that are foundational to language (O’Donnell et al. 2005). Hence, this research framework enables the investigation of animals’ computational capacities along a complexity axis, which includes the computational capacity underpinning natural language processing.

As Fitch and Friederici (2012) explain in their exhaustive and, at the same time, intuitive overview of the formal language theory at the base of Chomsky’s hierarchy, a crucial distinction within this hierarchy is between “regular” and “supra-regular” grammars. This distinction is important because it provides a line of demarcation between the computational abilities that are necessary to process very simple structures and those that are necessary to process hierarchical syntactic structures in natural languages. Regular grammars can be computed by the simplest class of automata (called “finite state automata”), using basic computational rules, namely, transition probabilities between a finite number of “states” (e.g., phonemes, syllables, or words). Examples of strings that can be processed by regular grammars are “(AB)n” - where the automaton has to accept an n number of “AB” bigrams, or “AB*A”—where any number of B units can occur between the A units at the edges. These basic rules are not enough to process the structural complexity of natural languages (Jäger and Rogers 2012). However, they might suffice to process phonological sequences, an ability that humans might share with other animals (Fitch 2018a). In contrast, “supra-regular” grammars, which include multiple subsets of grammars, rely on more complex rules and computational power than that required for a finite state automaton. An example of a supra-regular grammar is a context-free grammar, which can be computed by a “pushdown automaton”. For instance, the AnBn sequence—where a number of B elements follows the same number of A elements—can be processed by this type of automaton, but not by a finite state one (Fitch and Friederici 2012; O’Donnell et al. 2005), which is not able to count and compare (Jäger and Rogers 2012). Crucially, the set of supra-regular grammars vary in the amount of requested computational power that can be used to process dependencies between the constitutive elements of an expression. Importantly, this set includes grammars that can process dependencies within recursive structures, such as A1A2A3B3B2B1, where the same pattern AB is nested in itself, following a center-embedded structure (Chomsky 1956, 1959). As Jäger and Rogers (2012) explain, an example of nested dependencies is given by the English construction “neither-nor”, repeated multiple times within the same sentence, as in “Neither did Mary think she would neither go to the cinema nor eat pizza, nor did I”.

The first study to use the distinction between regular and supra-regular grammars to compare humans and animals’ (specifically, cotton-top tamarins) was conducted by Fitch and Hauser (2004). In this study, the authors found that cotton-top tamarins are able to process ABn sequences—i.e., regular grammars, but fail to process AnBn sequence - i.e., supra-regular grammars, while humans, as predicted, succeeded in processing both grammars. Following up this work, a number of studies have probed how phylogenetically widespread the ability to process regular and supra-regular grammars is. To date, the majority of studies have found that, multiple species of animals are able to process regular grammars, specifically, (AB)n sequences (ravens, Corvus corax, Reber et al. 2016; kea, Nestor notabilis, and pigeons, Columba livia, Stobbe et al. 2012; cf. ten Cate and Okanoya 2012) and perceptual dependencies between edge stimuli in ABnA sequences both in the visual domain (chimpanzees, Pan troglodytes, Sonnweber et al. 2015; cotton-top tamarins, Saguinus oedipus, Versace et al. 2019) and in the auditory domain (squirrel monkeys, Saimiri sciureus, Ravignani et al. 2013; cotton-top tamarins, Saguinus oedipus, Newport et al. 2004; common marmosets, Callithrix jacchus, Reber et al. 2019). In addition, although some research has suggested that birds are able to process supra-regular grammars (Abe and Watanabe 2011; Gentner et al. 2006), subsequent studies have shown that these birds might have used simple strategies—that do not require any of the computational power at the level of supra-regular automata—to parse these structures (Ravignani et al. 2015; Van Heijningen et al. 2009). However, in a recent study, Jiang et al. (2018) provided, for the first time, compelling evidence that an animal species—specifically, the macaque monkey (Macaca mulatta)—is able not only to parse, but also to produce a sequence according to a supra-regular grammar, namely, a “mirror” (context-free) grammar of the form ABCCBA. Here, the second part of the string is a mirror image of the first part, thus including a center-embedding organization. The authors tested pre-school children on the same task, and found that, compared to monkeys, who needed a massive amount of training to learn the grammar, humans learned to master the grammar with only a little training. These findings suggest that monkeys possess these computational competences, although they do not have the same human inclination to use them (Fitch 2018b).

Here, it is important to stress that much debate is currently ongoing regarding the assumption that recursion is the defining computational system of language (Christiansen and Chater 2015; Evans and Levinson 2009; Parker 2007; Perruchet and Rey 2005). Nevertheless, comparative research relying on grammars defined within Chomsky’s hierarchy is effective for a systematic investigation of the ability of animals to process different levels of structural complexities in the vocal domain. This, in turn, may provide key insights into the evolution of the human ability to parse compositional patterns.

But what is the evolutionary advantage of the animals’ ability to produce and process compositional structures? Crucially, in animal communication systems, higher levels of structural complexity in compositional structures allows for the transmission of information with greater degrees of complexity compared to vocalizations with simpler structures (Nowicki and Searcy 2014). In this regard, research indicates that higher levels of vocal complexity typically co-occur with the predisposition to learn to articulate a signal by imitating (and modifying) someone else’s signal (Nottebohm 2002). Hence, the tendency to learn vocally might have been a key factor in the evolution of the human ability to identify and produce syntactical structures in language. Extensive research has addressed the phylogenic path of the ability for vocal learning, and the selective pressures underpinning its evolution (cf. Martins and Boeckx 2020). In particular, animal research on this topic has mainly focused on three groups of birds (parrots, hummingbirds, and songbirds) (Beecher and Brenowitz 2005; Jarvis 2006). Recently, this line of research has been complemented by studies on phylogenetically distant mammalian species, including terrestrial and marine mammals (e.g., African elephants, Loxodonta africana, Poole et al. 2005; Egyptian fruit bat, Rousettus aegyptiacus, Prat et al. 2015; humpback whale, Megaptera novaeangliae, Cerchio et al. 2001; Californian sea lion, Zalophus californianus, Reichmuth and Casey 2014).

The Animal Ability to Associate Vocal Utterances with Meanings

Much research aimed at pinpointing the evolutionary precursors of the human ability for word-meaning association in animal communication systems has focused on animals’ ability to understand the link between vocal utterances and their meaning—i.e., the information they express or refer to (Dawkins and Krebs 1978; Macedonia and Evans 1993; Marler et al. 1992; Wiley 1983). For instance, studies indicate a strong link between acoustic features of the signal and information related to the body size and the emotional state of the signaler (Owren and Rendall 2001). Body size has been demonstrated to be reliably cued by formant-structure of mammalian vocalizations. Specifically, individuals with bigger bodies have lower formant frequencies than smaller individuals (domestic piglets, Sus scrofa domesticus, Garcia et al. 2016; koala, Phascolarctos cinereus, Charlton et al. 2011; rhesus macaques, Macaca mulatta, Fitch 1997; humans, Pisanski et al. 2014; for cross-species studies, see: Bowling et al. 2017; Charlton and Reby 2016; Taylor and Reby 2010). In accordance with these studies, research on the perception of vocal indicators of body size suggests that formants are also the most reliable acoustic parameters for perception of size-related variation in animals (e.g., whooping cranes, Grus americana, Fitch and Kelley 2000; red deer, Cervus elaphus, Charlton et al. 2007a, b; dog, Canis lupus familiaris, Faragó et al. 2010), and between species (Taylor et al. 2008). Similar mechanisms seem to be at play in the perception of body size and related information through acoustic features of the voice in humans. Indeed, research shows that formants are linked to size perception (Ohala 1984; Pisanski et al. 2014; Rendall et al. 2009) and dominance (Puts et al. 2006) in humans, and suggest that back vowels (e.g., /o/, /a/) are associated with big objects and front vowels (e.g., /i/, /e/) are associated with small objects (see Lockwood and Dingemanse 2015a for a review). In addition, Auracher (2017) reports that human participants associate back vowels with larger sizes, aggression, strength, and social dominance, and front vowels with small sizes, weakness, fearfulness, and social subordination. Interestingly, the author found that, in this association process, the semantic content of the pictures (e.g., elephant vs. rabbit) overwrites the actual size of the depicted objects in this association process—given, for instance, by using an image of the elephant that was relatively smaller than the image of the rabbit.

In addition, Bowling et al. (2017) showed that body size inversely correlates to F0 in a wide variety of mammalian species. This study is consistent with Morton’s (1977) “motivational-structural rules” hypothesis, which states that in mammals and birds, harsh, low-frequency vocalizations are used in competitive contexts to signal physical dominance, whereas more tonal, high-frequency vocalizations are used in fearful or appeasing contexts to signal submission. Recent research has extended these findings, suggesting that larynx size (in particular, vocal fold length), which might not be proportional to body size, predicts F0 better than body size (Garcia et al. 2017).

Critically, research found evidence for the ability to process simple spoken sound-meaning associations in animals. Dogs (Kaminski et al. 2004), parrots (Pepperberg 2006), and chimpanzees (Savage-Rumbaugh et al. 1993) have all been shown able to infer which specific object a word refers to. Finally, comparative research on animal communication has described animal calls as “word-like” vocal units in that these calls are associated with specific objects or events akin to the referential nature of human words. For instance, in a very influential study, Seyfarth et al. (1980) suggested that the vervet monkey (Chlorocebus pygerythrus) have three distinct alarm calls, each associated with ‘snake’, ‘eagle’, and ‘leopard’ respectively. These calls elicit appropriate behaviors in the listeners, such as looking up upon hearing the call emitted by the signaler in response to the presence of an eagle. More recently, research has revisited these original findings and adopted state-of-the-art techniques for acoustic data analyses (Fischer and Price 2017; Price et al. 2015). These studies highlight that animal calls do not “carry” information on the basis of an arbitrary association between sounds and meanings, as in the case of human words. On the contrary, in primates, vocalizations are genetically determined and are triggered by emotional and cognitive states of the signalers, which are reflected in specific acoustic features of the signal. The perception of these acoustic features, combined with contextual cues, allows listeners to associate the signal with its eliciting stimuli, and subsequently select the appropriate responses (Wheeler and Fischer 2012).

Within the comparative approach proposed here, studies on emotional expression through voice intonation are particularly relevant to the study of the evolution of the ability to associate arbitrary vocal utterances with their meaning. Indeed, as I will describe in the next sections, emotional expressions are widespread across a wide variety of vocalizing animal species (Darwin 1872), and, within humans, across cultures (Barrett and Bryant 2008; Sauter et al. 2015; Scherer et al. 2001). This makes emotional expresssions a good candidate for enhancing our understanding of the dynamics underpinning the evolution of the human ability for speech processing and word-meaning associations.

Vocal Emotional Expression: A Cross-Species Comparative Approach

The study of emotional expression through voice intonation in animals may provide crucial insights to reconstruct the dynamics underpinning language evolution (Darwin 1871; Filippi 2016; Filippi and Gingras 2018; Filippi et al. 2019). Across animal species, emotions serve adaptive functions, favoring actions that promote survival, such as a fight-or-flight response to an attacking predator in the surroundings (Nesse 1990). In addition, emotional stimuli engage selective attention (Kret et al. 2016) and favor associative learning in animals (McGaugh 2004; Seymour and Dolan 2008).

Importantly, changes in emotional states may create tension in the muscles involved in vocal phonation, as for instance, those involved in respiration (diaphragm and intercostal muscles) and, importantly, in the vocal folds (Ladefoged 1996; Titze 1994). These changes affect vocal sound production, generating audible differences between vocalizations emitted in intense emotional states and those emitted in less intense ones. In line with Darwin (1871, 1872), I assume that the mechanisms of production of emotional vocalizations might be evolutionary conserved across species (Filippi et al. 2017a), and were, presumably, in place at the time the first hominids diverged from the last common ancestor between humans and chimpanzees.

Multiple studies have addressed emotional communication in animals, focusing on discrete emotions, such as fear or rage (Camperio Ciani 2000; Forkman et al. 2007). However, as Mendl et al. (2010) observe, this approach may narrow down the range of emotions that can be assessed in animals within a comparative approach. In fact, a research framework that is best suited for comparative analyses is offered by the dimensional approach (Russell 1980), in which emotions are described according to two dimensions: arousal (low/calm or high/excited) and valence (positive or negative). Crucially, the investigation of arousal, which relies on quantitative measures of physiological correlates of emotional activation of signalers, serves cross-species comparison very well (Briefer 2012). In addition, this quantitative approach allows researchers to identify vocal indicators of arousal levels in the vocalizing animals. In an extensive review, Briefer (2012) reports that across the vast majority of studied mammalian species (including humans, see Banse and Scherer 1996; Johnstone and Scherer 2000), heightened levels of arousal are expressed through energy distribution towards higher frequencies, higher frequency-related parameters, amplitude contour, vocalization rate, and lower inter-vocalization interval. In addition to studies on emotional arousal expression in mammals, research reports that a songbird species, the black-capped chickadee (Poecile atricapillus) encodes the degree of threat posed by small or large predators—which presumably trigger low and high arousal emotional states, respectively—in their calls (Templeton et al. 2005). Specifically, the higher the threat, the higher the number of D notes at the end of their call.

The ability to identify emotional states in vocal signals, which may be produced within social interactions (Altenmüller et al. 2013; Bryant 2013), favors survival of conspecifics in contexts such as territory defense or predation (Cross and Rogers 2006; Desrochers et al. 2002; Owings and Morton 1998). In addition, survival chances may be favored by “eavesdropping” on another species’ alarm calls although acoustically different from their own (de Boer et al. 2015; Fallow et al. 2011; Kitchen et al. 2010; Lea et al. 2008; Magrath et al. 2009). In line with these studies, recent work has found that humans and black-capped chickadees can discriminate high versus low arousal calls across a large variety of vocalizing species, spanning all classes of vocalizing vertebrates (Congdon et al. 2019; Filippi et al. 2017a, b).

Finally, the ability to identify emotional activation in the signaler (conspecific or heterospecific) may determine survival of newborns, who can express their needs very effectively through voice intonation, thus enabling their caregivers to respond appropriately (Marmoset monkey, Callithrix jacchus, Tchernichovski and Oller 2016; Zhang and Ghazanfar 2016; human, Fernald 1992). Interestingly, Lingle and Riede (2014) found that mule deer (Odocoileus hemionus) and white-tailed deer (Odocoileus virginianus) mothers are sensitive to high arousal, negatively-valenced vocalizations of infants of a variety of mammalian species (e.g., mule deer, Odocoileus hemionus, bighorn sheep, Ovis canadensis, marmots, Marmota flaviventris, bats, Lasionycteris noctivagans, Australian sea lion, Neophoca cinerea and Subantarctic fur seals, Arctocephalus tropicalis), if the F0 values are within the frequency range produced by infants of their own species.

Taken together, these studies are consistent with Darwin’s (1871) hypothesis that emotional communication in animals is produced through mechanisms underpinning voice production that are conserved across phylogenetically distant species. This hypothesis is in line with a growing body of studies attesting to the human ability to identify vocal emotions across widely different cultures (Barrett and Bryant 2008; Sauter et al. 2015; Scherer et al. 2001). Emotional communication might, therefore, be biologically ancient and immune to the influence of cultural dynamics.

In light of the evidence reviewed in this section, it is worth addressing how emotional intonation, as a communication code used across a wide variety of animal species, affects language processing in humans. This line of investigation will provide insights into the dynamics underlying the emergence of language from nonhuman animal communication systems.

Emotional Intonation: Facilitating Effect on Language Processing

The Human Ability to Identify and Produce Phonemes (Within Compositional Structures): Facilitating Effect of Emotional Intonation 

“I cannot doubt that language owes its origin to the imitation and modification, aided by signs and gestures, of various natural sounds, the voices of other animals, and man’s own instinctive cries. [...] we may conclude from a widely-spread analogy that this power would have been especially exerted during the courtship of the sexes, serving to express various emotions, as love, jealousy, triumph, and serving as a challenge to their rivals. The imitation by articulate sounds of musical cries might have given rise to words expressive of various complex emotions.” (Darwin 1871, p. 56)

In accordance with Darwin’s hypothesis on the origins of language, extensive research has identified a positive effect of emotional stimuli on cognition, particularly on attentional, perceptual, and memory resources, which are at the core of language processing (Dolan 2002; Kotz and Paulmann 2011; Storbeck and Clore 2008). Indeed, multiple studies indicate that the presentation of emotional written words, images, or sounds enhances the processing of target stimuli that are presented before or after the given emotional stimulus. This results in higher accuracy in recalling the target stimulus and facilitates associative learning between the emotional stimulus and the target one (Finn and Roediger 2011; Guillet and Arndt 2009; Riegel et al. 2016). For instance, high arousal auditory stimuli—independently from their valence—affect selective attention, favoring perception and memorization of salient visual stimuli (namely, letters with higher contrast font within a set of visually presented letters) (Sutherland and Mather 2018). Importantly, the effects of emotional stimuli on perception, attention, memory, and learning are mediated by primary brain networks in the limbic system that humans and animals share, in particular, the amygdala (Dolan 2002; Phelps and LeDoux 2005; Seymour and Dolan 2008).

Consistent with these studies, research on auditory emotional words show that, in adults, shifts towards higher F0 mean and F0 variation positively affect perceptual salience of these spoken words, engaging attention, and ultimately, favoring their intelligibility (Davis et al. 2017; Dupuis and Pichora-Fuller 2014; Nencheva et al. 2020). Numerous studies have addressed this topic focusing on the special speech register that human caregivers use when addressing infants (hereafter infant-directed speech or IDS). Crucially, emotional intonation in this type of speech is prominent and effective in conveying communicative functions such as alerting, comforting, alarming, or disapproving (Fernald 1992; Trainor et al. 2000). Fernald et al. (1989) found that, compared to adult-directed speech (ADS), IDS is characterized by higher values related to F0, shorter utterances, and longer pauses. The authors found that this result applies to six different language groups (American English, British English, Japanese, German, Italian, and French), suggesting that voice modulation in IDS is shared across human societies. In addition, expanded pitch contours and longer vowel duration in IDS, compared to ADS (Andruski and Kuhl 1996; Fernald and Simon 1984; Kuhl et al. 1997), favor infants’ discrimination of vowel categories (de Boer 2005; Trainor and Desjardins 2002; Werker et al. 2007). Similarly, voice onset time (VOT) in stops (namely, /b/, /d/, /t/, /g/, /k/) are longer in IDS than in ADS (Englund 2005). These findings are corroborated by research showing that 7–8-month-old infants are better at recognizing words spoken in IDS compared to words spoken in ADS (Singh et al. 2009).

Generally, in speech, and, in particular, in the case of IDS, spoken sound identification occurs within sentences, hence, within compositional structures. Indeed, a higher order of language processing at which emotional intonation may play a key role consists of producing and parsing well-formed connections between spoken words or phrases, according to compositional rules. To my knowledge, the effect of emotional intonation on these processes has never been investigated directly. In contrast, extensive research has focused on linguistic prosody, i.e., prosodic structure of utterances that is used, for instance, to recognize words within sentences, to emphasize a particular word in a sentence, or to distinguish a command from a statement or a question (Cutler et al. 1997). Studies suggest that linguistic prosody has a crucial role in bootstrapping syntax comprehension and on marking the beginning and the end of a phrase (Soderstrom et al. 2003). In addition, Gussenhoven (2002, 2016) has addressed anatomical and physiological factors affecting the use of voice intonation to mark questions, utterance start/end, topic continuity, or focus in speech. For instance, he suggested that high pitch typically signals the beginning of an utterance, and low pitches signal their ending. This is given by the fact that, when starting an utterance, the subglottal air pressure in the speaker is higher than towards its end. In accordance with these studies, previous research has shown positive effects of emphasized prosodic features in language comprehension (Klieve and Jeanes 2001) and in perception of interrogative forms (See et al. 2013) in children with hearing deficits.

The Human Ability to Associate Words with Meanings: Facilitating Effect of Emotional Intonation

A critical aspect of language to consider within the present research framework is the ability to associate arbitrary sequences of vocal sounds (i.e., words) with meanings. Notably, word-meaning association is an essential part of word learning, where categorical, conceptual and social factors come into play (Waxman and Gelman 2009). One of the most efficient paradigms to investigate word-meaning association is the cross-situational word learning paradigm, where participants are exposed to a series of visual images containing a target referent, while hearing a target word that always co-occurs exclusively with the corresponding referent (Yu and Smith 2007). Research applying this paradigm to an artificial language learning experiment suggests that marking a target word with IDS typical F0 exaggerated contours benefits the learners’ ability to associate target word and target visual referents into a word-meaning pairing (Filippi et al. 2014, 2017c). This research is consistent with previous work suggesting that IDS-typical F0 prominence facilitates word-meaning mapping in preverbal infants (Ma et al. 2011; cf. Fernald and Mazzie 1991). In addition, much work has focused on the relative prominence of emotional intonation and lexical content within a task of emotional meaning identification. For instance, in a recent study, Filippi et al. (2017d) adopted a Stroop task in which participants had to identify the meaning of an emotional word by either focusing on emotional intonation (hence, ignoring lexical content) or the other way around, by focusing on lexical content, while ignoring emotional intonation. In this task, the two channels can be congruent—as in the case of the word “happy” spoken with a happy intonation—or incongruent, as in the case of happy”, spoken with a sad intonation. The authors found that, in the incongruent condition, when participants had to ignore emotional intonation and identify the emotional meaning focusing on lexical content, they were significantly less accurate than when they had to ignore lexical content and identify the emotional meaning conveyed by intonation. These findings are echoed by multiple studies reporting the higher salience of emotional intonation over lexical units also at a brain level (Schirmer et al. 2002; Schirmer and Kotz 2006). The attested prominence of emotional intonation over lexical content corroborates the hypothesis that the ability to process emotional content through voice intonation is older than phonetic processing, and might have favored its emergence. Within this research framework, Aryani and Jacobs (2018) addressed the interaction between semantic content and phonemes iconically associated with high emotional arousal, for instance plosives or hissing sibilants (as assessed in Aryani et al. 2018). The authors found that words where semantic content and constituent phonemes (e.g., the plosive consonant /k/ in “Krieg” [war]) are congruent in the expression of arousal are processed faster and more accurately. These findings are consistent with further studies showing facilitating effects at a neural level, provided by the interaction between phonemes and semantic content in emotional word processing tasks (Aryani et al. 2019).


Taken together, the studies reviewed here are in line with the hypothesis that emotional intonation scaffolded the emergence of the following core abilities involved in language: phonemes’ identification and production (within compositional structures) and word-meaning association. In support of this hypothesis, firstly, I reviewed studies tracing the presence of simpler forms of these abilities across a variety of vocalizing animal species. These studies shed light on the evolutionary precursors of the corresponding abilities involved in language, and on the selective pressures driving their emergence. Secondly, I described studies providing compelling evidence of the use of emotional intonation across multiple animal species, including humans. Consistent with this literature, research indicates that emotional stimuli activate the most evolutionarily ancient components of the brain that are shared between humans and animals (Dolan 2002; Phelps and LeDoux 2005; Seymour and Dolan 2008). The link between emotional communication in animals and the linguistic abilities reviewed here is provided by research indicating the enhancing effect of emotional stimuli on cognitive processes involved in language, namely, perception, selective attention, memory and learning in humans (Dolan 2002; Storbeck and Clore 2008). Hence, it is plausible that emotional intonation might have boosted the evolution of abilities involved in language from comparable ones found in animal communication systems. This line of argumentation was further corroborated by evidence that emotional intonation facilitates the perception of phonemes and spoken word-meaning association in human infants and adults (reviewed above).

This review is in accordance with previous studies that, building on Darwin’s work (1871), and within a comparative approach to animal communication, suggest that the expression of emotion through prosodic (or “musical”) voice modulation set the stage for the emergence of language (Altenmüller et al. 2013; Brown 2017; Darwin 1871; Filippi 2016; Filippi and Gingras 2018; Filippi et al. 2019; Panksepp 2009; Thompson et al. 2012). Furthermore, the present work extends previous research on the physiological foundations of the use of voice intonation for linguistic purposes (Gussenhoven 2002).

Importantly, this review supports previous research suggesting that the investigation of the dynamics underlying language evolution can take place by integrating empirical evidence from multiple disciplines (Fitch 2017). Within this approach it is beneficial to explore language as a complex ability made of core components whose evolutionary dynamics can be investigated separately, rather than as a monolithic block. In particular, this work opens the avenue for empirical investigation of specific research questions I plan to address in follow-up studies, namely, whether emotional intonation facilitates the following abilities in animals and preverbal humans: (1) producing and identifying phonemes; (2) processing and learning compositional rules in vocal utterances; (3) associating unfamiliar spoken words with their meaning.

In order to enhance our understanding of language evolution (which is an intrinsically multimodal system, Levinson and Holler 2014), studies on the vocal and emotional origins of language need to be integrated with research on primates’ abilities for gestural communication (Meguerditchian et al. 2013). Furthermore, it is crucial to bridge this research with investigations on the role of time-coordinated interactions in the emergence of language processing abilities (Filippi 2016; Filippi et al. 2019; Levinson 2016; Ravignani et al. 2019), and on the evolution of pragmatic abilities such as, for instance, theory of mind (Fitch et al. 2010; Scarantino 2018). Finally, within this research framework, comparative work on animal communication abilities needs to be connected to studies tracking language evolution in the hominin line (e.g., Blasi et al. 2019).

To conclude, I would like to highlight that emotional intonation is strongly connected to the social dimension of linguistic communication (Sander et al. 2005), and might have driven the typically human “urge” to create socio-emotional bonds and share information with conspecifics (Fitch 2010). Hence, by investigating emotional communication as a communication code widely used across animal species, and which may have been critical in fostering the emergence of spoken language, we ultimately begin to elucidate a fundamental dimension of humans: the species-specific drive for interpersonal communication.