Just like language, music constitutes a distinctive behavioral trait of humans. However, current understanding of the role of music in shaping human evolution, as well as the matter of origins of music, remain far from clear—in contrast to what is known about the contribution of language (but see Honing, 2019; Perlovsky, 2017; Schulkin, 2013; Tomlinson, 2015; Wallin et al., 2000, for some hypotheses). At the same time, notable parallels exist between the structural and functional properties of music and language (see Jackendoff, 2009, for a useful review)—to the extent that some authors have argued in favor of their common evolutionary origins (Brown, 2000; Harvey, 2017; see de Boer & Ravignani, 2021, for a recent critical view). In this paper, we wish to substantiate this view with a new model that heavily builds on current findings and methodologies of evolutionary linguistics. Just like the language types that emerged throughout human history as humans became more tolerant and prosocial, following a steady reduction in reactive aggression (Benítez-Burraco & Progovac, 2020), music acquired diverse typology, complexity, and functionality that accompanied its global spread.

We start this paper by overviewing the commonalities between music and language—one of the very few available sources to establish the evolutionary prehistory of music and language. Next we outline the self-domestication hypothesis of human evolution and explain its benefits for modeling the evolution of music and language. Finally we discuss possible ways for music’s interaction with language in their parallel development.

Music and Language: Common Evolutionary Roots

Overall, existing hypotheses about the origins of music fall into two general classes. The first one regards music as a by-product of the extended use of some preexisting biologically important capacity, such as vocal signaling, sound imitation, auditory analysis, motor coordination, problem solving, and linguistic communication. The second class of theories claims that music was selected for some evolutionary advantage(s).

Music shares many commonalities with language. Both feature numerous functions, typologies of complexity, and a pronounced evolutionary continuity of the cognitive and communicative abilities of other species. In other words, many characteristic traits of human musical and linguistic communication can be traced in animal communication (see Corballis, 2020; Cowley & Kuhle, 2020; Pereira et al., 2020 for recent views).

We shall list the most important similarities between language and music. First, a number of structural parameters of music—pitch, rhythm, meter, tempo, dynamics, articulation, and timbre—are also exploited by language (Besson & Schön, 2001; Filippi et al., 2019; Heffner & Slevc, 2015; Patel, 2003; Rohrmeier et al., 2015; Slevc, 2012). For example, pitch changes are used to distinguish different words in tonal languages, or different sentence types, as in prosodic intonation (see Nikolsky & Benítez-Burraco, 2022, for ample discussion).

Second, music equals language in many of its common functions:

  1. 1.

    the expressive function—especially, conveying emotions (Altenmüller et al., 2013; Cook, 2002; Eerola & Vuoskoski, 2013; Gabrielsson & Juslin, 2003; Johnson-Laird & Oatley, 2010; Juslin, 2005, 2011, 2013; Krumhansl, 2002; Mohn et al., 2010; Nikolsky, 2015a; Panksepp & Trevarthen, 2009; Peretz, 2013; Perlovsky, 2012; Schiavio et al., 2017; Trainor, 2010; van Goethem & Sloboda, 2011),

  2. 2.

    the phatic function—in other words, reinforcing interpersonal and social bonding (Boer & Fischer, 2012; Clarke et al., 2015; Clayton, 2016; Cross, 2009; Dunbar, 2012a, b; Harvey, 2017, 2020; Mehr et al., 2021; Savage et al., 2020; Trevarthen, 2002),

  3. 3.

    the conative function—in other words, calling to action (Karl & Robinson, 2015; Kühl, 2011; Leman, 2009; Liszkowski et al., 2012; Mehr et al., 2021; Monelle, 2006; Nazaikinsky, 2013; Rodman & Rodman, 2010; Tagg, 2012; Tarasti, 1998; Vuust & Roepstorff, 2008), and

  4. 4.

    the mnemonic function—memory conservation (Belfi et al., 2015; Boer & Fischer, 2012; Janata et al., 2007; Levitin, 2019; Nikolsky, 2016b; Tamm, 2019; van Dijck, 2006; Will, 2004).

Third, as with languages, all human cultures have developed different music systems to support important musical behaviors that fulfill specific social and psychological roles. The form-function links between language and music remain quite stable across various cultures and societies. Although during the past 40 years, Western ethnomusicologists have tended to deny the global universality of specific structural patterns of pitch and rhythm organization, their stance seems to be driven by political reasons—mainly, fear of a Eurocentric bias in conducting scientific comparative study of the world’s music traditions (Blacking, 1977; Gourlay, 1984; Hood, 1977; List, 1971, 1984; Nattiez, 2012; Supičič, 1983).Footnote 1 The arguments for the nonexistence of musical universalities are all limited to the absence of specific higher-order combinatorial patterns in certain music cultures rather than to the omnipresence of certain basic principles of music-making (Brown & Jordania, 2013; Fitch, 2017; Grauer, 1996; Justus & Hutsler, 2005; Kolinski, 1978; Lomax, 1977; McAdams, 1989; Nketia, 1984; Savage et al., 2015; Tagg, 2012; Verhoef & Ravignani, 2021).Footnote 2

A number of common elementary “surface-level” music constructs are virtually omnipresent across the globe and rely on the perceptory mechanisms that are already active immediately after birth:

  1. 1.

    In practically every music culture, listeners recognize musical sounds as more pleasant than other types of sounds and are eager to listen to them for a long time, over and over again (Alworth & Buerkle, 2013; Fitch, 2006; Granot, 2017; Hefer et al., 2009; Lots & Stone, 2008; Nieminen et al., 2011; Salimpoor & Zatorre, 2013; Schubert, 2009; Snowdon, 2021; Watanabe, 2008).

  2. 2.

    Listeners distinguish pleasant (consonant) from unpleasant (dissonant)Footnote 3 simultaneous combinations of musical sounds and only vary in judging which specific combinations are considered “consonant” versus “dissonant” (Bidelman & Krishnan, 2009; Brandl, 2008; Cazden, 1959, 1972, 1980; Lots & Stone, 2008; McPherson et al., 2020; Messner, 2006, 2013; Schellenberg & Trehub, 1996; Tenney, 1988; Terhardt, 1974b).

  3. 3.

    Listeners distinguish melodic steps from leaps (Alekseyev, 1986; Bendixen et al., 2015; Bregman, 1994; Larson, 1997; Nazaikinsky, 1977; Rags, 1980; Sievers et al., 2013; Stefanics et al., 2009; Tiulin, 1937; van Noorden, 1975).

  4. 4.

    Listeners distinguish regular integer-ratio rhythms from irregular rhythms (Arom, 2006; Brown & Jordania, 2013; Drake, 1998; Drake & Bertrand, 2001; Fitch, 2012; Fraisse, 1982; Jacoby et al., 2021; Monahan, 1993; Pressing, 1983; Ravignani et al., 2016).

  5. 5.

    Listeners distinguish binary metric groups from ternary (Abecasis et al., 2005; Bergeson & Trehub, 2006; Clayton, 2000; Fraisse, 1982; Iyer, 1998; Jacoby et al., 2021; London, 2004; Monahan, 1993; Potter et al., 2009; Temperley, 2009).

  6. 6.

    Listeners distinguish fast tempi from slow (Baruch & Drake, 1997; Collier & Collier, 2007; Dalla Bella et al., 2001; Ellis, 1992; Fraisse, 1982; Levitin & Cook, 1996; McAuley, 2010; Trainor et al., 2004; van Noorden & Moelants, 1999).

  7. 7.

    Listeners experience music as virtual movement of a certain character, analogous to physical motion, but in imaginary space, formed by the alternation of tension-inducing and relaxation-inducing structures (Fraisse, 1982; Friberg & Sundberg, 1999; Iyer, 1998; Jackendoff & Lerdahl, 2006; Larson, 2012; Larson & McAdams, 2004; Larson & Vanhandel, 2005; Nazaikinsky, 1988; Nikolsky, 2015b; Rothfarb, 1988).

  8. 8.

    Listeners use no more than 12 pitch-classes (most commonly 5–7 of different sizes) and employ logarithmic incrementation to distinguish between them within a pitch-set (Balzano, 1980; Beliayev, 1990; Brown & Jordania, 2013; Gill & Purves, 2009; Honingh & Bod, 2011; Jacoby et al., 2019; Korsakova-Kreyn, 2013; Mazel, 1952; McAdams, 1989; McBride et al., 2022; Sethares, 2005; Shepard, 2010) that possibly shares roots with the linguistic prosody (Fenk-Oczlon, 2017; Kolinsky et al., 2009; Schwartz et al., 2003; Terhardt, 1984).

Numerous experimental studies suggest the existence of universal cross-cultural patterns of musical communication (Argstatter, 2016; Balkwill & Thompson, 1999; Egermann et al., 2015; Fritz et al., 2006, 2009; Juslin & Laukka, 2003; Kwoun, 2009; Laukka et al., 2013; Sievers et al., 2013; Smith & Williams, 1999; Stevens & Byron, 2009; Trehub et al., 1993; Yurdum et al., 2022). This line of research is extremely important in validating claims of Western ethnomusicologists and identifying the common biomusicological foundation that underlies world’s music cultures.

Fourth, newborns show innate predisposition to acquire music no less than language. Hence, there is evidence that fetuses distinguish music from environmental sounds during the last months of gestation, and newborn infants even remember music they were exposed to during gestation (Parncutt, 2016). Acquisition of music occurs implicitly, even in the absence of formal training (Rohrmeier & Rebuschat, 2012). Infants routinely learn multiple music systems just as they learn multiple languages used by their caretakers (Wong et al., 2009). The development of musical skills in childhood seems to proceed in the direction of building new culture-specific skills of identifying culturally important conventional patterns of musical sounds (e.g., “chords” and “keys”), based on the biologically ingrained foundation of synesthetic perception of musical pitch, rhythm, timbre, and dynamics (see the discussion in Nikolsky, 2022). Ontogenetically, this line of development from implicit “natural” (onomatopoeic) and, therefore, cross-cultural and general to explicit “cultural” (convention-based) learning is not that different from linguistic acquisition (Berry et al., 2002; Dasen, 2012; Greenfield et al., 2003; Johnson & White, 2020; Kidd et al., 2018; Monaghan et al., 2014). This emergence of “cultural” forms of musicking from “natural” forms must be responsible for the significant correlation between the geographic distribution of specific genetic variations and specific folk music traditions, as revealed by recent studies (Brown et al., 2013; Le Bomin et al., 2016; Pamjav et al., 2012).

Finally, music perception and production rely on specific brain circuits, the impairment of which leads to distinctive, music-specific damage (i.e., amusia) (Perrone-Capano et al., 2017; Reybrouck et al., 2018; Stewart et al., 2006; Tillmann et al., 2015; Vuust et al., 2022). This substrate shows extensive overlapping with the substrate of language impairments, specifically in syntax processing (Asano, 2022; Brown et al., 2006; Harvey, 2017; Sun et al., 2018; but see Chen et al., 2021, for an opposing view).

Overall, just as one can argue for a human linguisticality—the set of capacities that enable humans to learn and use languages in all their diverse forms (after Haspelmath 2020)—one can argue for a human musicality, understood as an innate predisposition to perceive and create music, encompassing all the perceptual, cognitive, and behavioral aspects of music. Our contention here is that these parallels can also be extended to the evolutionary domain. Retaining the parallel with language(s) again, in no way should music be regarded as a recent cultural invention.Footnote 4 Musicality must be an ancient capacity that has manifested in different types of music along the long pathway of Homo sapiens, reflecting the milestones in the cultural evolution of our species, as well as important cognitive and behavioral changes.

In view of the similarities reviewed above, some scholars (most notably, Brown 2000) have suggested that language and music might share common evolutionary roots. However, as noted by Cross and colleagues (2013), even were this the case, there are several likely scenarios of their emergence: music developing from language (Spencer’s view), language emerging from music (Darwin’s view), or language and music splitting up from a common musilanguage (Brown’s view) and afterwards following different, but still related (and perhaps interacting), trajectories (Harvey, 2020). In this paper, we propose a new model of the evolution of music that adheres to the latter possibility.

What Music Functions Can Tell About the Evolution of Music

Pretty much as for language, one can think of diverse functions for which music might have been selected—and even estimate a timeline for the selection of each type of function. Most of the functions of music mentioned in the previous section can be characterized as “external” to the subject and thus execute some social role: for example, (1) the establishment and consolidation of social bonds within human groups (Dunbar, 2012a, b; Harvey, 2017, 2018; Savage et al., 2020) and (2) the conveyance of credible information to others either for signaling mate quality (e.g., Merker 2000; Miller, 2000) or for coping with progressively complex social conflicts of interest (Mehr et al., 2021).

Nonetheless, an “internal” role for music has been hypothesized as well, such as Perlovsky’s (2017) view of music as a tool for overcoming unpleasant emotions, resulting from our interaction with the environment. Often, “external” functions of music, most notably those related to social bonding, impact the “internal” state of a subject by influencing the stress-response systems or the rewards systems (see Dunbar 2012a, b; Harvey, 2020; Savage et al., 2020, for discussion). Accordingly, it is not an easy task to infer an evolutionary path for these functions.

One promising approach is to cross-examine the codependencies between the most common music functions, based on the music skills required to process those music structures that characterize each of these functions.Footnote 5 Like language, music is structurally determined by the functions it regularly performs (listed below). Once forged, such structures, in turn, start supporting and conserving a function that shaped them.Footnote 6 As a result, these formative functions form complex dependencies whereby one function cannot operate without another function being accessible.Footnote 7 More importantly, some functions build the foundation for others, supporting new modes of interaction with the physical and, particularly, the cultural environment.

In a recent paper (Nikolsky & Benítez-Burraco, 2022) we present a thorough reconstruction of the entire chain of dependencies of the most common formative music functions, tracking them down to the primordial hedonistic function that underlies all others. We identified 14 operational functions in the recent research literature (Bispham, 2018; Boer & Fischer, 2012; Brown, 2005; Clayton, 2016; Dissanayake, 2005; Levitin, 2019; Perlovsky, 2014; Savage et al., 2015; Schäfer et al., 2012; Schäfer & Sedlmeier, 2009; Stefanija, 2007; Trevarthen, 2009; van Goethem & Sloboda, 2011):Footnote 8

  • hedonistic stimulation (make music or listen to it to experience pleasure),

  • emotional communication (make or listen to music that expresses one’s current emotional state or characterizes a state of a third party),

  • emotional regulation (make or listen to a selected type of music to maintain a desired emotional state or to change an undesired one),

  • compliance to norms (ritualizing one’s behavior and organizing one’s feelings and goals in accordance with some ideal, collective task, or belief),

  • recreation (entertain an individual or a social group by doing something not totally predictable, such as improvising, exploring a new instrument, or playing some singing/vocalization games),

  • interpersonal bonding (secure close relations with another individual or a social group by sharing a musical experience with them),

  • coalition status display (publicly display one’s membership in a specific social group or project and affirm a wish-to-be social identity),

  • physical aid (support a specific pattern of physical motion in one’s daily work, play, or workout, collective or solitary),

  • learning aid (stimulate the discovery of new things and help remember important information, as in children’s learning songs),

  • contemplating an event (evoke the imagery of an important occasion, holiday, season, sporting event, place of interest, landmark, or monument),

  • calling to action (music signaling, as in military bugle signals or herding calls—i.e., supporting language-like commands—and the creative use of such semiosis to entertain the audience, as in “program music”),

  • conservation of memories (preserve a valuable memory for an individual and their close family/friends, usually nostalgic, and maintain one’s mental integrity under pressure),

  • self-promotion (exhibit one’s music faculties to increase confidence, self-esteem, and/or earn respect or show superiority),

  • personal profiting (earn money and/or fame by making music as a professional occupation).

We cannot dedicate much space to the discussion of these functions here and will only cover those points that directly relate to the evolution of music and language.

Figure 1 summarizes the codependencies that we have established in our 2022 paper (see Nikolsky & Benítez-Burraco 2022 for details). “Hedonistic stimulation” does not depend on any other function and is not only cross-cultural but cross-specific for a number of nonhuman species. Therefore, it is placed at the root. “Personal profiting” and “calling to action” do not support other functions. Therefore, they go to the top. Other functions are distributed in-between according to their dependencies.Footnote 9

Fig. 1
figure 1

Evolutionary development of operational functions of music. Fourteen operational functions are placed along two axes: temporal (vertical) and social (horizontal). The former (on the left, in pink) reflects the operational dependencies between all functions, which is generally representative of the ontogenetic pattern of acquisition of music skills throughout childhood. On the right (in purple), the corresponding phylogenetic line of development is outlined. The horizontal axis shows the gradual social expansion in the use of functions throughout childhood. The ellipsis after the name of a function indicates that this function keeps developing toward engaging a greater number of participants, the extent of which is reflected by the relative length of the surrounding box after the ellipsis. Black arrows show the derivative relations between functions. A blue rectangle at the bottom encloses functions that are undifferentiated from verbal communication and characteristic for the “musilanguage.” A green rectangle marks the functions that are differentiated from verbal communication but are not autonomous from it, representative of protomusic and earliest forms of “personal music.” Darker green distinguishes more biologically dependent functions from more culturally varied ones. A yellow rectangle encloses functions specific to music. Darker yellow distinguishes functions based on informal, orally transmitted, and implicit musical grammars from formally learned, notation-based, and explicit grammars

Note that the lower-order functions form the succession that fits the pattern of acquisition of musical skills throughout childhood (see the discussion in Nikolsky, 2022). “Hedonistic stimulation” by music seems to be inborn and universal. It supports and enables the acquisition of every other music function. “Learning aid” capitalizes on the capacity of music to bring pleasure, connecting it to the disposition to learn new things and the mnemonic power of music (evident in the earworm phenomenon and the efficacy of music therapy in treating dementia). Multimodal interaction with the mothering figure, whose singing, motherese, touch, movements, and gestures altogether shape this “learning” function, teaches an infant the principles of communication. “Interpersonal bonding” emerges from the ongoing communication with caretakers, usually set by the mother and thereafter expanded to other close relatives. Based on the observed patterns of vocal communication, by the second year of life, infants engage in active musicking—in the form of solitary musical babbling, which introduces the “recreation” function. However, musical babbling remains very similar to verbal babbling. All four of these basic functions are engaged in verbal acquisition too.

Hours of dedicated exercising self-initiated vocalizations, accompanied by spontaneous physical movements, lead to discovery of the expressive capacities of melodic leaps, steps, directionality, dynamics, and, eventually, rhythm and tempo. Infants learn melodic movement as they learn physical movement. Mastering melodic leaps and steps accompanies learning to walk. Thereby, music evolves into the “physical aid” function. Through solitary exploration of singing while moving, playing with toys, drawing, and so on, children discover that certain types of melodic motion suit certain types of physical motion. Specific musical patterns become associated with the affective characteristics of the accompanying locomotion and with the imaginary characters of toys and protagonists of drawings. From this point on, musical expression focuses on “emotional communication,” and verbal expression, on referential communication. However, both keep sharing the same functions: like music, speech conveys emotions, accompanies locomotion in play-games, and entertains (tongue-twisters, nursery rhymes).

Music expression becomes autonomous from speech once children begin using skills they have learned in emotional communication to control their emotional state: avoid negative emotions, bring themselves into a state required by a social situation, and so on. “Emotional regulation” opens doors to “compliance to norms”—children begin learning ritual behaviors for different environmental settings. Music comes handy in organizing “rituals” for collective activities (work songs, play songs, anthems, hymns, theme songs). Since execution of such activities keeps involving a greater number of participants and increasing the distance of musical communication beyond the intimate space (typical for motherese), tonal organization of music starts obtaining pitch orientation (see Nikolsky & Benítez-Burraco, 2022, for details). The emerging pitch patterns become more and more culture-specific—averaging the knowledge and preferences of the growing pool of participants in musicking.

Variety in learned musical rituals enables one to display their “coalition status” to a growing number of people to demonstrate which norms one chooses to abide by. This way music preferences turn into something like a social “identity badge.” This function is exceedingly important among teenagers, laying the ground for another function, very important for adults: “conservation of memories.” Music patronized in youth is usually cherished throughout life and serves to maintain one’s integrity. The latter, in turn, becomes indispensable for “self-promotion.” Raising one’s self-esteem and earning respect through performance and patronization of sophisticated music requires stylistic consistency and adherence to the earlier established values.

“Self-promotion” can evolve into “personal profiting” for those who achieve technical proficiency and artistic integrity in music.Footnote 10 In modern societies, where music schooling supports music notation, reproduction, and wide distribution of music compositions, this option might be quite lucrative—in contrast to folk music cultures, where active musicking constitutes the norm. In turn, notation and formal schooling facilitate accumulation of knowledge and acquisition of basic arranging skills, especially valuable for the evolution of “conservation of memories” into “contemplating an event.” The latter supports the capacity to use music in reference to specific circumstances in their absence from the immediate environment (e.g., contemplating Christmas by listening, playing, or imagining the sound of carols during summer). Building a lexicon of music idioms to refer to many culturally important events (including foreign and exotic) leads to the acquisition of the most advanced musical function—“calling to action.” It supports the capacity to suggest affective states, characters, imagery, and attitudes by choosing and arranging suitable music structures and convincingly rendering them for the audience. This function is almost entirely based on cultural conventions and learning.

Phylogeny Meets Ontogeny

This entire line of ontogenetic musical development finds a close match in phylogenetic development—after all, a cultural phenomenon can exist in no other way but through the successful transfer from one generation to another in quantities sufficient for its survival. The success of this transfer is largely determined by the psychophysiological limitations of a learning youngster and the ability of adult experts to cater to that person (see Nikolsky, 2020a, for thorough discussion). Hence, the infantile functions correspond to the musilanguage stage in the evolution of music and language. Both are characterized by the prevalence of personal and duetic settings—epitomized, respectively, in babbling and motherese. Presence of a diverse repertory of relatively well-structured signals, adopted as a standard to convey certain types of information, must have distinguished human musilanguage from animal communication. Longer altriciality and ever-growing capacity to accumulate knowledge must have promoted this diversification. The typical forms of preverbal interactions between infant and caregiver provide at least some idea of how the musilanguage systems might have been put into use by humans before the emergence of modern articulate speech and true human musicality (Harvey, 2017).

Differentiation of musical and verbal acquisition around the age of 3–5 years corresponds to the divergence of protomusic and protolanguage. Their mutual cutoff from the preceding musilanguage probably occurred due to discovery and appreciation of singing and metro-rhythm (likely discovered through the entrained knapping during collective manufacturing of stone tools). This stage can be characterized by crystallization of protogenresFootnote 11—forms of musicking developed to accompany collective hunting and repelling of predators or personal caretaking, such as mothering and grooming. The phylogenetic equivalent of the formation of a mother-child “microcosm” (and its further expansion into a “macrocosm” of friends and acquaintances) would be the emergence of a family nucleus and significant reduction of aggression within it among early humans (followed by the expansion of this nucleus). The direct connection between the increase in attachment behaviors, so instrumental for the evolution of language and music, and hormonal effects of the peptide oxytocin on music-related activities has been thoroughly discussed by Harvey (2020).

The next phase of ontogenetic development—learning to express musical emotions and to use them to optimize one’s state—marks the onset of a new phylogenetic stage of “protomusic” turning into “music,” fueled by the emergence of musical mode. The latter can be defined as a social convention for combining certain types of musical sounds into sets for expression of a particular topic. Musical modes are inseparable from musical genres: in virtually every folk music culture, each basic genre (e.g., lullaby)Footnote 12 supplies one or a few suitable characteristic musical modes (so that all applications of the same genre sound recognizably the same).

In case of timbre-oriented music, musical modes are timbral—they join not pitch-classes but “timbre-classes” (as in jaw harp music; see Nikolsky et al., 2020). Timbre-matching has been reported in mother-infant communication (Malloch, 2000, 2004), driven by the instinct to adjust one’s vocalizations to those of an interlocutor. The emergence of this capacity likely occurred in the late Paleolithic and marked the birth of timbral music from protomusic.

The next evolutionary advance was the conversion of timbre-classes into pitch-classes and transition from timbre- to pitch-orientation. Ontogenetically, this transition usually occurs at the age of 3–5 years through the practice of “objectivization” of pitch values in music, when salient pitch changes become associated with physical objects, qualities, and events based on the synaesthetic connections between melodic motion and physical motion as observed by children in their environment (see Nikolsky, 2022).Footnote 13 The other important factors contributing to the emergence of pitch orientation are:

  1. 1.

    long chains of folk-style person-to-person transmission (see Nikolsky & Benítez-Burraco, 2022, Chap. 5),

  2. 2.

    spread of collective singing with the accompaniment of rhythmic and melodic musical instruments (Morley, 2013),

  3. 3.

    concentration of people in a confined space of caves that became the preferred form of shelter toward the end of the Paleolithic (the reverberation converts melodic intervals into harmonic intervals by prolonging the “tails” of preceding melodic tones; Nikolsky & Benítez-Burraco, 2022),

  4. 4.

    musicking at distances where listeners cannot discriminate between different timbre-classes, especially if the distance changes during the same session of musical communication (as in herding; see Nikolsky, 2020b).

The fifth phase of musical ontogenesis corresponds to the evolutionary stage when musical keys emerged from musical modes (first documented in ancient Greece; see Nikolsky, 2016c). The extensive use of keys within a particular music culture led to the formation of “tonality,” which came to replace “modality” (Nikolsky & Benítez-Burraco, 2022). In short, this stage is characterized by the adoption of standardized tuning, as defined by the practice of tuning musical instruments most important for a given musical culture. Musical keys “canonize” specific sets of pitch-classes, convenient for playing on the preferred musical instruments. Such sets become adopted for other musical instruments and vocals within the same music culture through the practice of mixed ensemble performance. With the advance of ensemble music and rise of formal music theory, a culture establishes an assortment of keys for conventional forms of expression across all the important genres, generating a “tonality”—a system of keys (i.e., a set of sets of pitch-classes). Western classical tonality constitutes just one particular case of tonality. Indian raga, Arabic maqam, Persian dastgah, or Chinese yuye each implement “tonality” in their own way, according to their cultural values.

If the musilanguage and protomusic stages are characterized by cross-cultural uniformity, since they rely mostly on the innate forms of encoding information into auditory signals (what we call anthropophonic and onomatopoeic intonation types), the tonality stage exhibits maximal cultural diversity and minimal universality of the musical expressive means. That is, the task of comprehending music created within a tonality system absolutely requires a listener to learn the conventions of the corresponding music culture. In contrast, comprehension of the earliest forms of tonal organization of music can rely on the biomusicological universalities and synesthetic environmental associations. The challenge of conducting a comparative study of music (synchronic and diachronic) is that each musical function, once established, remains accessible, supporting the higher-order functions, while becoming adjusted to the broader user-base. Functions, as well as music genres and traditions that rely on such genres, do not become replaced by newer functions, genres, and traditions, but accumulate, disappearing only after a prolonged absence of any use.Footnote 14

For example, the foundation for “physical aid” function is prepared by the mother moving the infant’s limbs in concert with her motherese talking and singing. Embodied in this way, sound patterns are further explored by a child during sessions of solitary babbling, accompanied with spontaneous self-induced locomotion. The discovered correspondences between melodic and physical motion are further explored in singing that accompanies solitary playing with dolls, toys, and in drawing, where each character receives a dedicated musical pattern. As the child grows up, such games start involving playmates and including nursery rhymes, ditties, and popular songs, rearranged for each instance of application. As children learn the assortment of patterns of various musical movements, they can participate in work-songs and other music-based activities together with adults (which is exceedingly common in traditional societies). In modern urban culture, teenage children rapidly advance to the stage of mass consumption of music—they switch from active performance typical for earlier childhood to passive listening and learn to select music for background listening while doing something (e.g., during physical exercises). This way, the initially personal use of melodic motion ends up expanding to involve up to thousands of participants (e.g., a session of rhythmic gymnastics streamed over the internet) as the “physical aid” function passes through developmental rounds with a broadening user-base. Similar development must have taken place in the cultural evolution of music as human societies grew in size and complexity, and music was put into serving a greater number of users.

The most important take from the variability of musical functions and their cumulative nature is that any analysis and comparison of music should involve the entirety of relevant musical functions, their structural implementation, and the quantity of their users.

The Formative Power of Cultural Transmission on Music and Language

To add a final piece to the evolutionary puzzle, we need to point out that cultural transmission per se exerts a formative power over music structures—just as verbal structures are shaped by transmission chains. Thus, Lumaca and Baggio (2017) experimentally demonstrated how transmission altered pitch and rhythm aspects of the transmitted pattern, resulting in diatonization of the initial model—in other words, chromatic semitones being systematically replaced by diatonic whole tones, thereby increasing music’s compliance to conventional keys. The formative power of transmission goes as far as to transform ekmelic intonations—gradual changes in pitch and indefinite pitch values (like pitch contours of spoken sentences)—into emmelic intonations (incremental changes in pitch with definite pitch values) at the end of a transmission chain (Verhoef, 2012; Verhoef et al., 2014).

Discretization and diatonization seem to occur because the transmitter tends to complicate a specific pattern in an attempt to increase its expressivity, whereas the receiver tends to simplify it for the sake of easier learning (Kirby et al., 2015). This trade-off eventually results in the increased compressibility of the encoding and the regularization of the variables. The longer the transmission chain, the stronger the effect. Iterated learning generates natural selection for optimal acoustic distinctiveness, supporting the transformation of non-combinatorial signals into combinatorial signals (Zuidema & de Boer, 2009). The same process is at work in linguistic and musical transmissions: each receiver intuitively strives to minimize entropy while learning a structure, which promotes compression of information and the emergence of compression regularities, thereby generating grammars (Tamariz & Kirby, 2016). Here, yet another peculiarity of transmission comes into play—each new learner tends to bring into uniformity those structures that just slightly differ (Smith & Wonnacott, 2010). This leads to crystallization of grammatical rules.

A number of scholars have denied that music has grammar, meaning, and compositionality. The reasons for this are numerous:

  • confusion over the typology of music functions and uses,

  • disregard for music structures and analysis of music form, common among Western ethnomusicologists,

  • absence of a general definition of music and disinterest in coining it,

  • demise of comparative ethnomusicology in the West after WWII for political reasons, and

  • a pronounced Eurocentric bias among many Western cognitive scientists and developmental psychologists who hold Western classical music as the universal or ultimate model of tonal organization.

Nonetheless, what tells music apart from other auditory phenomena, we believe, is music’s overall orientation toward putting the listener in a specific premediated emotional state and keeping them in that state for an extended period of time—and doing this repeatedly, so the same type of sonic material becomes associated with a specific type of semantic content by means of public convention (see Nikolsky 2015a, 2020b; Nikolsky & Benítez-Burraco, 2022).Footnote 15 We realize that the idea of tying the concept of music to emotion appears unattractive to many scholars with a background in classical music composition, performance, and music history, ever since Stravinsky and the post-WWII avant-garde won critical acclaim in Western academia and among prestigious cultural philanthropic organizations.Footnote 16 However, any attempt to tweak the general definition of music in order to incorporate the latest short-lived (just a century long) development of just one music tradition (albeit a very important one) is methodologically wrong (generalizing on a sample size of one). We cannot name a single non-Western musical tradition that abstains from using musical emotions and musical genres (which usually serve to assign affective qualia to specific music structures, generating convention-based semiosis in music).Footnote 17

Morphologically, music closely follows language in employing both combinatoriality and compositionality, although, as noted above, there is some controversy as to whether music syntax is processed in the same cortical regions as language syntax. Music combines many meaningless elementary units—pitch-, rhythm- and timbre-classes, metric beats, and voices in texture—to generate meaningful morpho-syntactic units, such as motifs, chords, rhythmic figures, metric groups, and textural components (e.g., accompaniment, counter-melody, pedal tone) that carry certain semantic values (sighing motif, sad chord, bouncing rhythmic figure, leisurely swaying ternary meter, stiffening pedal tone, etc.). These morpho-syntactic units are conjoined according to a set of rules that distinguish each musical tradition, enabling listeners to identify a tradition by ear (Nazaikinsky, 1982). For instance, in Gregorian plainchant, melodic leaps, regular meters, the so-called dotted (or “punctured”) rhythms, chords, and chromatic alterations are to be avoided altogether (Ferreira, 1997), whereas in Western military march music they are encouraged (Monelle, 2006). Mastering such traditions requires apprenticeship so a layperson can learn their compositional principles.

Historic ethnomusicology testifies to the fundamentality of compositional organization of music. Western, Arabic, Persian, Indian, and Chinese classical music traditions each feature hundreds if not thousands of treatises on music composition.Footnote 18 Western compositional music theory is rooted in the ancient Greek theory of rhetoric, understood as the craft and science of bringing the audience into a specific emotional disposition (Bartel, 1997; Bonds, 1991; Harrison, 1990; Kallberg, 1988; Keller, 1973; Mabbett, 1990; Meier, 1990; Vickers, 1984; Zakharova, 1983).Footnote 19 The musical implementation of rhetoric occurred initially through the liturgic practice of composing sermons and supporting the verses required for liturgy with music (Murphy, 1981), but by the eighteenth century the theory of musical rhetoric firmly held its ground in purely instrumental and secular forms of music (see Mattheson & Harriss, 1981). Other musical cultures featured their own pathways of developing musical rhetoric (see Dorchak 2016; López-Cano, 2020; Powers, 1980; Rink, 1989; Smith, 1971; Theodosopoulou, 2019), including such a recent development as composing music for advertising (Scott, 1990).

Chain transmissions inherently introduce and magnify cultural biases in music structures and combinatorial and compositional rules since different cultures favor different structural features in response to culturally dependent factors, such as popularity and social prestige. The same applies to the domain of speech (Verhoef et al., 2014). More generally, experiments involving artificial languages suggest that the cultural transmission of linguistic structures promotes compressible regularities, combinatorial rules, and compositionality (Kirby et al., 2015; Tamariz & Kirby, 2016). The analysis of sign languages spontaneously developed by isolated deaf populations also suggests that some basic properties of language (such as duality of pattern) are lacking at the beginning of transmission and emerge gradually as a result of increased interactions between signers (Dachkovsky et al., 2018; Sandler et al., 2005).

In the case of music, it is more difficult to identify “idiomatic” structures (i.e., music lexicons of specific music-user communities) and combinatorial rules (i.e., conventional music grammars). The reason for this might be the growing prevalence of “tree-like” transmission—in other words, chain-like passing of a music work from a person to a group (Nettl, 2005).Footnote 20 “Tree-like” transmission tends to replace folk-style “linear,” person-to-person transmission as notation, formal music theory, and professional forms of public performance begin to obtain a greater share in a musical culture. Notation and theory substantially aid learning, thereby reducing the formative power of simplification in learning on the part of the listeners throughout the transmission chain.

The presence of an audience, in turn, incentivizes performers to intuitively amplify their expression in order to increase rhetorical control over the listeners. As a result, the innovation rate in exploring newer expressive means grows—structural patterns are modified more at each new act of transmission. Subsequently, the diversity of the emerging variants increases since each of the multiple listeners inevitably introduces slight variations in the learned music when they pass it on to new listeners. The compound effect of the tree-like transmission greatly exceeds that of linear transmission. Prevalence of linear transmission makes music cultures that remain primarily “personal” (e.g., Nenets or Nganasan) in their music usage to stand out as amazingly conservative in comparison to music cultures that primarily employ collective forms of performance and listening. The larger the number of the ensemble performers typical for a given tradition (e.g., orchestral music) and the size of its audience (e.g., concert hall, radio), the higher the innovation rate (Alekseyev, 1976, 1986, 1988). Naturally, the greater the discrepancy between synchronic and diachronic invariants of the same musical structure, the vaguer its structural and semantic characteristics and the weaker the combinatorial rules of its use. Language does not have this problem because, in everyday use, “person-to-person” distribution remains prevalent over “person-to-group” (for the discussion of harmonization versus individualization, see Harvey, 2017).

Music is more oriented to the expression, transmission, and prolonged experience of emotions, whereas language is more optimized for delivering prompt referential information. Therefore, oral verbal encoding is designed for quick peer-to-peer streaming, where information has to be constantly chunked by parsing the stream of sounds, identifying words in it, retrieving their meanings, interpreting phrases, and constructing the meaning in a cumulative way. All of this relies on clarity of phonemic and morphological contrasts, while prioritizing the processing speed and robust error-correction.

Conversely, music prioritizes continuity and homogeneity of the sounds within the same musical phrase. Music is designed to elicit particular affective states in the listener, allowing them to immerse themselves in the music and fully engage with the experience of those states. This requirement causes music to:

  • slow down music’s transmission rates, giving music a meditative appearance,

  • cause music to simultaneously engage multiple aspects of expression, each with its own proprietary “idiomatic” patterns (rhythmic, metric, melodic, harmonic, etc.), and

  • ground music to iconic semiosis and synesthetic correspondences between the musical meaning and the acoustic attributes of music sounds.

This distinction between music and language is far from being clear-cut. Language also conveys emotional contents and is partially iconic—especially poetic speech, in which iconicity facilitates word learning and communication while systematicity facilitates category learning. Linguistic arbitrariness, iconicity, and systematicity interact in complex ways under the effects of cultural selection to reshape not only a language’s vocabulary but also its grammar, promoting compositionality and regularity (Dingemanse et al., 2015). Nevertheless, the differences between musical and linguistic oral transmissions are sufficient to make music functions form operational relations quite different from language functions. Notably, music functions rely on each other to such an extent that higher-order functions can hardly be fully operational without lower-order functions being effectively engaged.

Subsequently, the study of the evolution of music requires the consideration of all music functions in their systemic relations. Most disagreements between extant theories of the evolution of music seem to originate from the limitation of study to only a few functions, specific to the earliest or latest stages of evolutionary development, while ignoring the other functions. Moreover, both biological and cultural factors need to be considered on par and in their interaction.

Human Self-Domestication and Language Evolution

In the next two sections we present a model of music evolution to account for musical functions and for biological and cultural factors formative for music structure and function. Our model is based on a recent account of human evolution, namely, the hypothesis of “human self-domestication” (HSD), which has been successfully applied to the characterization of the evolution of language in our species (Benítez-Burraco & Progovac, 2020; Thomas & Kirby, 2018). Because of the parallels between music and language discussed above, we expect this evolutionary model to be applicable to music.

The HSD hypothesis supports the view that the human phenotype is, to a large extent, the outcome of an evolutionary process similar to that of animal domestication. In nonhuman mammals, domestication initially involved selection for tameness and resulted in a set of distinctive traits—physical, cognitive, and behavioral—that usually co-occurred, forming the domestication syndrome (Wilkins et al., 2014; see Lord et al., 2020, and Sanchez-Villagra et al., 2019, for critical views). This might be due to the fact that tameness reduces the input to the neural crest, an embryonic structure that supports the ontogenetic development of numerous body parts (Wilkins et al., 2014; see Lord et al., 2020; Sánchez-Villagra et al., 2016). The HSD hypothesis builds on the findings of many domestication traits in humans, including smaller skulls/brains (compared with archaic humans), reduced hair, neotenic features (e.g., extended childhood and increased playing behavior), and, particularly, reduced levels of reactive aggression (Fukase et al., 2015; Leach, 2003; Plavcan, 2012; Shea, 1989; Somel et al., 2009; Stringer, 2016; Zollikofer & Ponce de León, 2010).

Diverse factors have been hypothesized to trigger HSD, including the rise of co-parenting, the advent of community living, changes in our foraging ecology, climate deterioration, and the colonization of new environments (Brooks & Yamamoto, 2021; Pisor & Surbeck, 2019; Spikins et al., 2021). All in all, these factors might have promoted a selection toward less reactive and more prosocial behaviors, thereby instilling in humans a constellation of physical, behavioral, and cognitive changes characteristic of domestication. Many human-specific traits, such as our enhanced social cognition, increased cooperation, and finally, advanced technology and sophisticated culture, are the products of domesticate-like adaptation (see Hare, 2017, for an overview). This collective cooperativity that extends beyond the familial gene pool does not necessarily equate to domestication, but it quite closely resembles its principal traits.

It seems to us that HSD presents a useful evolutionary framework for linguistic studies, especially for capturing those aspects of languages that are thought to emerge through a cultural mechanism. It is worth remembering that the earliest hominids, who had high levels of reactive aggression, practiced musilanguage rather than “language” and must have cultivated signals similar to animal communication. The latter simply could not support the “duality of patterning” (Hockett, 1960) and combinatoriality. Therefore, the “linguistic” component in musilanguage is harder to see than the “musical” component, although there is evidence that animal communication uses referential as well as motivational information, each coded differently (Manser, 2010). Indeed, animal communication comes much closer to human music than to human language due to its dedication to showing the signaller’s affective state (Fitch, 2006). There is neurophysiological evidence that “full language” must have crystallized later than “full music” because the acoustic characteristics of primate vocalizations are mainly determined by music-like features that serve as the foundation of verbal acquisition for human infants (Koelsch, 2009).

However, concluding from this that language evolved from music, as argued by Fitch (2010), seems a far stretch. The principal arguments against this scenario were summarized by Tallerman (2013):

  1. 1.

    Phonological systems do not evolve in isolation from semantics, as if they were “bare vocal sounds.” Consonants and vowels are linguistic entities, and phonological expansion derives from a growing vocabulary of words—not the other way around. It is the developing lexical system that brings to life phonological gestures (de Boer & Zuidema, 2010; Lindblom, 1998; Studdert-Kennedy, 2011; Zuidema & de Boer, 2009).

  2. 2.

    Despite greater similarity to animal vocalizations than human language, human song remains fundamentally different from animal vocalization. Animal-learned vocalizations lack transposability of intentions (i.e., repeated use of the same signal in different circumstances) and abstraction of the representation of an affective state, which are the landmarks of musical emotions. A single animal call is the basic unit of animal communication—produced instinctively in response to the actual stimulus present in the environment (Zuberbühler, 2017). And animal-learned vocalizations are limited to display of fitness (Naguib & Riebel, 2014), are season- and gender-specific (Slater, 2011), and relate to mating or territory-defending situations (Slater, 2001)—unlike human music.

  3. 3.

    Animal learned vocalizations (some ethologists and researchers of animal communication call them “animal songs”) have a critical period of acquisition, are learned holistically, and take months before an animal can deliver them (Hurford, 2012). In contrast, humans can learn songs at any life-stage, doing it incrementally and rather quickly. Evidently, human song-learning engages very different neuro-physiological mechanisms and constitutes not an extension but a parallel evolutionary development to animal song—as Fitch himself recognizes (Fitch, 2010:184).Footnote 21

  4. 4.

    Finally, it is hard to explain how and why music-like aspects of hominin vocalizations would have reduced their musicality and given rise to consonants that are fundamentally “unmusical” and notably absent in animal communication (Kolinsky et al., 2009). The musicality of speech comes from prosody, and prosody comes from joining words into phrases. Musical phrases have nothing in common with linguistic phrases other than the misleading term “phrase” (Benjamin et al., 2015)—linguistic phrases are built around words and their categorical relations, whereas musical phrases are determined primarily by the breathing rate that characterizes different emotional states (greater excitement transpires in shorter phrases) and general release of tension (harmonic and melodic) toward the end of a phrase, which accompanies expiration (Alekseyev 1976).Footnote 22

As we see it, the evolutionary continuity of animal communication and human music is superficial—human song and animal song constitute independent developments—and there is no reason to trace the origins of language from music. Under closer scrutiny, animal communication combines the semiotic characteristics of both human music and language (Manser, 2010):

  • Animals use referential calls (i.e., they refer to specific attributes of the eliciting external stimuli to enable the receivers of these signals to react to these external stimuli) when encountering predators, discovering a food resource, and in agonistic social interactions,

  • Animals use motivational calls (i.e., calls that display the emotional state of a caller, so the receivers react to this emotional state) in all other situations. Ontogenetically, acquisition of motivational calls precedes acquisition of referential calls and appears to be simpler in structure.

Musilanguage must have just inherited referential and motivational specialization from animal communication and advanced it to the next evolutionary stage—building the repertories of calls of both types and introducing some transposability of their use. In this process, each type obtained a set of characteristic structural features that allowed listeners to distinguish both types upon hearing them. Motivational calls probably resembled the repertory of infantile vocalizations during the first few months of life, categorized into negative cries of various sorts and positive cooing—all characterized by prolonged use, as with music (typically, as long as the emotional state lasts).

Referential musilanguage calls likely resembled the earliest attempts of an infant to point to specific things in a dialogic communication with a caretaker with the aid of gestures—shorter and more of turn-taking than the “monologic” motivational vocalizations. Such a “wordless” linguistic component is what Brown outlined in his 2017 amendment of the musilanguage theory with his new “prosodic scaffold” model (Brown, 2017). According to it, musilanguage conveyed primarily affect-related information in two principal ways:

  1. 1.

    through “affective prosody” (music-like) by means of anatomically available and innate impulse-driven modulations of pitch, loudness, and tempo—which remain global and holistic for the entirety of a call;

  2. 2.

    through “intonational prosody” (speech-like) by filling a prosodic scaffold with phoneme-like deictic utterances—employing both global and local mechanisms for conveying linguistic modality (e.g., question versus statement) and emphasis (stress, prominence, focus).

The speech-like way must have evolved from the music-like way through an ongoing adaptation of the reflex-based vocalizations in response to the most common environmental situations. Such vocalizations were probably reshaped by their chain transmission and natural selection for the most effective patterns of communication under the pressure of time—in other words, to successfully deliver signals as soon as possible, enabling live updates on critical changes in the environment. The demand of urgency probably pushed “intonational prosody” toward language, in contrast to “affective prosody,” focused on the caller’s expression rather than the task of keeping listeners up-to-date. Supported with hominin’s capacity for accumulation of knowledge, the newly forged intonational patterns were memorized and preserved (in contrast to animal communication), leading to the invention of consonants, formation of syllables, and eventual adoption of basic conventional words for the most common objects.Footnote 23

With regard to HSD, musilanguage, protomusic, and protolanguage all fall out of its scope, since currently available data do not indicate the presence of a domesticated phenotype among extinct hominins, and the data coming from developmental psychology and ethnomusicology is applicable to Homo sapiens only. Extrapolating our conclusions on the factors at play (see Nikolsky & Benítez-Burraco, 2022, for details), it is plausible to expect that hominins who practiced protomusic and protolanguage, perhaps even musilanguage, had lower levels of reactive aggression than nonhuman primates. Some traits established for Homo erectus might be interpreted as promoting cooperative behaviors between closely related partners: hunting and gathering in groups, caring for injured and sick group members (Leroy et al., 2011), need for helpers during delivery due to large cranial size, caretaking assistance due to longer altriciality (Boaz & Ciochon, 2004), and migration to colder climates, where hardship of survival was likely to encourage mutual support in such activities as communally planned big game hunting, maintaining fire, and making clothes and huts—all suggestive of some form of communication between the participants (Mania & Mania, 2004). However, such arguments remain speculative until more conclusive archaeological evidence is uncovered.

For humans, HSD can account for the evolution of abilities and behaviors that enable the cumulative growth of linguistic complexity through already ongoing, multigenerational learning and use. This involves language teaching and practicing, promoted by a more prosocial and neotenic phenotype. In a series of related papers, Progovac and Benítez-Burraco (2019; Benítez-Burraco & Progovac, 2020, 2021) have developed a detailed model of how HSD might have contributed to the evolution of language (and of languages). At the time of the emergence of early humans, reactive aggression was still high, and consequently, communication through language must have been limited to single-word commands, threats, and exclamations, mostly aimed at conveying emotions. Patient and cooperative turn-taking, using long utterances, and conveying referential meanings, frequently observed in present-day interactions, were simply unattainable back then.

Increasing HSD supported stronger in-group networks, involving more diverse, frequent, and prolonged contacts between their members. Cooperative turn-taking must have become more common and elaborated, enabling the development of linguistic structures via cultural transmission. It is plausible to expect that single-word utterances were replaced by rudimentary two-slot grammars made of nouns and verbs to express predications. These earliest grammars might have been primarily used for creating colorful derogatory expressions (since emotional reactivity was still quite high), contributing to further increase in HSD, as these derogatory utterances helped replace physical reactive aggression with less-harmful verbal aggression.

The main reason for the positive feedback loop between reactive aggression and grammar is the functional connection and partial overlap of the brain mechanisms that support combinatoriality and control of reactive aggressivity. To give just one example, in learned aggressive actions (a form of controlled aggression), the prefrontal cortex regulates the activity of the hypothalamus (a component of the “core aggression circuit”) and the striatum (part of the “learned aggression circuit”; Lischinsky & Lin, 2020). But the striatum plays a key role in grammar processing as part of the procedural memory and, more generally, of the cortico-subcortical networks responsible for hierarchical processing (Teichmann et al., 2015). Evidence of this functional connection/partial overlap is the concurrence of the difficulties in processing structural aspects of language with the aggressive outbursts in clinical conditions, caused by striatal dysfunction (Rosenblatt & Leroi, 2000; Savage, 1997; Zgaljardic et al., 2003). Accordingly, from an evolutionary point of view, one can expect that reduced reactive aggression, resulting from increased HSD, demanded additional control of subcortical structures by the cortex, which also promoted cross-modality. In other words, the ability to combine information from different cognitive domains was pivotal for merging linguistic items (see Benítez-Burraco & Progovac, 2021, for a more detailed discussion).

Once HSD reached its peak at the end of the Upper Paleolithic (Cieri et al., 2014), behaviors conducive to the advance in linguistic complexity via cultural mechanisms proliferated: more frequent and diverse social contacts, longer learning periods, more frequent practicing, and so on. Such changes likely put in place the first hierarchical grammars that expressed transitivity. Languages with such grammars are called esoteric. These languages typically exhibit larger sound inventories and complex phonotactics, opaque morphologies (with more irregularities and morpho-phonological constraints), limited semantic transparency (abundant idioms and idiosyncratic speech), reduced compositionality, and less sophisticated syntactic devices. These features are common for languages spoken by isolated human groups, living in small, close-knit communities with high proportions of native speakers—a rough proxy for languages spoken by present-day hunter-gatherer societies.

The transition from the Upper Paleolithic to the Neolithic was accompanied by cardinal changes in social organization as a result of steady demographic growth and climatic changes. Growing social interactions brought to life extensive social networks, promoting trading and mating, while also unleashing intergroup hostilities over competition for limited natural resources. The necessity to regulate conflicts, convey decontextualized meanings, and exchange technological know-how with unrelated individuals favored the emergence of another type of language—exoteric. These languages typically feature expanded vocabularies and increased syntactic complexity (including greater reliance on recursion), as well as greater compositionality and enhanced semantic transparency—all advanced at the cost of simpler phonological inventories and sound combinations, and more regular morphologies. A proxy of such languages are those spoken by present-day agriculturalist societies, particularly state-governed autochthonous ones. Since these languages are also suitable for conscious planning, establishing alliances, conducting warfare, and, ultimately, supporting the emergence of cultural institutions related to war and peace, their emergence can be linked to the advent of proactive aggression that became more widespread during the transition from the Neolithic to the rise of first civilizations.

Our model of evolution of human languages under the effects of HSD can also explain modern pragmatics and linguistic modes of interaction. A reduction in reactive aggression is beneficial for cognitive and behavioral changes necessary for the emergence of rules of turn-taking and complex inferential abilities, both of which are cornerstones of our conversational abilities. On the cognitive side, the expansion of pair-bonding to nonreproductive relationships marked a crucial achievement in social organization. The potentiating of cross-modal thinking, instrumental for linguistic chunking, enabled conventions of figurative uses of language (e.g., metaphors and metonyms) and pragmatic inferencing. On the behavioral side, increased HSD favored prolonged face-to-face interactions, long-term cooperation, and consideration for others’ needs. Overall, these cognitive and behavioral changes enabled communication of more complex meanings by indirect means (see Benítez-Burraco et al., 2021 for a detailed view).

In general, this HSD model ties the evolution of language to changes in aggression management, both reactive and proactive, ultimately connecting specific linguistic structural features with the HSD-related behavioral and cognitive changes, based on their shared neurobiological substrate. At the same time, this model establishes a strong continuity between communication and cognitive abilities exhibited by other species, while also supporting cultural niche construction, cultural evolution, and gene-culture coevolution as key factors that accounted for the exclusiveness of language to human communication. Our contention here is that the same model can be also applied to human musicality, music types, and functions of music—not only because of the common origins of music and language, but mostly because of the common effects of changing levels of reactive and proactive aggression.

Human Self-Domestication and the Evolution of Music

As noted, we find our HSD model of coevolution more parsimonious than those accounts that hypothesize different rationales and mechanisms for the evolution of music and of language. Our approach reconciles hypotheses about music evolution that have been presented as irreconcilable, such as the “social bonding hypothesis” (Savage et al., 2020) and “the credible signal hypothesis” (Mehr et al., 2021). Moreover, this model explains better than other models how different types of music diachronically emerged through a cultural mechanism—which was previously examinable mostly through memetic approaches (see Jan, 2018). Overall, we hypothesize that the gradual changes in the subtle balance between reactive and proactive aggression could help us understand the steady complexification of music, the emergence of its new functions, and the transformation of the old ones, as well as the past and present distribution of musical types and genres as reported by ethnomusicologists.

Our model is summarized in Fig. 2, which presents music evolution vis-à-vis language evolution to highlight their common origins and their parallel evolutionary pathways under the effects of HSD, changes in paleoclimatic conditions, demographic changes, and relevant cognitive and behavioral innovations. We support the view that the evolution of music systems and languages can be conceived as two different products of the same biological/cultural processes, heavily influenced by the increased feedback loop between the reduction of reactive aggression and the sophistication of language and music structures and uses.

Fig. 2
figure 2

The timeline of the coevolution of music and language. The figure reflects the evolution of types of music vis-à-vis the evolution of types of languages in regard to the changes in human socialization patterns under the effects of increased HSD (reproduced from Nikolsky & Benítez-Burraco, 2022, Fig. 7)

In brief, once a musilanguage emerged from the building blocks rooted in animal communication, cognition, and behavior, protomusic started to diverge from protolanguage, later evolving into timbre-based music, and thereafter, into pitch-based music, ultimately generating collective forms of music that can be found in many present-day societies. Still, these stages should not be viewed as a clear-cut “monolithic” order of things. Since environmental and social conditions instrumental for HSD are always in the process of transformation, HSD levels are prone to vary from one place to another, and from one human group to another (see, e.g., Gleeson & Kushnick, 2018, for sexual dimorphism under the HSD effects). Therefore, we expect significant historic and geographic overlaps between different evolutionary musical types globally. As the available ethnomusicological data suggest, the schemes of tonal organization that characterize different stages of evolution of musical structures tend to build on each other, retaining the previous formations. Even in music cultures of modern Western countries that are based on full-fledged tonality (the conventional key system of Western classical music), it is often possible to identify traces of the older methods of tonal organization (musical modes, including those that feature fewer than seven pitch-classes—i.e., five strata of tonal organization in traditional Lithuanian music; see Leisiö, 2002). Traces of earlier music usually survive in specific folk genres—most commonly, within the venerated epic and religious traditions.

It would be unrealistic to expect that each stage in our model started at the same time worldwide. This is in line with current evidence of modern human behavior having appeared in different regions at different points of time (Ashton & Davis, 2021). We reserve the possibility that other close hominins, particularly Neanderthals and Denisovans, will fit in the first stages of our model, if evidence of their human-like management of reactive aggression emerges. Below, we provide a more detailed description of our model.

Summary of Our Four-Stage Model

Before the advent of our species, roughly 300,000 years ago,Footnote 24 a musilanguage stage can be hypothesized for the hominin clade. The likely distinction between animal communication and this pre-human musilanguage was the presence of conventional acoustic forms of expression for conveying common emotional and deictic information between the members of the same social group (loud collective signals to fend off dangerous predators, individual grunting patterns to accompany caretaking activities, etc.). Such signals were probably not coordinated in pitch and time between multiple participants, featuring a jumbled “isophonic” texture (Nikolsky, 2018)—very much like the howling of a wolf pack. But unlike animal communication, musilanguage signals can be hypothesized to address specific group members, to vary in sonic patterns based on application, and to be passed on from one generation to another (see Nikolsky, 2020a). A communication system capable of enhancing sociality and altruistic behavior is critical to the promotion of cooperation at times of environmental stresses, so frequent throughout the Paleolithic. It is plausible that waves of mass hominin migration from Africa were enabled by the prosocial influence of the musical component in a musilanguage system.

Stage 1 in our model (protomusic) starts with the emergence of archaic, anatomically modern humans (AMHs) endowed with cognitive innovations—particularly, with a new neuronal workspace that entailed greater connectivity between distant brain regions and could overcome the limits of core knowledge systems, supporting basic forms of cross-modal thinking (Boeckx & Benítez-Burraco, 2014). Two innovations distinguish protomusic from both musilanguage and protolanguage. The first—the emergence of singing—was a likely outcome of an attempt to maximize the intensity of phonation in distant calls (Maclarnon & Hewitt, 2004) and in collective vocalizations designed to scare off predators or to ambush prey (Jordania, 2011, 2017). The second innovation—the accidentally discovered sounds of flintknapping—probably gave birth to the world’s earliest musical instrument, a pair of rocks hit or rubbed against each other in the manner of modern claves or guiro (Montagu, 2004). Rhythmic knapping is known to be a natural by-product of entrainment during collective manufacturing of stone tools (Zubrow & Blake, 2006). The latter was definitely used in prehistoric times (Boivin et al., 2007) and still survives in aboriginal societies in performance rituals, where it is ascribed magic properties (Duncan-Kemp, 1952).

Within this stage, some interaction between protomusic and protolanguage was likely to have occurred. Consider the case of lullabies and motherese, both of which can be related to our prolonged (in comparison with other primates) altriciality period and shortening of interbirth intervals that posed the need for collective caretaking of multiple children. It is quite possible that specific musical intonations that globally characterize lullabies (e.g., the descending leaps by about 300 cents—Fernald, 1992; Reigado et al., 2011) might have been cultivated within the motherese throughout the millennia of its application. Because we still lack precise knowledge of cognitive and behavioral features of earlier hominins (including their social life), we cannot rule out the possibility that Neanderthals (and, perhaps, Denisovans) also exhibited some sort of protomusic since they have been hypothesized to share with humans the basic capacity to sing (Mithen, 2005) and to have had some form of culture, particularly symbolic behavior (Mellars, 1996; D’Errico et al., 2003).

Around 200 kya, the long Riss Glaciation began, and climatic conditions became harsher. Frequent alternations of extreme cooling and warming caused significant fluctuations in sizes of social groups. Depopulation periods increased the value of cooperation in harsh environments, strengthening bonds and stimulating interpersonal communication. During subsequent periods of demographic growth, newly established patterns of communication were cultivated over larger territories and involved a larger number of people. The seesaw demographic alternations favored selection for increased prosociality and promoted personal and interpersonal uses of protomusical behaviors.

Two formats—solitary musicking to entertain oneself during prolonged solitary activities (the babbling model) and duetting of closely related persons (the motherese model) during the the times of depopulation—provided a fertile ground for the invention of “musical mode.” Bonded couples intuitively matched the sonic characteristics of their vocalizations, as observed in modern-day motherese, and solitary musicking gave an opportunity to explore the combinatorial capacities of the matched common patterns of expression. The resulting set of sounds that pleased the sensibilities of music-makers was thereafter conserved for future musicking, available for those who overheard such musicking. Tone-matching probably originated from mother-infant interaction, characterized by instinctive mutual imitation of the expressive vocal attributes (Malloch & Trevarthen, 2009) and fueled by oxytocin (Harvey, 2020). Much of this mimicking is confined to the domain of timbre, which makes it the most likely substrate for the earliest musical modes. A set of timbre-classes, selected and repeatedly used to express specific semantic contents, constitutes what can be called a “musical timbral mode” (Nikolsky et al., 2020), which is the type of music we hypothesize for Stage 2 in our model.

The “natural” (anatomy-driven) rules of binding acoustic properties of an auditory signal with emotional semantic content, typical for animal vocal communication, were ultimately replaced by “cultural” conventions that often violated the “natural” order of things (as is characteristic of present-day human music). Here, the peculiar institution of personal song must have been particularly instrumental (Nikolsky et al., 2020). In numerous music cultures of Indigenous hunter-gatherers of the extreme North, whose lifestyle comes the closest to that of early humans during the Quaternary glaciation, each person is assigned a song that indicates one’s place of origin, ethnicity, kin, age, occupation, and personality type (Nikolsky et al., 2020; Sheikin, 2002). The information conveyed in a personal song is crucial for avoidance of incest in marriages in lightly populated areas. Its honest use is protected by a widespread ancestor cult and by social conventions imperative for one’s survival in harsh environments.

All in all, personal song presents a likely model for a transitory stage between the animal-like protomusic and the full-fledged human music. Personal song resembles animal songs in marking territoriality and ancestrality while assisting mating (see Bradbury and Vehrencamp, 2011). But in sharp contrast to instinct-driven animal songs, parents in Indigenous societies actually “compose” personal songs for their newborns—they deliberately use tone-classes (entailing timbre, rhythmo-meter, and pitch contours) to represent the child’s temper that they observe during the first days of parenting. Ultimately, the coexistence of personal songs and timbre-oriented music traditions among numerous ethnicities of Siberia and the Russian Far East, as well as the inherent spatial limitation of timbral music (timbral modulations are practically inaudible beyond the distance of a few meters), make a timbre-oriented personal song a very likely candidate for the forms of music characterizing our Stage 2. The development of personal song is directly related to the ongoing reduction in reactive aggression since the circulation domain of one’s personal song is limited to one’s extended family and characterized by greater tolerance in comparison to relations with outsiders. Also, the ongoing everyday musicking by individual owners of a personal song was likely to promote greater emotional control, thereby contributing to the general reduction of interpersonal conflicts within a community. The evidence of such a mediative and regulatory role of music has been provided by numerous recent studies of the enhancing influence of music on the inhibitory control in children (Bolduc et al., 2021; Bugos et al., 2022; Hennessy et al., 2019; Joret et al., 2017; Moreno & Farzan, 2015).

Around 110 kya, the Riss-Würm Interglacial ended, and the climate deteriorated again, leading to the Last Glaciation, which lasted until 10 kya. This period, when HSD reached its peak (Cieri et al., 2014), and behavioral modernity spread over most parts of the world, we see as Stage 3 in our model. For music, the primary achievement toward the end of this stage was the emergence of cross-cultural pitch orientation, evident in the uncovering of more than a hundred “bone flutes” in caves, often in bundles, over a wide region from Germany to Spain, dated to 36–30 kya (Morley, 2013). Similarities in their construction (D’Errico et al., 2003) suggest the ongoing cultural interaction throughout 45–30 kya along the Danube corridor (Higham et al., 2012).

The rise of pitch orientation can be attributed to several factors. Between around 10 and 110 kya, caves with fire became common places for human occupation (Kempe, 1988). Cave reverberation is distortive for timbre-classes but resonant for pitch-classes (e.g., it makes familiar voices unrecognizable but amplifies a pitch value). Reflections from the walls make pitch changes more salient due to the prolonged decay of each sustained pitch level. Inhabited Paleolithic caves usually resonate at a specific frequency, about 110 Hz (Devereux, 2006), and contain stalactites usable as lithophones—in some caves they produce sophisticated scales (Dams, 1985). Both resonance and lithophones might have provided the reference pitch for singing. The most resonant locations in such caves often contain paintings, dated from 35 kya onward (Díaz-Andreu & García, 2012). The same affiliation characterizes “sounding rocks,” some of which contain marks of hitting, indicative of their ritual musical usage (Morley, 2013). Most Paleolithic “bone flutes” were uncovered in caves (Morley, 2013), testifying to the pitch orientation of European Paleolithic cave-dwellers.

Another pitch-inducing factor is the intuitive tuning-in that occurs, when numerous singers try to sing the same melody: they tend to resolve sustained inharmonious combinations of tones (Zarate et al., 2010) into harmonically “perfect” intervals of unison, octave, fifth, and fourth (Tallmadge, 1984). All things considered, cave singing had the power to direct singers’ attention to the fundamental frequency and harmonicity, while promoting timbral uniformity in pitch changes. Together with the above-mentioned tendency of chain transmission to discretize pitch, these conditions were likely to convert earlier timbral modes into pitch-sets, thereby widening the collective use of music and promoting the reduction in reactive aggression. In turn, the self-domestication features promoted extensive prosociality, favorable for communal cave living and collective use of music.

The advent of the Holocene marks the final stage in our model, roughly 10 kya, when population growth resulted in prolonged intergroup contacts, extensive social networks for trade and intermarriage, and, in many cases, escalated conflicts between larger human groups. A new type of aggression—proactive—became widespread. All of these promoted a new type of music that entailed standardized intervallic typologies and tuning, as well as prescriptive rules for combining pitch-classes. Standardization of pitch- and interval-classes and pitch- and interval-sets inevitably reduces the diversity of musical modes, necessitating the institution of formal music training and introducing the notion of musical error (Nikolsky, 2016a). Music becomes professionalized and regulated by political or religious authorities. Rather free and loose usage of a multitude of musical modes that characterizes all-inclusive musicking in folk family and village traditions gives way to restrictive (“correct”) implementation of just a handful of musical keys, often supported by some sort of musical notation. Such transformation is documented in the history of ancient Babylonian (Dumbrill, 2005) and Greek (West, 1992) music systems.

Standardization of keys boosted the development of orchestral and choral music, invention of instrumental families, and the genesis of cyclic music forms that contained contrasting movements—complexities that were inaccessible before the standardization (Nikolsky, 2016b). Music, performed and auditioned en masse in service of the state or/and religion, became a political weapon in hostilities between countries, consolidating citizens across kins, clans, and castes against the supposed negative influence of neighboring cultures. Political use of music and language, where language unites communities by conveying ideas and reasons for their support while music backs the language by instilling the appropriate emotional states, has culminated in the twentieth century, comprising official propaganda in the majority of the world’s nation-states. For this reason, we periodize this fourth stage as continuing until the present.


In this paper, we have outlined our model of the coevolution of music and language under the influence of aggression management throughout human evolution. Enabled by the reduction in reactive aggression—due to a number of paleo-environmental factors—music and language started as undifferentiated forms of emotional and referential signaling within musilanguage. Initially, they abided by the principles of animal communication, relying on the single-signal “monologic” display of the signaler’s affective state and the deictic reference to something observable to the signaler. Growing control of aggression within the basic family units promoted development and intergenerational transmission of patterns of communication, eventually forming two autonomous systems.

  • Protomusic specialized in regulating the emotional states of individuals in their solitary activities and everyday interactions.

  • Protolanguage specialized in timely delivery of referential information (including live streaming) and directed and coordinated important collective activities.

Capacity of music to promote empathy and bonding favored the formation and transmission of lexical and grammatical conventions instrumental for the complexification of language. Crystallization of musical timbral modes marked the bifurcation of music and speech.

  • Music focused on the aesthetic appreciation of sonic attributes, evolving toward the selection of holistic idiomatic patterns whose acoustic properties were suitable for evoking specific emotional states common to a given lifestyle and provided easy integration of these patterns into a continuous stream.

  • Language focused on effective encoding of important referential information, evolving toward the selection of contrasting, easy-to-process phonemes, the combination of which could supply enough words to refer to the surrounding objects and frequently occurring events.

Hence, language headed toward symbolic semiosis, driven by the need to quickly update information, in contrast to music heading to iconic semiosis, to satisfy the need to secure emotional contagion by means of prolonged exposure to a specific musical emotion.

Increased cooperation and social interaction favored the emergence of pitch-oriented music, which became effective at long-distance communication to a large number of people. Subsequently, pitch orientation turned into a tool of social mediation, forging formats of collective performance that distinguished music from language to an even greater extent. Speakers took turns, whereas singers sang together. At this point, music counterbalanced language along the axis of opposition of “me” versus “us.” Language supported individual awareness, bringing to light differences between individual interlocutors, whereas music carried the opposite effect of emphasizing what was in common between multiple performers.

In the long run, language promoted individualization and analysis, offset by music that promoted integration and synthesis. Music compensated for the negative social and psychological effects of language use (e.g., propensity of individualization to lead to intergroup conflicts), while language compensated for the potential negative side effects of music (e.g., suppression of individual interests in favor of the interests of an entire social group). The antithesis and mutual compensation of music and language were further intensified as both reached their exoteric stages. Music became the means of inspiring masses to feel a certain way (most commonly, patriotic, family-bound, and religious), whereas language became the instrument of reasoning, frequently counterposed to “feelings.” Music and language developed an antinomy of “heart” versus “mind.” Their dichotomy still fuels our cultural life today.

Overall, we have argued for a gradual coevolution of different types of music and of languages as the structure of human groups became more complex and diversified as a result of changing the balance between reactive and proactive forms of aggression. If early stages in the evolution of music and language were characterized by the curbing of reactive aggression, later stages became associated with the rise of and increase in proactive aggression. Our model provides a unified view of the evolution of language and music under the effects of changes in human cognition and behavior, which can and should be tested by subsequent studies.