The Ontogeny of Vocal Sequences: Insights from a Newborn Wild Chimpanzee (Pan troglodytes schweinfurthii)

Observations of early vocal behaviours in non-human primates (hereafter primates) are important for direct comparisons between human and primate vocal development. However, direct observations of births and perinatal behaviour in wild primates are rare, and the initial stages of behavioural ontogeny usually remain undocumented. Here, we report direct observations of the birth of a wild chimpanzee (Pan troglodytes schweinfurthii) in Budongo Forest, Uganda, including the behaviour of the mother and other group members. We monitored the newborn’s vocal behaviour for approximately 2 hours and recorded 70 calls. We categorised the vocalisations both qualitatively, using conventional call descriptions, and quantitatively, using cluster and discriminant acoustic analyses. We found evidence for acoustically distinct vocal units, produced both in isolation and in combination, including sequences akin to adult pant hoots, a vocal utterance regarded as the most complex vocal signal produced by this species. We concluded that chimpanzees possess the capacity to produce vocal sequences composed of different call types from birth, albeit in rudimentary forms. Our observations are in line with the idea that primate vocal repertoires are largely present from birth, with fine acoustic structures undergoing ontogenetic processes. Our study provides rare and valuable empirical data on perinatal behaviours in wild primates.


Introduction
Primates communicate using a limited vocal repertoire, which largely develops in species-specific ways (Seyfarth & Cheney, 1997). The acoustic structure of calls uttered by infants typically resemble the corresponding adult call types, suggesting that vocal structures develop under strong genetic control (Hammerschmidt & Fischer, 2008;Janik & Slater, 2000;Owren et al., 2011), with some room for socially acquired call variants (Ruch et al., 2018;Snowdon, 2009). The acquisition of novel call types is virtually absent in wild primates (Fischer & Hammerschmidt, 2020;Tyack, 2020; but see Lameira, 2017). Overall, primate development of species-typical calls results from a combination of genetic, social, and environmental influences, though the relative role of each is still debated (Fedurek & Slocombe, 2011). Data on very early utterances shortly after birth are critical to assess the departure point in vocal ontogeny, prior to social and environmental influences. However, most primate births occur at night (Dunn, 2012) and are difficult to observe due to the unpredictability of parturition and maternal avoidance of other group members (Nishie & Nakamura, 2018;Otali & Gilchrist, 2006;Ramsay & Teichroeb, 2019), probably as a response to infanticide risk (Palombit, 2012). As a result, primate perinatal behaviours remain poorly understood (Trevathan, 2015), despite their theoretical relevance for developmental research (Nagy, 2011).
Vocal development has usually been analysed at three different levels: (1) production learning -how individuals modify specific acoustic features of calls after exposure to others' calls, (2) usage learning -how individuals give existing calls in new contexts or combine them as part of new vocal sequences, and (3) comprehension learning -how individuals respond appropriately to the vocalisations of others (Janik & Slater, 2000;Seyfarth & Cheney, 1997;Vernes et al., 2021). In primates, production learning is regarded as mostly fixed, while usage and comprehension learning are considered more flexible (Seyfarth & Cheney, 1997;Snowdon, 2009). However, this model is largely based on studies of alarm calls, which are expected to be less flexible than calls with more social functions, which should instead be the focus when making comparisons with human vocal development (Elowson et al., 1992;Snowdon et al., 1997).
The majority of data on primate vocal development stem from studies of monkey vocalisations (Seyfarth & Cheney, 1997;Tomasello & Zuberbühler, 2002). However, monkeys are arguably less directly relevant to studies of language evolution than great apes, who share a more recent last common ancestor with humans (Langergraber et al., 2012) (Fischer & Hage, 2019;Fitch & Zuberbühler, 2013). Great apes often produce sequences of calls (e.g., chimpanzees, Pan troglodytes verus: Girard-Buttoz et al., 2022), including combinatorial structures (e.g., bonobos, Pan paniscus: Schamberg et al., 2016; gorillas, Gorilla gorilla beringei and Gorilla gorilla gorilla: Hedwig et al., 2014), with some evidence for socially-learned call variations (e.g., orangutan, Pongo pygmaeus wurmbii and Pongo pygmaeus abelii: Lameira et al., 2022). While both humans and great apes use a limited set of sounds comparable in size (e.g., McComb & Semple, 2005;Moran et al., 2012), the ability to combine these sounds hierarchically to form vocal sequences varies greatly between humans and other apes, and sets apart human language from the communication of other animals (Hauser et al., 2002;Townsend et al., 2018). Although vocal learning abilities in great apes are clearly more constrained than in humans, investigating the degree to which great apes vocal sequences are socially learned or hard-wired, and thus present from birth, can inform us about the evolution of more complex vocal structures.
There are only a handful of direct observations of perinatal behaviour in our one of our two closest living relatives, the chimpanzees (Fujisawa et al., 2016;Goodall & Athumani, 1980;Kiwede, 2000;Nishie & Nakamura, 2018;Zamma & Shabani, 2012). As a consequence, very little is known about newborn chimpanzee vocal behaviour, although subsequent stages of vocal development are somewhat better documented (e.g., Dezecache et al., 2019;Laporte & Zuberbühler, 2011;Plooij, 1984;Taylor et al., 2021). Early qualitative descriptions indicate that the first vocalisations of wild chimpanzees are comparable to the corresponding adult call types, such as grunts, whimpers, cries, and screams (Plooij, 1984). Human-reared chimpanzees initially exhibit vocal output that have some similarities to that produced by human infants in the first months of life (Kojima, 2008), although these are often elicited by human caretakers or researchers (Bard, 1998;Kojima, 2008). Human infants are special, however, in producing highly variable and functionally flexible vocal sequences, referred to as babbling -a form of vocal exploration considered a milestone during language acquisition (Oller, 2000;Oller et al., 2021). Typically, babbling starts soon after birth, consists of a subset of the acoustic features characterising the adult repertoire, and does not require a social context or to be communicative (Oller, 2000;ter Haar et al., 2021). In addition to simple vocal practice, one probable function of this peculiar behaviour is to enhance social interactions and bonding with caregivers (Locke, 2006;Oller & Griebel, 2008). However, evidence for babbling-like vocal behaviour is absent in chimpanzees (Oller et al., 2019;ter Haar et al., 2021).
The vocal repertoire of wild chimpanzees consists of a relatively small number of acoustically distinct call types that can grade into each other (Crockford, 2019;Goodall, 1986;Marler & Tenaza, 1977). Different call types often appear in sequences (Crockford & Boesch, 2005;Girard-Buttoz et al., 2022;Leroux & Townsend, 2020), such as pant hoots and food grunts (Leroux et al., 2021) or screams and barks (Fedurek et al., 2015). Whether such call combinations function to convey different information is still unclear and a topic of ongoing research (Engesser & Townsend, 2019;Zuberbühler & Lemasson, 2014). Furthermore, no study to date has investigated how and when the capacity to produce vocal sequences appears during chimpanzee ontogeny.
In this study, we report on the vocal behaviour of a new-born wild chimpanzee in the Sonso community of Budongo Forest, Uganda. We used two methods of call classification: (1) qualitative spectrographic and auditory categorisation of calls supplemented by auditory categorisation by human experts, and (2) quantitative soft clustering analysis to determine distinct acoustic clusters and discriminant function analyses to investigate pant hoot production across age categories.

Study Site and Population
We studied the Sonso chimpanzee community in Budongo Forest, western Uganda. Chimpanzees from the Sonso community have been studied and followed daily by field assistants since 1990 (Reynolds, 2005). At the time of the study, the community was composed of 71 individuals, including nine adult males and 31 adult females (Table SI Supplementary Material). The main individuals involved in the study were members of the Kutu family (Table I).

Data Collection
On the 20 th of November 2019 at 10:12 am, the first and second authors observed the birth of KU7. Three additional researchers attended part of the afterbirth period and assisted with data collection and identification of callers. We collected audio recordings of all vocalisations produced by KU7 and collected the vocalisations produced by other individuals in the party opportunistically with a directional microphone, the Sennheiser MKH416 (Sennheiser Electronic GmbH & Co. KG, Wedemark, Germany) with a Marantz PMD661 MkII (Marantz, Kanagawa, Japan) solid-state recorder (sample rate 44.1 kHz, resolution 32 bits, 'wav' format). We defined party composition as all individuals present within a radius of approximately 35 m of the focal individual (Newton-Fisher, 1999). We set the recorder's gain on level 9 to maximise the signal/noise ratio due to the softness of calls and the distance from the subject (approx. 15 m). We maintained this distance due to the delicate nature of the event and to reduce any effects of our presence on the chimpanzees' behaviour. We dictated observations to the microphone or noted them using CyberTracker (ver. 3.496) on a Samsung Xcover 4 portable device (Samsung Group, Seoul, South Korea). We recorded videos using a Panasonic VHC-770 HD (resolution: 1920*1080/50p). We recorded all relevant events and changes in the behaviour of all individuals in the party, and recorded the composition of the party continuously.

Qualitative Acoustic Analysis
We inspected audio recordings to extract vocalisations using spectrograms generated with Praat software (ver. 6.0.42) and Sennheiser HD650 headphones. We transformed calls with the Fourier function using a Hanning window function and 1024 time steps. Four authors independently categorised call types (Table II) based on auditory features and inspection of spectrograms using published chimpanzee vocal repertoires composed of nine call types (Table III). If one of the four authors disagreed with the categorisation, we used group majority to determine the call type. There were no instances where more than one author disagreed with the categorisation.
To provide a more comprehensive and diverse assessment of the call types, we asked seven independent experts in chimpanzee vocal communication (Table SII Supplementary Material), blind to any information about the recordings, to categorise the calls recorded from KU7, estimate the age of the caller, and comment on the vocal structures. We provided an unlabelled audio file in which we collated all the calls produced by KU7 in chronological order, with sequences separated by 1 s of silence (Online Resource 1).

Quantitative Acoustic Analysis
We manually extracted six acoustic features from each call unit using Praat software (ver. 6.0.42): duration of each exhaled unit, fundamental frequencies (F0) at the start, middle, and end of the unit, maximum and minimum F0, and range of the F0. We selected these features based on the acoustic data extractable from the recordings  Marler and Tenaza (1977), Slocombe and Zuberbühler (2010), and Taylor et al. (2021). The Definition of Pant Hoot is Based on Marler and Hobbett (1975) and Notman and Rendall (2005). Example Spectrograms for Each Call Type Produced by Chimpanzees in the Sonso Community, Budongo Forest, Uganda, are Presented in Slocombe and Zuberbühler (2010) Call type Definition Bark Short, loud, and noisy call with abrupt onset. Generally low-frequency.

Grunt
Short and soft call that can be either tonal or noisy. Generally low-frequency and produced with variable rhythm.

Hoo
Short and tonal call with highest frequency and amplitude at the start then decreasing over the duration of the call. The call can be either soft (e.g., rest hoo) or loud (e.g., alarm hoo).
Laughter Short, soft, and noisy sounds produced while inhaling and exhaling. Generally low-frequency and produced with irregular rhythm.

Pant
Short, soft, and low-frequency unvoiced sounds.

Scream
Loud and high-frequency call that can be either noisy or harmonic. Acoustic energy is usually present during exhalation.

Squeak
Short, high-frequency, and clear tonal call.

Whimper
Soft, low-frequency and tonal call similar to "hoo" calls. Units can increase in frequency within a sequence.

Pant hoot
Vocal sequence composed of up to four acoustically distinctive phases produced in this order and composed of: Introduction Low-frequency and tonal calls that acoustically resemble "hoo" calls but are longer in duration and alternated with inhaled tonal elements.
Build-up Short and low-frequency calls produced both during inhalations and exhalations in rapid rhythm. Intensity and frequency typically increase over the phase.

Climax
High-amplitude and high-frequency calls that resemble "scream" calls.
Let-down Short and low-frequency acoustic calls which acoustically resemble the build-up phase but with decreasing intensity and frequency. and on measurements typically considered when determining call types in chimpanzees (e.g., Marler & Tenaza, 1977;Mitani et al., 1999;Mitani & Brandt, 1994;Slocombe & Zuberbühler, 2010). We only considered exhaled vocal units to make our acoustic analyses comparable with previous studies of pant hoots (e.g., Clark & Wrangham, 1993;Desai et al., 2021;Fedurek et al., 2013a, b, Fedurek et al., 2017Mitani et al., 1992Mitani et al., , 1999Riede et al., 2007; but see Crockford et al., 2004), but also noted the number of inhaled (panted) units produced between exhaled units when these were visible on the spectrogram. Because of the quiet nature of the newborn vocalisations, the presence of environmental background noise, and the distance between the newborn and the microphone, we could not use automated procedures to extract acoustic features.

Clustering Analysis
The general approach to studying how experience mediates vocal development is to catalogue the different call types across developmental stages, using acoustic measurements and classification algorithms (Bradbury & Vehrencamp, 2011;Kershenbaum et al., 2016). A common problem is that vocal repertoires are often graded, making objective classifications particularly challenging . However, human vocal behaviour is also highly graded, but receivers still perceive transitions in categorical ways, suggesting that human perceptual judgements can be used to disambiguate gradual transitions (Deecke & Janik, 2006;Janik, 1999). For animal vocal repertoires, data-driven categorisation approaches are preferable, mainly because the degree to which human perceptual bias reflects that of other species remains unclear, and because they allow systematic comparisons across communities (Crockford, 2019;. Soft clustering methods based on fuzzyset theory (Zadeh, 2008) are very suitable to describe graded vocal repertoires of primates (e.g., chacma baboons, Papio ursinus: Wadewitz et al., 2015), an approach that is also promising for chimpanzees (e.g., immature chimpanzees: Taylor et al., 2021). We used fuzzy c-means clustering to identify the best fitting model for the number of clusters representing different call types in the newborn vocalisations. The fuzzy c-means algorithm measures the degree to which sounds belong to categories based on their acoustic proprieties without restricting them to a single category, capturing more details than hard clustering methods, including the graded transition between call types. We analysed the stability and reliability of model solutions to evaluate the extent to which the optimal description of the calls depended on a small number of acoustic parameters, and how robust optimal descriptions were to overlap between clusters. We z-transformed the acoustic features prior to analysis to prevent the influence of measurements with different scales (i.e., Hz and s) on cluster solutions. Since fuzzy c-means clustering is based on the individual acoustic features of each call instead of the total number of calls available (Wadewitz et al., 2015), the small number of newborn vocalisations we recorded was not a limiting factor because a minimum number of data points for each call type is not required.
We adjusted two parameters to identify the best cluster solution to describe the newborn calls: the maximum number of clusters extracted (K), and the 'fuzziness parameter' (μ) which limits the degree of overlap between clusters (i.e., lower values allow less overlap between clusters). We ran fuzzy models using the "fanny" implementation in the "cluster" package (ver. 2.1.2, Maechler et al., 2021) varying K values from a minimum of two (required to quantify gradation) and a maximum of seven, which matched the number of call types we assessed qualitatively and was in line with Taylor et al. (2021). We varied μ values starting at 1.1 with increments of 0.5 following Taylor et al. (2021) and stopped at 3.0 when all membership coefficients were too close to 1/K, which corresponds to the limit of the algorithm to assign cluster membership to calls (Zadeh, 2008). All models considered converged within 500 iterations. We evaluated the fit and confidence of each solution based on the mean silhouette value of all data points combined, which represents how separable the acoustic clusters are. Silhouette values range from −1 to 1, with positive values representing data points that are closer to their primary cluster and indicate some degree of confidence with regard to their cluster membership, while negative values represent datapoints that overlap between clusters and are potentially misclassified (Wadewitz et al., 2015). We assessed the reliability of the model by looking at the range of μ values obtained for any given K value, which provided an indicator of 'gradedness'. Solutions for which low and high μ values can be extracted are regarded as more robust to overlap between clusters . Using the "clValid" package (ver. 0.7, Brock et al., 2008) we assessed the stability of the clusters by calculating four measurements that compare the result of the clustering algorithm by systematically removing one variable at a time and measuring how much the clusters are based on a small number of acoustic parameters, which represented how 'generalisable' the cluster separations are. The four measurements we used are: the mean proportion of non-overlap between data points (APN), the mean distance between data points in the same cluster (AD), the mean distance between the cluster's centre and the data points in the same cluster (ADM), and the mean variance of data points in the same cluster (FOM) (Brock et al., 2008). Given that our aim was to categorise calls into clusters, we gave priority to mean silhouette values to identify the best model. We extracted a hard-clustering solution for the best-fitting model and assigned all the calls to their primary cluster membership. We then examined the distribution of qualitatively categorised call types in each cluster.

Comparison with Pant Hoots
We conducted additional acoustic analyses to compare vocal sequences produced by the newborn which contained calls resembling pant hoot phases with pant hoots from infant, juvenile, sub-adult, and adult males of the Sonso community. We selected pant hoots produced during resting or feeding events, because pant hoots vary depending on the behavioural context of production (Fedurek, Zuberbühler, & Dahl, 2016b;Notman & Rendall, 2005), and the newborn vocalised while resting or potentially before/ after nursing. Although pant hoot sequences can be composed of repeated vocal units from a single phase (Soldati et al., 2022), we selected calls composed of two or more phases to be consistent with previous studies (e.g., Fedurek et al., 2014;Mitani et al., 1999;Notman & Rendall, 2005). To control for potential differences between the sexes in the acoustic structure of pant hoots (e.g., Holden, 2017), we only selected male pant hoots. We selected recordings based on their overall quality (lack of background noise or overlap with other callers) and good signal-to-noise ratio. Although these recordings were of higher quality than recordings of the newborn, we extracted the acoustic measurements manually in the same way to avoid introducing a potential bias. We sampled the first and the middle vocal units of the introduction and climax phases for four units from each pant hoot. This allowed us to take into consideration the acoustic gradation that can occur within phases. Where there were an even number of units, we chose the first of the two middle units (as in Desai et al., 2021). In total, we extracted features from 189 vocal units (42 pant hoots) produced by three infants, four juveniles, four subadults, and four adults, with a minimum of two pant hoots per individual (Table SVIII).
To determine whether the acoustic structure of Phase 1 and 2 calls produced by the newborn differed from the introduction and climax phases produced by infant, juvenile, sub-adult, and adult individuals from the Sonso community, we used permuted discriminant function analyses (pDFA; Mundry & Sommer, 2007), following previous studies (e.g., Leroux et al., 2021;Soldati et al., 2022). To analyse the introduction phase, we used 19 vocal units from 11 calls produced by four adults, 21 units from 11 calls produced by four sub-adults, six units from four calls produced by three juveniles, and 19 units from ten calls produced by four infants. Together with the Phase 1 calls (n = 26), we obtained a total of 91 calls. To analyse the climax phase, we used 17 units from 11 calls produced by four adults, 20 units from 11 calls produced by four sub-adults, 18 units from nine calls produced by four juveniles, and ten units from ten calls produced by three infants. Together with the Phase 2 calls (n = 18), we obtained a total of 93 calls. Before analysis, we assessed multicollinearity to avoid including correlated acoustic parameters. We removed at a time the parameter with the highest variance inflation factor (VIF) using the 'performance' R package (version 0.8.0, Lüdecke et al., 2021) until we obtained a set of variables with low correlation. In the final set of four variables (Duration, Start F0, End F0, Range F0), the highest VIF for introduction calls was 2.85 and the highest VIF for climax calls was 1.63. We assessed the distribution of the data, and when variables were not normally distributed and this could be improved, we applied a log or squared-root transformation. We then used nested pDFA with 1000 permutations to test whether the acoustic structure of newborn's Phase 1 & 2 calls differed significantly from the corresponding phases produced by the other age categories (Mundry & Sommer, 2007). In comparison with a conventional DFA, a pDFA allows the inclusion of repeated data points per individual and controls for unbalanced data sets at the same time. We included the 'ID' of the caller as a control factor.
We conducted all statistical analyses in R (version 4.1.2, R Core Team, 2021).

Ethical Note
Data collection was entirely observational, adhering to the ASAB guidelines for the treatment of animals during behavioural studies (Association for the Study of Animal Behaviour, 2018 Data Availability Data on the newborn vocal and perinatal behaviours generated or analysed during this study are included in this article and its supplementary information files.

Results
A detailed report of the birth is available in the Supplementary Material. Video and audio recordings are available as Online Resources (1, 3-5).

Qualitative Call Classification
We recorded 70 call units from the newborn during 2 hours and 15 minutes of observation (0.5 per minute). These calls were divided into 12 separate vocal occurrences (also referred as 'utterances'; call rate 0.1 per minute), of which three were single calls and nine were call sequences (see Online Resource 2 for the acoustic spectrograms). Vocal sequences contained a mean of 7.4 vocal units (range 2-17). We identified barks (n = 2), grunts (n = 8), hoos (n = 6), squeaks (n = 2), whimpers (n = 8), and units that we labelled as part of a pant hoot (n = 44). We distinguished two variants which we refer to as "pant hoot phase 1" (n = 26), hereafter Phase 1 for brevity, and "pant hoot phase 2" (n = 18), hereafter Phase 2 (Fig. 1). The four vocal sequences that included Phase 1 or Phase 2 were composed of a mean of 11 vocal units (range 5-15) when excluding other call types. Of all the newborn's Phase 1 and 2 units, 36% (n = 15) included panted units between exhaled units (Fig. 1). Five of seven independent experts agreed with our decision to classify KU7's Phase 1 and Phase 2 calls as resembling adult pant hoots (Table SIV Supplemen tary Material). One expert classified the calls as either pant hoots or whimpers, and one expert classified the calls as whimpers. The experts did not reliably classify the other call types (barks, grunts, hoos, squeaks, and whimpers) citing challenging conditions (soft signal volume and background noises), but two experts reported the presence of quiet hoos and grunts among these other call types. Three experts classified the caller as a young individual, one as a juvenile, one as an infant, one as immature, and one as either a juvenile or a young adult.

Quantitative Analyses
We extracted six acoustic features from 70 call units and used them to model the best clustering of acoustically similar units using fuzzy analyses (Table IV). Overall, we obtained 20 unique models varying between two to seven clusters (K) and fuzziness parameters (μ) of 1.1, 1.5, 2.0, and 2.5 (Fig. 2). Two-, three-, four-, and five-cluster solutions could be calculated up to μ = 2.0, while six-and seven-cluster solutions could be calculated up to μ = 2.5. The model that best fit our data was calculated with three clusters (K = 3) and with a fuzziness parameter of 1.1 (Fig. 2), scoring the highest mean silhouette value of 0.450, which indicates confidence in the overall solution. Although this model was not the most stable, only 6% of call units (n

Cluster Composition
We calculated the percentage of each call type (determined qualitatively) that belonged to each of the three clusters in the best fitting model identified by the quantitative analysis (Fig. S2 Supplementary Material). The first cluster was composed of grunts and whimpers (Table V). The second cluster consisted of barks, squeaks, Phase 2 calls, Phase 1 calls, and grunts (Table V). The third cluster consisted of hoos, Phase 1 calls, and Phase 2 calls (Table V).

Call Combinations
KU7 produced single calls in three instances (grunts only) and nine different call sequences (range: 2-17 units). Of the nine sequences, four were combinations of units from different call types (Fig. 3). Overall, calls from two or three different clusters were produced in a single combinatorial structure, and two to four different calls were combined in a structure.

Comparison with Pant Hoots
Pant hoots were composed of a mean of 15 vocal units in infants (range 1-18), 12.6 units in juveniles (range 1-12), 10.2 units in sub-adults (range 1-15), and 8.1 units in adults (range 1-15). A panted unit followed 43% of vocal units produced by infants (n = 21), 59% of units in juveniles (n = 20), 73% of units in sub-adults (n = 37), and  94% of units in adults (n = 44). Vocal usage rate varied with phase and age category (Table SIX). While introduction, build-up, and climax phases were observed across all age categories, the let-down phase was not observed in infants and juveniles (Table SIX). Furthermore, the let-down phase was rarely produced by sub-adults and adults, and the build-up phase was rarely produced by adults (Table SIX). The discriminant function could not operate when including the introduction (produced by infant, juvenile, sub-adult, and adult males) and the Phase 1 calls produced by the newborn. This was likely because the within-group variance of the variables was lower than the level accepted by the function, which might indicate that variables are collinear or constant (Venables & Ripley, 2002). Because the variables we considered were not collinear and since pant hoots have mainly been studied in sub-adult and adult males, we repeated the analysis including only individuals from these age categories. The results are not compatible with the idea that calls are acoustically different in newborn vs sub-adult and adult males (expected correctly cross-classified: 42%, p = 0.257). For the climax and Phase 2 calls, the results are also not compatible with the idea that calls are acoustically different in newborn vs infant, juvenile, sub-adult, and adult males (expected correctly cross-classified: 26%, p = 0.164). We repeated the analysis including only the newborn, sub-adult, and adult individuals, and found similar non-significant results (expected correctly cross-classified: 40%, p = 0.266).

Discussion
We qualitatively discriminated seven call types from the 70 units produced by a newborn chimpanzee immediately after birth, in line with what was previously reported for older chimpanzee infants (Plooij, 1984). The majority of units were given as part of sequences. Interestingly, the newborn also produced vocal structures resembling pant hoots. Quantitative analyses revealed three acoustically distinct clusters of calls, with calls from different clusters combined into the same sequence. All call types were also produced in isolation, with the exception of Phase 2 (pant hoot) calls and squeaks, which were only produced in combination with other calls. From these data we concluded that chimpanzees have the capacity to combine some call types into larger structures from birth.
We can suggest four hypotheses explaining what could trigger the newborn's vocal production. First, the newborn produced a series of long vocal sequences to attract the mother's attention, for example due to discomfort or desire to be nursed. However, we did not notice any of the more typical calls for such contexts (i.e., cries and whimpers; Dezecache et al., 2021) and the newborn was always in bodily contact with the mother during our observations, making this explanation unlikely.
Second, the newborn's vocal behaviour may represent a rudimentary form of babbling (Oller, 2000), which serves as vocal practice and to elicit care-giving (ter Haar et al., 2021). However, the newborn produced vocal sequences composed of repeated vocal units that were rhythmically produced and only contained a subset of adult calls. Furthermore, the sequences lacked variable acoustic structures, did not elicit vocal or social responses from the mother, were produced at lower call rates than human and marmoset infants (Elowson et al., 1998a(Elowson et al., , 1998bOller et al., 2021;Snowdon & Elowson, 2001) and at comparable rates with chimpanzee and bonobo infants (Kojima, 2008;Oller et al., 2019;Taylor, 2020), making the babbling hypothesis an unlikely explanation. In marmosets, one of the few primates where babbling has been reported, parents engage in vocal feedback and exchanges (Takahashi et al., 2015), while chimpanzee mothers rarely direct vocalisations to their offspring (Schick et al., 2022). While language-trained or human-raised chimpanzees have been reported to produce vocalisations similar to babbling when interacting with researchers (Hayes & Hayes, 1951, pp. 106-108), the behaviour is different from human infant babbling in terms of variety, quantity, and duration (Kellogg, 1968;Kojima, 2008). In contrast, the newborn vocalisations presented some similarities with protophone-like sounds produced by young bonobo infants (Oller et al., 2019). These sounds are regarded as akin to the exploratory protophones produced by human infants since birth during low-to moderate arousal contexts, without requiring social stimulation, and prior to babbling (Oller, 2000;Oller et al., 2016). However, the newborn chimpanzee produced sounds at much lower rates than humans, without clear signs of playfulness, and without interacting vocally with the mother, all of which also characterise vocal behaviours in infant bonobos (Oller et al., 2019). One possibility is that the period of vocal exploration in great apes is very reduced and limited to the earliest developmental phase, although further observations are necessary to test this possibility.
Third, the newborn vocal behaviour may have been an artefact resulting from limited vocal control. Interestingly, Phase 1 calls were always followed by Phase 2 calls in sequences, but never the other way around, and terminated with squeaks followed by grunts. This call order is akin to that of pant hoots, in which introduction units are followed by climax units, and akin to how pant hoots tend to be followed by food grunts in call combinations (Leroux et al., 2021). These observations do not support the artefact hypothesis, although there may have been anatomical constraints on vocal production. For instance, in adult pant hoots, the climax is never produced in isolation but is always preceded by an introduction or build-up phase (Soldati et al., 2022), perhaps because producing high-pitch and high-amplitude calls requires more time and effort (Riede et al., 2007).
Finally, intra-uterine auditory exposure to conspecifics' calls may have affected the newborn's vocal development. While we cannot address this hypothesis with our data, it has been documented in marmosets (Narayanan et al., 2022), humans (Gervain, 2018;Varga et al., 2019), and songbirds (Colombelli-Négrel et al., 2021), all of which are regarded as vocal learners (Vernes et al., 2021). Further studies are needed to clarify the effects of pre-and post-natal auditory exposure on the development of great ape vocalisations.
The most puzzling aspect of the newborn's vocal behaviour was the presence of vocal structures that acoustically and visually resembled chimpanzee pant hoots, with clear resemblance to adult as well as infant and juvenile pant hoots. Most experts rated these structures as pant hoot attempts. Although some experts rated them as whimpers, cluster analyses revealed that they did not belong to the whimper cluster. Phase 1 calls closely resemble the introduction phase and Phase 2 calls closely resemble the climax phase, which was supported by the results of the discriminant analyses. All the hoo calls produced by the newborn belonged to the Phase 1 cluster, in line with the idea that the introduction can be seen as a variant of hoos (Crockford, 2019). The production of panted units, a characteristic feature of pant hoots, followed an incremental pattern from the newborn through all age categories, suggesting that it develops during ontogeny. In addition, we agree with the experts, who pointed out that the overall 'rhythmicity' of the newborn sequences is characteristic of pant hoots. Thus, the newborn produced vocal structures resembling pant hoots, suggesting that such complex structures are part of the innate vocal repertoire of chimpanzees. However, the newborn utterances differed from adult pant hoots in terms of absence of phases that correspond to the build-up and letdown. Alternating phases with a relatively low rate of unit production, such as the introduction and climax, with phases exhibiting a fast-paced and panted unit production, such as the build-up and let-down, is a key feature of pant hoots.
The differences between the arguably rudimentary form of the newborn pant hoot and adult pant hoots suggest that this call type undergoes some ontogenetic processes. Immature individuals might learn to produce certain phases in specific contexts or as part of structurally varying sequences (usage learning; Janik & Slater, 2000;Marshall et al., 1999), as well as pant hoots that resemble the pant hoots of group members or social partners (production learning; Ruch et al., 2018). The latter hypothesis is supported by the presence of community dialects (Crockford et al., 2004;Mitani et al., 1992) and by the stronger call similarity between social partners (Mitani & Brandt, 1994;Mitani & Gros-Louis, 1998), although genetic or ecological factors might also explain community differences (Mitani et al., 1999;Desai et al., 2021). While some developmental changes result from the maturational process (Nishimura et al., 2003), systematic study of the acquisition of vocal capacities, especially at the early ontogenetic stages, is a key missing element in the current debate on vocal learning in primates, particularly in great apes (e.g., Fischer et al., 2015;Watson et al., 2015aWatson et al., , 2015b. Our observations fit with the idea that primate vocal repertoires are largely fixed and present from birth (Fischer & Hammerschmidt, 2020), although they also indicate that fine acoustic structures undergo ontogenetic processes.
The limitations of our study include the short observation period and the small dataset, which reduce our ability to generalise from our study. Having complete access to the context of production typically facilitates call classification, although it can also be misleading with graded and flexible calls (Fischer & Price, 2017;Schamberg et al., 2018). For instance, pant hoots are produced across most contexts and in response to conspecific calls (Goodall, 1986), while food grunts are related to both feeding and agonistic events (Ischer et al., 2020;Marler & Tenaza, 1977). The categorisation of calls based on acoustic features is less subject to biases when the context of production is particularly unclear or flexible, as in pant hoots and immature calls. When applying fuzzy clustering on small datasets, calls can appear more discrete since they are less likely to represent the entire repertoire. In addition, extracting a small number of features can lead to higher spread of values but does not necessarily indicate better separation (Wadewitz et al., 2015). Because we did not observe the production of typical newborn utterances such as cries or screams (Kojima, 2008), it is possible that their absence affected our quantitative analyses. Specifically, the best solution was the most distinct but was also more influenced by a smaller subset of features. However, in a recent study the repertoire of infant chimpanzees was best described by a two-cluster model, with evidence of a potential third cluster (Taylor et al., 2021), which provides further validity to the model describing the newborn vocalisations. We do not claim that there are only three call types in the repertoire of newborn chimpanzees. To determine its true size, it will be necessary to investigate how receivers react to each vocalisation (Seyfarth & Cheney, 2017), including to graded variants (e.g., Fischer, 1998;Fischer et al., 2001).
The observation of a birth in a wild chimpanzee community provided a rare opportunity to investigate the vocal behaviour of a newborn chimpanzee, the starting point of a long developmental trajectory. The newborn demonstrated the capacity to combine different call types into larger vocal sequences. Some of these combinatorial structures were composed of unique calls that shared several characteristics with pant hoots and were identified by expert human listeners as such. Consequently, our study suggests that acoustically complex structures, akin to adult pant hoots, are part of the chimpanzee vocal repertoire from birth, and that these sequences are subject to ontogenetic processes that shape their acoustic structure. While extensive work has been conducted on adult male calls, and combinatorial capacities have recently gained attention, further work is necessary to elucidate the production, usage, and comprehension of complex vocal sequences in primates from an ontogenetic perspective. Although we remain careful in interpreting observations from a single individual, we believe it provides a valuable contribution to the study of chimpanzee vocal development that will hopefully encourage further research on the ontogeny of great ape vocalisations.