Introduction

In many animals, acoustic communication plays a central role in a variety of behavioural contexts such as mate selection, territorial defence, predator avoidance, group cohesion and foraging (Bradbury and Vehrencamp 2011). During the last few decades, there has been a large effort to unravel the evolution of acoustic communication in different taxa, resulting in the identification of four main sources of acoustic. Firstly, sexual selection has been recognized as a powerful evolutionary driver of acoustic signal divergence, given the involvement of signals in mate choice (Wilkins et al. 2013; Pearse et al. 2018; Friis et al. 2021; Mikula et al. 2021). For instance, a number of studies have shown the existence of character displacement following a secondary contact between closely related species (Pfennig and Pfennig 2010). Secondly, environmental selection through ambient noise strongly influenced signal evolution and the strength of the constraints imposed by noise on acoustic communication differ according to species richness and risk of acoustic interference (Robert et al. 2019). Studies on birds and insects revealed a selection process for specific acoustic signal characteristics and signalling behaviours that overlap the least with those of other species (niche partitioning: Sueur 2002; Luther 2009; Robert et al. 2021). Thirdly, indirect environmental selection may affect traits that are also involved in the production and/or modulation of vocalization (i.e. pleiotropic effect). Indeed, ecological selection may apply on some morphological traits such as body size or beak shape in birds, resulting in indirect selective effect on acoustic signals since these traits are also involved in acoustic signal production (Podos and Nowicki 2004). Lastly, both cultural and genetic drift can also drive acoustic divergence in the absence of selective pressure (Wilkins et al. 2013). For instance, in the Greenish Warblers (Phylloscopus trochiloides) spatial divergence in territorial calls is mainly related to the genetic distance but not to ecological divergence (Irwin et al. 2008). Such a neutral process is also expected to be more important in the case of complex acoustic signals and learning (Wilkins et al. 2013).

As outlined by Wilkins et al. (2013), all these evolutionary drivers are not mutually exclusive, and their combined effect will also depend on the environmental constraints imposed on the signal transmission. Indeed, as first suggested by Chapuis (1971) and Morton (1975), birds living in forest may vocalize at a lower frequency than species living in open habitat because lower frequencies are less degraded than higher frequencies during propagation in closed habitats. Recently, this environmental constraint on sound propagation has been extensively examined in mating signal within different taxa and according to different environmental constraints (Ey and Fisher 2009; Roca et al. 2016), especially in birds for which meta-analyses (Boncoraglio and Saino 2007; Roca et al. 2016) and relatively large phylogenetic analyses (Derrybery et al. 2018; Pearse et al. 2018; Friis et al. 2021; Mikula et al. 2021) are now available. Although some of these studies suggest a moderate effect of habitat on birdsong divergence, which is mainly associated with one spectral acoustic feature, the peak frequency (i.e. the frequency for which amplitude/energy is maximum, Boncoraglio and Saino 2007; Pearse et al. 2018), other studies did not find such an effect (Friis et al. 2021; Mikula et al. 2021). This relatively weak effect of environmental constraints on acoustic signal divergence could be explained by evolutionary trade-offs applying to the function of the studied acoustic signal (Boncoraglio and Saino 2007; Ey and Fischer 2009). For instance, in the case of long-range mate attraction, it is likely that senders have to make a trade-off between the transmission efficiency needed in order to attract conspecific mates over long distance and energetic cost, as well as the risk of eavesdropping by predators or parasites in doing so (Ey and Fisher 2009). Moreover, as demonstrated in the Silvereyes (Zosterops lateralis), such an acoustic adjustment to environmental constraints could result from phenotypic plasticity (e.g. Silvereyes are capable of flexible adjustments of call frequency, amplitude, and duration to maximize signal-to-noise ratio in noisy environments), a phenomenon which could be widespread among birds as a response to anthropogenic environmental constraints (Potvin and Mulder 2013). Lastly, since the effect of environmental constraints exerted on the signal propagation increases with the distance, a weaker effect of these constraints is expected on signals emitted at closed range compared to long-range communication.

Studying acoustic signals associated with functions other than mate selection or territorial defence may be particularly helpful to disentangle the relative contribution of each evolutionary driver, as well as their interaction with the effect of environmental constraints (Ey and Fischer 2009; Wilkins et al. 2013; Billings 2018; Friis et al. 2021). In birds, as in mammals, acoustic communication is also largely involved in behavioural contexts other than mate selection and/or territorial defence, such as parental care, predatory avoidance, group cohesion and foraging (Caro 2005; Catchpole and Slater 2008). In the case of predator avoidance, two main types of acoustic signals have been documented: flee alarm calls and mobbing calls, with each associated with a distinct and quite contrasted function (Marler 1955; Magrath et al. 2015). While flee alarm calls are emitted by escaping individuals and trigger freezing or fleeing in listeners, mobbing calls are emitted by individuals trying to gather other potential prey in order to deter a predator (Pettifor 1990; Pavey and Smyth 1998; Magrath et al. 2015). Both the intensity of mobbing and its success in deterring the predator are positively related to the size of the group attracted by the caller (Robinson 1985; Verbeek 1985; Picman et al. 1988). Thus, the functional support of mobbing calls requires the sender to be easily localizable (Marler 1957; Hurd 1996), a sound property that could be enhanced by both large broadband frequencies and long duration. In his seminal work on the distinctiveness of alarm call function according to these specific acoustic features, Marler (1957) suggested that these functions should lead to a convergence of the related acoustic signals in diverse taxa or lack of divergence if the calls are produced by phylogenetically closely related species. Mobbing calls, therefore, provide a clear foundation to anchor comparative analyses on the evolution of acoustic communication. Nevertheless, only few studies investigated acoustic variation in mobbing calls across bird species (Latimer 1977; Ficken and Popp 1996; Proppe et al. 2010; Wheatcroft and Price 2015; Billings 2018), and none of them have quantitatively compared these variations to those observed on other types of vocalizations (but see Latimer 1977; Wheatcroft and Price 2015). Such a lack of data is not surprising if we consider the specificity of mobbing calls. Indeed, mobbing calls cannot be characterized solely according to the context (i.e. response to a predator) but mainly by the suite of stereotyped behaviours associated to the emitter (Wheatcroft and Price 2015).

As outlined by Odom et al. (2021), taking within-species variations into account is especially important for vocal behaviours, given the large variations that may exist within and among populations of the species in question. However, among the few studies that compared the divergence of acoustic signals according to their respective function (Hu and Cardoso 2010; Martin et al. 2011; Potvin et al. 2011; Sturge et al. 2016; Friis et al. 2021), none considered within-species variation or performed separate phylogenetic analyses for each acoustic signal (i.e. function). This is somewhat problematic since species tend to exhibit more similar patterns than expected by chance (this is caused by both the non-independence of the observations (i.e. phylogenetic relatedness among species) and the risk of measurement error (or within species variations), see the review Freckleton et al. 2002 for a formal explanation and examples) and ignoring within-species variance can lead to biased and imprecise estimates (Ives et al. 2007).

In this study, we investigated the relative divergence of acoustic features across songbird species between sexual signals, territorial songs, and antipredator signals, mobbing calls, two vocalizations involved in different behavioural contexts and for which acoustic features are diversely selected. For this purpose, we carefully selected from a number of bird species for which multiple song recordings are available (e.g. Derrybery et al. 2018: 276 species; Pearse et al. 2018: 578 species), those for which several mobbing calls recordings were also clearly established (according the context and the associated behaviour of the emitter as outlined above), resulting in a restrictive set of 23 species. On a structural level, sexual selection is expected to favour longer and more complex (increased number and diversity of note types) song in birds (Nowicki and Searcy 2004; Leitão et al. 2006). Overall, songs involve a variety of different notes, while mobbing calls are often combinations of short and simple frequency pattern notes (Marler and Slabbekoorn 2004). However, this can vary hugely across different species; some Paridae species tend to have much simpler songs, yet possess complex mobbing calls. For example, Black-Capped Chickadees (Poecile atricapillus) have a larger variety of mobbing call note types compared to types of notes within their song. Nuthatches (Sitta sp.), on the other hand, have similar song and mobbing complexity, while many other species have more complex song than they do mobbing calls. It is important to note that recent work found that some mobbing calls are composed through the combining of functionally distinct call types and are more complex than previously thought (Engesser et al. 2016; Suzuki et al. 2016; Dutour et al. 2019).

Furthermore, while mobbing calls are clearly expected to facilitate rapid and efficient localization of the sender (Marler 1955), this is probably less true in the case for songs. In the case of mobbing, the signaller has the benefit of being perfectly localizable, and selection should favour mobbing calls that encode precise information about signaller location whatever the distance of the receiver, since mobbing efficiency depends on the size of the group attracted by the signaller (large groups are more efficient than small groups at repelling predators; Krams et al. 2009; Wheatcroft and Price 2018). Indeed, the caller being already localizable by the predator it is mobbing, the arrival of any species, prey or predator, is more likely to turn to its advantage than to that of the mobbed predator since large groups of prey reduce the individual risk of predation (Sordahl 1990) and the arrival of other predators may also compromise the predatory success by competitive interference. In accordance with these expectations, previous studies documented the relative role of the frequency bandwidth in making the sender localizable (Marler 1955; Dooling and Searcy 1985; Aubin and Jouventin 2002), and this acoustic characteristic is shared with aggressive and distress signals (Marler 1957; Morton 1977; Jurisevic and Sanderson 1998). While Marler (1955) predicted that mobbing calls would be structured for maximum localizability (i.e. abrupt, lower-frequency and broadband calls), Owings and Morton (1998) offered an alternative hypothesis that predicted the same structural pattern, but for a different reason. Under the motivational–structural code, mobbing calls, being inherently aggressive, should be low-pitched and harsh. Whatever the above hypothesis, broadband calls should be selected during mobbing in order to facilitate attraction of help. Recent work also found that the emission rates (i.e. the number of vocalizations produced per minute) during mobbing are higher than during singing (Cordonnier et al. 2023), and high emission rates are associated with the closest approach to the sender (Randler 2012; Dutour et al. 2022). Otherwise, territorial songs serve several functions, mainly mate attraction (as well as mated pair maintenance) and territorial defence. Songs may be produced as solo either by males or females, as well as by pairs in duets (Langmore 1998; Catchpole and Slater 2008). Songs are thus also expected to enhance emitter location, especially by neighbours, rivals and potential mates usually in the vicinity of their territory (Mathevon et al. 2008). Furthermore, the attraction of counterparts can also increase predator eavesdropping (Zuk and Kolluru 1998). Thus, signaller location accuracy should rather result from a trade-off: while it is expected that the benefits of efficient communication would favour the development of songs that do not deteriorate over long distances, in order to reach intended receivers, there is also the opposing consideration that selection for avoidance of the eavesdropping costs would lead to the evolution of songs ensuring low locatability of the signaller by predators and parasites (Boncoraglio and Saino 2007). Finally, since the context during singing is less urgent than mobbing, territorial songs should be less selected to enhance emitter location than to mobbing calls. In our view, mobbing calls are selected for a “rapid and perfectly public” location information, while this is not the case of territorial songs. In such situations, the selective pressure exerted on temporal or spectral acoustic features should be quite different between the two vocalization types. Finally, mobbing calls are given in anti-predatory contexts and are expected to vary according to the situation encountered (e.g. type of threat: Baker and Becker 2002; Templeton et al. 2005; Kalb et al. 2019) and the physiological state of the emitter (e.g. breeding season, arousal state and resource availability). For instance, arousal induced by the predation risk is suggested to alter the duration and intensity of mobbing calls (Templeton et al. 2005; Kalb et al. 2019), and the type of predator has been evidenced to affect the temporal and structural features of mobbing calls in tits (Suzuki 2014; Kalb et al. 2019). At the opposite, songs are produced for recognition and selection of mate as well as territorial defence and thus are expected to give rise to a more stereotyped vocalization (Collins 2004; Marler and Slabbekoorn 2004).

The main objectives of our study are to ask: (1) whether the divergence of acoustic features varies according to the function of the vocalization (i.e. mobbing calls vs. territorial songs) controlling for phylogenetic inertia, and (2) to what extent these divergences are also constrained by allometry, as indicated by the species size. Our main prediction is as follows: (1) songs are more complex (i.e. increased number and type of notes) and have a longer duration in comparison with mobbing calls, (2) mobbing calls possess higher-frequency bandwidth, maximum and peak frequency and lower minimum frequency compared to territorial songs, (3) mobbing calls are more variable (variation in the note richness (i.e. the number of unique note types per vocalization) in a call and large intraspecific variability, lower species specificity), than songs.

Methods

Species selection

To be included in our study, species had to meet the following three requirements: (1) vocal repertoire information had to be available for the species, (2) several good-quality recordings should be available, allowing us to take account of intraspecific variations in analyses, and (3) included in the phylogenetic tree of Jetz et al. (2012) and Jetz et al. (2014). The second point is particularly important because, if the songs of a very large number of species are well known, the mobbing calls are far less so, leading to a restriction of the number of species available to study mobbing calls contrarily to songs (e.g. Pearse et al. 2018). Moreover, because we mostly used recordings of mobbing calls available from online databases (see below for more details) and that (i) these calls are very often classified as alarm calls, and (ii) mobbing calls can be a combination of alarm calls and recruitment calls (Engesser et al. 2016; Suzuki et al. 2016; Dutour et al. 2019), making them easily confounded with other signals without the context of production; we reduced our selection to sound files that we could make sure that they were associated with a mobbing event (if notes by the recordist indicated for example: “calls given in response to approach at nest”, “alarm call series in reaction to playback of Pygmy Owl (Glaucidium passerinum) calls”). Finally, a total of 23 species from nine oscines families met these requirements: 1 species from the Fringillidae, 1 species from the Meliphagidae, 2 species from the Muscicapidae, 8 species from the Paridae, 3 species from the Parulidae, 3 species from the Sittidae, 1 species from the Thraupidae, 2 species from the Troglodytidae, 2 species from the Vireonidae. We used online (http://www.oiseaux.net/) and field guides (Handbooks of birds; Del Hoyo et al. 1999) to gather average body mass for each species.

Acoustic recordings and analysis

For the 23 species, we collected high-quality (44.1 and 48 kHz) mobbing calls and territorial songs from Macaulay Library (Cornell Lab of Ornithology, http://macaulaylibrary.org) or the Xeno Canto online database (http://www.xeno-canto.org) (Fig. S1). Since some bird species have more than one song and/or mobbing call in their repertoire, we used “typical” songs or mobbing calls for the majority of species in our study (see, for instance, Salis et al. 2021). Most of these mobbing calls have already been used in previous mobbing studies (e.g. for Americans species see Abolins‐Abols et al. 2017; for European species see Randler and Vollmer 2013). All song recordings have been obtained from the Xeno Canto database with search criteria specifying the type of vocalization as “song” (and also as “calls” for the Crested Tit (Lophophanes cristatus), because these two types of vocalizations are often confused in this species) and quality “A” (i.e. the highest recording quality as rated by users; although 5% of the song recordings have a “B” quality) (see Online Appendix 2 Table S1 for a full data set containing song sources information). When necessary, we consulted written description of songs to distinguish song from other types of vocalizations. We did not use recordings of juveniles, which might be still learning vocalization. Additionally, David J. Wheatcroft gave us recordings of the Eurasian Wren mobbing calls (Troglodytes, n = 1) during mobbing and Robert D. Magrath those of the Australian species, the Noisy Miner (Manorina melanocephala, n = 10). Lastly, we recorded mobbing calls of most European species (25 total individuals among 7 species) in situation where focus birds respond in response to Pygmy Owl calls with a Fostex FR2LE digital recorder connected to a Sennheiser ME67-K6P microphone (see Dutour et al. 2016). Since songs and calls can vary across populations, we used soundtracks recorded in different populations located on all of the species' range in order to encompass the song/call variation range given by the species (Table S1, expect for the mobbing calls produced by the Noisy Miners because they were recorded in the same area in Australia and the majority of calls produced by the Crested Tits, the Coal Tits (Periparus ater), the Great Tits (Parus major), and the Chaffinches (Fringilla coelebs). These mobbing calls were recorded in the Auvergne-Rhône-Alpes region in France).

We obtained measurements of acoustic features by measuring spectrograms in Avisoft SASLab Pro, following the established method (Dutour et al. 2017). Recordings were in WAV format. The recordings from Xeno-Canto were in mp3 format and were converted from mp3 to wav format in Goldwave. (We verified that conversion did not affect any of the acoustic measurements, see Online Appendix 3 Table S2 for details.) Our sampling method considered both individual and species signal variability. At the individual level, we measured on average 7.4 vocalizations (± 0.23 SD; range 1–10), and we selected the average vocalization to represent each individual (i.e. for each individual, we selected the vocalization that best represented the average for that individual using principal component analysis, see Online Appendix 4 for more details). The number of calls/songs per individual used in this study is higher than those used in previous studies (e.g. Wheatcroft and Price 2015). Each species was represented on average 4.5 individuals for mobbing call (± 1.7 SD; range 2–10) and 4.9 individuals for territorial songs (± 0.5 SD; range 3–5). This corresponds to a total of 216 individual recordings. For each sound recording, we measured or calculated seven acoustic features from spectrograms. Four of them correspond to spectral features: (1) peak frequency (the frequency for which amplitude is maximum in Hz); (2) maximum frequency (highest frequency of the call in Hz); (3) minimum frequency (lowest frequency of the call in Hz); (4) frequency bandwidth (differences in Hz between maximum frequency and minimum frequency measure on a linear amplitude spectrum); and the three other are involved in temporality and complexity: (5) vocalization duration (hereafter duration) (s); (6) pace (i.e. the vocalization duration / total number of notes); (7) note richness (see Online Appendix 1 Fig. S1). Maximum and minimum frequencies were identified as the frequencies at which the sound amplitude drops 20 dB below the sound peak amplitude (amplitude of the loudest frequency), which captures the vast majority of sound energy in songs/mobbing calls while being generally robust to interference by background noise in our recordings. To ensure that the total number of notes and the number of different notes that we found were non-aberrant, we consulted written description of the vocalizations for each species.

Phylogenetic framework

We based our analyses on the phylogenetic tree distributions from the BirdTree database (Jetz et al. 2012; http://birdtree.org). For both ‘Hackett’ and ‘Ericson’ backbones, we sampled 100 trees (with 9993 or 6670 Operational Taxonomic Units, OTUs each), which were pruned to generate tree distributions for all species (except the Japanese Tit (Parus minor) not available in the database) in our dataset. Based on these distributions, we used TreeAnnotator version 2.4.7 to generate four maximum clade credibility (MCC) trees (i.e. one tree for each method), setting branch lengths equal to ‘Common Ancestor’ node heights. The four final composite trees were similar in topology, and finally we used the composite tree based on the Ericson 9993 method. Because the Great Tit was the nearest species to the Japanese Tit and there was hybridization between them (Paeckert et al. 2005; Kvist and Rytkönen 2006; Johansson et al. 2013), we added a value close to that of the Great Tit for this species to obtain the final tree (see Online Appendix 5 Fig. S3). The phylogenetic variance–covariance matrix was obtained from the transformation of the final phylogenetic tree under Brownian motion model.

Statistical analysis

We used bivariate phylogenetic mixed models embedded in a Bayesian framework and implemented in the MCMCglmm package (Hadfield 2010) to investigate how acoustic features diverge between both the two vocalization types while controlling for the species size and the phylogenetic relatedness among species. We also controlled for the effect of habitat to verify that there was no relationship between the acoustic features and habitat (Mikula et al. 2021; Friis et al. 2021). It is important to note that the sample size of 23 species is mostly focused on temperate species, with smallish range in habitat (only two types of habitats: closed and semi-open habitat; see Online Appendix 6 for the habitat classification). More specifically, the same acoustic features measured for both the two vocalization types were introduced as two dependent variables in the same model, and both the species habitat and the species size were introduced as explanatory terms. In these models, the contribution of the phylogenetic relatedness was taken into account through a separate covariance estimate of the acoustic features for both vocalization types. Both vocalization types were thus considered as two evolving traits allowing us to estimate both their respective phylogenetic heritability and their phylogenetic correlation. All acoustic features, except the note richness, were analysed using a normal distribution for the error term. All these acoustic features were also standardized using the Z-score function before the analyses in order to improve model optimization, and the species size was standardized to facilitate the interpretation of estimates (Schielzeth 2010). The note richness was analysed using a log-link function and a Poisson distribution for the error term. The effect of the phylogenetic relatedness among species was accommodated through a random effect based on the standardized phylogenetic variance–covariance matrix and the variance decomposition method was then used to calculate the proportion of variance explained by the phylogenetic relatedness, i.e. the phylogenetic heritability (hereafter referred as H2, equivalent to the phylogenetic signal of Pagel; de Villemereuil et al. 2016). Since each vocalization type was recorded on several subjects per species, we were also able to estimate the species-specific effect (i.e. the proportion of the between species variance relative to the total variance discarding the phylogenetic random effect and calculated separately for each vocalization type) as measured as the adjusted intra-class correlation (\(ICC_{{{\text{adj}}}} = { }\frac{{\sigma_{{{\text{ spc}}.{\text{ specific}}}}^{2} }}{{\sigma_{{{\text{spc}}.{\text{ specific}}}}^{2} { } + \sigma_{{{\text{residual}}}}^{2} }}\), hereafter referred as ICC, Nakagawa et al. 2017). A larger residual variance of an acoustic feature for one vocalization type compared to the other associated with a reduced ICC of the former compared to the latter indicate therefore a more versatile acoustic feature for the former than the latter. However, if ICC remains similar, this should rather suggest that the range of the acoustic feature is just larger for the former than the latter. The phenotypic mean within each vocalization type was computed based on the averaging of the fixed effects and all these estimates (H2, phenotypic mean, ICC) were computed on the observed scale using the QGglmm package (de Villemereuil et al. 2016). Finally, the methodology proposed by Nakagawa et al. (2017) was used to calculate the marginal coefficient of determination (hereafter referred as partial R2) associated with the acoustic feature variance explained by the species size.

Given the moderate sample size of our dataset (i.e. 23 species) and the reduced expected phylogenetic variance or species-specific one in regards of the residual one (especially in the case of mobbing calls), we selected two sets of inverse Wishart priors, one weakly informative (i.e. nu = 2, V = 0.02) and the other more informative (i.e. nu = 2, V = 0.33). Sensitivity of the results to the priors was controlled using Gelman and Rubin’s convergence diagnostic (Gelman and Rubin 1992) based on the calculation of the potential scale reduction factor (hereafter referred as psrf) between Markov chains simulated under both priors. The estimates of the explanatory terms (i.e. intercept for each vocalization type, habitat effect, species size effect) were found to be insensitive to the prior (i.e. psrf < 1.05 in the worst case) but not so for the phylogenetic variance parameter (i.e. psrf = 1.5 in the worst case), as expected given our moderate sample size. For each parameter, we reported the mean of the highest posterior density distribution as well as the lower and upper limits of its 95% credible interval (hereafter referred as 95% CI) on the latent scale (see Online Appendix 7 Tables S4 and S5). Furthermore, although the Bayesian approach is particularly suitable in the case of a moderate sample size as in our study, it remains sensitive to unbalanced sample size. We therefore also tested that our results remain unchanged even after discarding the five species for which the number of recordings was less than three for at least one vocalization type. As it could be expected, all results were similar although 95% credible intervals of the estimates were inflated (see Online Appendix 8 Tables S6, S7, S8 and S9).

Results

Divergence of the temporal features and complexity between mobbing calls and territorial songs

Overall, the variability of temporal features was found clearly higher within mobbing calls than within territorial songs (as indicated by the comparison of the intraspecific residual variance between both vocalization types for the duration as well as for the pace, see Table 1). Both the phylogenetic heritability and the species specificity were comparable between vocalization types whatever the feature, but while the phylogenetic heritability was estimated to be at low to moderate (although a large credible interval), the species specificity was relatively high especially concerning the pace (Table 1). The vocalization duration was slightly shorter among mobbing calls than among territorial songs as indicated by the weak difference of their respective posterior distribution (Table 2), and the pace was similar between both vocalization types (Table 2). Both acoustic features were neither altered by the species size nor by the species habitat (see Online Appendix 7 Table S4).

Table 1 Phylogenetic heritability (H2 observed, equivalent to the phylogenetic signal of Pagel), intra-class correlation (ICC, species specificity), and intraspecific residual variance (\({\sigma }_{res}^{2}\), variance among individual recordings) of the three acoustic features involved in temporality and complexity for the two types of vocalizations (mobbing call and territorial song)
Table 2 Posterior distribution of the mean value of the three acoustic features involved in temporality and complexity for the two types of vocalizations (mobbing call and territorial song)

Contrarily to temporal features, there was a substantial difference of phylogenetic heritability between vocalization types concerning the note richness, for which both the phylogenetic correlation (i.e. H2) and the species specificity (i.e. ICC) were found substantially lower for mobbing calls than for territorial songs, while the residual variance was relatively low whatever the vocalization type (Table 1, Fig. 1). Moreover, the mean value of the note richness was also substantially lower for mobbing calls than for territorial songs (Table 2, Fig. 1) but was not altered by the species habitat nor the species size (see Online Appendix 7 Table S4).

Fig. 1
figure 1

Posterior distribution of the mean value of the note richness for the two types of vocalizations (mobbing call: dark grey; and territorial song: light grey) and posterior distribution of the effect size (i.e. average difference between the two types of vocalizations). Solid lines depict mean values and dotted line 95% credible intervals

Divergence of the spectral features between mobbing calls and territorial songs

As for the case of temporal features, all spectral features, except the minimal frequency, were found largely more variable within mobbing calls than within territorial songs as indicated by the comparison of the intraspecific residual variance between both vocalization types (Table 3). The phylogenetic heritability and the species specificity were also found to be at low to moderate for these spectral features whatever the vocalization type (Table 3).

Table 3 Phylogenetic heritability (H2 observed, equivalent to the phylogenetic signal of Pagel), intra-class correlation (ICC, species specificity), and intraspecific residual variance (\({\sigma }_{res}^{2}\)) of the four spectral features for the two types of vocalizations (mobbing call and territorial song). Values are reported with 95% credible intervals [CIs]

Moreover, the frequency bandwidth was found substantially larger within mobbing calls than within territorial songs (Table 4, Fig. 2). This difference was also higher for large species than small ones as revealed by the effect of the species size according to the vocalization type (see Online Appendix 7 Table S5 and Fig. S4) even though the effect of the species size remains weak as indicated by its relative variance contribution (R2 for the effect of species size, respectively, for mobbing calls and territorial songs: 0.02; 0.14). Results were similar concerning the maximal frequency (Table 4, see Online Appendix 7 Table S5). The peak frequency was also slightly higher within mobbing calls than within territorial songs, but the corresponding effect size was weak (Table 4). The minimal frequency was similar between both vocalization types (Table 4), although as the peak frequency, it was negatively altered by the species size (see Online Appendix 7 Table S5). In contrast, none of the four spectral features were affected by the species habitat whatever the vocalization type (see Online Appendix 7 Table S5).

Table 4 Posterior distribution of the mean value of the four spectral features for the two types of vocalizations (mobbing call and territorial song). All acoustic variables Z-transformed, values are reported with 95% credible intervals [CIs]
Fig. 2
figure 2

Posterior distribution of the mean value of the frequency bandwidth for the two types of vocalizations (mobbing call: dark grey; and territorial song: light grey) and posterior distribution of the effect size (i.e. average difference between the two types of vocalizations). Solid lines depict mean values and dotted line 95% credible intervals

Discussion

Overall, our results show that, at the individual level, almost all acoustic features were substantially more variable within mobbing calls than within territorial songs. The phylogenetic heritability was found to be moderate and similar between both vocalization types for all acoustic features except in note richness, for which it was substantially higher in songs than mobbing calls. The species specificity, as revealed by the residual part of species variance not explained by their phylogenetic relatedness (i.e. ICC), roughly follows the same patterns, although it was always slightly higher than the phylogenetic heritability, especially concerning the pace for which it was high whatever the vocalization type. Both the note richness and frequency bandwidth as well as, to some extent, the duration and the maximum frequency were found to segregate vocalizations. The difference between both vocalization types was less evident with regard to the other acoustic features. We also detected an effect of the species size on the spectral features, but in the case of the frequency bandwidth and the maximum frequency, the strength of the size effect varies markedly according to the vocalization type. On the contrary, we did not found evidence for an effect of the species habitat whatever the acoustic feature. These results are thus well in agreement with previous studies, but they also highlight the substantial part of species specificity in vocalizations as well as its importance in deciphering the relative evolutionary divergence between vocalizations according to their functional support. Furthermore, it is worth noting that our results remain unchanged even after discarding species with few recordings (see Online Appendix 8), suggesting that our interpretation is not skewed by a reduced and unbalanced sample size of recordings between vocalization types. Finally, it is important to note that the 23 species sampled here, focused mainly on temperate species of one major passerine clade. In future work, it would be interesting to explore the acoustic differences between vocalization types with a larger sample size.

Divergence of acoustic features according to vocalization type

If vocalizations are not subject to the same level of context-dependence, such a change should translate to a difference in the extent of variability observed at the individual level. Mobbing calls are given in an anti-predatory context and are expected to convey information that may vary according to the specific threat (Baker and Becker 2002; Templeton et al. 2005; Kalb et al. 2019). On the other hand, songs are produced for recognition and selection of mate as well as territorial defence and thus are expected to give rise to a more stereotyped vocalization (Collins 2004; Marler and Slabbekoorn 2004). These expectations are particularly well supported by our results since we found almost all acoustic features, whether they relate to the temporal properties or spectral ones, to be more variable at the individual scale in the case of mobbing calls than in the case of songs.

However, if mobbing calls are more variable at the individual scale than songs, this does not imply that their acoustic features strongly diverge from those of songs. Indeed, to which extent the acoustic features diverge between these two vocalization types should depend on their relative involvement in supporting the function associated with these vocalizations. In particular, one may expect the acoustic features that are involved in the functional support of both vocalizations to be more similar between vocalizations and to exhibit both a similar phylogenetic heritability and species specificity. Conversely, acoustic features for which the involvement differs between these two vocalization types should exhibit mean phenotypic divergence and should also be subject to a different level of phylogenetic correlation, as well as a different level of within-species variance.

As revealed by our results, pace was the sole acoustic feature to exhibit, for both vocalization types, a very similar phenotypic mean, a larger and equivalent species specificity, as well as a relatively large phylogenetic heritability. Several studies have highlighted the importance of the bill size, as well as its shape, to explain variations of temporal features in bird songs, notably pace (Derryberry et al. 2018; Garcia and Tubaro 2018; Demery et al. 2021). Beak size is also known to exhibit a large phylogenetic signal (Gardner et al. 2016). As we did not consider beak size in our analyses, this could explain the relatively large phylogenetic signal we detected on the pace. Moreover, a reduced intraspecific variance compared to the between-species variance concerning the beak size could also well explain the large species specificity of pace which we detected for both vocalizations, and should therefore deserve future attention.

Both the complexity and duration of signals are expected to be enhanced by sexual selection in songs (Catchpole and Slater 2008), and several studies reported a correlation between the strength of sexual selection and these acoustic features (Soma and Garamszegi 2011; Robinson and Creanza 2019; but see also Crouch and Mason-Gamer 2019). On the contrary, mobbing calls are not driven by sexual selection and they exhibit lower duration and note richness (number of different note types: Marler 1955; Marler and Slabbekoorn 2004). Our results confirm this hypothesis, although the difference between mobbing calls and songs was less evident for the duration than for the note richness. Interestingly, unlike non-Paridae species who have vastly more complex songs than mobbing calls (the mean value of the note richness for songs and mobbing calls: 6.7 and 1), the mobbing calls of Paridae are not much more complex than their songs, which are composed of flat tonal notes (the mean value of the note richness for songs and mobbing calls: 1.9 and 2). This suggests that non-Paridae species have more complex song, but Paridae calls and songs are similar in note richness. Recently, Friis et al. (2021), which compared acoustic features between songs and non-alarm-related calls across a larger dataset (> 500 species), reported a similar phylogenetic signal for song duration (\(\lambda =\) 0.45) suggesting a labile evolution for this acoustic feature in songs. However, our results also reveal that both the phylogenetic signal and the species specificity of duration and note richness are also lower within mobbing calls than within songs. It seems, therefore, that duration and complexity of mobbing calls exhibit a higher evolutionary lability than songs, which is in contrast to recent findings concerning the comparison of songs and contact calls (Friis et al. 2021). As outlined above, mobbing calls may greatly vary according to the threat perceived by the caller, involving either gradual signals (Templeton et al. 2005; Kalb et al. 2019) or referential ones (Suzuki 2015), and both related to a change in duration and note richness of mobbing calls. Such variations in mobbing calls have been suggested to give rise to a higher evolutionary lability compared to more specialized calls (Wheatcroft and Price 2015). Moreover, mobbing calls are generally intended to reach a broader audience than just conspecifics (Hurd 1996; Dutour et al. 2016), contrary to other vocalizations such as contact calls (e.g. Friis et al. 2021). It is therefore also likely that only important acoustic features of mobbing calls are well conserved across species (or converged).

The efficacy of mobbing calls fully depends on their ability to favour emitter localization, since both the intensity of mobbing and its success in deterring the predator are positively related to the size of the group attracted by the caller (Krams et al. 2009). Although songs are also expected to enhance emitter location (Mathevon et al. 2008), we have predicted that they are expected to be less selected to enhance emitter location than mobbing calls since this property is less essential for the functional support of songs; in our view, mobbing calls are selected for a “rapid and perfectly public” location information, whereas songs are selected for “rough” location information (see introduction for details). Previous studies have well documented the relative role of the frequency bandwidth in making the sender localizable (Marler 1955; Dooling and Searcy 1985; Aubin and Jouventin 2002). Our results are congruent with this hypothesis since frequency bandwidth was found to be higher for mobbing calls than territorial song, indicating that this spectral feature is an essential component of mobbing calls (Marler and Slabbekoorn 2004) and is less important for songs. One may, therefore, suggest this feature likely constitutes the acoustic solution towards which all species converged in order to enhance localization when emitting mobbing calls as suggested by Marler (1955).

Effect of species size according to the type of vocalization

Our results reveal that larger species had lower peak frequency and minimal frequency than smaller species, whatever the vocalization type. These results are consistent with previous research looking at mobbing calls (Billings 2018) and songs (Mason and Burns 2015; Mikula et al. 2021; Friis et al. 2021). The positive correlation between syrinx size and body size may explain the relationship between body size and these spectral features since larger syrinx size can produce lower frequencies (Bowman 1979; Wallschläger 1980). Importantly, an inverse allometry was found for the frequency bandwidth and the maximal frequency, but only in the case of territorial songs and not in the case of mobbing calls. It is worth noting that we also found a difference of the allometric strength concerning the two temporal features (duration and note richness) for which we also detected a difference in the phenotypic mean between both vocalization types. Thus, it appears that the acoustic features that mostly differ between songs and mobbing calls are also those for which the allometric strength is more pronounced for songs than mobbing calls. As recently outlined by Friis et al. (2021), a stronger allometry of spectral features in one vocalization relative to another should indicate a stronger deterministic boundary set by body size on that vocalization, relative to the other. The similar allometry of peak frequency and minimal frequency, that we found in both vocalization types, can be explained as mobbing calls and songs need to transmit information across long distances to attract individuals. The allometry of frequency bandwidth and the maximal frequency, found only in territorial songs, could be explained according to the function associated with vocalization types: in the case of mobbing calls, larger species also keep higher maximal frequencies to maintain a large bandwidth required to facilitate location, which is not the case of territorial song.

Conclusion

In summary, our work suggests that divergence of acoustic features varies according to their function (mobbing calls vs. territorial songs) in the 23 passerine species studied. We found that phylogeny explained acoustic variation in only one of the variables measured; note richness. Since our dataset was restricted to nine oscine families, the relative effect of shared ancestry for characters involved in sound production or modulation, such as syrinx morphology or beak shape (Derryberry et al. 2018), is likely reduced. Moreover, most of the studies also reported an important effect of body size on several of the acoustic features (Wallschäger 1980; Ryan and Brenowitz 1985; Billings 2018; Derryberry et al. 2018). Our results indicate that species size influences spectral features. Our results reveal that the acoustic characteristics vary differently according to the type of signal; almost all acoustic features were more variable within mobbing calls compared to within territorial songs. For the acoustic variables related in temporality and complexity, there is a greater versatility of the acoustic variables at the intraspecific level (revealed by the tendency for mobbing calls to have a lower ICC than territorial songs). Likewise, spectral properties of mobbing calls wider range of values compared to that observed for territorial song. However, our sample sizes (for both the number of species and the number of recordings per species) are too low to be able to confirm this, and thus, larger sample sizes are required to confirm this trend. Furthermore, it would be particularly interesting to extend the comparative study to other types of vocalizations associated with different functions, especially flee alarm calls since one should expect the opposite trend for the location of the sender (Marler 1955; 1957). Finally, although our method accounted for intraspecific variation in each type of vocalizations, it does not allow proper separation of phenotypic plasticity by measuring intra-individual variations. Future work will need to test the importance of the extent for variability at the individual scale when performing comparative analyses across species.