Introduction

Why endemic species exhibit restricted geographical ranges is a key question in biogeography and ecology (Olivieri et al. 2015). Endemics may occupy a limited ecological niche (Williams et al. 2009), or originate by hybridization occurring only in the area of contact between progenitor species (Grünig et al. 2021), or they may be relics from a wider past range (‘paleo-endemics’; Favarger and Contandriopoulos 1961; Petrova et al. 2015) or recent species (‘neo-endemics’, sensu Stebbins and Major 1965) yet to disperse (Behroozian et al. 2020). Alternatively, genome duplication (polyploidy), effectively representing ‘instantaneous’ sympatric speciation (Mayr 1963), may be a major driver of plant evolution (Levin 1983; Otto and Whitton 2000; Soltis et al. 2014; Van de Peer et al. 2017). Polyploidy occurs when the cell cycle includes chromosome duplication but lacks the subsequent stages of cell component separation, resulting in cells with multiple chromosome complements. This can occur during mitosis (somatic doubling) or meiosis (non-reduction during sporogenesis), either within populations of a single species (autoploidy), or subsequent to interspecific hybridization (allopolyploidy; Ramsey and Schemske 1998). Polyploidy is often accompanied by larger nuclei and cells (genome size and nuclear and cell volumes are directly related; Cavalier-Smith 2005; Hodgson et al. 2010), altering physiological and morphological traits of offspring (see Van de Peer et al. 2020) that effect fitness (e.g. a greater tendency towards vegetative reproduction; Herben et al. 2017; see Soltis and Soltis 2016). Altered phenotypes change the ecology of polyploids with respect to diploid progenitors: e.g. larger cells result in larger organs (the ‘giga effect’), producing larger flowers of different colours and scents favouring different pollinator species (see Rezende et al. 2020). With different sets of chromosomes, polyploids are often reproductively isolated from mother plants (Husband and Sabara 2004; Laport et al. 2016; Lavania 2020). Ancient events of genome doubling are often associated with increased rates of speciation (Husband et al. 2013; Soltis and Soltis 2016; Landis et al 2018). Polyploidy occurs frequently in plants, particularly in angiosperms (Soltis and Soltis 2016; Lavania 2013, 2020): all angiosperms show evidence of multiple polyploidy events, except for Amborella trichopoda, the sister to all other living angiosperms which possesses only the ancestral ancient polyploidy event shared by all living flowering plants (Soltis et al. 2009; Jiao et al. 2011; Amborella Genome Project 2013; Lavania 2020).

However, despite being widely recognised as an important process for plant evolution and ecology, the extent to which polyploidy represents a general mechanism in the emergence of endemic species throughout the angiosperms has yet to be investigated.

Understanding the extent to which polyploidy is associated with endemism is complicated by the fact that ‘endemic’ is an ill-defined term. In part this is due to the historical use of the word, the original meaning being ‘a constant background presence in a particular area’ (e.g. “yellow fever is endemic to tropical Africa”), the opposite of ‘epidemic’ (i.e. spreading out of control). This meaning does not preclude the species being found elsewhere. In biology and ecology ‘endemic’ has taken on a meaning similar to ‘precinctive’, i.e. restricted to a precinct or place and found nowhere else. However, the precinct is often delimited on a case-by-case basis using artificial criteria such as geopolitical boundaries, which are variable in extent and often biologically irrelevant. The term ‘endemic’ is context dependent and applied over various scales (i.e. ‘continental endemic’, ‘local endemic’ or ‘narrow endemic’; Lavergne et al. 2004; Coelho et al. 2020), and may be considered to include the ecological requirements of the species and degree of habitat specificity (Boakes et al. 2010; Beck et al. 2014; Fithian et al. 2014). Indeed, terms such as ‘narrow endemism’ or ‘micro-endemism’ are typically qualified with information on the number of populations, degree of isolation and the genetic structure, environmental requirements and dispersal capacity of the taxon (Médail and Baumel 2018)—specific information that does not exist for many species. For the purposes of the present analysis the basic phenomenon under study is that of geographical range restriction, and to avoid the problems of scale surrounding the word ‘endemic’, here we explicitly define species as endemic (precinctive or range restricted) vs. non-endemic (relatively cosmopolitan) based on a number of criteria, including geographical distribution but also recognition as ‘endemic’ or ‘sub-endemic’ in national floras, in many cases combined with a specific epithet of the Latin binomial name that suggests belonging to a particular geographical location. We do not attempt to distinguish micro-endemics from more generally range-restricted endemics.

A further complication arises because ploidy level is not always easy to define, even when the basic number of chromosomes is known. This can result from ancient autopolyploidy or allopolyploidy (Parisod et al. 2010; Lavania 2013; Zozomová-Lihová et al. 2014), sometimes followed by diploidization and occasionally chromosome number reduction (i.e. diploidized paleo-polyploids; Tamayo-Ordóñez et al. 2016; Qiao et al. 2019). Moreover, some taxa show intraspecific variability, with multiple chromosome counts (different ploidy levels) arising from relatively frequent polyploidy events (Husband et al. 2013; Vimala et al. 2021). While polyploidy multiplies sets of chromosomes (and thus has striking effects during karyotype evolution), a range of processes can subtly rearrange single chromosomes, altering chromosome number or characteristics such as DNA content. These include insertion, deletion or duplication, inversion and intra- or inter-chromosomal reciprocal translocation, particularly evident for paleo-species that have accumulated changes over time (see Schubert and Lysak 2011; Vimala et al. 2021). As polyploidy drives change in the overall set of chromosomes, but complicating processes can further alter the number of chromosomes, the relationship between the number of chromosomes and polyploidy is not necessarily straightforward. With this caveat in mind, in the present study it is assumed that entire genome multiplication in the sporophyte generation (2n) is principally affected by polyploidy, and the number of chromosomes is used as a quantitative measure to represent the net result of karyotype evolution.

Additionally, the distinction between ‘neo-endemic’ and ‘paleo-endemic’ is also ambiguous. Despite the considerable attention given to the classification of endemic species in terms of when they originated (Favarger and Contandriopoulos 1961; Stebbins and Major 1965; Maers and Giller 2013), an absolute age threshold differentiating paleo- from neo-endemics remains undefined. While species younger than 1 million years are clearly neo-endemics (Kraft et al. 2010), the issue becomes complex for less recent species. Indeed, the term paleo-endemic refers more to a process than to a particular time or period per se (i.e. endemism by restriction or fragmentation of a previously extensive range). Ferreira and Boldrini (2011), for example, addressed the problem by suggesting the combination of a dated phylogeny (i.e. estimated age and degree of systematic isolation) with environmental context (based on stratigraphy and the age of underlying rocks). Lazarina et al. (2019) considered reproductive and geographical isolation, while Mishler et al. (2014) proposed a method based on their relative phylogenetic endemism index, to distinguish centres of neo- and paleo-endemism. Unfortunately, these methods are too unwieldy to be used for a prompt distinction between neo- and paleo-endemics in large datasets.

Particular relevance has been given to apo-endemics, or polyploids diverged from diploid progenitors (Favarger and Contandriopoulos 1961). However, the origin of a polyploid and divergence in the case of sympatric speciation is not always possible to date (Doyle and Egan 2010). Indeed, dating polyploidy events and their role in creating new taxa has so far been limited to the timing of major clade emergences (Wood et al. 2009), lacking sufficient detail to compare particular species within families or genera. In the present study, rather than entering into the debate regarding what constitutes a neo- or a paleo-endemic, we refer simply to the time period elapsed since the divergence of the taxon. Thus, the absolute timescale (in millions of years) is used here as a framework, and from hereon we explicitly avoid referring to arbitrary ‘neo-’ and ‘paleo-’ classes.

In summary, the comparison of estimated taxon age (Ma since divergence) against the number of chromosomes will test whether the chromosome complement is highest in recent species; information on the occurrence range of species will allow assessment of whether the phenomenon is general within the angiosperms or relatively prevalent in endemics. Based on these data, the principal objective of the present study is to assess whether polyploidization events are principal drivers of the emergence of new endemic species. Specifically, it is hypothesized that, despite a prevalence of diploid taxa throughout evolutionary time: (1) the highest chromosome counts are evident for angiosperm taxa that have diverged recently, (2) higher numbers of chromosomes are particularly evident for recent endemic (cf. non-endemic) taxa, and (3) the character of the ploidy level/divergence time relationship is consistent throughout the angiosperms, from ancient to recently diverged clades.

Materials and methods

Data mining

The relationship between genome duplication and the timing of speciation for endemic angiosperms used sporophyte chromosome count data, in the context of ‘time since divergence’ and geographical presence data. These data were collated from databases containing values from the scientific literature, and directly from the literature itself, aiming to broadly represent both endemic and non-endemic taxa across the angiosperms. The recent phylogeny of Leebens-Mack et al. (2019) was used, and the dataset specifically aimed to represent all major angiosperm clades and the largest families, starting with ANA-grade taxa (represented by Nymphaeaceae—other families in this clade are too under-represented in terms of both chromosome number and taxon age data), and including the monocots (represented by Poaceae and Orchidaceae), Magnoliids (Magnoliaceae), Ranunculales (Ranunculaceae), Caryophyllales (Caryophyllaceae), Asterids (Apiaceae, Campanulaceae, Asteraceae, Ericaceae, Primulaceae), Saxifragales (Saxifragaceae) and Core Rosids (Brassicaceae, Euphorbiaceae, Fabaceae, Rosaceae, Violaceae). Taxonomic name standardization was ensured using data from The Chromosome Counts Database (CCDB, v1.47: ccdb.tau.ac.il/browse), based on the automatic taxonomic name resolution software Taxonome (Kluyver and Osborne 2013) and The Plant List (v1.1: www.plantlist.org). Species with both ‘accepted’ and ‘unresolved’ taxonomic status were included: subspecies and varieties were discarded.

The diploid number of chromosomes for the sporophyte generation was attained from the CCDB (last access: October 2020), as a quantitative proxy of ploidy level. Only one count per species was included, except when different counts were equally reported in the database. When multiple counts were reported for a species, the modal value was retained; for species exhibiting multiple modal values, all were retained (e.g. 2n = 25, 2n = 30, 2n = 36 for Paphiopedilum victoria-mariae; Orchidaceae). Missing sporophyte values were calculated by doubling the gametophyte counts. B chromosomes were not considered. Negative values, 0 and 1 were considered errors (being biologically improbable) and discarded. When possible, the source material used for the counting was checked: usually mitotic counts were made using the root-tip squash method (Miller 1961), while meiotic counts were made from floral buds (see Windham et al. 2020). Chromosome counts for each taxon are presented in Table S1.

The estimated taxon age was obtained from the public database TimeTree: The Timescale of Life (TTOL, www.timetree.org; last access: September 2021). Molecular dating has been applied to an increasing number of species (the largest dated phylogenetic tree for the angiosperms comprises more than 36,000 species, belonging to ~ 8400 genera, 426 families and all orders; Janssens et al. 2020), but heterogeneity in datasets, sequences, calibrations and the software used can yield different estimates for the same species, often hindering comparison between the results of different studies (Pulquério and Nichols 2007). TTOL provides a comprehensive synthesis of data published between 1987 and 2013 (3998 studies; www.timetree.org/references) for 50,632 species, of which 14,465 angiosperms, and offers data uniformity, rapid data access and a robust foundation in the scientific literature (Hedges et al. 2006, 2015; Kumar and Hedges 2011; Kumar et al 2017). Divergence time between taxa is estimated through a hierarchical average linkage method (Hedges et al. 2015). Note that while confidence intervals for average divergence time estimates are not reported for all taxa in the present study, among-study variance does occur due a variety of factors, including differences in calibrations and gene and taxon sampling between studies. These interval estimates are calculated and reported by TTOL (for more details, see www.timetree.org/faqs#q2). Thus, in the present study, it is implicit that taxon age values represent estimated means based on a range of sampling methods employed across the scientific literature. A preliminary check of data included in the TTOL estimates was made from original chronograms in specific published papers, cited by TTOL. The discretional value of 0.001 Ma was attributed to extremely recent nodes, when a specific “estimated time” was not indicated [e.g. Adenocarpus hispanicus (Fabaceae), Anemone hepatica, Anemonastrum narcissiflora (Ranunculaceae), Magnolia coco, Magnolia obovata and Magnolia officinalis (Magnoliaceae), Table S1 presents estimated divergence times for all study species].

For the purposes of the analysis, we classified species as endemic (range-restricted) on a case-by-case basis using a combination of quantitative data (geographical range) guided by qualitative information such as designation as ‘endemic’ in national and regional floras. The geographical range for each species was obtained from public databases: Global Biodiversity Information Facilities (GBIF: https://www.gbif.org/), the Plants of the World Online portal (www.plantsoftheworldonline.org), taxon-specific databases (i.e. the Global Compositae Database—GCD: www.compositae.org; the Campanula portal: www.campanula.e-taxonomy.net) or specific papers (i.e. for Campanulaceae: Kandemir 2007 for Campanula coriacea; Crowl et al. 2015 for Catopsis delicatula; for Asteraceae: Zhang et al. 2011 for the genera Soroseris, Stebbinsia and Syncalathium). The highest richness of precinctive species is found in biodiversity hotspots (Cañadas et al. 2014; Noroozi et al. 2018), which have been identified in 36 areas around the globe (Conservation International, www.conservation.org; Critical Ecosystem Partnership Fund, www.cepf.net), and range from 18,972 km2 (New Caledonia) to 2,373,057 km2 (Indo-Burma region). Such heterogeneity often requires the identification of smaller, higher-concentration areas within these regions (“hotspots within hotspots”) with endemism again being considered across differing scales (Cañadas et al. 2014; Noroozi et al. 2018). Based on the geographical extent of these hotspots, a range not exceeding 600,000 km2 was one factor in the decision to classify a species as endemic. This threshold was chosen in order to include the remaining vegetation of the 36 Biodiversity Hotspots (see Table S2) according to Conservation International. Since the pre-industrialisation extension of some hotspots (i.e. Indo-Burma region, Brazil’s Cerrado, or Mediterranean Basin) exceeds 2,000,000 km2, and also includes urban areas, only the extension of the remaining vegetation (rather than the historical extent) was considered. Classification as endemic or not was also decided by determining whether species are recognised as endemic in national or local floras (for example: New Zealand Plant Conservation Network, http://www.nzpcn.org.nz; Cellinese et al. 2009 for endemic Campanulaceae of Crete; Brochmann et al. 1997 for endemics of Cape Verde; see Table S1 for details of each case, including flora languages), and when species epithets of Latin names indicated belonging to a geographical location (e.g. Amelanchier nantucketensis).

To check whether latitude affected data availability, for both the distribution of endemic species and chromosome counts, a control analysis was performed. Data records (Table S1) were randomised, and a subset of 500 species extracted and assigned a latitudinal zonation class: ‘Tropical’ (species occurring between the Tropics of Cancer and Capricorn, i.e. 23° 27′ N and 23° 27′ S), ‘Subtropical’ (between latitude 23° 27′ and 35° in each hemisphere, following www.globalbioclimatics.org), ‘Temperate’ (between latitude 35° and 66° 33′, the polar circles) and ‘Polar/Alpine’ (species at latitudes above the polar circles, or growing at elevations above 2000 metres above sea level, m a.s.l.). Species equally spread across two or more zones were considered as ‘Cosmopolitan’. To assess the distribution of endemic species according to latitude, the proportion of endemics with respect to the total number of species in each zone was calculated.

Data analysis

The coverage rate of the available data for each family was calculated as a percentage ratio between the number of species included in the analysis and the total number of species (both accepted and unresolved) reported in The Plant List. Three separate analyses were performed both on the totality of data collected (referred to as ‘angiosperms’) and on subsets for single families, further subdivided into analysis of all species (endemics and non-endemics), and endemics and non-endemics treated separately. For Nymphaeaceae, only one analysis was performed due to lack of data on endemic species, while for Magnoliaceae, Rosaceae, Saxifragaceae and Violaceae, analysis of endemic species was not performed, due to insufficient data (10 spp. or less). Analyses were performed using the statistical software R (v3.5.1; R Core Team 2018). Data were plotted according to the estimated time since divergence (x axis) and the number of chromosomes (y axis), using the ggplot2 package (Wickham et al. 2020).

To investigate the maximum number of chromosomes exhibited by taxa over geological time (time since taxon divergence), an upper boundary regression was applied, fitting the regression curve only to the highest values of the dataset. Boundary functions are widely used in ecology to highlight the maximum effects of processes, otherwise obscured by the weight of mean values (Pierce 2014, and references therein). To remove the effects of redundant chromosome counts within each family, age values along the x axis were divided into periods (‘bins’) of 1 million years, and regression fitted to the five highest y values within each bin. The function applied was an exponential decay with the formula: \(y=A{\mathrm{e}}^{\left(-kx\right)}-c\), where y is the sporophyte number of chromosomes, A is the initial quantity or y intercept (estimated y value for x = 0), k is the decay constant, x is the estimated time since divergence and c is the lowest y value for each family. The c parameter was introduced to obtain a horizontal asymptote equal to the lower chromosome count and avoid curves tending to zero, as zero chromosomes would be biologically unrealistic.

Finally, the percentage ratio between the number of polyploids (sensu Wood et al. 2009) and the total number of species for each family included in the analysis was calculated to test whether taxonomic groups are differentially predisposed to polyploidy, in terms of both formation and establishment. Data are available in Microsoft Excel spreadsheet format (Table S1).

Results

Analyses were performed on a total of 4530 records, representing 4210 species, 1344 (31.9%) of which were classified as endemic according to the combination of criteria used. For each family, the coverage rate of collected data generally did not exceed 4% of known species (Table S3), with the exception of Ranunculaceae (5.8%), Apiaceae (6.2%), Magnoliaceae (9.1%) and Campanulaceae (12.6%). According to the estimated crown age (Table S3), the oldest families (> 100 Ma) are Rosaceae (106 Ma, 95% CI 41–161 Ma), Orchidaceae (105 Ma, 95% CI 97–113 Ma) and Magnoliaceae (104 Ma, 95% CI 95–113 Ma), while Brassicaceae (46 Ma, 95% CI 19–71 Ma), Caryophyllaceae (52 Ma, 95% CI 39–62 Ma) and Apiaceae (54 Ma, 95% CI 29–57 Ma) are the most recent. However, uncertainty related to estimated crown age was sometimes substantial: in Rosaceae, the extreme case, the confidence interval differed by 120 million years. Fabaceae (71–80 Ma), Violaceae (67–77 Ma) and Ranunculaceae (75–88 Ma) exhibited the least variable estimates.

Recurrent numbers of chromosomes were evident in all families, which sometimes represent a high proportion of data, for example, 2n = 22 for Apiaceae (58%); 2n = 34, Campanulaceae (38.8%); 2n = 38, Magnoliaceae (67.7%); 2n = 32, Ranunculaceae (54.9%); and 2n = 48, Violaceae (39%). Model parameters (k = decay constant; A = y intercept) and statistics (R2adj and p-value) were extracted during model production in the R environment. Hereafter, the adjusted R2 values for each general analysis [total species (T) representing endemics (E) plus non-endemics (N)] are indicated by R2adjT, while for each analysis performed on endemics and non-endemics separately variance is indicated by R2adjE, and R2adjN, respectively. Upper boundary regression applied to the entire dataset (Fig. 1) showed a significant exponential decay trend (R2adjT = 0.48; R2adjE = 0.49; R2adjN = 0.44; p always < 0.0001) between the number of chromosomes and the estimated time since divergence, which was three times steeper in endemic species (k = 0.12) with respect to non-endemics (k = 0.04). The estimated y intercept was higher in endemics compared to non-endemics (A = 164 cf. 111, respectively; Fig. 1).

Fig. 1
figure 1

The relationship between time since taxon divergence and number of chromosomes (as a proxy for ploidy level) across all major clades of the Angiosperms, applied to: a the entire dataset, b endemic species and c non-endemic species. Squares represent endemic species, and circles represent non-endemic species. Broken lines represent ‘upper boundary regressions’, or 3-parameter Lorentzian regressions fitted to the five highest values in each ‘bin’ or 1 million year interval (bin value data points are filled in dark grey; points under the upper boundary curve are unfilled). Dotted lines represent the ± 95% confidence interval

In analyses performed on single families, results were varied but common patterns were evident. The overall negative relationship between the number of chromosomes and taxon age was confirmed in the majority of the families, with differing degrees of significance: Campanulaceae (R2adjT = 0.42; R2adjE = 0.22; R2adjN = 0.31, p < 0.0001; Fig. 2a–c), Asteraceae (R2adjT = 0.46; R2adjE = 0.45; R2adjN = 0.23; p < 0.0001; Fig. 2d–f), Fabaceae (Fig. 2g–i), Poaceae (Fig. 2j–l), Caryophyllaceae (Fig. S1a–c), Ranunculaceae (Fig. S1g–i) and Rosaceae (Fig. S1d–f). However, similar exponential decay patterns were only evident as non-significant trends for endemic Primulaceae (Fig. S1k; R2adjE = 0.065; p = 0.116). The relationship was stronger for endemics, as confirmed by the higher values of k (e.g. Asteraceae, k = 0.15 and 0.04 for endemics vs. non-endemics, respectively; Caryophyllaceae, k = 0.23 and 0.15 for endemics vs. non-endemics, respectively) and the entire family (e.g. k = 0.11 for Asteraceae, or k = 0.16 for Caryophyllaceae). The negative trend was of borderline significance for endemic Apiaceae (R2adjE = 0.06; Fig. S2b), not significant for endemic vs. non-endemic Orchidaceae (R2adjT = 0.19; R2adjE =  − 0.02; R2adjN = 0.14; p = 0.51 in endemics and < 0.001 for the family and non-endemics; Fig. S2d–f) and non-significant for Violaceae (R2adjT = 0.05; R2adjN = 0.03; p always > 0.1; Fig. S2g–i) and Magnoliaceae (negative values for R2adjT and R2adjN; p always > 0.6; Fig. S2j–l), although data were lacking for endemics of these latter two families (Fig. S2h, k).

Fig. 2
figure 2

The relationship between time since taxon divergence and number of chromosomes for examples of angiosperm families exhibiting a declining upper boundary relationship: ac Campanulaceae, all spp., endemic spp. and non-endemic spp., respectively, df Asteraceae, gi Fabaceae and jl Poaceae. Squares represent endemic species, and circles represent non-endemic species. Broken lines represent ‘upper boundary regressions’, or 3-parameter Lorentzian regressions fitted to the five highest values in each ‘bin’ or 1 million year interval (bin value data points are filled in dark grey; points under the upper boundary curve are unfilled). Dotted lines ± 95% CI Note that x and y data ranges are different for each family

For Ericaceae, regressions were not significant (R2adj always < 0.01; p always > 0.1; Fig. 3). In contrast, Brassicaceae (Fig. 4) and Euphorbiaceae (Fig. S3) exhibited statistically significant negative slopes for non-endemics and the families as a whole, while slopes for endemics were not significant (− R2adj, p ~ 0.7) with increasing tendencies with taxon age (k =  − 0.03 and − 0.009, respectively). Finally, Nymphaeaceae (Fig. 5) showed a significant positive relationship, with k =  − 0.04, R2adjT = 0.29 and p = 0.03. A similar, but non-significant, tendency was shown in Saxifragaceae (k =  − 0.02 for the family and k =  − 0.01 for non-endemics, Fig. S4a–c), with negative R2adj values and p always > 0.3.

Fig. 3
figure 3

Non-significant trends in the family of Ericaceae, with a positive slope in a the analysis of the whole family and c non-endemic species only, contrasted with a negative slope for b endemics. Squares represent endemic species, and circles represent non-endemic species. Broken lines represent ‘upper boundary regressions’ fitted to the five highest values in each ‘bin’ (points filled in dark grey; points under the upper boundary curve are unfilled). Dotted lines ± 95% CI

Fig. 4
figure 4

Significant negative relationship between the number of chromosomes and estimated taxon age in a the entire family of Brassicaceae, c non-endemic Brassicaceae and b non-significant positive trend in endemic Brassicaceae. Squares represent endemic species, and circles represent non-endemic species. Broken lines represent ‘upper boundary regressions’ fitted to the five highest values in each ‘bin’ (points filled in dark grey; points under the upper boundary curve are unfilled). Dotted lines ± 95% CI

Fig. 5
figure 5

The weakly significant positive trend between taxon age and chromosome number in Nymphaeaceae. Broken lines represent ‘upper boundary regressions’ fitted to the five highest values in each ‘bin’ (points filled in dark grey; points under the upper boundary curve are unfilled). Dotted lines ± 95% CI

The proportion of polyploid taxa within each family was found to differ substantially between families (Fig. S5): polyploidy was evident for the majority of taxa in Violaceae (92%), Primulaceae (76%) and Campanulaceae (68%), while it occurred in less than 10% of species of the Fabaceae (8%) and Orchidaceae (6%). Note that there was no statistically significant geographical bias (in terms of bioclimatic zones) for either the number of chromosomes (Fig. S6) or the proportion of the flora that were endemic species (Fig. S7).

Discussion

We demonstrated a negative exponential relationship between maximum number of chromosomes and time since divergence with a decay constant three times greater for endemic angiosperms with respect to non-endemics (k = 0.12 cf. 0.04, respectively; Fig. 1). Moreover, the estimated y intercept was much higher for endemics with respect to non-endemics (A = 164 cf. 111, respectively), indicating that recently diverged species with higher numbers of chromosomes are more likely to be range restricted (endemic). Indeed, the results broadly support the hypotheses that polyploidy is particularly evident in recently diverged angiosperms (Hypothesis 1), especially for recent endemic species (Hypothesis 2). While this phenomenon was evident for most of the families investigated, it was not always observed, and the hypothesis of a mechanism working consistently across the angiosperms (Hypothesis 3) was only partially supported. The distribution of chromosome counts with time (i.e. towards recent ages) highlighted the pattern of progressive multiplication of the chromosome set, with high concentrations of records corresponding to diploid, tetraploid and hexaploid counts (2n = 2x, 4x and 6x, respectively): this is particularly evident in Apiaceae, Caryophyllaceae, Ranunculaceae and Rosaceae. Thus, polyploidization appears to be an important mechanism for the emergence of new species generally.

Contrasting results for different families suggest that phylogenetic effects operate within each family (revealed by the direction, range and variability of patterns; Fig. S5). For Ericaceae, the regression analyses were not significant: it is likely that other prevailing mechanisms, such as adaptive radiation, drive the emergence of new ericaceous species. Despite showing an overall negative tendency between the number of chromosomes and taxon age, a similar interpretation could explain the weak significance for Orchidaceae, also indicated by the highly variable 95% confidence intervals.

In Saxifragaceae, Magnoliaceae, Nymphaeaceae and Violaceae, results were probably affected by limited data availability. This could also explain the lower significance of the analyses for endemics and non-endemics with respect to the family-level analysis in Primulaceae, and the atypical trend in endemic Euphorbiaceae and Brassicaceae, characterised by wide and irregular 95% confidence intervals. In particular, the estimated age for endemic Brassicaceae did not exceed 9 Ma, with only seven species older than 3 Ma (i.e. Cochlearia aragonensis, Draba hederifolia, Streptanthus glandulosus, Vella asperum, Vella bourgaeana, Vella pseudocytisus and Vella spinosa). In these situations, the nature of the data requires careful consideration: four out of these species belonging to the same genus, Vella, are endemics of Spain and were dated relying on a single study (i.e. Simón-Porcar et al. 2015); independence of observations could not be assured, since V. asperum, V. bourgaeana and V. pseudocytisus are closely related (see also Siljak-Yakovlev and Peruzzi 2012). This is likely to have effects on species distribution: these species are also found in the same habitat (disturbed xerophytic shrublands on gypsum substrate; Gómez Campo 1993). With regard to karyotypes, the three species share the same basic chromosome number (x = 17), but while V. bourgaeana is diploid (2n = 2x = 34), V. pseudocytisus is mainly tetraploid (2n = 4x = 68), and V. asperum is hexaploid (2n = 6x = 102) (Simón-Porcar et al. 2015), further confirming the role of polyploidy in speciation events. For these reasons, detailed here for a single prominent case, all analyses based on a restricted number of records should be interpreted with caution.

Additionally, weaker regressions or positive trends were generally evident for older families: Orchidaceae (105 Ma), Magnoliaceae (104 Ma), Nymphaeaceae (89 Ma), Saxifragaceae (88 Ma) and Ericaceae (82 Ma; Table S3). This suggests that for certain ancient clades polyploidization may not be the main driver of speciation, although Rosaceae, the oldest clade (106 Ma), agreed with the hypothesis of higher polyploidy occurrence in recently diverged species. It is noticeable that some of these ancient families (i.e. Nymphaeaceae, Saxifragaceae, Rosaceae) showed high percentages of polyploid taxa, sensu Wood et al. 2009 (higher than 45%, Fig. S5). In Saxifragaceae, a high number of chromosomes in older taxa but a lack of a relationship over time could indicate an initial burst of speciation via polyploidy followed by a lesser involvement of polyploidy in speciation. Therefore, polyploidy could be relatively widespread even across ancient families, but it is evidently not the main driver of speciation for these families.

Why recent polyploids are relegated to limited ranges is not immediately evident from our dataset. Polyploids are often adaptable species able to survive in harsh environmental contexts, and thus advantaged when colonizing new habitats (Flovik 1940; Brochmann et al. 2004; Manzaneda et al. 2012; te Beest et al. 2012; Mas de Xaxars et al. 2016; Paule et al. 2018; Stevens et al. 2020). Intriguingly, alteration of phenotype by polyploidy suggests that plant functioning and fitness may be fundamentally changed. However, a preliminary classification of the species in our dataset according to Grime’s CSR ecological strategies (method of Pierce et al. 2017) showed that no particular strategy class was associated with polyploidy: species adapted to survive competition, stress or disturbance all exhibited an extremely wide range of chromosome numbers (data not shown). Rather than reflecting fitness and adaptation, the high incidence of endemic polyploids could depend mainly on limited time for dispersal and colonizing new areas, evident for certain species (Behroozian et al. 2020). It has been determined that polyploid plant species rely mainly on vegetative reproduction (Herben et al. 2017), which is usually ineffective for wide or rapid dispersal (Winkler and Fischer 2001; Herben et al. 2016). Indeed, the rearrangements of the chromosome set and the encumbrance created by multiple chromosomes in polyploid cells hinder meiosis, imposing disadvantages for sexual reproduction (Herben et al. 2017). Another polyploid trait impacting dispersal, related to the ‘giga effect’, is that of larger and heavier seeds (Stevens et al. 2020), of particular importance to wind-dispersed species. Moreover, some studies (e.g. Corneillie et al. 2019; Mo et al. 2020) have shown slower growth rates for polyploids compared to diploids. Together, these traits may limit dispersal, colonisation and seedling recruitment processes for polyploids.

The present study has a number of limitations that should be considered in interpreting these results. For example, the available data cannot indicate the extent to which anagenesis (speciation via evolutionary change directly within a single lineage) occurs, and speciation by cladogenesis is assumed. Increased availability of data regarding phylogenies (inclusion of greater numbers of species), tropical species (which are less well studied; Prance 1977; Prance and Campbell 1988; Sosef et al. 2017), species molecular dating, chromosome counts and distributions will provide further support to this analysis. This particularly applies to families with less complete records (e.g. Ericaceae, Magnoliaceae, Nymphaeaceae), in order to determine whether contrasting results for these families are indicative of truly different patterns or are data artefacts.

Conclusion

Chromosome duplication is more prevalent in recent angiosperms, in particular young endemics, confirming the role of polyploidy as a key driver of recent endemism throughout the flowering plants. This pattern was generally evident across flowering plant families, but some cases in which patterns were lacking may reflect insufficient data availability. Dispersal limitation is more likely for polyploid taxa, potentially explaining why recently evolved polyploids tend to remain relegated to small ranges compared to diploids. However, the majority of young species (both endemics and non-endemics) are diploid, and thus polyploidy is not an exclusive driver of endemism.