Introduction

Studying evolutionary and genetic variation in crop species is a significant research avenue for discovering novel attributes that may increase worldwide food and nutrient requirements (Ramirez-Villegas et al. 2022). Yet, climate change jeopardizes global crop production, potentially worsening malnutrition, poverty, and sustainable development, especially in developing countries (Peng et al. 2020). In this context, avocado (Persea americana Mill., Lauraceae) arises as a highly nutritious fruit tree crop that is becoming a top commodity worldwide (Sommaruga and Eldridge 2021). However, current avocado plantations rely extensively on the monoclonal propagation of a few commercial varieties, either as rootstocks or scions. In the long term, these low-diversity plantations may prove unsustainable due to the lack of a variable genetic pool capable of responding to biotic and abiotic stresses in the face of climate change (Ingvarsson and Dahlberg 2019). Thus, avocado pre-breeding efforts require exploring the centers of diversity to find new sources of genotypes and alleles capable of enriching the genetic bases of key traits, including the adaptative potential to different agroecosystems (McCouch 2004).

The genus Persea Mill. evolved as two different subgenera, Persea and Eriodaphne, each containing 12 and 11 species, respectively (van der Werff 2002; Pironon et al. 2020). The Persea subgenus, to which P. americana belongs, is highly diverse in Mesoamerica and South America. Therefore, these regions are considered diverse hotspots for avocados and tree species from the Lauraceae family (Pironon et al. 2020). The avocado tree has been dated to at least 9000 YBP (Years Before the Present) and was first used and selected in the Tehuacan Valley in the State of Puebla in Mexico at around 8000 YBP (Smith 1966, 1969). Based on this evidence, several authors proposed this place as the avocado’s center of origin (Arumuganathan and Earle 1991; Galindo-Tovar et al. 2007; Alcaraz and Hormaza 2007; Piperno 2011; Calderón-Vázquez et al. 2013). Interestingly, the oldest archaeobotanical record for this species in Colombia is from the Calima Region in the middle Cauca Valley, and was also dated within the same geological period, 7830 ± 140 YBP. This opens the question of where the Colombian avocado originated and how it dispersed to the region (Piperno 2011). In pre-Columbian times, the obligate route for exchanging crops and wild relatives between Mesoamerica and South America was through northwest South America, either by foot through the Darien Gap and the Isthmus of Panama or by canoes across the Gulf of Morrosquillo and the Lesser Antilles (Larranaga et al. 2021). Thus, when the Spanish arrived in America during the Conquest, the distribution of avocados already ranged from Mesoamerica to Ecuador and Peru, and its dispersion was further reinforced by the establishment of human populations following antique pre-Columbian commercial routes (Bergh and Ellstrand 1986; Wolters 1999; Galindo-Tovar et al. 2008). Despite this compelling archeological evidence, the study of the genomic diversity of this species in the northwest of South America, including Colombia, remains in its infancy (Chen et al. 2009; Galindo-Tovar and Arzate-Fernández 2010).

The avocado species currently comprises three horticultural groups (also known as races) that differ in origin, genetic diversity, and horticultural characteristics. These groups are classified as Guatemalan (P. americana var. guatemalensis (L.) Wms.), Mexican (P. americana var. drymifolia (Schlecht. et Cham.) Blake) and West Indian (P. americana var. americana Mill.) (Rendón-Anaya et al. 2019). The Guatemalan group originated in the mid-altitude highlands of Guatemala and is characterized by having small fruits and late fruit maturity. The Mexican group originated in the mid-altitude highlands of Mexico and is known for its early fruit maturity and cold tolerance. In contrast, the West Indian group originated in southern Mexico and Central America’s lowlands and has larger fruits with low oil content (Rendón-Anaya et al. 2019). Some reports prefer referring to this last group as “lowland” (Galindo-Tovar and Arzate-Fernández 2010; Solares et al. 2023), yet we stick to the more conventional naming of West Indian. The renaming partly obeys the fact that this group has open questions about its origin and dispersion routes (Galindo-Tovar and Arzate-Fernández 2010). For instance, Spanish chroniclers described avocado trees with West Indian characteristics in northern South America, including Colombia, Ecuador, and Peru, both in mountainous regions as well as on the Pacific Coast and the Amazon basin (Galindo-Tovar and Arzate-Fernández 2010).

Several genetic studies of avocados have supported the three major recognized groups and their evolutionary origin using molecular genetic markers such as ESTs (Expressed Sequence Tags), SSRs (Single Sequence Repeats), and SNPs (Single-Nucleotide Polymorphism) (Gross-German and Viruel 2013; Rubinstein et al. 2019; Ge et al. 2019a; Talavera et al. 2019). In these studies, the Mexican and Guatemalan groups appear more closely related to each other than the lowland West Indian group. Furthermore, several avocado germplasm banks have characterized their accessions at the genetic level to identify the ancestry of the conserved genotypes. These include the Venezuelan gene bank-INIA-CENIAP (Ferrer-Pereira et al. 2017), the US National gene bank repository (SHRS ARS USDA) in Miami (Boza et al. 2018), and the Spanish gene bank (Cañas-Gutiérrez et al. 2019). More recently, the first genome assembly of an avocado cultivar (Hass) provided a reference sequence of 980 Megabases-Mb (Rendón-Anaya et al. 2019). This genomic information reinforced the substructure of the major avocado groups and provided evidence for the hybrid origin of the commercially famous Mexican × Guatemalan var. Hass (Rendón-Anaya et al. 2019). However, a recent analysis suggests that the Hass variety has a complete ancestry to the Guatemalan group (Solares et al. 2023). The latter report also determined the genome sequence of another avocado cultivar, the Gwen variety (Solares et al. 2023), the first to be successfully assembled to a chromosome level.

Although the growing genomic resources of avocados promise to aid the conservation, discovery, and breeding of new commercial varieties, avocado germplasm from the South American tropics is still poorly characterized (Cañas-Gutiérrez et al. 2022). Furthermore, local accessions and seedling rootstocks in the region lack proper group classification, obscuring the dawn of the nursery plant material, and jeopardizing its optimum resilient and sustainable deployment into already complex geographies. Eventually, nurseries could apply early marker screening at seedling saplings before grafting for racial ancestry, genetic value, disease resistance, and against detrimental alleles (Reyes-Herrera et al. 2020). Still, until then, the lack of genetic traceability also challenges fruit yield and quality and, therefore, overall profitability. Although native avocado production in Colombia is mainly for internal consumption of fresh and processed fruit, the northern Andes also presents favorable agro-climatic conditions (from 1600 to 2200 m above sea level–masl) for producing and exporting avocado cv. Hass. The international avocado market is particularly profitable at the end of the year when worldwide demand is high for guacamoles during Thanksgiving, Christmas, and New Year’s Eve festivities, as well as during the Super Bowl in early February, yet the offer is unsatisfied because of a valley in the harvest at producing countries but Colombia (Rios Castaño and Tafur Reyes 2003). Recent exports to Europe, Japan, the USA, and other countries totaling 146 million (M) USD prove this point and reinforce Colombia’s potential as a supplier of avocado markets abroad (Navarro-Villa 2022). Colombian production ranks second with 876.7 thousand tons (FAO 2022), and Mexico remains the top producer and consumer (Galindo-Tovar et al. 2011; FAO 2022).

Despite the avocado’s undeniable nutritional and commercial value, their plantations are becoming more vulnerable to diseases and long-term climate change effects due to a restricted number of clones used as commercial rootstocks and scions. Therefore, this study aimed to leverage local avocado gene pools from northwest South America by combining the reported SNP allelic diversity of classical avocado groups with the reduced representation genotyping of native accessions and varieties from the Colombian avocado germplasm bank, as well as different seedling rootstocks from commercial orchards in the northwestern Andes of Colombia, a significant producer hub of cv. Hass for exportation. Specifically, we focused on the following research questions: (1) Does the Colombian avocado germplasm belong to the recognized genetic groups from Mesoamerica, or does it correspond to a new gene pool of P. americana? (2) What is the genetic group identity of the rootstocks used at commercial orchards in the Colombian northwest Andes? and (3) What is the best evolutionary scenario to explain the genetic origin of the Colombian avocado germplasm based on plausible dispersal routes from Mesoamerica? Exploring the questions above would inform whether Colombian avocado resources provide novel evolutionary and allelic diversity beyond the classical groups from Mexico and Central America.

Material and methods

Plant material and reference genomic sequences

This study analyzed the genetic diversity of 456 avocado samples (P. americana) from two sources: the avocado collection of the Colombian Germplasm Bank (CGB) (n = 144) and Seedling Rootstocks (SR) (n = 240) of commercial orchards from the northwest Andes of Colombia, a main avocado-producing region (Table S1). Additionally, sequences of genomic resources of 71 individuals from the study of Talavera et al. (2019) representing the three recognized avocado groups (Guatemalan, GU; Mexican, ME; and West Indian, WI) and hybrids among these groups were included as reference materials (RM). Finally, the genome sequence of P. schiedeana was used as an outgroup (O). The latter is a sample obtained from the high-altitude germplasm bank of “Fundación Salvador Sánchez Colín” (CICTAMEX, S.C) located at La Cruz Experimental Center in Coatepec Harinas, Mexico (Rendón-Anaya et al. 2019).

The CGB is managed by AGROSAVIA (Colombian Agricultural Research Corporation) at the Palmira research station in the province of Valle del Cauca (3°32′22.0″N 76°18′13.0″W, 1000 masl). Of the 144 conserved accessions, 16 (“Bacon”, “Booth_5”, “Booth_7”, “Booth_8”, “Choquette”, “Duke_7”, “Edranol”, “Fuerte”, “Hass”, “Lula”, “Pollock”, “Reed”, “Simmonds”, “Topa Topa”, “Waldin”, and “Zutano”) are commercial varieties or cultivars (Commercial Materials, CM), and 128 correspond to Colombian native and criollo trees (native × introduced hybrids) (Fig. 1). Some of these accessions were sampled from contrasting agroclimatic regions, particularly areas with high-humidity soils, where it is probable to find tolerant genotypes to Phytophthora cinnamomi (Rodríguez-Henao et al. 2017), a major threat in avocado plantations. Following the International Bioversity manual for avocados, this germplasm has been characterized by juvenile morphological and botanical traits (IPGRI 1995). The SR had 240 seedling rootstocks from eight commercial orchards in the Antioquia province, Colombia’s primary avocado-producing region (Table S1). This germplasm was previously screened using SSR markers (Cañas-Gutierrez et al. 2019; Reyes-Herrera et al. 2020).

Fig. 1
figure 1

Geographic location of the avocado genotypes conserved in the CGB with passport data. The color of the samples represents the assignation of these samples to one specific genetic cluster (Colombian Andean and Colombian Caribbean, West Indian-WI, or hybrids/admixtures) based on the following ancestry analysis results

Molecular genetic characterization of CGB and SR materials

DNA isolation

We extracted genomic DNA from foliar tissues of 144 accessions conserved in the CGB and from root tissues of 240 seedling rootstocks (SR) using the DNeasy Plant Mini Kit (QIAGEN, Germany) with the following modifications: first, we added 450 µL of the AP1 solution to an equal volume of 20% SDS (sodium dodecyl sulfate), and we incubated the samples for 30 min at 65 °C and froze for 60 min at − 20 °C. DNA quality was verified by electrophoresis in 1% agarose gels dyed with Sybr Safe (Invitrogen), and the DNA concentrations were measured in a Qubit® 2.0 fluorometer (Life Technologies).

Genomic libraries construction and sequencing

Each avocado DNA sample was double digested using PstI (CTGCA-37 °C) and ApeKI (G/CWGC-75 °C) enzymes, following the recommendations of New England BioLabs. Genomic libraries of 300 base pairs (bp) were constructed from the digested DNA using the NEBNext Ultra DNA kit for Illumina (E7103L). We used the 96 NEBNext Multiplex Oligos kit indexes for Illumina (E6609L) to identify the sequences corresponding to each sample. Each genomic library was quantified through fluorescence in a Qubit 2.0, and the average size of the fragments of some random samples was determined by digital electrophoresis on a Tape Station 4200 (Agilent). Finally, the genomic libraries were diluted to a concentration of 10 nM and pooled in groups of 96 samples. The pooled samples were sequenced using a paired-end strategy with 150 sequencing cycles in Illumina HiSeq X equipment (Macrogen, Inc. Korea). The demultiplexing of sequences for each genotype was implemented using the bcl2fastq Illumina software (Illumina 2022). Raw data sequences were deposited in the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI) database under the Bioproject number PRJNA878519.

SNPs identification for the CGB, SR, RM, and O samples

To discover SNPs among all avocado samples, we made the SNP calling using the pipeline reported by Osorio-Guarín et al. (2020). In summary, for all sequence files of each analyzed sample (downloaded and generated in this study), we used the FastQC software (Andrews 2010) to verify the quality of the Fastq files. The adapter’s primers and sequences with a quality of less than Q30, and a length of less than 50 bp were filtered using Trim-Galore software (Krueger 2012). The Burrows-Wheeler Aligner (BWA) software (Li and Durbin 2009) was used to align the sequences against the P. americana var. Hass cultivar reference genome version 2 (~ 913 Mb) (Rendón-Anaya et al. 2019). We used Picard (BROAD Institute 2022b) and the Genome Analysis ToolKit-GATK (BROAD Institute 2022a) software to remove PCR duplicates, correct mapping quality assignment, and realign and recalibrate the reads. The SNP calling was done using the Unifiedgenotyper option of the GATK algorithm. Finally, we used the VCFtools software (Danecek et al. 2011) for maintaining the SNP markers with a Minimum Allele Frequency (MAF) of 5%, a minimum of 2X of sequencing depth (SD) per position, keeping biallelic SNPs and excluding InDels (insertion-deletion). Samples and SNPs with percentages higher than 20% (genotype level) and 5% (marker level) of missing data were excluded.

Because the nucleotide sequences were gathered from different tissues (leaves and roots) as well as from published genomic resources (Rendón-Anaya et al. 2019; Talavera et al. 2019), we differentially implemented a final filter for polymorphic SNPs discovery to get three goal-oriented datasets with contrasting SNP density (Table 1). Summary statistics, such as Mean Depth of Sequencing (MDS), nucleotide diversity (Pi), transition/transversion ratio (Ts/Tv), MAF, and Polymorphism Information Content (PIC), were determined in dataset 1 using VCFtools software (Danecek et al. 2011) and R-SNPready package (Granato et al. 2018) in R software (R project 2022).

Table 1 Avocado SNPs datasets generated to implement the goal-oriented analyses proposed in this study

We relied on three datasets and approaches, as described below, to answer the research questions proposed in this study about the classification of the Colombian avocado germplasm given the classical races (ME, GU, and WI), as well as the presence and evolutionary origin of potential new genetic groups.

Unsupervised non-parametric genetic clustering

Population stratification was characterized using dataset 1, which included 4931 SNPs identified in 199 samples (Table 1), using two complementary non-parametric approaches from the unsupervised machine learning paradigm: partitioning and hierarchical clustering. We had to perform a preliminary step to reduce the dimensionality of the SNP dataset through principal component analysis (PCA) (Alhusain and Hafez 2018; Foote et al. 2019) using the R-glPca function in the R-adegenet package (Jombart and Ahmed 2011). To find the optimum number of principal components needed as input in the clustering analysis, we carried out the Tracy‐Widom test (Tracy and Widom 1994; Patterson et al. 2006) to obtain the statistical significance of each component employing the Eigenstrat function performed in R‐AssocTest (Wang et al. 2020) as in Foote et al. (2019).

Once dimensionality was reduced in 21 principal components (66.28% accumulated variance), we determined the optimal number of genetic pools or K groups for each unsupervised non-parametric clustering algorithm. Then, we conducted in the R software two complementary methods. First, we ran the NbClust (Charrad et al. 2014) algorithm, an internal measure’s function that integrates 30 indexes to determine the optimal K value based on ward.D2 method and the Euclidean distance, evaluating three (K = 3) to eight (K = 8) putative genetic pools. Second, we ran the optCluster (Sekula et al. 2017), an improvement of the traditional clValid algorithm (Brock et al. 2008) that optimizes both the K score, and the clustering approach based on internal stability and biological measures. In this sense, the optCluster function validated at once all clustering algorithms by each approach using cross-entropy and genetic algorithm methods, using weighted Spearman footrule distance (Kumar and Vassilvitskii 2010) across three (K = 3) to eight (K = 8) putative genetic pools. We explored the optimal number of clusters using available non-parametric algorithms, such as partitioning and hierarchical clustering. For partitioning clustering, we used K-means (MacQueen 1967; Lloyd 1982), Partitioning Around Mmedoids (PAM), and Clustering Large Applications (CLARA) (Kaufman and Rousseeuw 1990) algorithms. For hierarchical clustering, we ran Aglomerative Nesting (AGNES) and Divisive Analysis (DIANA) algorithms (Kaufman and Rousseeuw 1990) using the ward.D2 method.

Maximum likelihood phylogenetic reconstruction

The phylogenetic analysis was conducted using dataset 2, which included 3899 SNPs for 202 avocado samples (Table 1) and an outgroup (O), the close relative species P. schiedeana. Although this species could potentially hybridize with avocados, it behaved as an outgroup for our analyses (Ashworth and Clegg 2003). In a preliminary test (data not shown), in which we also used Phoebe bournei as an outgroup, P. schiedeana was more related to this species than avocados. Therefore, we decided to use only P. schiedeana as an outgroup because we could recover more SNPs. The alignment of nucleotide sequences across all target SNPs was visualized and verified in Geneious software (Kearse et al. 2012). The phylogenetic tree was reconstructed using the Maximum Likelihood (ML) method in the PhyML software (Guindon et al. 2010) with 1000 bootstrap replicates.

Genetic structure and contemporary directional migration rates among avocado genetic clusters

To quantify the genetic differentiation among clusters detected in the unsupervised non-parametric genetic cluster (K = 4) and phylogenetic (K = 5) analyses, we conducted inferences of molecular variance (AMOVA) using dataset 1. We calculated the global Phi scores (like FST, both provide population differentiation summary statistics) using 1000 permutations in the R-poppr package (Kamvar et al. 2014). Pairwise FST values were calculated in the R-dartR package (Gruber et al. 2018) to compare genetic clusters, while the genetic diversity was measured through observed (Ho) and expected (He) heterozygosity and inbreeding coefficient statistics (FIS) calculated in the R-SNPready software (Granato et al. 2018). Genotypes from the CGB with geo-referenced information (Fig. 1) were used to implement Mantel correlation tests with 1000 permutations between genetic differentiation and geographic distance within and between pairs of clusters. The R-geosphere package (Hijmans et al. 2021) was utilized to gather geographic distances. The R-ape package via the mantel.test function was used for the explicit test. Two diagrammatic versions were drawn in the same R software depicting genetic distance vs. geographic distance (km), as well as decaying correlograms against distance classes, the latter using the mantel.correlog functionality also from the R-ape package.

Finally, we conducted a Bayesian Markov Chain Monte Carlo (MCMC) analysis on dataset 1 using the BayesAss software v.3.05 (Wilson and Rannala 2003) to estimate contemporary and directional gene flow migration rates (m) among the four K = 4 (CO, ME, GU, and WI) and five K = 5 (ME, GU, WI, Colombian Andean, and Colombian Caribbean) genetic groups detected in the unsupervised non-parametric genetic clustering and phylogenetic analyses, respectively. We used the BA3SNP option (Mussmann et al. 2019). We set the parameters for migration rates (m: 0.15 in both analyses), allele frequencies (a: 0.3 in both analyses), and inbreeding coefficients (FIS: 0.01 in K = 4 and 0.005 in K = 5) to achieve an optimal acceptance rate between 20% and 60%, as recommended in the manual. We ran three independent analyses for each K using different seed values (100, 200, and 300) evaluated in 10 M iterations and 1 M burn-in, sampling every 1000 steps. We checked the convergence of the results and calculated the mean, standard deviation, and 95% Confidence Intervals (CI) of gene flow over the three independent runs, using the Tracer v.1.7.2 software (Rambaut et al. 2018). We considered gene flow migration rates with CI over 0 as significant. We drew the pairwise FST and m results using the R-qgraph package (Epskamp et al. 2012).

Ancestry inferences

To determine which groups of avocados are spanned by the CGB and SR genotypes, we ran two independent ancestry analyses using datasets 1 and 3, because each matrix presented a different number of SNPs, 4931 and 227, respectively (Table 1). These datasets included genotypes with high genetic ancestry (> 80%) to one of the genetic groups (ME, GU, WI, and CO, the Colombian group) determined in the unsupervised non-parametric genetic cluster analysis. In the case of the CO group, we selected samples representing the two possible genetic sub-populations detected in the Colombian collection (Andean and Caribbean) through phylogenetic analysis. We used samples with a clear genetic ancestry for these five genetic clusters as reference genotypes to implement a supervised analysis in ADMIXTURE software (Alexander and Lange 2011). These analyses were run for a K of five (ME, GU, WI, Colombian Andean, and Colombian Caribbean) using a cross-validation (CV) of 10. The CGB and SR samples with > 80% ancestry were assigned to a specific genetic cluster, and mixed samples (≤ 80%) were cataloged as admixtures/hybrids based on their admixture proportion (e.g., Colombian Andean × Colombian Caribbean).

Reconstruction of evolutionary demographic scenarios

We implemented Approximate Bayesian Computation (ABC) using the DIYABC Random Forest v.1.0 software (Collin et al. 2021), an extended version of DIYABC v.2.1.0. This tree-based classification method uses supervised machine learning to classify evolutionary relationship scenarios and estimate their parameter robustness based on backward simulations. With that in mind, we simulated 120,000 sets for the 4931 selected loci of dataset 1 (Table 1), 10,000 trees per scenario, and 1000 as the number of out-of-band testing samples to estimate the historical parameters given the scenario with the highest posterior probability. Because there is no experimental estimate of the SNP mutation rate available for avocados, the mean mutation rate followed an a priori uniform distribution. Each locus had a possible range of two allelic states, as expected for bi-allelic SNPs in a diploid species. Moreover, we set the divergence time as the number of generations to the most recent common ancestor as equal to or higher than in this order: \({t}_{3}\ge {t}_{2}\ge {t}_{1}\). First, we considered five groups: Mexican (ME), Guatemalan (GU), West Indian (WI), and the two identified Colombian groups (Andean and Caribbean). However, trial runs never showed convergence (data not included). Therefore, we simplified the analysis using previous parameters into four groups: the Mesoamerican groups (i.e., ME, GU, and WI) and the Colombian group (CO) with accessions from the two subgroups (i.e., Andean and Caribbean). This strategy showed convergence.

For these four genetic groups, we constructed 18 different demographic scenarios, grouping them into four families of hypotheses. The null hypothesis (scenario 1) was that all four groups (ME, GU, WI, and CO) diverged at the same time. In contrast, the other three families of hypotheses tested if CO split early, late, or nested compared with the other three recognized groups. The first family of hypotheses (nine scenarios, named from scenarios 2 to 10) reconstructed an early divergence of CO compared to ME, GU, and WI. The second family of hypotheses (three scenarios, named from scenarios 11 to 13) reconstructed a demographic history where CO diverged in a nested manner compared to ME, GU, and WI. Finally, the third family of hypotheses (five scenarios, from 14 to 18) reconstructed a demographic history where CO diverged later compared with ME, GU, and WI groups (Fig. S1). These scenarios reflect a uniform distribution for the possible genealogies. In other words, instead of only inputting likely topologies based on the results from previous archeological reports and genetic outputs from this work, we opted to embrace a broader spectrum of prior genealogies. Additionally, rather than recovering a single optimum scenario, we identified a family of probable genealogies with comparable posterior probabilities. This strategy is an updated utilization of the Bayesian thinking, which can concurrently handle multiple genealogies with similar likelihoods without limiting itself to a parametric singularity of the posterior distribution.

After selecting the three best scenarios according to the number of votes, we converted from the number of generations to years the resultant posterior distributions for the divergence times \({t}_{1}\), \({t}_{2}\), and \({t}_{3}\) (i.e., summarized by mean, standard deviation, and CI at 0.05 and 0.95). This transformation relied on the generation time (g), the expected average time in years between two consecutive generations, which correlates with the onset of the reproductive phase. Since the avocado species is a perennial woody tree crop (Miller and Gross 2011) with a variable reproductive phase depending on environmental and genetic factors, the conversion from the number of generations to absolute time was not trivial and had to acknowledge multiple expectations of the generation time g. We utilized three different calibration points to account for this variability in generation times across studies. A conservative generation time of g = 10 years represented the earliest age of fruiting observed among 11 pairwise cultivar crosses (i.e., ranging between three and 14 years) (Thorp et al. 2015). However, we also utilized generation times of g = 30 years and g = 50 years because avocado trees in nature may take longer to reach sufficient fecundity to produce enough viable fruits and offspring (Krome 1956; Goodall et al. 1970; Zuazo et al. 2021). Ultimately, these multiple conversion points enabled considering uncertainty thresholds for explicit comparisons with geological profiles and archeological records.

We finally compared the timing of divergence between pairs of lineages across the best models with two hypotheses on the evolution and genetic diversification of avocados, both inspired by archeological data and Spanish chroniclers’ descriptions (Galindo-Tovar et al. 2007; Galindo-Tovar and Arzate-Fernández 2010). The subgenus Persea probably diversified as a product of available habitats in the Pleistocene (i.e., from 2,580,000 to 11,700 YBP) (Galindo-Tovar et al. 2007). If so, avocado dispersion and diversification would have to be mediated by big mammals and hunter-gatherers humans as they migrated from Mesoamerica to northern South America through Central America and the Isthmus of Panama (Galindo-Tovar et al. 2007). In contrast, if the dispersal and diversification of avocados occurred more recently during agriculture and village life, we would expect divergence times during the Holocene, the most recent period of the Quaternary that began approximately 11,650 YBP until today (Krome 1956; Goodall et al. 1970; Galindo-Tovar et al. 2007; Miller and Gross 2011; Thorp et al. 2015; Zuazo et al. 2021).

Results

SNP discovery in avocado panels

The sequencing of 384 avocado genotypes from CGB and SR generated over 3,132,393,704 sequence reads (Table S1), which were combined with available sequences from 71 reference genotypes (Talavera et al. 2019), and the outgroup, P. schiedeana (Rendón-Anaya et al. 2019). In CBG and SR genotypes, the number of DNA sequences obtained was similar from leaf (CGB-3 M to 52 M) and root (SR-1.3 M to 56 M) tissues (Table S1). However, regarding the SNP calling, the matrices obtained from leaf DNA sequences had the highest number of SNPs. In contrast, matrices including sequences from root tissues had many missing data and a low number of polymorphisms. Therefore, creating a single dataset was inviable due to significant differences in SNP numbers between leaf and root tissue sequences. We then generated three different SNP datasets depending on the goal of each analysis, each including different samples (CGB, SR, RM, and O). The SNPs numbers ranged from 227 in dataset 3, which included 289 genotypes from SR and CGB, as well as RM materials from ME, WI, GU, Colombian Andean, and Colombian Caribbean groups, with high ancestry (> 80%) for each genetic cluster, to 4931 SNPs in dataset 1, which consisted of 199 genotypes from CGB and RM genotypes (Table 1).

The global SNP summary statistics were calculated using dataset 1, which had the highest SNP number (4931) mapped in 8082 contigs of the Hass genome reference assembly. This dataset had a Ts/Tv proportion of 1.54, with 2988 Ts and 1943 Tv substitutions. These SNPs presented an MDS mean of 55, ranging from 15 (Ctg0042_632283 SNP) to 245 (Ctg1142_29090); while the nucleotide diversity (Pi) presented a mean of 0.275 ranging from 0.064 (Ctg0037_1227237) to 0.500 (Ctg0882_23451). On the other hand, the polymorphisms had a mean MAF of 0.255, ranging from 0.100 (Ctg1752_17921) to 0.500 (Ctg0066_1021952), and the SNPs were highly informative, with a mean PIC of 0.283, ranging between 0.160 (Ctg1752_17921) and 0.380 (Ctg0066_1021952).

Avocado genetic stratification was consistent with four genetic groups

The two unsupervised non-parametric clustering approaches consistently showed four clusters as genetic groups (Table S2 and Table S3). Initially, we obtained three principal components, where the first component explained 26.77%, the second component explained 12.98%, and the third component explained 6.93% of the variance from filtered SNPs. The percentage of variation explained by the first three principal components is a typical value in allogamous species for dimensionality reduction analysis using PCA from SNPs, as demonstrated in the initial study by Talavera et al. (2019), who employed the same dimensionality reduction algorithm as ours, obtaining comparable partitions across components. However, we performed the unsupervised training analysis using all the first 21 principal components suggested by the Tracy-Widom test, which accounts for 66.28% of the total variance, reserving only the first two components for 2D visualization of the clustering results. Based on the first 21 principal components, the NbClust function suggested four optimal clusters by running the ward.D2 method (Table S2).

Meanwhile, independent of the clustering approach or validation method as cross-entropy and genetic algorithms, the optCluster function recurrently suggested four optimal clusters. PAM was the best partitional clustering method (Cross-Entropy score: 25.82, Genetic Algorithm score: 25.50), and AGNES was the best hierarchical method (Cross-Entropy score: 20.82, Genetic Algorithm score: 20.82). Therefore, the optCluster function suggested PAM as the best clustering algorithm compared to AGNES.

The avocado genotypes were grouped into four clusters that we identified based on available passport data as follows: GU cluster (n = 43), CO cluster (n = 79), ME cluster (n = 19), and WI cluster (n = 58) (Fig. 2). The PCA biplot showed the CO group separated in the first component, while the other three clusters consisted of ME, GU, and WI groups from Talavera et al. (2019). The CO cluster was exclusively composed of Colombian native accessions, while the other three groups included RM genotypes with complete ancestry of GU, ME, and WI groups, as well as RM genotypes with admixtures between Mesoamerican groups (e.g., GU × WI, GU × ME, and ME × WI), commercial materials (CM), and accessions conserved in the CGB with high ancestry of ME (“Bacon”, “Duke_7”, “Fuerte”, “Zutano”, “Topa Topa”, and VAPA_066), GU (“Edranol”, “Hass”, “Reed”, “Simmonds”, QUIQUI_029, and TACO_047), and WI (“Booth_5”, “Booth_7”, “Booth_8”, “Choquete”, “Lula”,  “Pollock”, “Waldin”, and other 31 Colombian CGB accessions (Fig. 2 and Table S3).

Fig. 2
figure 2

Principal component analysis (PCA) biplot showing the genetic diversity of 199 avocado genotypes of CGB and RM analyzed with 4931 SNPs. The colors indicate the assignment to a genetic group based on the Partitional Around Medoids (PAM). Abbreviations are as follows: Guatemalan, GU cluster; Mexican, ME cluster; West Indian, WI cluster; and Colombian, CO cluster. The black letters indicate the three recognized avocado groups (GU: Guatemalan, ME: Mexican, and WI: West Indian) based on the results of Talavera et al. (2019). The varieties or cultivars conserved in CGB are marked as CM (Commercial Materials)

Phylogenetic analysis supported stratification across four major genetic pools and suggested a substructure in the Colombian genetic group

A phylogenetic tree was constructed using the SNP alignment data of 203 avocado genotypes. The tree consisted of four well-supported clades corresponding to the reference groups and a new clade comprising mainly Colombian genotypes (CO group) from the Colombian germplasm bank (Fig. S2). However, some accessions did not fall within a single clade. Based on the clustering analysis, they corresponded to hybrids or admixtures. Thus, we created a new phylogenetic analysis excluding those hybrids or admixtures, with 92 samples and 3899 SNPs (Fig. 3). The study confirmed four clades with bootstrap support (BS) of 100% that match the clustering analyses (Fig. 1). The clade that first diverged corresponded to samples belonging to the ME group. The accessions from the GU group formed the next clade in position, followed by the clade with the WI materials. The CO and WI groups were related in two supported clades in the most derived phylogeny positions. The CO clade was divided into two subgroups. Based on passport data, the avocado genotypes mainly from the Andean region (Colombian Andean: Cauca, Huila, Quindio, Risaralda, Santander, Meta, and Tolima provinces) were regrouped in a well-supported clade (BS = 100%), while a second weaker-supported clade (BS = 43%) regrouped samples from the Caribbean region (Colombian Caribbean: Bolivar, Cesar, and Magdalena provinces). From this filtered dataset (without hybrids), seven accessions from the Colombian germplasm fell in the Mesoamerican clades, four genotypes (“Waldin”, QUIBU_11, VADU_10, and VADU_14) in the WI clade, one (“Hass”) in the GU clade, and two (“Topa Topa”, and “Duke_7”) in the ME clade (Fig. 3).

Fig. 3
figure 3

Maximum likelihood phylogenetic tree showing the relationships with 3899 SNPs among 91 avocado accessions (> 80% of ancestry for each genetic group) from CGB (Colombian Andean and Colombian Caribbean), Reference Materials (RM) of Mexican (ME), Guatemalan (GU), and West Indian (WI) ancestry, and the P. schiedeana outgroup. The values on nodes are the percentage of the bootstrap values for 1000 replicates. Abbreviations are as follows: ME: Mexican, GU: Guatemalan, WI: West Indian group, and CO: Colombian genetic groups. Andean: CGB genotypes from the Andean region of Colombia, Caribbean: CGB genotypes from the Caribbean region of Colombia

Genetic divergence was concordant with botanical and spatial differentiation

The AMOVAs for the K = 4 and K = 5 analyses showed that most of the genetic variation was found among genetic groups (K = 4: 54.86% and K = 5: 55.10%), with 45.13% (K = 4) and 44.89% (K = 5) of variation within the clusters. This result indicated a solid genetic structure and significant genetic differentiation (p = 0.001) between the different avocado genetic groups detected in K = 4 (Phi: 0.548) and K = 5 (Phi: 0.551). The pairwise FST values indicated that the CO group had significant (p = 0.001) genetic differentiation from the ME (FST of 0.49), GU (FST of 0.47), and WI (FST of 0.30) groups in the K = 4 analysis (Fig. 4a). The Andean and Caribbean Colombian groups also showed genetic differences (FST of 0.14), but lower compared to the differences between the Colombian groups against ME (pairwise FST of 0.48 and 0.50 for Caribbean and Andean), GU (pairwise FST of 0.46 and 0.48 for Caribbean and Andean), and WI (pairwise FST of 0.27 and 0.31 for Caribbean and Andean) groups in the K = 5 analysis (Fig. 4c). In both analyses using K = 4 and K = 5, the genetic differences among the Mesoamerican groups were intermediate and presented values between 0.21 (ME vs. GU) and 0.30 (WI vs. ME) (Fig. 4a, c). These results suggest a strong genetic structure and differentiation between the avocado genetic groups, particularly the Colombian and Mesoamerican groups.

Fig. 4
figure 4

Pairwise FST a, c and contemporary migration rates m b, d values among the four (K = 4) and five (K = 5) avocado genetic clusters determined by population genetic structure and phylogenetic analysis. In the Figure, GU corresponds to the Guatemalan genetic group, CO corresponds to the Colombian group, ME corresponds to the Mexican genetic group, WI corresponds to the West Indian genetic group, COA corresponds to the Colombian Andean genetic group, and COC corresponds to the Colombian Caribbean genetic group

The gene flow according to directional migration rates (m) calculated for K = 4 (Fig. 4b) and K = 5 (Fig. 4d) in BayesAss converged and did not show differences between the several simulations implemented for each dataset. The results suggested that gene flow among the different genetic groups of avocados was generally low (< 0.1), except for the directional gene flow between the ME and GU groups (ME → GU: 0.15 in K = 4 and 0.16 in K = 5). Also, some low but significant directional gene flow was detected between ME and WI (ME → WI: 0.04 in both K analyses), WI and Colombian Caribbean (WI → Caribbean: 0.02 in K = 5), and Colombian Andean and Caribbean (Andean → Caribbean: 0.03 in K = 5). Additionally, significant bidirectional gene flow scores were detected between WI and GU (GU → WI: 0.02 and WI → GU: 0.05 in both K analyses), as well as between WI and CO (WI → CO: 0.08 and CO → WI: 0.01 in K = 4), and WI and Colombian Andean (WI → Andean: 0.06 and Andean → WI: 0.02 in K = 5) (Fig. 4b and d). These results indicate that gene flow among different avocado genetic groups is limited, and there is some degree of isolation and differentiation between these groups.

In addition, we explored the genomic geographic divergence within the Colombian sub-populations, and their relationship with the WI group, using the samples with geo-referenced information. We implemented explicit correlations of genetic distance with geographic gradients via Mantel tests between pairs of clusters. Here, we were able to recover a significant (p < 0.05) trend of intra-group isolation by distance (IBD) only when admixed individuals were included, as expected under rampant mobility (Fig. S3 and Fig. S4). However, the IBD trend vanished when the analysis was limited to the purified WI and Colombian Andean and Caribbean clusters. This result may suggest that the same underlying process that helps shape admixture could boost regionalized geographic structure.

The overall population of avocados exhibited intermediate levels of Ho (0.22) and He (0.27), with a positive value of FIS (0.19). When comparing the genetic groups identified in the K = 4 and K = 5 analyses, the Mesoamerican groups WI (Ho: 0.28 and He: 0.24), GU (Ho: 0.24 and He: 0.20), and ME (Ho: 0.22 and He:0.19) were generally more heterozygous than the Colombian group (CO) in the K = 4 analysis (Ho: 0.18 and He: 0.16). In the K = 5 analysis, Colombian Andean (Ho: 0.18 and He: 0.16) and Caribbean (Ho: 0.18 and He: 0.15) subgroups exhibited similar levels of heterozygosity. Negative FIS values were observed for all genetic groups; however, the CO group had higher negative FIS values (− 0.08 in K = 4) compared to the GU (− 0.18), WI (− 0.14), and ME (− 0.12) genetic groups. Furthermore, the K = 5 analysis identified distinct FIS values between the Colombian Andean (− 0.13) and Caribbean (− 0.20) subgroups (Table 2).

Table 2 Genetic diversity summary statistics (mean and range) in the overall avocado population and in the four and five genetic clusters detected in this study

Limited Mesoamerican genetic ancestry in the Colombian germplasm supported the presence of a new avocado group in Colombia

The CGB conserves avocado genotypes with ancestry from all avocado genetic groups, with 55.8% having complete ancestry (pure genotypes) to the five clusters recovered in previous analyses. Most CGB genotypes had genetic ancestry from Andean (34.1%) and Caribbean (13.2%) groups native to Colombia. A minor proportion (8.5%) of CGB genotypes presented pure ancestry of WI (5.4%: “Pollock”, “Waldin”, QUIBU_011, QUIPI_001, QUIPI_003, VABU_010, and VABU_014), ME (1.6%: “Duke_7” and “Topa Topa”), and GU (1.6%: “Reed” and “Hass”). The other 44.2% of the CGB genotypes were admixed or hybrids with diverse admixture patterns (18 possible combinations of hybrid backgrounds). The highest proportion of avocado hybrids had mixed ancestry from WI × GU (10.1%), WI × Colombian Andean (4.7%), and WI × GU × Colombian Andean (3.1%). Meanwhile, NATU_001 and CANO_008, some of the few accessions from the CGB reported as tolerant to P. cinammomi (Rodríguez-Henao et al. 2017), were admixed (53% WI × 38% Colombian Andean × 8% Colombian Caribbean for NATU_001, and 63% WI × 22% GU × 14% Colombian Andean for CANO_008 (Fig. 5a, Table S3, and Table S4). On the contrary, in the SR, most genotypes (75.7%) had a pure ancestry of all genetic groups except the ME group. The genotypes with pure genetic ancestry were distributed in the Colombian Caribbean (37.7%), Andean (11.3%), GU (23.4%), and WI (3.5%) groups. The remaining genotypes were admixed (24.24%), the majority having a Colombian ancestry of Caribbean × Andean (13.4%) and Andean × Caribbean (4.3%) (Fig. 5b and Table S4).

Fig. 5
figure 5

Barplot ancestry analysis results implemented in the CGB and SR germplasm using genotypes with complete ancestry backgrounds of Mexican (ME), Guatemalan (GU), West Indian (WI), Colombian Andean, and Colombian Caribbean Colombian (CO) genetic groups. a CGB ancestry analysis barplot. b SR ancestry analysis barplot

The Colombian group diverged during the Pleistocene after the ME, GU, and WI splits

We used the DIYABC random forest method to test the best scenario of evolutionary genealogical relationships among groups explaining the evolution of the Colombian group (CO) against the other groups (i.e., ME, GU, and WI) (Fig. S1). From all 18 scenarios, scenario 15 fitted best our data with 1460 votes out of 10,000 random trees and a posterior probability of 0.550. The topology of this scenario is the same as the phylogenetic analyses (Fig. 6a and Fig. S2). The subsequent two best scenarios were 16 and 17, with 1141 votes and 1092 votes, respectively (Fig. 6b, c). All three scenarios corresponded to the hypothesis of later divergence for CO compared to the ME, GU, and WI groups. The best three scenarios showed that CO and WI are sister clades with a recent common ancestor between them. The difference among the three best scenarios was whether ME or WI had the earliest divergence (Fig. 6a, c, d).

Fig. 6
figure 6

Evolutionary relationships of avocado groups reconstructed by the DIYABC random forest methodology. a The most probable evolutionary scenario (scenario 15) of avocados in Mesoamerica and northern South America showed 1460 votes and a posterior probability of 0.550. This scenario suggests an earlier divergence of the Mexican (ME) group at t3, followed by the Guatemalan (GU) group divergence at t2 and a later divergence between the Colombian group (CO) and the West Indian (WI) group at t1. b The second-most probable scenario (Scenario 16) had 1141 votes. This scenario suggests a parallel evolution of two ancient clades that diverged at t3. One clade split at t2, originating ME and GU. The other clade split at t1, forming CO and WI. (c) The third most probable scenario (Scenario 17) had 1092 votes. This scenario suggests an earlier divergence of the GU at t3, followed by the ME divergence at t2 and a later divergence between the CO and WI at t1. The divergence times are not scaled. df The divergence time in years of the best three scenarios, 15, 16, and 17, respectively, assuming 10 years (green), 30 years (orange), and 50 years (violet) of generation time (g). The white circles represent the mean estimates, the thick hash marks indicate the standard deviation, and the thin hash marks indicate the confidence interval at 95%. The gray vertical dot line separates the Pleistocene and the Holocene within the Quaternary era. The Pleistocene started 2,000,000 years ago and ended 12,000 years ago. At that time, the Holocene started until the present

According to the best three scenarios with the higher number of votes, all four genetic groups diverged during the Pleistocene regardless of the generation time (g) used as calibration. This result supports that divergence is ancient and occurred before the Holocene when agriculture and village life started in the Americas. Thus, in the best scenario (scenario 15 with 1460 votes and p = 0.550), the median divergence time of the ME group (t3) occurred between 87,079 YBP (95% CI: 36,020–95,300 at g = 10) and 435,393 YBP (95% CI: 180,100–476,502 at g = 50). Then, the median divergence time of the GU group (t2) occurred between 57,070 YBP (95% CI: 28,490–80,680 at g = 10) and 285,351 YBP (95% CI: 142,450–403,398 at g = 50). Finally, the later median divergence time between WI and CO (t1) occurred between 34,879 YBP (95% CI: 6570–60,058 at g = 10) and 174,393 YBP (95% CI: 32,850–300,291 at g = 50) (Fig. 6b).

In the case of the second-best scenario (scenario 16 with 1,141 votes), the oldest median divergence time between two ancestral avocado clades (t3) occurred between 87,351 YBP (95% CI: 58,790–95,828 at g = 10) and 436,755 YBP (95%: 293,950–479,139 at g = 50). Moreover, the divergence between ME and GU (t2) occurred between 65,500 YBP (95% CI:37,220–86,756 at g = 10) and 327,499 YBP (95% CI: 186,100–433,781 at g = 50). In parallel, the divergence between the CO and WI (t1) occurred between 21,746 YBP (95% CI: 2,440–44,988 at g = 10) and 108,732 YBP (95% CI: 12,200–224,941 at g = 50) (Fig. 6d).

Finally, in the third-best scenario (scenario 17 with 1092 votes), the median divergence time of the GU group (t3) occurred between 74,968 YBP (95% CI: 47,570–97,945 at g = 10) and 374,840 YBP (95% CI: 327,850–487,474 at g = 50). Then, the median divergence time of the ME group (t2) occurred between 59,626 YBP (95% CI: 46,240–85,654 at g = 10) and 298,130 YBP (95% CI: 231,200–428,268 at g = 50). Finally, the most recent median divergence time between WI and CO (t1) occurred between 28,726 YBP (95% CI: 2420–37,120 at g = 10) and 143,631 YBP (95% CI: 12,100–185,599 at g = 50) (Fig. 6f).

Discussion

Our study analyzed two collections of Colombian avocado genotypes: (1) the Colombian Germplasm Bank (CGB), which contains both native avocados and commercial varieties, and (2) seedling rootstocks (SR) from commercial orchards in the Antioquia province, a major Colombian avocado-growing region. The comparison of genome-reduced sequences of the Colombian germplasm with avocados from Mesoamerica, including the Mexican (ME), Guatemalan (GU), and West Indian (WI) groups, as well as hybrids among these, reveals that the majority of CGB and SR accessions belong to a new Colombian genetic group (CO), divided into Andean and Caribbean sub-populations. While some CGB and SR accessions belonged to the classical Mesoamerican groups, this discovery highlights untapped genetic diversity in northwest South America that could be highly beneficial for tree crop improvement.

Enabling genomic resources for avocado rootstocks

This study does not report a unique marker dataset due to significant differences in SNP coverage for sequences gathered from leaf and root tissues. Consequently, we had to split the reference mapping and SNP calling steps by tissue type among sequences to generate three datasets for each research question. Regardless of the final number of SNPs in each dataset, these variants enable us to approach the research questions at different levels of resolution. The efficiency of nucleic acid extraction processes varies depending on the tissue’s properties, such as physical characteristics (e.g., lignification level) and chemical composition (e.g., presence of secondary metabolites). Therefore, when utilizing root tissue for genomic approaches, it is recommended to sample tissues with consistent and optimized conditions (e.g., dryness and size) (Fisk et al. 2010; Zeng et al. 2015; Oliveira et al. 2015; Miller et al. 2017). Alternatively, stooling or layering (Knight et al. 1927; Webster 1995) could also be employed to facilitate the collection of leaf tissue from rootstocks. These propagation approaches could convey more effective DNA extraction methods for avocado rootstocks by inducing leaf tissue formation rather than exclusively relying on the root system.

The Colombian avocado germplasm does not belong to the classical Mesoamerican groups but is consistent with a new gene pool of P. americana

All analyses implemented in this study reinforce the well-described differentiation among the classical Mesoamerican groups (ME, GU, and WI), where ME and GU are genetically closer than WI (Rubinstein et al. 2019; Ge et al. 2019b; Talavera et al. 2019; Wienk et al. 2022; Solares et al. 2023). However, our inferences went one step further. We demonstrated a consistent and robust genetic differentiation of the previously reported Mesoamerican groups from a new genetic group composed exclusively of genotypes native to northwest South America, i.e., the Colombian (CO) group. This novel genetic group further exhibits a genetic substructure associated with the eco-geographic origin, allowing the differentiation of avocado genotypes from Colombia’s Andean and Caribbean regions. Cañas-Gutierrez et al. (2019) also report signals of population structure between avocados from the North and the South of Colombia by scoring SSRs markers in samples of the CGB and some commercial cultivars from the Antioquia province. Various genetic summary statistics support this cryptic population structure. For instance, concerning gene flow patterns, the Colombian group had low gene flow with ME and GU but higher gene flow from the WI group. At the local level, these results suggest that the gene flow in the Caribbean group comes from the Andean and WI groups. The low gene flow and significant genetic differentiation linked with geography may indicate ecological divergence, an ad hoc hypothesis that deserves further testing (Cañas-Gutierrez et al. 2019).

Meanwhile, another genetic summary statistic associated with different levels of population stratification was heterozygosity (i.e., correlated with negative FIS scores). The Colombian group presents lower heterozygosity values than the Mesoamerican groups, which could reflect a bottleneck during colonization. High levels of heterozygosity in this species are expected to be reinforced by outcrossing (ranging from 74% to 96%) due to the avocado’s peculiar floral biology behavior known as synchronous protogynous dichogamy. In this system, the female flower function predates the male flower, which minimizes selfing and promotes genetic admixture (Borrone et al. 2008; Solares et al. 2023). Alternatively, the positive selection of highly heterozygous individuals in specific environments may be a signature of heterosis in hybrids with mixed ancestries among groups, a pattern already reported by Reyes-Herrera et al. (2020). Finally, heterozygosity may as well vary by other exogenous forces such as tree distribution within orchards, edaphoclimatic variability, and agronomic management (Borrone et al. 2008; Sánchez-González et al. 2020). Although the underlying cause of divergence and its correlates may remain elusive, the Colombian groups reported in this study still provide new insights into avocados’ phylogenetic diversity, population structure, and the potential drivers of dispersal.

Finally, this is the first report of the Colombian (CO) avocado group, probably associated with the absence of native avocado samples from Colombia in previous studies (Ge et al. 2019a, b; Rubinstein et al. 2019; Solares et al. 2023; Talavera et al. 2019; Wienk et al. 2022). As the sampling of native avocado trees across the Americas continues increasing, we may be able to identify other genetic clusters locally adapted to distinctive agroecological regions. For example, Solares et al. (2023) recently included some samples from Costa Rica, a new possible genetic group that also seems to be under differentiation from the prevalent Mesoamerican trinity. Therefore, future studies require embracing potential comprehensive centers of avocado genetic diversity beyond Mesoamerica while better clarifying the origin of such possible novel groups.

Seedling rootstocks from commercial orchards result from rampant admixture among genetic groups with potential heterotic effects

The CGB and SR germplasm exhibited high numbers of pure genotypes with ancestries predominantly from the Colombian groups, with the Caribbean ancestry being more frequent in commercial genotypes and the Andean ancestry in the CGB collection. This result could be the effect of the distinct sampling approaches employed by the two collections. The SR primarily includes rootstocks used in commercial orchards (Cañas-Gutiérrez et al. 2019), while the CGB comprises native “criollo” genotypes, some of which may have a natural tolerance to P. cinammomi (Rodríguez-Henao et al. 2017). Moreover, 5% of the CGB and 23% of the SR populations display pure ancestries of the classical Mesoamerican groups. This result suggests that the seedlings used for commercial plantations also have a proportion of Mesoamerican ancestry in their genetic background. Our finding aligns with the expectation that plant genotypes utilized in these commercial plantations could result from modern introductions (Cañas-Gutiérrez et al. 2019).

Previous genetic structure reports have shown recurrent evidence of hybridization and introgression among the classical Mesoamerican groups (Rubinstein et al. 2019; Ge et al. 2019b; Talavera et al. 2019; Wienk et al. 2022; Solares et al. 2023). Therefore, it will not be surprising that if Mesoamerican ancestries were recovered from seedling rootstocks in Colombia, they could as well harbor signatures of admixture. Indeed, 44% and 24% of genotypes of the CGB and SR populations were classified as hybrids. Within the CGB, some hybrids displayed ancestries from Mesoamerican groups, such as WI × GU (Rodríguez-Henao et al. 2017). The remaining hybrids (20.2%) have backgrounds between the Colombian Andean and other groups, although hybrids with ancestries between Andean and Caribbean are relatively scarce in the CGB (4.6%). In contrast, most hybrids in the SR population have backgrounds between the Colombian Andean and Caribbean groups (18%). These findings indicate that the commercial seedlings used for grafting may result from hybridization within the Colombian germplasm and other Mesoamerican groups. This result is interesting, especially considering that current rootstock planting material in Colombia is usually gathered from seeds of commercial varieties such as “Waldin”, “Duke_7”, “G_755”, and “Lula” as well as some Colombian genotypes like La Torre, Tumaco, and Villagornona, in addition to “criollo” native trees (Ríos-Castaño et al. 2005; Bernal and Díaz 2020). The fact that most rootstocks used in Colombia’s main avocado production region have admixed origins may result in novel heterotic combinations that could assist adaptation to the country’s specific agro-ecological conditions (a hypothesis already embraced by Reyes-Herrera et al. (2020) in the same plantations using SSR markers). Although insightful for rootstock breeding efforts, this finding does not yet provide clues on the evolutionary origin of the Colombian genotypes, a matter we fully embrace in the next section.

The Colombian lineage of avocados diverge from the West Indian group during the Pleistocene

The best demographic model from our study (scenario 15) is concordant with the recent genome-wide phylogeny from Solares et al. (2023) and with our phylogeny. Both results suggest that the Mexican lineage diverged first. However, the subsequent best scenarios (scenarios 16 and 17) open the question about whether the Guatemalan group experienced the second most ancient divergence or if the Mexican and Guatemalan groups are both sister clades of the West Indian and Colombian groups.

Despite these undefined divergence nodes, the 95% confidence intervals of their timing suggest that the evolution of avocado, its long-distance dispersal, and possible adaptive divergence started ~ 430,000 YBP, a more recent estimate compared to the divergence time of 1.3 M YBP calculated from the wide-genome phylogeny of Solares et al. (2023). However, both calculations match the climatic fluctuations of the Pleistocene, a period from 2.5 M to 11,700 YBP. During the Pleistocene, the Central American isthmus had already uplifted completely, joining North America and South America and enabling the massive migration interchange of fauna, flora, and humans among continents (Galindo-Tovar et al. 2008). Thus, our data are consistent with long-distance dispersal mediated by big mammals such as the giant ground slots that inhabited the Americas during the Pleistocene. These species probably consumed avocados, dispersed their seeds, and generated isolation of specific populations for long periods before any human did (Diamond 1999; Barlow 2002; Chen et al. 2009).

More recently, the West Indian and the Colombian groups share a common ancestor before the onset of the Holocene (i.e., 34,879 YBP in scenario 15, 21,746 YBP in scenario 16, and 28,726 YBP in scenario 17). Our results suggest that the four main groups of P. americana were already differentiated when the first humans arrived in the USA. During that epoch, the dispersion of avocado seeds and propagules may have happened by spontaneous growth from leftovers of the first hunter-gatherers’ humans that reached America soon after the last glacial maximum (LGM) ~ 16,000 to 13,000 YBP. They probably consumed avocados in Mesoamerica as they migrated southward, reinforcing their dispersion (Wiersum 1997; Diamond 1999; Galindo-Tovar and Arzate-Fernández 2010). Consequently, our evidence sustains the hypothesis that tropical trees such as avocados developed together and enabled the establishment of American cultures (Galindo-Tovar and Arzate-Fernández 2010). Avocados could be one of the first trees recurrently used in situ across different places in the tropical America, ultimately promoting sedentary life and agriculture development (Smith 1966, 1969; Galindo-Tovar and Arzate-Fernández 2010).

Meanwhile, the first human activities in northwest South America, nowadays Colombia, date around 11,000 YBP (Oyuela-Caycedo 2008). Thus, given the archeological evidence, Colombia must have been a mandatory trading path bringing Mesoamerican and South American plant crops (Oyuela-Caycedo 2008). Specifically, trading activities by diverse human cultures extended from Mexico and Honduras toward northern South America. Archeological evidence points towards in situ utilization of the Mexican avocado group around 8000 YBP in the Tehuacan Valley (Smith 1966, 1969). Further archeological evidence dates the avocado seeds in the Moche Valle of Peru between 2500 and 1800 YBP, and on the Peruvian Pacific coast around 1500 YBP. Likewise, the Caral culture from Peru consumed avocados even before maize 1200 years ago (Heiser 1979; Pozorski 1979; Skidmore 2005). The presence of cassava in Tabasco (Mexico) 4600 years ago offers evidence for recurrent bidirectional trading, including plant resources from the Amazon basin, a plausible pattern for avocados, too (Healy 1978; Pope et al. 2001; Marcus 2003).

Our genetic data agree with a common ancestor between the West Indian and the Colombian groups. Archeological and historical evidence suggests that the West Indian group originated in the lowlands of Yucatan and Belize. According to Galindo-Tovar and Arzate-Fernández (2010), migration may have started as follows: Maya groups living in this area utilized West Indian avocados and migrated eastward, reaching Honduras with evidence of avocado consumption by the Papayeca culture around 1200–1000 YBP. Finally, when the Spanish arrived on the Caribbean coast of South America, they described the presence of avocado trees in Yaharo (Colombia) in 1519. Other chronicles described avocados in the Peruvian Amazonia in 1542 and Ecuador in 1748, matching West Indian characteristics (Galindo-Tovar and Arzate-Fernández 2010). Overall, the three best evolutionary demographic scenarios and the geographic distribution of the genetic groups fit archeological data and chroniclers’ descriptions (Galindo-Tovar et al. 2008; Galindo-Tovar and Arzate-Fernández 2010). Interpreting the genetic data in the light of archeological and documentary evidence supports long-distance exchange promoted by human groups in areas where avocado races were already established. Such germplasm exchange in the last 11,000 years may have produced, until present times, the high levels of admixture observed in P. americana accessions and the isolation by distance pattern that arises when hybrid accessions are considered (i.e., “rampant” mobility hypothesis). Although the evolutionary history between Mexican and Guatemalan groups remains open and is beyond the scope of the current sampling, our study reinforces the origin and adaptation of avocados to middle-high altitudes in Mexico and Guatemala.

Morphological and genomic convergence of the avocado fruit parsimoniously supports a common ancestor between the West Indian and Colombian groups, which opens new questions for future studies. For instance, did the pre-adaptation of the lowland West Indian group from Central America enable the colonization of South America? After colonization, did the West Indian group radiate across northern South America, and did it re-adapt to new local conditions diverging in the Colombian groups? Recent genome-wide evidence supporting wild avocados from Costa Rica as a probable distinct ecotype arising from ancient hybridization between the Mexican group and other antique lineages opens further exciting questions (Solares et al. 2023). Did different groups reach South America and hybridize with the nascent sub-populations? Future research efforts aiming to test the above hypotheses will require extensive sampling within Colombia and surrounding areas, covering all distribution ranges, such as Costa Rica in Mesoamerica and Ecuador and Peru in the north of South America. We hypothesize ad hoc that as this study supports a new genetically differentiated sub-population in Colombia, expanding characterization to those countries could reveal further primary or secondary centers of diversity.

Extensive molecular sampling also enables compelling reconstructions of the demographic, evolutionary, and selective processes interplayed during the avocado colonization of South America and its subsequent divergence and adaptation. After all, genomic heterogeneity obliges us to acknowledge that multiple, sometimes conflicting signatures differently imprint discrete positions across the genome (Ellegren and Galtier 2016). It is, therefore, practical to move from a pre-conception that coerces all genetic markers into a ubiquitous underlying process. Instead, embracing more credible genomic thinking requires concurrently acknowledging the existence of porous genome sections more prone to exhibit gene flow (Stölting et al. 2013) compared to low-recombining genomic islands of divergence (Wolf and Ellegren 2017), particularly among groups. This shift in the interpretation of the scale and drivers of heterogeneity from the population level into the molecular fine-tuning is already appreciated in several crop plant species (Cortés et al. 2018), avocados not being the exception. For instance, Rendón-Anaya et al. (2019), comparing FST- and dxy statistics, determined genomic divergence profiles among avocado groups from Mesoamerica, especially against the Hass background. Also considering the three traditional Central American groups, Solares et al. (2023) performed selective sweep and FST- and π-inspired divergence mapping, focusing on the Gwen varietal and the two main heterodichogamy flowering types. Interestingly, authors found gene enrichments for stress response, terpene production, and metabolic processes linked to interracial divergence. These processes could as well underlie the separation of the Colombian group.

Therefore, it remains to be assessed via genomic scans whether the Colombian sub-populations, as part of their divergence and adaptation across northern South America, recruited genomic standing variation dating back to its Mesoamerican ancestors or recalled novel variants that arose after the formation of the isthmus of Panama. Our current data may offer some tendencies in this regard. Reports by Rendón-Anaya et al. (2019) and Solares et al. (2023) utilized whole-genome resequencing (WGR), which confers more coverage of highly recombined regions experiencing insufficient linkage disequilibrium, a common trend in long-lived allogamous tree species (Kelleher et al. 2012). Unfortunately, so far, we have been unable to carry out those detailed genome-wide reconstructions of the inter-racial divergence and selection profiles against the new Colombian subgroups because the NCBI-uploaded assembly of Rendón-Anaya et al. (2019) lacked the pseudomolecule information. A new reanalysis of our dataset using chromosome-level information of the recent Gwen’s reference genome by Solares et al. (2023) could provide more resolution for divergence and selective sweep mapping, which may, in turn, sustain the distinctiveness of the Colombian clade.

Perspectives

The novel sub-population from northwest South America indicates that previously unexplored avocado resources may have persisted via divergent selection across isolated pockets of cryptic diversity at secluded hills and valleys of the northern Andes and the Caribbean savanna. Therefore, further habitat-based population-guided collections (Castañeda-Álvarez et al. 2016) are crucial to prioritize the conservation of avocado genetic resources and better characterize these isolated pockets of diversity (López-Guzmán et al. 2021), both lowland and mid-altitude. Notably, we recommend new sampling trips at the foothills of the Sierra Nevada de Santa Marta, the massif of Montes de María, and the vicinity of Santa María la Antigua del Darién (southeast of the Isthmus of Panama). The former, an isolated mountain range in northern Colombia separated from the Andean range, is where avocados were first described during colonial times (Reyes-Herrera et al. 2020). The second is a reminiscence of an antique orchard of splendid local avocado trees. Yet abandoned decades ago by the conflict and the plagues, its plus tree survivors are candidates to source adaptation and resistance. The latter corresponds to the first settlement founded by conquistadors in mainland America (ca. in 1510), initially named Dariena but deserted a couple of decades after due to the unforgiving climate and the fierceness of the indigenous people. All three are likely sources of isolated exotic variation for avocados, without excluding other similar hotspots of cryptic avocado diversity in Ecuador and even south of the Huancabamba depression, a critical biogeographical barrier for Andean plant taxa in the northern Andes of Peru (Weigend 2002).

Novel genetic resources will boost avocado breeding

Modern platforms for allelic discovery (Hickey et al. 2017) would benefit by targeting cryptic pockets of genetic diversity and habitat-based population-guided collections (Castañeda-Álvarez et al. 2016), as suggested here for avocado resources in the northern Andes. This way, avocado rootstock pre-breeding efforts will not rely exclusively on exogenous diversity, likely without sufficient pre-adaptations to the local conditions of northern South America. Ultimately, the identification, conservation, and utilization of novel adaptive sources (Cortés et al. 2020) among native avocados and related wild species will enable diversifying rootstock selection by offsetting the winnowing effect (McCouch 2004) of clonal tree propagation in natural genetic variation (Ingvarsson and Dahlberg 2019).

Classical diversifying rootstock selection of multi-clonal genotypes and half-sib families (Fernández-Paz et al. 2021) may be boosted by genome-enabled predictions targeting discreet genetic pools (Arenas et al. 2021), such as the Colombian cluster described in this study. Early screening at nurseries could target racial constitution, and desired categorical and quantitative trait scores in adulthood. Predictive markers of the demographic substructure linked with local stresses may also assist biotic resistance (Guevara-Escudero et al. 2021). Multi-clonal and seedling rootstock breeding from local genetic resources (Mickelbart and Arpaia 2002) could broaden the genomic basis of avocado adaptation in regions with contrasting and complex ecologies, such as the northern Andes. Families and farmers also keep a reservoir of native avocado trees in their backyards as traditional orchards and living fences (Galindo-Tovar et al. 2008). These old avocado trees may too source novel rootstock adaptations (Cañas-Gutiérrez et al. 2022), so far unseen in current commercial plantations (Reyes-Herrera et al. 2020). Yet, satisfying the growing demand (Reyes-Gómez et al. 2023) for high-quality avocados (Astudillo-Ordóñez and Rodríguez 2018) with uniform superior rootstocks is an urgent requirement in current nurseries and plantations (Ernst 1999), for which diversified clonality may be more appealing (Ingvarsson and Dahlberg 2019). Even in this case, cryptic Colombian avocado variation may provide the desired adaptive allelic combinations and broad sense heritiabilities to be clonally propagated as a panel of elite rootstock genotypes.

Meanwhile, rootstock breeding could be parallelized with genetic screening for innovative fruit quality traits to modernize and diversify the fresh avocado market beyond Hass. It will not be surprising that arcane avocado gene pools and natural segregation continue sourcing new fruit varieties, as demonstrated by the unexpected discovery of Hass-like cultivars such as Carmen, Gem, Gwen, Maluma, and Méndez (Kremer-Köhne 1999). Regional consumption in Central and northern South America may even offer unexploited market targets for non-Hass fruit types because consumers in those regions are historically used to commercialize and consume “criollo” avocados, often experiencing a Hass aversion. After all, market preference in northern South America tends to prioritize lowland West Indian large and watery fruits, many of which are unknown outside local villages.

Bridging genomic divergence with morphological differentiation

Morphological analyses in fruit tree crops like avocados are fundamental to corroborate the horticultural racial distinction among differentiated genetic clusters. Each classical avocado group exhibits specific characteristics in its morphology, phenology, ecology, and adaptation (Barrientos Priego 2010). However, cryptic populations could be geographically isolated across different agro-climatic conditions without notorious phenotypic differentiation (Campos-Rojas et al. 2008), as expected at the initial phases of allopatric divergence (Coyne and Orr 2004). Definitory morphological characteristics appropriate for and exclusive to the Colombian population are still lacking, yet they may convey a clearer racial subdivision. López-Galé et al. (2022) suggested using fruit and seed traits to racially discriminate seed-donor “criollo” avocados. Oil quality could as well be an indicator to discern Colombian genotypes (Campos-Rojas et al. 2008).

Even without morphological distinctiveness, cryptic genetic diversity from the Colombian population can still offer standing variation for adaptation to regions where avocado plantations are expanding, while coping with in situ climate change effects (Aitken and Whitlock 2013). Future studies may rely on species distribution modeling (SDM) using current and forecasted climate data, as per geo-referencing of herbaria specimens and germplasm (López-Hernández and Cortés 2019), to predict the future potential distribution of avocado groups (Franklin 2010; Peterson et al. 2011), as well as their local genetic adaptation (Cortés et al. 2022). This approach will enable the prediction of the most suitable environmental conditions for avocado plantations in light of the oncoming conditions as part of an assisted gene flow platform targeting long-lived perennial crops such as fruit trees (Aitken and Bemmels 2016). Current ecological preferences (temperature, humidity, altitude) for planting avocados from different races (Ashworth and Clegg 2003) may be uncoupled with the most dramatic oncoming climatic scenarios. Facilitation and antagonistic biotic interactions could as well be distorted under changing climate. In both cases, the Colombian populations will very likely provide novel alternatives. For instance, the Colombian group closer to the West Indian lowland race seems better adapted to humid tropical areas, a candidate reservoir for rootstock selection of resistance to soil-borne diseases, including P. cinammomi (Gross-German and Viruel 2013).

In short, although this study is the first to unveil the Colombian avocado population as a genetically distinct cluster, its phenotypic novelty is yet to be determined. Newly available reference alignments for avocado have increased the resolution for genomic scans (Solares et al. 2023) and haplotype resolved phasing (Nath et al. 2022), which could, in turn, assist the discovery of previously unnoticed segregating trait variation via reverse genetics. Meanwhile, underutilized allelic combinations within the Colombian cluster can source avocado rootstock adaptive breeding and cultivar diversifying selection.

Conclusions

This study identified two ancient Colombian genetic clusters of avocados beyond the three traditionally recognized races. These cryptic groups were genetically different and presented minor heterozygosity scores, with possible origin from the West Indian group (alternatively, although less parsimonious, it could also be the case that the West Indian radiate from the Colombian group). The Colombian sub-populations are in two distinct geographic regions, the Andes and the Caribbean, which may reflect divergent local adaptation after the initial colonization from Mesoamerica during the Pleistocene. Exploring the avocado genetic resources in South America will allow identifying genotypes with superior characteristics adapted to diverse agro-ecologies, which may source the selection of new cultivars and rootstock genotypes.