Introduction

The genetic differentiation of populations is the basis of evolutionary divergence and speciation, and a key focus of research in evolutionary biology (Hufford and Mazer 2003). Most studies focus on changes in allele frequencies associated with population divergence, but there are also other population parameters which change (Bohonak 1999; Metcalf and Pavard 2007) and can in turn influence their evolutionary trajectories (Trickett and Butlin 1994). Such parameters include genetic diversity, heterozygosity and linkage disequilibrium (LD), all of which define the recombination landscape of a population (Backström et al. 2010; Booker et al. 2017). This landscape may vary not only between populations but also within the genome (Sperisen et al. 2001; Colonna et al. 2014; Gion et al. 2016; Phillips et al. 2018; Soltis and Soltis 2021). The resolution at which such genome-wide variation can now be studied has greatly improved with advances in DNA sequencing and the availability of genome assemblies (Phillips et al. 2018; Soltis and Soltis 2021). These advances have opened the path to more detailed understanding of population and genome variation in LD, which can also be practically applied, for example in genomic selection and genome-wide association genetic studies (GWAS) to inform the density of markers required (Goddard and Hayes 2007; Slatkin 2008; Neale and Kremer 2011).

The non-random association of alleles at different loci, as measured by LD, typically decays with increasing physical separation between loci, and variation in this distance is a population parameter that can influence future recombination outcomes (Slatkin 2008). LD can be studied at various scales, from within genes up to genome-wide averages, with the scale chosen resulting in different aspects of population history and evolution being revealed (Mueller 2004). For instance, examining LD decay distance within a gene can provide insights into the variability inherent in that gene and the associated local rate of recombination and gene conversion (McVean et al. 2004; Xing et al. 2007; Slavov et al. 2012). Alternatively, as LD is influenced by the interplay of gene flow, selection and drift (Slatkin 2008), genome-wide LD can be used to infer population history events. For example, the detection of LD differences between populations can be indicative of past genetic bottlenecks or hybridisation events experienced by one population and not another (Zhang et al. 2004; Noonan et al. 2006). Notable examples of this have been found in human studies, with variable LD observed in different populations thought to have arisen from genetic bottlenecks arising from founder effects (Park 2019). These bottlenecks often have a broad-scale effect on the genome, creating a genome-wide LD signature (Slatkin 2008). In contrast, selective sweeps may cause localised increases in LD around relatively recent advantageous polymorphisms due to insufficient time for recombination to occur (Kim et al. 2007). Localised differences in LD among populations may also arise through variation in chromosomal architecture such as inversions and translocations (Faria et al. 2019; Mostert-O'Neill et al. 2021).

The breeding system of a species will heavily influence the extent of population differentiation. The novel selection pressures faced by crops, for instance create distinctive signatures of extensive LD between different gene pools and populations (Hamblin et al. 2011), which has been observed in many species under domestication (Caldwell et al. 2006; Mather et al. 2007; Chen et al. 2012). While population differences in LD have been observed in relatively few plant species in the wild, some of which include the herb Arabidopsis thaliana (Ostrowski et al. 2006) and the forest tree Populus trichocarpa (Slavov et al. 2012), these impacts are still prevalent. Most forest tree species are outcrossing, leading to high genetic diversity and rapid decay of LD compared to many other plant species (Neale and Kremer 2011; Plomion et al. 2016). Broadly similar genome-wide average LD decay distances ranging from 3 to 6 kb have been observed in tree genera such as Populus (Slavov et al. 2012) and Eucalyptus (Silva-Junior and Grattapaglia 2015) in studies employing 21–30 k markers, although studies in very short regions or employing more dense marker sets have revealed major drops in LD occur within 16–900 bp (Eckert et al. 2009; Arumugasundaram et al. 2011; Marroni et al. 2011; Lu et al. 2016; Murray et al. 2019) (see Discussion).

Belonging to the family Myrtaceae, Eucalyptus is a genus of economically and ecologically important forest trees native to Australia and islands to its north, but grown all over the world for pulp, charcoal and solid wood (Doughty 2000). Eucalyptus globulus, native to south-eastern Australia (Dutkowski and Potts 1999), is one of the major hardwood species grown in pulpwood plantations in temperate regions of the world (Potts et al. 2004; Downham and Gavran 2020; Tomé et al. 2021). A predominantly outcrossing species (Hardner and Potts 1995; Mimura et al. 2009), E. globulus has been classified into 13 races based on distribution and divergence in quantitative genetic traits (Dutkowski and Potts 1999). Genetic structure studies using microsatellite markers have postulated five broad genetic groupings of the races (Yeoh et al. 2012). However, there is only moderate neutral marker differentiation between these races (F ST = 0.09) despite large disjunctions across the species’ natural range, consistent with ongoing, long-distance pollen dispersal (Steane et al. 2006; Mimura et al. 2009; Yeoh et al. 2012). The effect of gene flow is superimposed on a complex history, with evidence for ancient hybridization (McKinnon et al. 2001, 2004b), and environmental adaptation at multiple scales (James and Bell 2000; Foster et al. 2007; Dutkowski and Potts 2012; Freeman et al. 2018; Costa e Silva et al. 2020). Previous studies have used low-density markers to examine population genetic structure (Steane et al. 2006; Yeoh et al. 2012; Costa e Silva et al. 2017) and coarse, genome-wide LD decay (Cappa et al. 2013; Durán et al. 2017) in E. globulus, but access to a more extensive set of markers would allow finer-scale characterisation of genetic structure and differentiation across the genome, and permits comparisons of patterns of LD decay distance.

To this end, this study aims to elucidate the population genetic structure and genome characteristics of E. globulus using a high density set of molecular markers. We examine how LD decay and genomic diversity vary among the different races of E. globulus as well as within and among chromosomes. We also identify genomic regions contributing disproportionately to race and chromosome differentiation. We hypothesise that the various races will exhibit different signatures of LD and diversity at the genome-wide level due to their different evolutionary histories.

Methods

An E. globulus provenance-progeny trial, planted in 1989 near Latrobe in Tasmania, Australia, by Gunns Ltd (now Forico Pty Ltd) was selected for sampling. The trial was established with 570 open-pollinated E. globulus families collected from wild mother trees, with each family represented by a two-tree plot, in each of five replicates. Trees were sampled from this trial within the races of Furneaux (FURN), King Island (KI), West Coast (WC), North-eastern Tasmania (NETAS), South-eastern Tasmania (SETAS) and Southern Tasmania (STAS), along with combined sampling from Gippsland and the Strzelecki Ranges (GIPP), and Western and Eastern Otways (OT) (Fig. 1), until approximately 20 trees from different families had been sampled per race or race combination (hereafter simply referred to as race) (Thavamanikumar et al. 2014). DNA was extracted from cambial scrapings as described in Tibbits et al. (2006).

Fig. 1
figure 1

Geographical distribution of Eucalyptus globulus mother trees sampled from each race in the genetic trial. The grey area indicates the natural distribution of E. globulus. The number in brackets indicates the number of OP progeny sourced from different families from each locality. Mother trees were only represented once per race

Genotyping by whole genome shotgun sequencing

Genomic libraries were prepared using Nextera DNA Library Preparation Kit (Illumina). Double-stranded genomic DNA was used as input for the Nextera ‘tagmentation’ reaction (DNA fragmentation and tagging). The reactions were performed as per the manufacturer’s protocol, replacing the column clean-up (disassociation) step with a 0.6% SDS (sodium dodecyl sulphate) treatment. Library amplification was performed using Nextera Index 1 (i7) and Index 2 (i5) adapters on a Stratagene Mx3005P™ qPCR system (Agilent Technologies), with a thermal profile as follows: 72 °C for 3 min; 98 °C for 30 s; and 5 cycles of 98 °C for 10 s, 63 °C for 30 s and 72 °C for 3 min. SYBR Green I (Thermo Fisher Scientific) in 1:1000 dilution was added to the PCR mixtures for monitoring the amplification process. Amplified libraries were pooled and purified with MinElute PCR Purification Kit (Qiagen) or AMPure XP SPRI (Solid Phase Reversible Immobilisation) paramagnetic beads (Agencourt, Beckman Coulter). Purified library pools were then subjected to size selection using gel electrophoresis by excising the DNA fragments within the desired size range (300–450 bp). The excised libraries were extracted from the gel with QIAquick Gel Extraction Kit (Qiagen) prior to quality assessment with High Sensitivity D1000 ScreenTape using a 2200 TapeStation system (Agilent Technologies). Once assessed, libraries were sequenced using a HiSeq 2000 Sequencing System (Illumina) at the Victorian Government Department of Economic Development, Jobs, Transport and Resources (DEDJTR), Centre for AgriBioscience (AgriBio), Bundoora.

Alignment to a draft genome assembly and SNP calling

A draft genome assembly for E. globulus (X46 v33) was generated from whole genome Illumina paired-end, mate-pair and synthetic long reads from an E. globulus clone (X46) sequenced in a previous project (Bayly et al. 2013; Myburg et al. 2014), and was reference-initiated through alignment to the E. grandis reference genome (Myburg et al. 2014). The Vision software suite (Gydle Inc.) was applied in rounds of iterative automated resolution steps with interactive consensus editing, assembly visualisation and manual curation of end junction sequences to generate the assembly version used in this work. The reference used to initiate the process is not assumed to have the correct linear order and the resolving process specifically detects rearrangements and inversions breaking and re-joining the assembly where conflicts with the raw data are detected. The final draft assembly comprised 13,674 contigs, had a total length of 530.65 Mbp and a contig N50 of 84.7 kbp. While the final assembly was not scaffolded, genome structure is known to be highly conserved between E. globulus and E. grandis (Hudson et al. 2012), so contigs were anchored to the E. grandis reference genome (Myburg et al. 2014) based on comparable sequence to allow contigs to be assigned to one of the 11 chromosomes. A total of 1,162 contigs with length > 100 kbp representing 245 Mbp of the genome were used in this study (see below).

The raw sequence reads from the sampled E. globulus mother trees were quality filtered and aligned using Nuclear v3.2.6 (Gydle Inc., Québec, Canada) to the draft E. globulus genome assembly (X46 v33). SNP variants were discovered using find-snp v2.6.16 (Gydle Inc.). Variants were output in VCF (Variant Call Format) v.4.2 format and filtered or manipulated using VCFtools v0.1.15 (Danecek et al. 2011).

SNP filtering and analysis

From an initial set of 5,594,130 SNPs, data were filtered using VCFtools v0.1.15 (Danecek et al. 2011), initially to SNPs residing on contigs of size > 100 kbp, and individuals to those with < 15% missing data. SNPs were then filtered for quality, with SNPs exhibiting > 10% missing data, a minor allele frequency < 0.2, a minimum read depth < 4 or non-biallelic SNPs removed. To minimise the impact of genotyping errors and paralogous loci, SNPs that deviated from HWE in an extreme manner (i.e. p < 1 × 10−7) were also removed (Pavan et al. 2020). Finally, contigs with < 30 SNPs were removed, resulting in a set of 203,337 SNPs across 1,162 contigs for LD and F ST analysis (see below). The relatively low minimum read depth threshold of 4 was used to achieve the target of at least 30 SNPs per contig examined (for a minimum of 435 SNP pair comparisons). With this relaxed threshold, there was a risk of introducing bias towards homozygotes due to allelic dropout and therefore inflating LD estimates (Nielsen et al. 2011), so average read depth of each race was computed to compare with estimates of LD. The minor allele frequency threshold of 0.2, which is conservative compared to other LD studies (Slavov et al. 2012; Silva-Junior and Grattapaglia 2015), was also adopted to help compensate for the low read depth (in places) of the sequencing data and limit the impact of genotyping errors. Generally lower coverage and a high proportion of missing data resulted in the West Coast race being entirely excluded based on the above criteria.

Population structure between the races was examined using both Admixture v1.3 (Alexander et al. 2009) and fastStructure v1.0 (Raj et al. 2014), using a subset of the above markers. This subset was generated by further filtering to a more stringent missing data rate < 6% and pruning marker pairs in LD (r 2 > 0.2), resulting in a subset of 19,907 unlinked SNPs. In both algorithms (using default parameters), three independent runs were performed for each value of k between 2 and 8. The most probable value of k was defined by cross-validation error in Admixture, or maximised likelihood in fastStructure. The Admixture analysis was also repeated for each chromosome separately once the most probable value of k was selected. Results were visualised using R (R Core Team 2019), with the colour-blind-friendly palette generated using the package viridis (Garnier 2017).

To complement the results from the structure analysis, a principal component analysis (PCA) was undertaken using the same subset of SNPs and individuals. Population differentiation was also examined with F ST using the same subset of markers and the package hierfstat (Jombart 2015). Pairwise F ST were also calculated for each chromosome separately, weighted using the Weir and Cockerham (1984) estimator. To visualise the pairwise results, unweighted pair group method with arithmetic mean (UPGMA) trees were generated using the package phangorn (Schliep 2011). To examine areas of the genome that were contributing to the differentiation, global F ST (i.e. between all pairwise combination of races) was calculated for every contig using the sliding window approach implemented in VCFtools v0.1.15 (Danecek et al. 2011) with the window size set to span entire contigs. These contigs identified to have elevated F ST values were aligned to the E. grandis reference genome, and annotated genes in these regions were tested for enrichment of GO terms using the R package SNP2GO (Szkiba et al. 2014).

Pairwise estimates of linkage disequilibrium (LD) measured by the squared correlation of allele frequencies (r 2) were obtained across every contig, both within races and across races (global estimates) using PLINK v1.9 (Chang et al. 2015) and the set of 203,337 SNPs. A decay curve was fitted for each contig using a nonlinear regression of pairwise LD based on the expectation of r 2 for drift-recombination equilibrium (Hill and Weir 1988) using an R script created by Marroni et al. (2011). As numerous thresholds for measuring LD exist in the literature, we explored the impact of three different thresholds. The first two thresholds, used in other eucalypt studies, employed the decay curve to measure average base pair distance to both a standard r 2 decay threshold of 0.2 and to half-maximum decay (Silva-Junior and Grattapaglia 2015; Murray et al. 2019; Otyama et al. 2019). The third threshold was set to the background LD level within the dataset (Falush et al. 2003; Mather et al. 2007). Background LD was established by selecting five random contigs assigned to each of the 11 chromosomes and calculating r 2 between all SNP pairs possible (excluding those within contig). The upper 95th percentile of these values was set as the background LD threshold. Using the individual contig r 2 estimates, average decay distance and standard errors were calculated at both the race and chromosome level, weighting for contig length and excluding those with curves that failed to reach convergence. Chromosome level comparisons were based on the global r 2 estimates for each contig, whereas race level comparisons were based on r 2 values estimated within each race for each contig, regardless of chromosome. Significant differences in decay distance between races and chromosomes were tested using 1-way ANOVA and Tukey’s HSD test, with the variation among contigs within either factor used as the error in each analysis. Potential patterns of short-range LD were explored by averaging the r 2 values between pairs of SNPs within 5 kb, divided into bins of 100-bp separation (i.e. average r 2 for SNPs separated by 1–100 bp, 101–200 bp etc.). This averaging of the r 2 values among the proximal pairs of SNPs ignored their contig of origin. Genetic diversity was measured using observed and expected heterozygosity and nucleotide diversity (π; setting window size equal to contig size) across contigs using PLINK v1.9, for comparison with the average distance to LD decay at the range-wide, race and chromosome (across race) levels, and tested for significance using ANOVA. To contrast population patterns of selection, Tajima’s D was also calculated at these levels using VCFtools. All results were visualised using the package ggplot (Wickham 2016) in R (R Core Team 2019).

Average LD distance of all contigs (weighted for contig length) were plotted according to their position on the 11 chromosomes. Curves were fitted to the data using locally weighted scatterplot smoothing (LOESS) polynomial regression with a smoothing parameter of 0.1 using R (R Core Team 2019). Peaks in LD were examined for significance by first identifying contigs that were above the 95th percentile in LD decay distance. To test that the clustering of these high LD contigs along the chromosome was greater than expected by chance alone, the position of every contig was randomised across the chromosome. Ten windows along the chromosome were delimited to include 10% of total contigs in each window, and the number of high LD contigs in each window was recorded. This randomisation was repeated 1,000 times. Peaks (within windows) containing a greater number of high LD contigs than the upper limit of the frequency distribution obtained by this bootstrapping were determined to be significant (p < 0.001). This process was repeated for each chromosome. Peaks in expected heterozygosity were also examined in this manner.

Results

Population structure analysis

After filtering SNPs to the criteria detailed in the Methods, 1,162 contigs with 203,337 SNPs were selected for use in LD analyses across 136 individuals. The average contig was 209,837 bp in length, with an average of 175 SNPs per contig spaced at an average of every 1,127 bp. Using a subset of 19,907 unlinked markers, population structure was examined using Admixture and fastStructure, fitting values of k from 2 to 8 (Fig. 2). Both algorithms found k = 4 to be the best at explaining the structure in the data, as it maximised the maximum likelihood values and showed the lowest cross-validation error in fastStructure and Admixture, respectively (Online Resource 1). At k = 4, there was a strong geographic influence on clustering as expected, broadly representing genotypes originating from the following: (1) Tasmania (STAS, SETAS and NETAS; purple group in Fig. 2); (2) the Furneaux (FURN) group of islands (green); (3) southern Victoria (Eastern Otways and Western Otways (OT), Strzelecki Ranges and Gippsland (GIPP); yellow); and (4) King Island (KI; dark blue). Most individuals share some level of genetic cluster identity with individuals from geographically neighbouring races, with the exception of FURN which shows very little admixture with GIPP. The apparent lack of admixture between SETAS and other races is also notable. These observations were also mirrored in a PCA on the same data (Online Resource 2).

Fig. 2
figure 2

Clustering of Eucalyptus globulus trees into different genetic groups using (a) Admixture and (b) fastStructure. Results are shown for values of k from 2 to 8. Each vertical line represents one genotype (an open pollinated progeny from a wild tree), and colours represent partitioning of the genotype to each cluster. Genotypes are grouped by race, including Furneaux (FURN), King Island (KI), North-eastern Tasmania (NETAS), South-eastern Tasmania (SETAS) and Southern Tasmania (STAS), with combined groupings of Gippsland with Strzelecki Ranges (GIPP), and Western Otways with Eastern Otways (OT)

The programs fastStructure and Admixture differed in the stability of genetic groupings past k = 4. The results from fastStructure consistently showed five major genetic groupings from k = 5–8 (with some instability in some genotypes at k = 8) with a distinct separation of GIPP and OT and a stable southern Tasmania group. Admixture was more variable, with NETAS and STAS becoming more distinct from SETAS at k = 5, and GIPP and OT separating at k = 6. Admixture also suggested a less homogenous KI clustering. Concerning the two combined groupings (GIPP and OT), the Strzelecki Ranges race did not separate from Gippsland at any values of k with either algorithm while the Western Otways and Eastern Otways races were only separated at k = 8 in Admixture, suggesting that the molecular genetic differences between these races are small and pooling them into single groups is reasonable.

To provide additional support for these groupings, population differentiation was examined using F ST values. The overall F ST value for the species was 0.059, indicating moderate differentiation between the races (Hartl and Clark 1997). Pairwise comparison of F ST showed that the SETAS and OT races were the most differentiated in allele frequencies, closely followed by SETAS and KI (Table 1). Broadly speaking, there was lower race differentiation within than between Victorian and Tasmanian races (Fig. 3), while the Bass Strait island races were somewhat intermediate, with FURN generally closer to Tasmanian races and KI generally closer to Victorian races, which is in agreement with the population structure (e.g. k = 2 in Fig. 2) and principal component (Online Resource 2) analyses. Examination of differentiation between the races (both in population structure and differentiation) one chromosome at a time revealed that the pattern of differentiation among races was also generally stable across chromosomes (not shown).

Table 1 Pairwise F ST calculated between each race of Eucalyptus globulus. Race abbreviations are explained in Fig. 2
Fig. 3
figure 3

Unweighted pair group method with arithmetic mean (UPGMA) phylogram of pairwise F ST values between the Eucalyptus globulus races

Examining population F ST across the genome revealed specific regions where F ST levels are elevated, indicative of regions of the genome that contribute highly to the differentiation observed among the races (Fig. 4). Specifically on chromosomes 1, 2, 6, 8, 9, 10 and 11, regions exceeding a conservative F ST value of 0.15 indicated a high level of differentiation (Hartl and Clark 1997). Examining the annotated genes within contigs at these regions of high differentiation in the E. grandis reference genome revealed 107 genes; however, these genes were not enriched for any specific gene ontology term (Online Resource 3).

Fig. 4
figure 4

Sliding window of global F ST across the Eucalyptus globulus genome. Positions are based on contig alignment with the Eucalyptus grandis reference genome. Chromosomes are delimited by vertical broken lines and labelled. Horizontal broken line indicates the F ST threshold for a window of high differentiation

Linkage disequilibrium and diversity

The distance to threshold decay (r 2 = 0.2), half-maximum decay and background decay (r 2 = 0.039) was calculated within all contigs and all samples in a global analysis, with genome-wide LD decaying to the threshold level within 3,974 bp, to half-maximum within 3,086 bp and to the background rate within 30,101 bp (Table 2). Of the approximately 25 million SNP comparisons, 90% fell below r 2 = 0.2, while 50% of most proximal SNPs (median distance 160 bp) also fell below this threshold. At short ranges, the largest reduction in average allele frequency correlation between SNP pairs was moving from distances of 1–100 bp (average r 2 = 0.37, n = 123,446) to 101–200 bp (average r 2 = 0.29, n = 90,373; Fig. 5). Average allele frequency correlation reaches r 2 = 0.2 within 2.2 kb in this graph, indicative of the variance in results obtained from different methodologies (in this case pooling across SNP pairs regardless of contig of origin versus averages of sliding windows or contigs). For a typical contig, it was extremely rare to see pairwise r 2 > 0.2 beyond 20 kb (Online Resource 4a), while for some contigs higher LD values were common up to 100 kb or beyond (Online Resource 4b & 4c). Although genetic indistinguishability (SNPs seemingly in perfect LD over long distances) can be a factor influencing LD estimates (Skelly et al. 2015), examination of contig distributions of LD decay distances typically revealed stable patterns with no long-distance indistinguishability (Online Resource 4).

Table 2 Linkage disequilibrium, genetic diversity and inbreeding coefficients in Eucalyptus globulus races
Fig. 5
figure 5

Average squared allele frequency correlation (r 2) between SNP pairs within 5 kb in Eucalyptus globulus based on global estimates ignoring race of origin. Each bar represents SNP pairs divided into bins of 100-bp distances (1–100 bp, 101–200 bp etc.)

Strong differences were observed amongst races, with average decay distance to threshold level varying at the race level from 4,646 to 12,249 bp (F 6,8071 = 59.9, p < 0.001), to half-maximum varying from 3,064 to 8,816 bp (F 6,8071 = 51.3, p < 0.001) and to background level from 78,163 to 142,824 bp (F 6,8071 = 101.9, p < 0.001; Table 2). Chromosomes also differed significantly, with average global LD decay distance to the threshold level varying from 3,815 to 11,183 bp (F 10,8067 = 19.9, p < 0.001), to half-maximum varying from 1,116 to 7,761 bp (F 10,8067 = 18.7, p < 0.001) and to background decay rate varying from 72,241 to 132,154 bp (F 10,8067 = 25.9, p < 0.001; Table 3). Both chromosome and race had a significant effect on LD, with chromosomes 1 and 10 and races OT and SETAS showing significantly slower average LD decay across all contigs, and chromosome 5 showing the fastest average LD decay (Table 2, Fig. 6). The trends in the differences among chromosomes and among races were the same regardless of the threshold used and thus the commonly reported (Luikart et al. 2019) threshold decay (r 2 = 0.2) was chosen as the LD statistic to explore longer-range LD further.

Table 3 Linkage disequilibrium and genetic diversity for the Eucalyptus globulus chromosomes. The parameters presented are global estimates across all races
Fig. 6
figure 6

Average distance to LD decay (r 2 = 0.2) in Eucalyptus globulus in (a) races and (b) chromosomes. Error bars indicate ± standard error. Significant differences indicated by differing letters as determined by Tukey’s HSD test. Race abbreviations are explained in Fig. 2. Chromosome values are global estimates that ignore race of origin

Heterozygosity was also significantly influenced by race (F 6,136 = 27.6, p < 0.001), with range-wide expected heterozygosity (H e) of 0.423 but race H e ranging from 0.408 to 0.436, with the lowest diversity seen in OT, SETAS and KI (Table 2). An excess of heterozygotes was seen in all races, although the significantly lower observed heterozygosity (H o) in OT and SETAS (F 6,136 = 41.2, p < 0.001) made the difference between H o and H e in these races negligible. After comparing the average read depth per race, we found that OT and SETAS had the highest read depth of all samples despite their lower observed heterozygosity, suggesting no bias had been introduced due to allelic dropout caused by the read depth criteria adopted. Tajima’s D was significantly influenced by race (F 6,8127 = 223.5, p < 0.001), with races all exhibiting positive values (which may be upwardly biased due to the minor allele filtering applied) and ordered similarly to expected heterozygosity (Table 2, Table 3), suggesting higher rates of balancing selection or more recent bottlenecks in the more diverse populations. Nucleotide diversity was also significantly influenced by race (F 6,8127 = 17.3, p < 0.001), and ordered similarly to expected heterozygosity, with the exception of OT and SETAS which unexpectedly had slightly higher nucleotide diversity. As nucleotide diversity can be interpreted as the percentage of sites expected to differ between two randomly selected individuals in the population, this pattern appears to be caused by the greater diversity of observed genotypes in these less heterozygous populations. Each population statistic was on average more variable among races than among chromosomes (Table 3).

Characterising LD variation within chromosomes

Mapping LD decay across the genome showed variable decay distance along all chromosomes, with several showing slow LD decay in the central third (Fig. 7). Highly significant clustering (as determined by bootstrapping, p <  = 0.001) of contigs with slow LD decay was detected on all chromosomes barring 5, 7 and 11 (Online Resource 5). These peaks in LD decay distance were present (at various intensities) in the same chromosomal region across all races while spanning different contigs in the region, suggestive of common genomic features influencing decay distance. There are several smaller peaks present on other chromosomes (for example those in the first half of chromosome 9) but these were generally associated with areas of slow LD decay unique to the OT and SETAS races. These two races had the highest average genome-wide LD decay distance (Table 2) and, while there were unique peaks, for the most part, their contigs with slower LD decay compared with other races were spread across the genome (Fig. 7). Several of these race-specific contigs had LD decay distances of over 100 kb (> 25 X the global across-race estimate), suggesting large haplotype blocks exist in these genomic regions in these races.

Fig. 7
figure 7

LD decay distance along each Eucalyptus globulus chromosome by race. Each point represents the average distance to LD decay (r 2 = 0.2) for a contig in a race, with chromosome position of the contigs based on synteny with the E. grandis reference genome. Curve drawn using locally weighted scatterplot smoothing (LOESS) polynomial regression with a smoothing parameter of 0.1. Contigs associated with the two races with the highest average distance to LD decay are coloured

Reduced diversity (i.e. expected heterozygosity) is generally associated with extended distance to LD decay (Smith et al. 2005), which we observed broadly at the race level, although not reaching significance (r =  − 0.61, p = 0.14; Table 2). When comparing contig-level expected heterozygosity to the chromosome LD peaks, very few genomic regions with reduced diversity featured increased LD (Fig. 8), with the exception of a small number of low-diversity contigs near the LD peak on chromosomes 2 and 10. Bootstrapping of contig positions on chromosome 2 showed the clustering of low diversity contigs in this window associated with the peak in LD decay distance was significant (p < 0.01). Races were relatively consistent in their genetic diversity across chromosomes (with OT and SETAS exhibiting consistently lower diversity). Contig-level observed heterozygosity showed a similar pattern, with OT and SETAS exhibiting much lower heterozygosity overall and a significant (from bootstrapping, p < 0.01) association of slow LD with contigs exhibiting low observed heterozygosity on chromosome 2 (Online Resource 6).

Fig. 8
figure 8

Expected heterozygosity (H e) along each Eucalyptus globulus chromosome. Each point represents a contig position based on synteny with the E. grandis reference genome for each race. The solid curve shows the average expected heterozygosity across all races, drawn using locally weighted scatterplot smoothing (LOESS) polynomial regression with a smoothing parameter of 0.1. The dotted curve shows the smoothed LD decay curve along chromosomes, averaged across all races (see Fig. 6). Contigs associated with the two races with the highest average distance to LD decay are coloured. Red arrows along the x-axis indicate positions of contigs with elevated F ST (see Fig. 4)

Notable findings from a previous study of recombination rate in E. globulus were an unusually strong positive correlation between gene density and recombination rate, and a negative correlation between recombination rate and genetic diversity, both at the chromosome level (Gion et al. 2016). At the chromosome level, there was no significant correlation between gene density (obtained via annotated genes in the E. grandis reference genome) and average LD decay distance (r = 0.625, p > 0.05, Online Resource 7). However, examining gene density versus proportion of contigs with high LD decay distances in ten windows across each chromosome revealed a weak negative correlation (r =  − 0.32, p < 0.001, Online Resource 8) which, if assuming high LD as a proxy for lower recombination, would be consistent with Gion et al. (2016).

Discussion

This study contributes to the increasing literature regarding genome-wide characterisation of population characteristics in wild plants (Ostrowski et al. 2006; Slavov et al. 2012; Silva-Junior and Grattapaglia 2015; Murray et al. 2019). Several key points have emerged. First, the genetic structure and degree of differentiation between E. globulus races was largely shaped by the disjunctions in its distribution, as observed in previous microsatellite studies (Steane et al. 2006; Yeoh et al. 2012). Secondly, the genome-wide LD decay distance estimated with our SNP data set varied significantly between the races of E. globulus. Regardless of the LD threshold used, the South-eastern Tasmanian and Otways races exhibited greater than expected differences from their neighbouring races, with significantly greater LD decay distances and lower observed heterozygosity, along with lower outbreeding (F) and expected heterozygosity, likely reflecting their distinct evolutionary histories. Lastly, some genomic regions contributed disproportionately to both differentiation and LD decay distance across all races (Fig. 7), possibly indicating regional genomic features such as repetitive content or chromatin architecture influencing these estimates.

Variation between species

Eucalyptus globulus has a mixed mating system but is predominantly outcrossing (Mimura et al. 2009) and thus would be expected to have reasonably low genome-wide average LD in contrast to plants with higher levels of selfing (Nordborg 2000; Flint-Garcia et al. 2003). Indeed, the global LD decay distance to r 2 = 0.2 averaged across contigs of 4 kb estimated with our genome-wide E. globulus SNP data set is quite low in comparison with annual selfing plant species such as wild soybean (75 kb, Lam et al. 2010), chickpea (450 kb, Bajaj et al. 2015) and mungbean (60 kb, Noble et al. 2018), and perennial outcrossing species such as Arabidopsis lyrata (> 100 kb, Hämälä and Savolainen 2019), but similar to that estimated for other outcrossing tree species such as E. grandis (4–6 kb, Silva-Junior and Grattapaglia 2015) and poplar (3–6 kb, Slavov et al. 2012).

However, direct comparisons of LD decay distance between studies are problematic and affected by factors including variations in methodology, filtering (especially minor allele frequency which can cause LD inflation; Boyles et al. 2005), and marker type and density, as well as the point on the decay curve comparisons are made. Therefore, it is important to consider the scale at which LD is being examined and the conclusions which can validly be drawn from results at the scale examined. For example, the average LD decay distances to r 2 = 0.2 of 3–6 kb for E. grandis and poplar employing 21 k–30 k markers (Slavov et al. 2012; Silva-Junior and Grattapaglia 2015) are in marked contrast with distances to half-maximum decay of 92–113 bp reported in another Eucalyptus study examining LD with millions of SNPs (Murray et al. 2019), which is likely to reflect the different patterns and processes impacting long- and short-range LD, respectively (Andolfatto and Nordborg 1998; Frisse et al. 2001; Pritchard and Przeworski 2001; Tenesa et al. 2007; Slavov et al. 2012). Similarly, even within E. globulus, variable LD decay distances have been reported, no doubt representing different scales of study, marker densities and population characteristics. An early study of SNPs in 12 candidate genes in four native E. globulus populations, for example suggested that LD within genes decayed rapidly to an r 2 = 0.2 within 500 bp (Thavamanikumar et al. 2011), whereas relatively low-density genome-wide studies of E. globulus breeding populations suggest that the distance to reach this average LD threshold is greater (Durán et al. 2017; Ballesta et al. 2018, 2019), and close to our global estimate of 4 kb (although the estimate from one complex breeding population reported r 2 = 0.2 at 21.8 kb (Quezada et al. 2022)). Evidence for the impact of marker density/scale of observation on estimates of LD decay distance comes from our fine-scale examination of global LD decay distance using the subset of markers separated by less than 5 kb, which show much more rapid decay than our contig average estimates. The steepest decline in LD occurred between 100 and 200 bp, consistent with the distance to half-maximum decay reported in Murray et al. (2019). This initial rapid LD decline over short genomic distance no doubt reflects short-range LD, shaped by ancient mutations and recombination events (Tenesa et al. 2007), along with the complicating factor of gene conversion (Andolfatto and Nordborg 1998; Frisse et al. 2001). In contrast, the LD decay threshold of r 2 = 0.2 on which we focused likely reflects a long-range LD more influenced by events in relatively recent population history such as bottlenecks, admixture and selection (Tenesa et al. 2007; Slavov et al. 2012), as well as chromosomal structural features (Vallejo et al. 2018; Todesco et al. 2020; Valenzuela et al. 2021).

Long-lived woody species like forest trees are known for their generally higher diversity compared to other life forms (Hamrick et al. 1992; Broadhurst et al. 2017), and those that outcross like poplar and pine often show higher observed than expected heterozygosity (Lu et al. 2016; Bothwell et al. 2017). An excess of heterozygotes was observed across all E. globulus races, consistent with the outbreeding nature of E. globulus, animal-mediated pollen dispersal (Mimura et al. 2009) and strong-post-dispersal selection against the products of inbreeding (Costa e Silva et al. 2014; Nickolas et al. 2019). While heterozygote excess was not observed in previous studies of E. globulus races (Jones et al. 2006; Steane et al. 2006; Yeoh et al. 2012), this is likely due to the nature of the markers used (all the cited studies used microsatellites), as the presence (Ollitrault et al. 2015), direction and intensity (Ebadi et al. 2019; Zimmerman et al. 2020) of the difference between observed and expected heterozygosity can vary widely between microsatellites and SNPs.

Variation between races

Population structure analyses suggested that the main divergence between E. globulus races is found between the mainland and Tasmania (k = 2), in agreement with previous microsatellite studies in E. globulus (Steane et al. 2006; Yeoh et al. 2012), with the geographically intermediate island races in western and eastern Bass Strait markedly different from each other. Accordingly, the best structure grouping was the four genetic groups: Victoria, King Island, Tasmania and the Furneaux islands. However, our next best grouping in the fastStructure analysis separated the mainland groups Gippsland and Otways consistent with microsatellite studies reported by Yeoh et al. (2012), Costa e Silva et al. (2017) and Thavamanikumar et al. (2014). Indeed, the five-group structure is the most consistent with the disjunctions between the sampled races, i.e. those imposed by Bass Strait and the relatively short disjunction between the Otways and Gippsland races. The pairwise differentiation between races also matched the relationships observed by Callister et al. (2021) (with the exception of the King Island and Otways pairing, possibly due to our pooling of the two eastern and western Otways races). It should be noted that the current study was unable to include the small disjunct western races of E. globulus which, in other analyses, link with the King Island and Otways races (Steane et al. 2006; Yeoh et al. 2012; Callister et al. 2021). In combination with evidence from shared chloroplast haplotypes (which are maternally inherited and therefore dispersed by seed only), this provides further evidence of a western migration route linking the Otways to Tasmania via a land-bridge formed in glacial periods (McKinnon et al. 2001) and now only above water through King Island. Shared chloroplast haplotypes between Gippsland and the Furneaux Islands, but not between these races and Tasmanian races (Freeman et al. 2001) argues for a historic barrier to seed dispersal between the Furneaux Islands and Tasmania. There are also some shared chloroplast haplotypes between Western Tasmania and the races on the east coast of Tasmania. In contrast, this study showed admixture in the nuclear genome between the north-eastern Tasmania and the Furneaux Islands races, consistent with the findings of chemical and microsatellite studies (Wallis et al. 2011; Yeoh et al. 2012). All these patterns can be reconciled by a ring migration model, whereby E. globulus migrated southward by seed to Tasmania from the Otways, onto the west coast of Tasmania, through Southern Tasmania and northward on the east coast of Tasmania but not reaching the Furneaux Islands by seed from Tasmania (Freeman et al. 2001). The Furneaux islands were then likely populated by seed from the Gippsland area, with subsequent pollen-mediated gene flow causing the admixture seen in the nuclear genome between the Furneaux Islands and Tasmania.

While LD decay distance varied among races in E. globulus, the average global LD decay distance for the combined races was less than almost any value obtained for a single race, at all thresholds of r 2 (barring one instance of race half-maximum decay). This is expected, as LD measurements are a function of allele frequencies and thus pooling sub-populations is likely to average any race-specific divergences in allele frequency (Evans and Cardon 2005). Therefore, a species-wide estimate is likely to under- or over-estimate LD decay distance in the presence of population structure, as has been previously reported in Populus trichocarpa (Slavov et al. 2012) and E. grandis (Silva-Junior and Grattapaglia 2015), and will obscure signals of distinct evolutionary histories experienced by sub-populations. For instance, in P. trichocarpa LD decay distance varied from around 2.5 kb in one sub-population up to > 6 kb in several others, with the higher values thought to be caused by bottlenecks or founder effects experienced over the last few hundred generations (Slavov et al. 2012). In the present case, reconciling the slow LD decay and low heterozygosity in the South-eastern Tasmanian and Otways races with their evolutionary history is difficult. These races of E. globulus both coincide with/are adjacent to hypothesised forest glacial refugia (Freeman et al. 2001; McKinnon et al. 2004a; Nevill et al. 2014), and there is evidence of extensive ancient hybridization with other eucalypt species in south-eastern Tasmania (McKinnon et al. 2004b). Expectations would suggest that these refugia should harbour higher than average genetic diversity relative to the other races (Médail and Diadema 2009), but this was not seen in our SNP data. Indeed, the dominant signal in these populations is one associated with higher inbreeding, with almost no heterozygote excess, in contrast to that observed in the other races (Wright 1943). A higher F-statistic can indicate geographic sub-structuring (De Meeûs 2017), which could be argued for the Otways race as in our study it is a combination of the Eastern and Western Otways races. However, this is unlikely due to the homogeneity of the genetic structure in both Otways and South-eastern Tasmania and would not explain their low observed heterozygosity. While evidence from chloroplast sharing suggests that ancient hybridization may have affected the South-eastern Tasmania race, any effect on nuclear genome diversity appears to be small (McKinnon et al. 2004b). Likewise, while more recent hybridization can lead to increases in LD, it is also generally associated with increases in diversity (Fridman 2015; Goicoechea et al. 2015), which was not observed here. Population bottlenecks/founder events (such as post-glacial migration from refugia) occurring post-hybridization could potentially lead to a genome-wide signal of slow LD decay (Finlay et al. 2017), which is consistent with our observations in the Otways and South-eastern Tasmania races. In contrast, the impact of a selective sweep would generally be localised around the advantageous allele rather than genome-wide (Galtier et al. 2000). To this effect, we did observe areas with a coincidence of low diversity and slow LD decay, one on chromosome 2 which was most evident in the Otways race, and one on chromosome 10 most evident in South-eastern Tasmania (Fig. 7). As both genome-wide and localised patterns are present in these two races, it is possible both processes contribute to their slower than average LD decay. Furthermore, given there are also observed peaks in slow LD decay which appear to be specific to these two races (e.g. chromosome 9) that did not coincide with low heterozygosity peaks, the possibility that chromosomal rearrangement variants also contribute to their slower than average LD decay cannot be dismissed (Mostert-O'Neill et al. 2021).

Variation between and within chromosomes

There was distinct variation between chromosomes when race was accounted for, with chromosomes 1 and 10 exhibiting significantly slower LD decay than any other chromosome, and chromosome 5 exhibiting the fastest LD decay. At the broad chromosome-wide scale, differences in recombination rate are likely to be the main driver of differences in LD (Flint-Garcia et al. 2003). Considering broad chromosome characteristics, chromosome length is one factor that is known to influence recombination rate (Shen et al. 2017). Consistent with this observation, chromosomes 1 and 10 are among the shorter and chromosome 5 among the longer chromosomes in E. grandis and have the highest and lowest recombination rates, respectively (Gion et al. 2016). Although this is also likely the case in E. globulus given their known synteny (Hudson et al. 2012), the recombination rates are in the opposite direction as would be predicted by the LD values exhibited in this study (except in the presence of high gene density, see next section). However, chromosomes 4 and 9 are similarly short and chromosomes 3 and 8 are similarly long, all of which exhibit LD decay distances closer to the average. Another possibility contributing to chromosomal differences and some of the isolated peaks in LD is differences in chromosome structural polymorphisms, as reported in E. grandis (Mostert-O'Neill et al. 2021). Looking within chromosomes, large peaks of elevated decay distance were found in similar genomic regions in all races on most chromosomes, suggesting common genomic features influencing LD at these locations. Notably, the largest peaks were found on chromosomes 1 and 10, likely contributing to their significantly higher LD than other chromosomes. Some peaks also coincided with areas that exhibited high levels of F ST, which when taken together is suggestive of key areas of differentiation between the races. Interestingly, a similar pattern of global F ST values (elevated F ST in the same regions of the same chromosomes) has been observed before in a study of six different Eucalyptus species (Hudson et al. 2015). Many of the species-differentiating markers identified in that study align well with peaks in F ST in this study (Online Resource 9), implying a level of conservation in these features across the subgenus. These could represent large adaptive haploblocks (Todesco et al. 2020) caused by recombination-inhibiting structural variation such as inversions (Vallejo et al. 2018). Many of the peaks in LD and F ST were in a broadly central region of the chromosome and may also be indicative of supressed recombination around the centromere. There is cytogenetic evidence for centromere position in E. globulus chromosomes; however, we cannot identify homologous chromosomes between the reference genome and the cytology study (Ribeiro et al. 2016). Other elements of chromosome structure could also play a role, for example polymorphic transposable elements have been shown to elevate LD decay distance (Choudhury et al. 2019), while chromatin architecture has also been associated with changes in LD (Smith et al. 2005; Henderson 2012; Jabbari et al. 2018).

Genomic regions subject to selective sweeps due to adaptive genes will also result in reduced diversity and subsequent slower decay of LD (Kim and Nielsen 2004), which was observed on chromosome 2 in this study. This is more likely in areas of high gene density, as they are generally subject to higher rates of selection and subsequently exhibit lower diversity, such as seen in E. grandis (Silva-Junior and Grattapaglia 2015). Therefore, in regions of high recombination rate along with high gene density, the typical positive correlation between recombination rate and diversity may be reversed by purifying selection, as selection may have increased efficacy due to high recombination rate (Pál et al. 2001). Notably, broad selective sweeps have been shown to leave a signature of slow LD decay (Kim and Nielsen 2004), which would develop despite the high recombination rate as areas become more homogenous (Gion et al. 2016). We found chromosome average LD in E. globulus to be negatively correlated with diversity and positively correlated with gene density, consistent with this hypothesis.

Conclusions

Genetic differentiation was detected between the different races of E. globulus, and their groupings reflected the main disjunction across its distribution. The levels of linkage disequilibrium and diversity in E. globulus were comparable to other genome-wide and range-wide estimates in other outbreeding tree species, including eucalypts. This study emphasises the differences between short-and long-range LD and focuses on contig variation in longer-range decay (r 2 = 0.2), which is shown to vary between races, suggestive of specific evolutionary histories. Differences were also observed within and between chromosomes, suggesting genomic features or processes associated with recombination suppression or diversity reduction. The mechanisms underlying the variation in LD decay in specific areas of the genome require further study and will need to be considered carefully when performing marker-phenotype associations.