Introduction

Plants need to retain water to survive in dehydrating habitats (Yeats and Rose 2013). They achieve this through the plant cuticle, a hydrophobic membrane that covers most of their aerial organs (Liu et al. 2021). While the cuticle’s primary function is to restrict non-stomatal water loss and control the exchange of solutes and gases between plants and the atmosphere, it also protects against environmental stressors, such as high temperature, UV radiation, microorganisms and insects (Domínguez et al. 2011). As such, it is one of the most important adaptations that allowed plants to transition from aquatic to terrestrial environments (Bhanot et al. 2021). The cuticle is chemically composed of an insoluble polymer called cutin and a wax mixture that includes hydrophobic lipids consisting primarily of very long-chain fatty acids and their derivatives, such as aldehydes, alkanes, primary and secondary alcohols, ketones and wax esters (Lee and Suh 2015; Domínguez et al. 2017). The quantity and chemical composition of cuticular waxes varies across species, organs, tissues and time (Nawrath 2006) and can change in response to biotic or abiotic stress (Shepherd and Wynne Griffiths 2006; Lewandowska et al. 2020).

The biosynthesis of cuticular waxes in flowering plants is a complex process controlled by a large number of genes (Xue et al. 2017; Ai et al. 2022). Among these, the ECERIFERUM (CER) gene family plays an important role in the biosynthesis of very long-chain fatty acids, which are a main component of cuticular waxes (Hannoufa et al. 1993; Jenks et al. 1995). The most studied member of this family is the gene ECERIFERUM1 (CER1), which encodes the biosynthesis of alkanes through decarboxylation of fatty acid metabolites and is implicated in drought tolerance (Aarts et al. 1995; Bourdenx et al. 2011; He et al. 2022). Alkanes are usually omnipresent in the cuticle of most species and organs, where they frequently accumulate to high concentrations (Samuels et al. 2008), increasing drought tolerance by reducing the permeability of the plant surface (Li et al. 2019). Transgenic experiments also showed that another member of the CER gene family, ECERIFERUM3 (CER3), interacts with CER1 to produce alkanes in Arabidopsis (Bernard et al. 2012). CER1 and CER3 share a common ancestry and their proteins are structurally similar (Wang et al. 2019), having evolved from the fusion of ERG3/FAH and WAX2 domains, which are their N-terminus and C-terminus, respectively (Chaudhary et al. 2021). The WAX2 domain is crucial for wax synthesis as its mutations can drastically affect the total amount of wax (Rowland et al. 2007). Schematic representations of their exon–intron structure show 10 exons for CER1, while CER3 presents 11 exons in Arabidopsis (Rowland et al. 2007; Sakuradani et al. 2013). There are also two CER1 homologues identified in Arabidopsis. One of these, CER1-LIKE1, is known to interact with CER3 to produce alkanes of different chain lengths compared to those produced by CER1 and is expressed in different organs and tissues (Pascal et al. 2018).

Phylogenetic analysis with diverse species of Archaeplastida (i.e., land plants, green and red algae and glaucophytes) has shown that CER1 and CER3 copy numbers tend to increase as plants evolved. Thus, ancient terrestrial lineages such as bryophytes, lycophytes and ferns present a low number of CER1 and CER3 genes, while seed plants usually have more copies. Similarly, gymnosperms often show lower number of copies of these genes than angiosperms, which can themselves vary within genera (Chaudhary et al. 2021). For instance, a phylogenetic study revealed that Quercus mongolica has a much-expanded number of CER1 copies compared with other species of this genus, with both tandem and dispersed duplicates detected (Ai et al. 2022). The authors suggested that this expansion could contribute to the adaptability of Quercus species to drought. Increasing the copy number has been suggested as an adaptation to environmental variation in several polyploid species, such as tobacco or wheat (Limin and Fowler 1989; Deng et al. 2012). In addition, an effect of copy number on function has also been reported with other genes, such as GhDREB1B in cotton, with increasing chilling tolerance with higher copy number (Wang et al. 2021). Less studied has been the effect of gene structural variation, such as intron loss, on function and environmental adaptation. One study showed that intron-poor members of the CIPK gene family are more highly expressed in response to drought stress than the intron-rich genes in soybeans (Zhu et al. 2016). No study has examined the relationship between variation in CER gene structure or copy number with environmental stressors such as drought.

Eucalypts are a group of trees and shrubs from the Myrtaceae family encompassed by the genera Eucalyptus L’Her. (~ 750 species), Corymbia K.D. Hill and L.A.S. Johnson (~ 100 species) and Angophora Cav. (10 species) that are naturally distributed in Australia and Malesia (Nicolle 2022a). Like the majority of Myrtaceae species, eucalypts are diploid with 2n = 22 (Grattapaglia et al. 2012). Eucalypts comprise a mixture of diverse and depauperate lineages, which are adapted to nearly every Australian environment (Thornhill et al. 2019; Slee et al. 2020). The adaptability and high growth rate of some species make eucalypts the most economically important hardwood trees worldwide for the production of timber, fibre and energy (Turnbull 1999). A key trait that is taxonomically and ecologically important in eucalypt species is the presence of glaucous waxy leaves (Barber 1955; Hallam and Chambers 1970). These waxes not only protect eucalypts from water loss (Hoffmann et al. 2013) but also other environmental stressors such as frost (Keller et al. 2013), high radiation (Close et al. 2007) as well as insects (Edwards 1982; Jones et al. 2002) and pathogens (Santos et al. 2019). The chemical composition and amount of cuticular waxes in eucalypts are influenced by both environmental and genetic factors (Koch et al. 2006; Gosney et al. 2016) and can be extremely variable between eucalypt species, with alkanes ranging from 0.6 to 74.3% of the total wax load (Li et al. 1997). Previous research showed quantitative trait loci (QTL) for wax yield (Gosney et al. 2016) and drought damage (Gosney et al. 2016; Ammitzboll et al. 2020) co-located with CER candidate genes in E. globulus. In E. grandis, a single copy of the gene CER3 was identified, while several copies of CER1 were detected (Chaudhary et al. 2021). Although their architecture is currently unstudied, gene families in eucalypts with multiple copies are often arranged in tandem duplicate arrays (Myburg et al. 2014; Li et al. 2015a; Healey et al. 2021), which may be the case here.

In this study, we performed a genome-wide survey for CER1 and CER3 genes across the main eucalypt lineages and compared them to those of other tree species of the Myrtaceae family, as well as Arabidopsis. We characterized their position in the genome, copy number, exon–intron structure and their phylogenetic relationships to better understand their evolution in eucalypts. We aim to determine (i) whether eucalypt CER gene duplication is ubiquitous across the genera, (ii) the extent to which CER genes exhibit structural variation, and (iii) if copy number or structural variation is associated with the species home-range environmental variation.

Materials and methods

Acquisition of CER sequences

To have a good representation of the taxonomic diversity of eucalypts, we used genome assemblies from 22 eucalypt species spanning the genera Angophora (1 species), Corymbia (2 species) and Eucalyptus (19 species), which cover all the subgenera and sections of the group that have a genome currently available. The genomes of another 6 tree species of the Myrtaceae family were included as outgroup (Melaleuca alternifolia) or sister taxa (Psidium guajava, Rhodamnia argentea, Metrosideros polymorpha, Syzygium aromaticum, Leptospermum scoparium) for comparison with the eucalypts (Thornhill et al. 2015), giving a total of 28 tree species studied (Supplementary Table S1). All the genomes were obtained from public repositories (see Myburg et al. 2014; Izuno et al. 2016; Thrimawithana et al. 2019; Wang et al. 2020; Ahrens et al. 2021; Healey et al. 2021; Voelker et al. 2021; Ferguson et al. 2023) except for E. globulus, which was obtained from Agriculture Victoria, Australia and Leptospermum scoparium which was obtained from the Aotearoa Genomic Data Repository (https://www.genomics-aotearoa.org.nz/data). The E. grandis and C. citriodora genomes were obtained from Phytozome 13 (https://phytozome-next.jgi.doe.gov), whereas the remaining genomes were obtained from GenBank (https://www.ncbi.nlm.nih.gov/genbank/). Most of these Genbank genomes were de novo assembled into very large contigs and scaffolded into chromosomes using the E. grandis assembly as a reference (Ferguson et al. 2023). Melaleuca alternifolia and E. pauciflora were not assembled into chromosomes, while E. grandis, E. globulus, C. citriodora, Psidium guajava, Rhodamnia argentea, Metrosideros polymorpha and Syzygium aromaticum were assembled to chromosome level de novo. To check that the chromosome assignment and orientation of Psidium guajava, Rhodamnia argentea, Metrosideros polymorpha and Syzygium aromaticum matched the most syntenic ones in eucalypts, the genomes of these species were aligned against the E. grandis genome using minimap2 (Li 2018) and their chromosomes were oriented and renamed if needed (Supplementary Figure S1). The chromosomes of all other species were numbered and oriented following E. grandis as is the convention in eucalypts.

The peptide sequences of CER1 (AT1G02205) and CER3 (AT5G57800) were obtained from the Arabidopsis genome (TAIR) via keyword search in Phytozome 13. We also obtained the peptide sequence of the two CER1 homologues in Arabidopsis (AT1G02190 and AT2G37700), which we named CER1a and CER1b, respectively, for inclusion in later analyses. The sequences of CER1 and CER3 were used in a tBLASTn search (Altschul et al. 1990) of all selected Myrtaceae genomes to find genomic regions that likely contain CER genes (e-value < 1e−03). The synteny of the genomic regions containing CER genes was examined through pairwise whole genome alignment using MUMmer (Marçais et al. 2018) using the same parameters as Ferguson et al. (2023), namely the tool nucmer (--maxmatch -l 40 -b 500 -c 200), and summarised using syri (Goel et al. 2019). Some alignments were able to be sourced from Ferguson et al. (2023).

Identification of genes, pseudogenes and exon–intron structure

To determine gene coordinates and exon–intron borders of putative CER genes, we first obtained from GenBank or Phytozome the nucleotide sequences of the genomic region (± 1 kb from hit) identified above and then used GeneWise (Birney et al. 2004) through the online platform EMBL-EBI (https://www.ebi.ac.uk/Tools/psa/genewise) (Madeira et al. 2022) using the nucleotide and peptide sequences of the genes CER1 and CER3 from Arabidopsis. Putative genes with reading frame shifts or insertions/deletions leading to premature stop codons were classified as pseudogenes. The chromosome number and position within a chromosome of each CER1 and CER3 gene and pseudogene were identified in each Myrtaceae species. The relative disposition of CER genes and pseudogenes with other non-CER transcripts was checked in the genomes of E. grandis, E. globulus and C. citriodora, which are the only eucalypt species that currently have annotations. To do that, we used the tools JBrowse (https://phytozome-next.jgi.doe.gov/jbrowse/index.html) for E. grandis and C. citriodora and the Integrative Genomics Viewer (https://igv.org/) for E. globulus.

Conserved domains were identified on each gene using the online tool Batch CD-Search (https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi) at NCBI’s Conserved Domain Database (Marchler-Bauer et al. 2010). The Gene Structure Display Server (GSDS v2, http://gsds.gao-lab.org/index.php) was used to visualise the exon–intron structure and position of conserved domains within genes (Hu et al. 2014).

Phylogenetic analysis

Multiple alignments of amino acid sequences of all Myrtaceae CER1 and CER3 genes as well as the Arabidopsis CER1, CER3, CER1a and CER1b were performed in MUSCLE (Edgar 2004). To mitigate potential errors, a visual inspection of the multiple sequence alignment was performed, assessing place gaps and potentially poorly aligned positions, but no trimming was necessary as we could not find any inaccuracy. Using these aligned peptide sequences, a phylogenetic tree was generated in IQ-TREE v1.6.12 with 1000 ultrafast bootstrap replicates (Nguyen et al. 2015). The software FigTree v1.4.4 was used to visualise the phylogenetic tree (http://tree.bio.ed.ac.uk/software/Figtree).

Association of copy number and exon–intron structure with environmental variation

Values of ten environmental variables were obtained from the Atlas of Living Australia (https://www.ala.org.au/) for the natural distribution of 21 eucalypt species in Australia, with the exception of E. camaldulensis which was excluded from the analysis because of its extremely wide distribution in Australia (Slee et al. 2020). The environmental variables included elevation (ELE) and nine climate variables, which were mean annual temperature (MAT), mean temperature of the coldest quarter (MTCQ, 3-month period), mean temperature of the warmest quarter (MTWQ), a thermic index of continentality (TIC) calculated as the difference between MTWQ and MTCQ (Tuhkanen 1980), mean precipitation of the wettest quarter (MPWQ), mean annual precipitation (MAP), annual mean radiation (RAD) and two moisture indices used to describe overall aridity. These indices were the annual heat–moisture index (AHM), which was calculated using the equation AHM = (10 + MAT)/(1000−1 MAP), and a modified summer heat–moisture index (SHM), that was calculated using the equation SHM = MTWQ/(1000−1 MAP) (Wang et al. 2006), replacing the mean warmest month temperature of the original equation by MTWQ. Records outside a species’ natural distribution were identified using a modified z-score outlier test based on the median absolute deviation of the mean annual temperature, mean annual precipitation and elevation (Iglewicz and Hoaglin 1993). Records with a z-score > 3.5 were checked manually against the known natural species distribution and deleted when appropriate (Jordan et al. 2016). For each of the 21 species, the average of each environmental variable was then calculated from the retained distributional records. To explore the association between the copy number of CER genes and the adaptation of eucalypts to environmental variation, Pearson’s correlations were calculated between the number of functional genes per eucalypt species and the average of each environmental variable for the species distribution. To explore the effect of the gene structure on environmental adaptation in eucalypts, we split the eucalypt species into two contrasting groups according to the results of the exon–intron structure analysis. We then used Welch’s t-test to compare the means of the two groups for each environmental variable.

Results

Copy number variation across all Myrtaceae

The genome-wide search on the 28 Myrtaceae tree species revealed 250 genomic regions containing sequences with high similarity to Arabidopsis CER1 or CER3. 162 sequences were annotated as genes, from which 135 were CER1 and 27 were CER3 genes. A total of 88 sequences were identified as pseudogenes, of which 87 were CER1 pseudogenes and only one was a CER3 pseudogene (Supplementary Table S2). While there was only one copy of the gene CER3 per species in almost all species (the exception was Leptospermum scoparium with no CER3 detected), multiple copies of CER1 were found in all the Myrtaceae tree species (Table 1). Notably, the copy number of CER1 genes ranged from 2 to 10 in eucalypts, whereas it varied from 1 to 4 for the other Myrtaceae tree species. Similarly, the number of CER1 pseudogenes was also greater in eucalypts than the other Myrtaceae species. Within eucalypts, the copy number of CER1 ranged from 3 to 10 for the genus Eucalyptus and from 2 to 7 for its sister genus Corymbia and related Angophora, with a similar pattern detected in the pseudogenes. Within the genus Eucalyptus, multiple species were represented from the two most important subgenera—Symphyomyrtus and Eucalyptus—and considerable variation in the number of CER1 copies was detected within these subgenera. The Symphyomyrtus species ranged from 3 to 10 copies of CER1 and species from the subgenus Eucalyptus had between 5 and 7 (Table 1).

Table 1 Copy number of ECERIFERUM1 (CER1) and ECERIFERUM3 (CER3) genes and pseudogenes (in brackets) identified in the studied genomes of eucalypt and other Myrtaceae tree species by chromosome

Conserved architecture of CER genes in Myrtaceae

All the Myrtaceae species with their genomes assembled at the chromosome level showed the majority of their CER1 genes and pseudogenes on chromosome 4 (Table 1). CER1 genes (and occasionally pseudogenes) were also present on chromosome 8 in most species of the genus Eucalyptus, except for six of the ten species representing subgenus Symphyomyrtus. CER1 genes were also absent from chromosome 8 in Angophora and Corymbia and the other Myrtaceae species, suggesting a recent evolutionary origin within only Eucalyptus for CER1 on this chromosome. Multiple copies of CER1 were present for C. citriodora on chromosome 9, but absent in Corymbia maculata (a eucalypt species not included in our dataset since its subgenus and section were already covered by the closely related C. citriodora) when a tBLASTn search of arabidopsis CER1 was performed (data not shown). Eucalyptus curtisii, E. melliodora and E. marginata, which belong to different Eucalyptus subgenera, showed CER1 genes and/or pseudogenes on chromosome 11, which appear to be independent translocations based on the lack of synteny between these loci (Fig. 1). The CER1 on chromosome 11 of E. curtisii and E. marginata are related to copies on chromosome 4, while the CER1 of E. melliodora on chromosome 11 is more related to CER1 on chromosome 8. Only one translocation of CER1 on chromosome 6 was registered for E. globulus. Most of the species showed CER3 genes and pseudogenes on chromosome 3 (Table 1), but translocations were also observed from this chromosome to chromosomes 1 and 8 for E. guilfoylei and E. cladocalyx, respectively. The chromosomal position of some CER genes or pseudogenes was undetermined (Un, Table 1) because some genomes were not assembled to chromosome level, or in other cases, CER genes/pseudogenes were identified within the unassembled component of data in Genbank.

Fig. 1
figure 1

Pairwise alignment of chromosomes 4, 8 and 11 for E. curtisii, E. marginata and E. melliodora. Black bars indicate the locations of CER1 genes/pseudogenes on the only three species with copies on chromosome 11. Grey lines indicate matching syntenic regions between the three eucalypt species. Green lines indicate translocations. Note the synteny of the region surrounding the CER1 loci on chromosomes 4 and 8 but the lack of synteny for the region surrounding the CER1 loci on chromosome 11. None of the CER1 genes on chromosome 11 share a common ancestry

The relative position of CER1 genes and pseudogenes within chromosome 4 shows the copies to be tandem repeats in a region that is syntenic across all studied species, spanning a region up to 1.6 Mbp (~ 4% of chromosome 4), except for Eucalyptus erythrocorys and Psidium guajava, which spanned regions of 4.6 and 7.4 Mbp, respectively (Supplementary Table S2; Supplementary Figure S2). While this region is in a more-or-less central position on chromosome 4 in most species, in E. regnans and E. tenuipes, in particular, it is shifted towards the chromosome end. A closer look at this region in E. grandis and E. globulus (the only Eucalyptus with annotations to date) reveals that CER1 genes and pseudogenes occur as interspersed repeats, since E. grandis has 40 non-CER1 transcripts dispersed among the eight CER1 genes and pseudogenes, while E. globulus has 46 non-CER1 transcripts dispersed among the 12 CER1 genes and pseudogenes (data not shown). In the case of C. citriodora (the only other eucalypt with an available annotation), there is only one CER1 gene copy and no pseudogene on chromosome 4. The tandem arrangement of CER1 sequences was also observed in chromosome 8 for some species of Eucalyptus, spanning a region up to 0.14 Mbp (Supplementary Table S2; Supplementary Figure S3), again in a syntenic region.

Gene structural variation across multiple Myrtaceae lineages

Most of the orthologues of CER1 in arabidopsis showed 10 exons and 9 introns in Myrtaceae. The ERG3 domain occupied exons 3 and 4 completely and exon 5 partially; while the WAX2 domain occupied exon 7 partially and exons 8, 9 and 10 completely, 10 being the last exon next to the 3′ end position. In some cases, a partial occupation of exon 3 by the ERG3 domain was observed (Fig. 2). We detected the loss of the 7th or 8th intron in the WAX2 domain for 24 CER1 genes from 15 species that included Metrosideros polymorpha, C. citriodora and Eucalyptus from diverse subgenera (Fig. 2, Supplementary Figure S4). Gene size of CER1 ranged from 5.9 to 18.9 kbp, with most of the genes around 7 kbp. Likewise, the exon–intron structure of CER3 in the Myrtaceae species followed the exon–intron structure of arabidopsis CER3, with 11 exons and 10 introns (Supplementary Figure S5). Corymbia calophylla was the only species exhibiting structural variation in CER3, with the loss of the first exon and intron. The ERG3 domain of CER3 occupied the same exons and showed the same variations described above for CER1, while the WAX2 domain occupied exon 8 partially and exons 9, 10 and 11 completely next to the 3′ end position. CER3 gene size was relatively uniform among the different Myrtaceae species, averaging 4.5 kbp (Supplementary Figure S5).

Fig. 2
figure 2

Range of structural variation of CER1. Genes are shown in the following order: Arabidopsis thaliana CER1; typical E. globulus CER1 with ERG3 domain occupying the exons 3 and 4 completely and the exon 5 partially; typical E. viminalis CER1 with ERG3 domain occupying the exon 4 completely and the exons 3 and 5 partially; typical C. calophylla CER1 with the 7th intron lost in the WAX2 domain; typical E. brandiana CER1 with the 8th intron lost in the WAX2 domain; and atypically long E. tenuipes CER1. Eucalypt sequences were named using species followed by chromosome number, number of exons (e9 or e10) and copy number on the chromosome if there were more than one

Phylogenetic relationships between CER genes

The phylogenetic analysis showed three main clades of CER genes for the Myrtaceae species (Fig. 3). Clades 1 and 2 were phylogenetically related to Arabidopsis CER1, whereas Clade 3 was related to Arabidopsis CER3. In addition, the Arabidopsis genes CER1a and CER1b (homologues of Arabidopsis CER1), were not closely related to Myrtaceae CER1 or CER3, thus the duplication of CER1 observed in Myrtaceae is independent of that in Arabidopsis. The duplication of CER1 into Clades 1 and 2 is relatively old as both of these clades include almost all eucalypt species and many of the other Myrtaceae species (Fig. 4a and b). Clade 2 includes 5 of the 6 other Myrtaceae species and 21 of the 22 eucalypts while Clade 1 includes all eucalypt species but only two of the other Myrtaceae species and, importantly, does not include Melaleuca alternifolia which is from a more ancient divergence within Myrtaceae compared to eucalypts as shown in the phylogeny of CER3 (Fig. 4c). Therefore, the duplication of CER1 into Clades 1 and 2 most likely occurred within Myrtaceae. The internal structure of Clades 1 to 3 generally reflected the species taxonomy (Fig. 4a, b and c) and phylogenies such as the one of Thornhill et al. (2015).

Fig. 3
figure 3

Phylogeny of CER1 and CER3 genes for eucalypts, other Myrtaceae tree species and arabidopsis. The scale represents amino acid substitution per site. Bootstrap values are displayed on branches. Arabidopsis sequences are shown using chromosome number and eceriferum gene (CER1, homologues CER1a and CER1b, and CER3). CER1 is shown in Clades 1 and 2 in Myrtaceae including eucalypts, while CER3 is only present in Clade 3

Fig. 4
figure 4

Details of the phylogeny of CER1 and CER3 for eucalypts and other Myrtaceae tree species based on peptide sequences. a Clade 1 and b Clade 2 of the phylogeny which were comprised only of CER1 genes and c Clade 3 of the phylogeny which comprised only CER3 genes. The legend differentiates non-eucalypt Myrtaceae, subgenus within eucalypts or genus when subgenus is inexistent for the taxon. Sequences were named using species or genus followed by chromosome number, number of exons (e9 or e10) and copy number on the chromosome if there were more than one. I, II and III are subclades that group sequences with intron loss. IV is a subclade that groups sequences from chromosome 8. Asterisk (*) indicates identical sequences for the same species. Asterisks (**) indicates identical sequences for different species. The scale represents amino acid substitution per site

The CER1 phylogeny grouped eucalypt sequences with intron loss into three small subclades (subclades I, II and III; Fig. 4a), reflecting the subgeneric structure. These three subclades only included Eucalyptus species. Note that the phylogeny was not affected by the intron–exon structure as it was based on the peptide sequences. The CER1 genes present on chromosome 8 were grouped into one single subclade (subclade IV; Fig. 4a) and included all 8 subgenera of Eucalyptus, but not the other two genera of eucalypts (Corymbia and Angophora). Thus, this translocation to chromosome 8 appears to be specific to the genus Eucalyptus. Exact copies of CER1 were found within the same species for E. globulus and E. brandiana (Fig. 4a and b), suggesting they are the result of recent tandem duplications. Exact copies were also found for the taxonomically close species E. globulus and E. viminalis, suggesting the presence of this gene in a common ancestor (Fig. 4a). As the phylogeny of CER3 includes only one copy per species, the position of sequences more clearly followed the species taxonomy (Fig. 4c).

Copy number and structural variation association with species home-environment

Correlations between the number of copies of CER1 genes per species and the average of each environmental variable for the eucalypt home range were not statistically significant (p > 0.05, Table 2), suggesting no association between the number of CER1 copies and environmental variation. Based on the exon–intron structure of the CER1 genes, eucalypt species were classified into two groups as follows: (i) species that present a complete gene structure (i.e., no intron loss, n = 8), which included A. floribunda, C. citriodora, E. erythrocorys, E. guilfoylei, E. virginea, E. leucophloia, E. grandis and E. melliodora; and (ii) species in which at least one CER1 gene showed intron loss (n = 13), which included C. calophylla, E. curtisii, E. tenuipes, E. cloeziana, E. marginata, E. pauciflora, E. regnans, E. microcorys, E. pumila, E. globulus, E. viminalis, E. cladocalyx and E. brandiana. No statistically significant differences were found for the environmental variables when the two groups of eucalypts were compared with a Welch’s t-test (p > 0.05, Table 2), suggesting that no association between gene structure and environmental variation exists. Since no major changes in copy number or gene structure were noticed in CER3, no such tests were performed.

Table 2 Association between CER1 copy number and gene structure with environmental variation in eucalypts

Discussion

Most of the basic knowledge on eceriferum genes comes from research on the model plant species Arabidopsis (Koornneef et al. 1989; Aarts et al. 1995; Jenks et al. 1995; Rowland et al. 2007), but the development of high-throughput sequencing and the availability of new complete genomes have recently allowed the study of eceriferum homologues at the genome-wide level in other plant species. These species include non-woody species such as sunflower, tomato, wheat, passion fruit (Ahmad et al. 2021; He et al. 2022; Rizwan et al. 2022; Wu et al. 2022), as well as woody species such as jujube, oak, Chinese chestnut (Li et al. 2021; Ai et al. 2022; Zhao et al. 2022) and in the present case, the eucalypts, sister taxa and the outgroup. By manually annotating the CER1 and CER3 genes across the 28 Myrtaceae genomes, we showed that CER3 copy number was conserved across the Myrtaceae, while CER1 was highly variable, with multiple lineage-specific tandem duplications in eucalypts, along with a translocation event conserved in many subgenera. In addition, we identified variation in the exon–intron structure of CER1, detecting the loss of intron 7 or 8 in different lineages of Myrtaceae. We did not find evidence to link variation in gene structure and copy number with the capacity of eucalypts to adapt to different environments. The information presented in this study highlights the variability of the CER1 genes, presumably because of the instability induced by tandem arrays, which is in contrast to the stability of the single copy CER3 in eucalypts.

Eucalypts had a higher number of copies of CER1 genes than the Myrtaceae outgroup and sister species, and especially higher than arabidopsis. Most often these copies were grouped in the phylogeny indicating lineage-specific duplication events. These duplications were most common for a localised region of chromosome 4 which harboured between 1 and 18 genes/pseudogenes depending on the species, suggesting that tandem duplication was the mechanism responsible for these patterns. Similar gene expansion has been observed in other gene families of E. grandis (Li et al. 2015b), E. globulus (Külheim et al. 2015) and C. citriodora (Butler et al. 2018) and is a feature of eucalypts which has a high proportion of tandemly duplicated genes (Myburg et al. 2014; Healey et al. 2021). Expansion of the CER1 gene family specifically has also been observed in oaks, with Ai et al. (2022) suggesting this may confer greater adaptability to drought, potentially through a dosage effect from increased gene product (Kondrashov 2012; Kuzmin et al. 2022). This seemed to be a plausible explanation for the highly variable copy number observed across the eucalypts, especially considering the diverse environments and stresses these eucalypts are exposed. However, we found no obvious association between CER1 copy number and aridity (measured as moisture indexes) or any other environmental variable measured for the species’ current distributions. This lack of association is not entirely unexpected, given the disparate CER1 copy number among species occupying similar environmental niches in our study. For instance, E. globulus and E. viminalis, which share a large area of their distribution and are phylogenetically close, showed dissimilar copy numbers of the CER1 gene in our study yet when they co-occur show similar susceptibility to drought (Kirkpatrick and Marks 1985). A more detailed investigation of this issue is needed, and an investigation of CER1 expression levels will be required to determine if dosage amplification of CER1 contributes to the environmental adaptability of eucalypts.

Chromosome 3 appears to be the ancestral position for CER3 in Myrtaceae, given the conservation of the position of this gene across the phylogeny. Similarly, CER1 was primarily located in a syntenic region on chromosome 4 in the Myrtaceae species studied, which implies that this is the ancestral position for Myrtaceae. The tandem duplication of CER1 on chromosome 4 which gave rise to two large clades of CER1 (Clades 1 and 2) is likely to be Myrtaceae specific since Clade 1 did not contain sequences from the outgroup Melaleuca alternifolia, which is the most basal Myrtaceae divergence sampled in our study (Thornhill et al. 2015). A large proportion of the Eucalyptus species examined also had CER1 genes or pseudogenes on chromosome 8, which was not the case in the related Angophora and Corymbia or other Myrtaceae, suggesting a more recent duplication and translocation event was responsible for this novel locus. There were also species-specific duplications and translocations. For CER1, this included the positions on chromosome 11 observed in E. curtisii, E. marginata, E. melliodora and the position on chromosome 9 in C. citriodora. For CER3, this included the position on chromosome 8 in E. cladocalyx and the position on chromosome 1 in E. guilfoylei. These species-specific inter-chromosomal translocations may be the result of specific evolutionary events for these species, although assembly errors cannot be discounted (Wang et al. 2020). Indeed, the translocation of CER1 to chromosome 9 in C. citriodora may be an assembly error as there are no copies present on this chromosome in the related C. maculata (data not shown).

Intron loss was observed in only one of the two clades of CER1 (Clade 1), where it occurred in multiple lineages including the genus Corymbia, diverse subgenera of Eucalyptus as well as Metrosideros polymorpha, a genus thought to have diverged from the eucalypts around 65–70 million years ago (Thornhill et al. 2015). Our study suggests introns of the CER1 genes were independently lost several times in the evolution of Myrtaceae as the cases of intron loss were dispersed in different subclades within Clade 1 (Fig. 4a). This pattern has been observed previously, with intron loss and gain reported in different members of the eceriferum gene family for sunflower (Ahmad et al. 2021). In addition, Rizwan et al. (2022) suggested that the gene family has gone through several rounds of intron loss and gain during its evolution. However, none of this research reported intron loss in eceriferum genes as a frequent event as we do. Similar recurrent intron loss for other genes has been found in other evolutionary lineages in both plants (Wang et al. 2014; Milia et al. 2015) and animals (Cho et al. 2004; Coulombe-Huntington and Majewski 2007), but this appears to be the first observation of this phenomenon in eucalypts. Research on animals, plants and malaria has shown that intron loss was more frequent in highly duplicated genes (Castillo-Davis et al. 2004; Lin et al. 2006; Roy and Penny 2007), which accords with our observed differences between CER1 and CER3 in eucalypts. Other research in yeasts, mice, and arabidopsis showed that stress genes with fast-changing expression levels had significantly lower intron densities possibly to avoid delays in transcript production and energetic costs associated with increased transcript length (Jeffares et al. 2008). This hypothesis suggests that selection may favour intron loss during evolution in such cases. This hypothesis may be applicable to the observed intron loss in CER1, which is known to increase its expression level quickly in response to drought stress in Arabidopsis (Bourdenx et al. 2011) and several other plant species (He et al. 2022; Wu et al. 2022; Gao et al. 2023). However, we did not find that intron losses in CER1 were linked to the species’ home-environmental variation, including measurements of aridity. Species-specific evolutionary pathways could have contributed to the loss of introns at any point in their evolutionary history, which could have been independent of the environment, supporting Penny et al. (2009), who consider the attribute of intron loss to a particular environmental condition to be speculative.

Intron loss was restricted to introns 7 and 8 of CER1, which may be explained by the intron loss mechanisms or by the features of these introns. Intron loss can be due to stochastic processes such as direct genomic deletion or nonhomologous end joining during repair of DNA double-strand breaks (Cohen et al. 2011; Fawcett et al. 2011). However, a third mechanism, the reverse transcription model (Roy and Gilbert 2005), may offer an explanation, particularly as only specific introns are lost in our study (i.e., always introns 7 or 8). According to this mechanism, introns are lost through gene conversion by a retrotransposed copy of a spliced transcript of the gene often spanning multiple intron positions. Given that the process of reverse transcription initiates at the 3′ end of genes and the enzyme reverse transcriptase frequently disassociates from the template prematurely, introns closer to the 3′ end, like introns 7 and 8 in CER1, are more prone to be lost. In our study, the adjacent introns 7 and 8 were never lost simultaneously. The absence of this concurrent loss in our study is an argument against the reverse transcriptase model, which is reported to favour the concurrent loss of neighbouring introns (William Roy and Gilbert 2006; Ma et al. 2015). Specific features of these introns may also help explain their loss. For example, small introns, like intron 8, are more likely to be lost than larger introns (Coulombe-Huntington and Majewski 2007; Loh et al. 2007). Although the reason for this size effect is still not completely understood, one potential explanation is the lower likelihood of regulatory modules occurring in short introns thus their loss is less deleterious (Wang et al. 2014). However, it is notable that introns 7 and 8 interrupt the conserved functional WAX2 domain of CER1. Functional domains uninterrupted by introns can provide a selective advantage due to a greater capacity for exon shuffling as hypothesised by Liu and Grigoriev (2004). However, changes in exon position were not observed in our study arguing against this hypothesis. Intron loss remains a subject of great interest for molecular biologists because of its importance in the evolution of life (Rodríguez-Trelles et al. 2006), but at this stage remains poorly understood (Rogozin et al. 2012; Milia et al. 2015).

In conclusion, our study of CER1 and CER3 genes across the main eucalypt lineages and other Myrtaceae species showed that gene structure and copy number varied markedly among species for CER1, but was highly conserved for CER3. Several evolutionary events were specific to eucalypts, such as a high level of tandem duplications or specific inter-chromosomal translocations in Eucalyptus. Although no association was found between CER1 gene structure or copy number and the environment of origin of the studied species, more research is needed to investigate the link between the observed variation in the eceriferum genes and its possible contribution to adaptability.