Introduction

The genus Lavandula (lavenders) is composed of over 32 morphologically distinct species. Among these, a few species including L. angustifolia (English Lavender, or Lavender), L. latifolia (spike lavender), and their hybrid L. x intermedia (Lavandin) are widely grown around the globe for their essential oils (EOs) (Upson et al. 2004), which are frequently used in perfumes, pharmaceutical preparations, cosmetic products, and antiseptics, among others. With an estimated production rate of over 1500 metric tons/annum (mostly extracted from L. angustifolia and L. x intermedia), these oils significantly contribute to the multibillion-dollar flavor and fragrance industry worldwide.

Although lavender has been developed as a model system for studying EO production (Lane et al. 2010) and numerous genes that contribute to EO production in this plant have been reported (Falk et al. 2009; Lane et al. 2010; Demissie et al. 2012), many questions regarding the regulation of EO metabolism and storage in these plants remain unanswered.

The quality of lavender EOs greatly depends on the characteristic scent of the oil, which is determined by certain monoterpenes. For example, the monoterpene camphor contributes an off-odor, and its presence in the oil lowers quality and, hence, market value. Conversely, high levels of linalool and linalyl acetate are desired in lavender oils. Because many high-yield lavender species/cultivars (e.g., L. x intermedia) produce high quantities of undesired constituents (such as camphor), there is great interest in enhancing oil yield and composition in lavenders. These objectives are readily achievable through targeted breeding and/or plant biotechnology. However, an adequate understanding of molecular elements that control the production of EO constituents in lavenders must first be developed.

To better understand the genetic makeup and key pathways that control EO production, secretion, and storage, a first reconstruction of a full-length draft genome assembly has been generated using short-reads from the Illumina platform. This resource can lead to better understanding of the genome architecture, gene clusters, and repeat content and ideally also in the generation of genetic markers for plant breeding programs.

De novo genome assembly of lavender

The genome of L. angustifolia (Maillette) was sequenced to a total of ~ 100 × coverage using the Illumina HiSeq 2000 platform combining the use of pair-end and mate pair libraries, all sequenced at 100 bp × 2 setting. The best de novo genome assembly was obtained using FERMI (Li 2012) for contig assembly and OPERA (Gao et al. 2011, 2016) for scaffolding. The initial genome assembly was further improved using in-house transcriptome-based program and GapCloser (Luo et al. 2015). The final draft assembly contains 869,786 077 bp in 84,291 scaffolds with N50 being 96, 735 bp, and the total non-gap sequences (sequences with no gaps) being 688,040,719 bp or 79.1% of the final draft genome. The repeat sequences annotated by RepeatMasker (Smit et al. 2013), RepeatModeler (http://www.repeatmasker.org/RepeatModeler/), and RepeatProteinMasker (http://repeatmasker.org) constitute 42.8% of the lavender draft genome. The genome has a 38.1% GC content. (Table 1 and Fig. 1a).

Table 1 Summary statistics for the lavender draft genome assembly
Fig. 1
figure 1

Summary features of the Lavandula angustifolia (Lavender) draft genome. a Pie chart showing the genome composition. Categories denoted by “*” exclude overlapping repeats and gaps. b Estimation of genome completeness of the lavender draft genome and comparison with other four published plant genomes using Benchmark Universal Single-Copy Orthologues (BUSCO)

Regarding the genome size of lavender, there have been two controversial reports, one suggesting a very large size of 5574 (Zonneveld et al. 2005) and another reporting genome size ranging from 772 Mbp to 880 Mbp (Urwin et al. 2007). The genome size of our draft genome assembly of ~ 870 Mbp is sufficiently close to our own experimentally determined genome size of approximately 850 Mbp using a qPCR-based method (Wilhelm 2003) (data not shown). A similar size prediction was also obtained based on the raw Illumina reads using Kmergenie (Chikhi and Medvedev 2014) (data not shown). Therefore, our data resolve the dispute on lavender genome size by supporting an estimated genome size around 850 Mbp.

The genome completeness determined by Benchmarking Universal Single-Copy Orthologues (BUSCO) (Simão et al. 2015) shows a high-quality draft genome, covering 1292 (89.7%) complete single-copy orthologues (SCO) of the 1440 plant-specific sequences (Embryophyta data set from BUSCO data sets) (Fig. 1b). The presence of 30 fragmented SCOs brings the total SCOs to 1322 and genome completeness to 91.8%. In comparison with the published genomes for several other plants including model plants, the completeness of lavender draft genome was comparable to that of the maize genome (1294 complete SCOs and 31 partial SCOs with completeness at 92%), which was obtained using more resources (Schnable et al. 2009), and is significantly better than the mint genome (848 complete SCOs and 183 partial SCOs; completeness at 71%) (Vining et al. 2017) (Fig. 1b).

An interesting observation among the complete SCOs is the presence of 696 complete duplicated copies. Sequence analysis using BLASTn (Altschul et al. 1990) combined with the BUSCO score analysis revealed that at least 547 (78%) of the 696 complete and duplicated SCOs represent real duplication events in the genome, distinguishing them from possible assembly errors due to genetic heterogeneity between homologous chromosomes. Indeed, possible duplication events resulting in polyploidy ranging from 2n = 36 to 2n = 54 have been reported for L. angustifolia (Upson et al. 2004). In this context, our data indicate that the variety used in this study (Maillette) is most likely a polyploid line.

De novo genome annotation

Using an automated and highly configurable de novo annotation pipeline (MAKER) with transcriptome data (Adal et al. 2018) and NCBI plant reference protein sequences to guide gene prediction for Augustus (Stanke et al. 2006; Campbell et al. 2014; Holt and Yandell 2011), we identified 218,000 initial gene models. This list was subjected to filtering based on minimal gene length and/or homology to known protein sequences to generate the final total 62,141 protein-coding genes. In addition, we have also identified 2003 tRNA and rRNA genes using tRNAScan (Lowe and Eddy 1996) and RNAmmer (Lagesen et al. 2007), respectively, which brings the total number of identified genes to 64,144. The total regions of genes, CDS, and introns cover 26.8, 9.2, and 5.1% of the lavender draft genome, respectively (Fig. 1a and supplementary Table S1). In comparison, lavender genome has a higher (~ 3 times) density of protein-coding sequences (CDS) than the maize genome, but it is more than four times lower than that of Arabidopsis (supplementary Table S1).

We also performed annotation of repetitive elements for the draft genome using RepeatMasker (Smit et al. 2013), which revealed a total of repetitive content ~ 43%, a level which is lower than the most published plant genomes at similar sizes. For examples, the tomato genome (Sato et al. 2012) size ~ 900 Mbp has 68% as repeats (Jouffroy et al. 2016), and the 2104 Mbp maize genome (Schnable et al. 2009) has 82.2% for repeat elements (supplementary Table S1). In comparison, the 119 Mbp model plant Arabidopsis thaliana (Weinig et al. 2002) genome has only 21% for repeat elements (supplementary Table S1). Among the repeats, as expected for most plant genomes, the LTR retrotransposon as the dominant type is contributing to 18% of the genome or 44% of the repeats as expected for most plant genomes. The lower than expected, overall repeat content for lavender could be partly due to selective loss post-genome polyploidization and partly due to biased distribution of gaps in repeat regions. The former is in agreement with the higher density of CDS exons. The latter can be resolved by additional sequencing and using them in closing the gaps.

Genes for EO pathways

Considering that EO production is the main characteristics of the lavender plants, we performed detailed analysis for the genes associated with EO pathways. The draft lavender genome covered all of the known genes encoding the four classes of enzymes required for all the stages of EO biosynthesis (Tables 2, 3 & supplementary Fig. S1). In comparison with the four other published plant genomes, including mint, the only published genome in the mint family as lavender (Vining et al. 2017), we observed that lavender has the largest gene copy numbers for 15 out of the 35 genes involved in EO biosynthesis (Table 2). For methylerythritol pathway (MEP), lavender genome has 13 copies of DXS being one copy higher than that of the mint, while all three other genomes have only a single copy. Interestingly, lavender genome has seven copies of the last gene, HDR, in the MEP pathway, while all other genomes including mint have only a single copy. It is likely that having a larger copy number of the first and last genes for the MEP permits the plant to be more efficient in the production of terpene compounds (essential oils, resins, etc.) as evidenced in other plant like pine (Kim et al. 2009).

Table 2 Comparison of gene copy numbers for the DXP, MVA pathways, and prenyltransferases that produce linear precursors for isoprenoid production in various plant genomes
Table 3 Previously cloned monoterpene synthase, sesquiterpenes synthase, and acetyltransferase genes responsible for producing mono- and sesquiterpene essential oil constituents in lavenders

In addition, lavender genome has a high copy number for the two precursor enzymes, geranylgeranyl pyrophosphate synthase (GGPPS) and farnesyl pyrophosphate synthase (FPPS). The lavender genome is much more similar to the mint genome by having many EO genes, including 13 of the 15 terpene synthases (TPS) genes critical for EO synthesis, which are absent in the non-mint plants (Table 2), and these genes contribute to the special EO-producing characteristics of the mint family plants.

A surprising finding was that the lavender genome contains only two copies of the R-linalool synthase gene (which contributes most to EO production), while it contains more than two copies of some other terpene synthase genes that contribute minimally to the EO (e.g., limonene synthase and 3-carene synthase) in lavenders (Table 3). Given that linalool synthase is responsible for the production of two of the most abundant EO constitutes (linalool and its acetylated form linalyl acetate, which make up > 80% of the EO in most L. angustifolia and L. intermedia species), and that transcripts for the gene are among the most abundant mRNA species in floral oil glands (Lane et al. 2010), we anticipated a higher copy number in the genome. Taken together, our results indicate that the R-linalool synthase gene is very strongly expressed in floral oil glands.

We also performed orthologous sequences/clusters’ search as a way of validating the genome annotation process using the Orthovenn tool (Wang et al. 2015). There are a total of 8413 clusters which have 76,424 (data not shown) gene sequences shared among all five plant genomes compared, and we were able to find orthologous sequences in Arabidopsis (64), Rice (58), Maize (87), and tomato (776) genomes. We were unable to include mint genome for this analysis due to lack of protein sequences. Tomato shares the largest number of genes with lavender among the plants compared. A search for orthologous sequences containing terpene synthase genes resulted in 20 orthologous clusters divided into characterized and uncharacterized TPS genes. Orthologous sequences for S-linalool synthase gene are presented in all plants and have three copies in tomato genome, one more than lavender. Interestingly, R-linalool synthase is absent in rice and maize, and has a single copy in the tomato genome, one less than lavender (supplementary Table S2). Among the uncharacterized TPS genes, the genes for ent-copalyl diphosphate synthase and solanesyl diphosphate synthase are shared among all plants. The other three genes [cis-abienol synthase, gamma-cadinene synthase, and (+)-epi-alpha-bisabolol synthase] is presented only in lavender (supplementary Table S2).

Conclusion

In conclusion, the lavender genome assembly reported here represents the first relatively complete high-quality draft genome. The initial analysis of the genome sequences reveals a genome optimized for EO production by having large copy numbers of many EO pathway genes and genes unique to EO-producing plants. Future work may involve the estimation of the level of genome heterozygosity, identification of any genome duplication events during the course of the plant’s evolution, closing of the gaps in the draft assembly using long sequencing platforms, and optical mapping technologies to map the genome sequences to their chromosomal locations. Given the economic status of lavender and its applications of EOs in many industries, the lavender draft genome sequences can serve as a significant genomics resource for the lavender research communities.

Author contribution statement

SSM and PL designed and funded the research. SSM, AMA and LSS prepared gDNA for sequencing and provided ESTs and transcriptome sequences used in annotating the draft assembly. RPNM and PL conducted bioinformatics work to generate the draft genome assembly and annotate it. RPNM, PL and SSM prepared the manuscript. All authors read and approved the manuscript.