Background & Summary

Lactuca sativa L. (Asteraceae), known as lettuce, is considered one of the most important vegetable crops1,2,3,4,5. Originating in the coastal Mediterranean regions, lettuce was featured in Egyptian tombs around 2,500 BC2,3. Today, lettuce is cultivated as diverse horticultural varieties for different purposes, including leafy types (looseleaf, crisphead, romaine, and butterhead) and non-leafy types (stem and oilseed), each with distinct morphological characteristics6,7. Leafy lettuces, particularly looseleaf and crisphead, are consumed globally in salads and hamburgers, and are also popular in hotpot cuisine in China and grilled with red meat in other parts of Asia. Looseleaf lettuce, compared to crisphead, grows faster, can be harvested earlier, and has better tolerates to abiotic stress. Thus, looseleaf lettuce is an important horticultural type for the annual leafy vegetable supply, and genomic research could greatly enhance its economic value.

A high-quality reference genome is crucial for identifying genetic variations, conducting phylogenetic research, and facilitating molecular marker-assisted breeding. As a representative species of the genus Lactuca in the Asteraceae family, the first reference genome for a crisphead lettuce type (L. sativa var. capitata cv. Salinas) was released in 2017, with a genome size of 2.38 Gb and contig N50 of 36 Kb8. With advancements in sequencing technology and broader use of Lactuca species, additional genome assemblies have been published, including those for two wild relatives (L. saligna and L. virosa), and one stem lettuce (L. sativa var. angustana cv. Yanling1)9,10,11. Although these data are useful for identifying intraspecific variation, only two chromosome-level genome assemblies of cultivated lettuce (the crisphead and stem types) have been generated to date8,11. A single or limited number of reference genomes for an economically important crop is insufficient for exploring genetic diversity, which hinders genomic research and molecular breeding12,13. A high-quality genome assembly for the looseleaf type is crucial for identifying genetic variations, inferring phylogenetic relationships among different horticultural types, and facilitating comparative genomic analysis and genetic improvement in lettuce.

In this study, we generated a chromosome-level and near-complete reference genome assembly for looseleaf lettuce (L. sativa var. crispa cv. Green Elegance) using PacBio high-fidelity reads (~46×), Oxford Nanopore reads (~13×), Illumina short reads (~50.39×), and Hi-C reads (~97×). The assembled genome (Green Elegance) had a total length of 2.59 Gb, with a contig N50 of 205.47 Mb and a BUSCO completeness score of 98.39%. A total of 2,580.61 Mb (99.61%) of the genome sequences were anchored to nine chromosomes, featuring 14 recognizable telomeres and 11 gaps. Genome annotation predicted 41,375 protein-coding genes and 77.11% repetitive sequences. These genomic resources provide a roadmap for further genetic and evolutionary investigation.

Methods

Sample collection, library construction and sequencing

Looseleaf lettuce (Lactuca sativa var. crispa cv. Green Elegance) was provided by the Beijing Vegetable Research Center, Beijing Academy of Agriculture and Forestry Science, Beijing, China (Fig. 1). The seedlings were grown in a growth chamber at the Beijing Vegetable Research Center under a photoperiod of 16-hour light (200 μmol m−2 s−1) and 8-hour dark at 25 °C. Fresh and healthy leaves were collected at the rosette stage and immediately frozen in liquid nitrogen for genome survey and sequencing (Table 1). For transcriptomic sequencing, samples included mature leaves, young seedlings (including roots), and inflorescence (Table 1). Newly developed tender leaves, maintained under moist and low-temperature conditions, were used to construct the Hi-C library (Table 1).

Fig. 1
figure 1

Morphological characteristics of the looseleaf lettuce (Lactuca sativa var. crispa) cultivar Green Elegance sequenced used for genome assembly. Images show the Looseleaf lettuce (cv. Green Elegance) in the field (a) and after harvest (b), including a full-expanded leaf, longitudinal section, the top view and side view of the Green Elegance plant.

Table 1 Summary of the sequencing data generated for the looseleaf lettuce (L. sativa var. crispa cv. Green Elegance) genome assembly.

High molecular weight genomic DNA was extracted from leaves using a modified CTAB (cetyltrimethylammonium bromide) method14. RNA was removed by adding RNase A. The quality of the DNA was assessed using agarose gel electrophoresis, which confirmed excellent integrity of the DNA molecules.

For Illumina sequencing, a short-read library with an average insert size of 350 bp was constructed and sequenced on an Illumina Novaseq platform (Illumina, CA, USA) using the PE150 program. This yielded 135.8 Gb of raw data. Finally, 124.16 Gb (50.39×) of clean reads were obtained for genome size estimation, sequence correction, and assessment of heterozygosity and repeat content (Table 1 and Fig. 2).

Fig. 2
figure 2

K-mer analysis of L. sativa var. crispa cv. Green Elegance. The plot displays the frequency of 21 K-mers, with the genome size of L. sativa cv. Green Elegance estimated at 2.46 Gb, a heterozygosity rate of 0.211%, and a duplication rate of 1.66%.

For PacBio HiFi sequencing, genomic DNA was fragmented to ~15 Kb to construct a long-read library following the manufacturer’s instructions (Pacific Biosciences, CA, USA). The library was sequenced on a PacBio Sequel II platform using Circular Consensus Sequencing (CCS) mode. The SMRTbell library was constructed using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences). Library size and quantity were assessed using the FEMTO Pulse (Agilent Technologies, Wilmington, DE) and the Qubit dsDNA HS Assay Kit (Life Technologies, Carlsbad, CA, USA). The library was loaded at a concentration of 55 pM using diffusion loading. Single-molecule real-time (SMRT) sequencing was conducted on a single 8 M SMRT Cell on the Sequel II System. After filtering out the low-quality reads and sequence adapters, we obtained 109.27 Gb (46×) of clean subreads with a reads-length N50 of 17.81 Kb.

Long-read sequencing using the PromethION platform from Oxford Nanopore Technologies (ONT) was performed to fill assembly gaps. High-quality genomic DNA was fragmented to ~8 Kb using a gTube, and the library was constructed with the Ligation Sequencing Kit 1D (Nanopore, SQK-LSK109). We generated 85.80 Gb of raw data, and after filtering out adapters, low-quality reads, and reads shorter than 2 Kb reads, 32.55 Gb clean reads with a clean N50 length of 100,148 Kb were obtained (Table 1).

Genome size and heterozygosity estimation

Short Illumina reads were quality-filtered using fastp15 (v0.12.4; settings ‘-q 10 -u 50 -y -g -Y 10 -e 20 -l 100 -b 150 -B 150’). The quality-filtered reads were used for genome size estimation. We counted the 21-kmers with Jellyfish16 (v2.1.4; k-mer size 21), and Genomescope17 (v2.0; default settings) were used to estimate a genome size of 2.46 Gb, and a genome-wide heterozygosity rate of 0.21% of sites (Fig. 2).

De novo genome assembly

High-accuracy Circular Consensus Sequencing (CCS) data were used to generate 156 contigs, with the longest contig length of 282.47 Mb and an N50 length of 205.47 Mb using hifiasm (v 0.16) software18 (Table 2). This resulted in a total genome sequence size of 2.59 Gb.

Table 2 Comparison of genome assemblies of L. sativa var. crispa, L. sativa var. capitata, L. sativa var. angustana, L. saligna, and L. virosa.

To anchor contigs, 251.37 Gb of clean reads pairs from the Hi-C library were mapped to the polished Green Elegance genome using BWA (bwa-0.7.17) with the default parameters. Invalid reads, such as self-ligation, non-ligation, PCR amplification, and random breaks, were filtered out. After correction and filtration, we obtained 77 high-accuracy scaffolds with a scaffold N50 length of 320.76 Mb and a total scaffold length of 2,590.68 Mb (Table 2). We successfully anchored 2,590.61 Mb (100%) of the genome into nine groups, which were designated as nine chromosomes of Green Elegance, using the agglomerative hierarchical clustering method in Lachesis19 (Fig. 3). Lachesis was then used to order and orient the clustered contigs. A total of 2,580.61 Mb (99.61%) was successfully ordered and oriented on the nine chromosomes (Table 3). The Hi-C contact heatmap, generated using Hicexplorer v3.720, revealed nine distinct groups based on interaction intensities between bins (a bin size of 800 Kb), indicating high quality of chromosome construction (Fig. 4). The final chromosomal-level assembly had chromosomal lengths ranging from 205,466,188 bp to 407,155,607 bp, encompassing 99.6% of the total sequence (Table 3). After gap filling with ONT sequencing data, 11 gaps remained across eight chromosomes, with one chromosome being complete. Fourteen telomeres, including 11 complete telomeres longer than 1 Kb, were distributed across the nine chromosomes (Fig. 5, Tables 2 and 3). This genome assembly of L. sativa var. crispa cv. Green Elegance represents a significant improvement in genome continuity (contig N50), gap number, and chromosome anchoring compared to the other sequenced Lactuca plants, including L. sativa var. capitata cv. Salinas, L. sativa var. angustana cv. Yanling1, L. saligna, and L. virosa (Table 2).

Fig. 3
figure 3

Circos plot of L. sativa var. crispa cv. Green Elegance genome assembly. Chromosome ideograms (a), transposable element (TE) density (b), simple sequence repeat (SSR) density (c), gene density (d), and GC content (e) were shown from the outermost to the innermost layer.

Table 3 Statistics of the L. sativa var. crispa chromosomes after assembly by Hi-C and gap filling.
Fig. 4
figure 4

Hi-C heatmap of L. sativa var. crispa cv. Green Elegance. The intra-chromosome interaction density is represented by varying colors and calculated using a bin size of 800 K.

Fig. 5
figure 5

The distribution of telomeres and genomic gaps on nine chromosomes of L. sativa var. crispa cv. Green Elegance. After assembly and gap filling, there are 9 centromeres, 14 telomeres and 11 gaps across the nine chromosomes. Gene density (bin size = 1 Mb) is also shown.

Repetitive sequences annotation

Transposon elements (TE) were identified using a combination of homology-based and de novo approaches. A de novo repeat library was first constructed with RepeatModeler (http://www.repeatmasker.org/RepeatModeler/)21. Full-length long terminal repeat retrotransposons (FL-LTR-RTs) were identified using LTRharvest (v1.5.9)22 and LTR_finder (v2.8)23, and a high-quality library was produced with LTR_retriever24. The de novo TE sequences library and known TE sequences from Dfam (v3.5) database were combined to create the final TE sequence set for the Green Elegance genome, which was classified using RepeatMasker (v4.12)25. Tandem repeats were annotated using Tandem Repeats Finder (TRF 409)26 and the MIcroSAtellite identification tool (MISA v2.1)27 with the default parameters (definition: 1–10 2–6 3–5 4–5 5–5 6-5; interruptions: 100). In total, transposon elements and tandem repeats accounted for 77.11% and 4.14% of the Green Elegance genome sequence, respectively, amounting to 2.00 Gb and 107.58 Mb (Table 4).

Table 4 Statistics of repetitive element annotation.

Gene prediction and functional annotation of protein-coding genes

Three approaches—de novo prediction, homology search, and transcript-based assembly—were integrated for annotating protein-coding genes in the genome (Table 5). De novo gene models were predicted using two ab initio gene-prediction software tools, Augustus (v3.1.0)28 and SNAP (Korf, 2004). For homolog-based prediction, GeMoMa (v1.7) was used with reference gene models from the various species, including Arabidopsis thaliana, Oryza sativa, L. sativa var. capitata cv. Salinas, L. sativa var. angustana, L. serriola, L. virosa, Helianthus annus, Taraxacun kok-saghyz, and Artemisia annua. For transcript-based prediction, RNA-sequencing data were mapped to reference genome using Hisat (v2.1.0)29 and assembled with Stringtie (v 2.1.4)17. GeneMarkS-T (v5.1) was used to predict genes based on these assembled transcripts. Additionally, PASA (v2.4.1) was employed to predict genes based on unigenes and full-length transcripts from PacBio/ONT sequencing assembled by Trinity (v2.11)30. Gene models from these approaches were integrated using EVM (v1.1.1) and updated with PASA. In total, 41,375 protein-coding genes with an average length of 3,744 bp were predicted in the Green Elegance genome (Table 6).

Table 5 Statistics of gene number by different annotation methods.
Table 6 Summary of gene annotation.

Gene functions were inferred by aligning to the National Center for Biotechnology Information (NCBI) Non-Redundant (NR), EggNOG31, KOG, TrEMBL32, InterPro33 and Swiss-Prot32 protein databases using Diamond blastp (diamond v2.0.4.142) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database34 with an E-value threshold of 1E-5. Protein domains were annotated with InterProScan (v5.34-73.0)35, while motifs and domains within gene models were identified using PFAM databases36. Gene Ontology (GO) IDs for each gene were obtained from TrEMBL, InterPro and EggNOG. Approximately 41,004 (99.10%) of the predicted protein-coding genes in Green Elegance could be functionally annotated with known genes, conserved domains, and Gene Ontology terms (Table 6). This high annotation ratio (99.10%) is the highest among five Lactuca plants, including L. sativa var. capitata cv. Salinas, L. sativa var. angustana, L. saligna, and L. virosa (Table 6).

Whole genome synteny analysis

For synteny analysis, genomes of four other Lactuca species, including L. sativa var. capitata, L. sativa var. angustana, L. virosa and L. saligna assemblies, were aligned to the L. sativa var. crispa genome using Mummer (v 4.0)37 with the parameters: -c 500 -b 500 -l 100–maxmatch (Fig. 6). Raw alignment results were filtered using delta filter with parameters: -1 -i 90 -l 500. MCScanX identified syntenic blocks38 with the parameter -s 15 (number of genes required to call a collinear block) and visualized them using jcvi v1.2.839 with the parameter–minspan = 30.

Fig. 6
figure 6

Whole genome synteny among five genome assemblies in the genus Lactuca. Conserved syntenic blocks are highlighted with lines of different colors, corresponding to the nine pseudo-chromosomes.

Data Records

The raw genomic sequencing data used for genome assembly are available in the Genome Sequence Archive (GSA)40 in the National Genomics Data Center (NGDC), Beijing Institute of Genomics (China National Center for Bioinformation)41, Chinese Academy of Sciences (https://bigd.big.ac.cn/gsa). The accession number CRA01487342 covers genome survey data, transcriptomic sequencing data, PacBio HiFi sequencing data, ONT sequencing data, and Hi-C sequencing data. The genome assembly and annotation files are available in the Genome Warehouse (GWH)43 in NGDC (accession number is GWHERDY0000000044), Genebank (JBFTWI000000000)45 and Figshare (https://doi.org/10.6084/m9.figshare.25116548)46.

Technical Validation

To evaluate the completeness of L. sativa var. crispa cv. Green Elegance (version 1.2) assembly, Illumina short-read and PacBio long-reads data were mapped back to the assembly. The alignment was analyzed using Qualimap v.2.2.2. The mapping rate for both libraries was 99.75% (an average 48× coverage) for Illumina short reads and 99.85% (average 42× coverage) for PacBio long-reads. BUSCO v5.2.247 with OrthoDB was used to assess genome completeness. In genome syntenic analysis, L. sativa var. crispa, L. sativa var. capitata and L. sativa var. angustana showed high conservation, compelling evidence that the gross genome structure has been accurately assembled (Fig. 6). We have observed relatively few genomic arrangement ambiguities in the Hi-C contact heat map, though with some discontinuities, which were probably caused by highly repetitive sequences. Visually inspection of the Hi-C map also revealed that some points of ambiguity appeared to be centromeres, likely due to sequence similarity in these regions. Meanwhile, clear antidiagonals for several chromosomes were also observed in the Hi-C contact heat map, such as chromosome 7 and chromosome 8. Such a pattern may suggest a Rabl configuration of the chromosomes, which could be validated in future cytological investigations. Overall, 98.39% BUSCOs were complete and 0.50% fragmented in the assembled genome (Table 7). CEGMA (Core Eukaryotic Genes Mapping Approach) (v2.5) analysis showed that 99.78% (457 CEG, Core Eukaryotic Genes) of CEGMA genes were present in the genome48. The LTR Assembly Index (LAI)49 of 17.34 indicated a high-quality genome assembly for L. sativa var. crispa cv. Green Elegance, with better continuity and completeness compared to other Lactuca species (Table 7). The higher ratio of complete BUSCOs and LAI values, compared to the other four Lactuca species, indicate the superior quality of the genome assembly for L. sativa var. crispa cv. Green Elegance (v1.2).

Table 7 BUSCO and LAI assessments of L. sativa var. crispa, L. sativa var. capitata, L. sativa var. angustana, L. saligna, and L. virosa.