A near-complete chromosome-level genome assembly of looseleaf lettuce (Lactuca sativa var. crispa)

Zhang, Bin; Xue, Yingfei; Liu, Xue; Ding, Haifeng; Yang, Yesheng; Wang, Chenchen; Xu, Zhaoyang; Zhou, Jun; Sun, Cheng; Tang, Jinfu; Li, Dayong

doi:10.1038/s41597-024-03830-y

A near-complete chromosome-level genome assembly of looseleaf lettuce (Lactuca sativa var. crispa)

Data Descriptor
Open access
Published: 04 September 2024

Volume 11, article number 961, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

A near-complete chromosome-level genome assembly of looseleaf lettuce (Lactuca sativa var. crispa)

Download PDF

Bin Zhang^1,2,
Yingfei Xue³,
Xue Liu ORCID: orcid.org/0000-0002-7600-5647^1,2,4,
Haifeng Ding^1,5,
Yesheng Yang⁵,
Chenchen Wang⁶,
Zhaoyang Xu⁶,
Jun Zhou ORCID: orcid.org/0000-0003-3131-7804⁶,
Cheng Sun³,
Jinfu Tang⁴ &
…
Dayong Li^1,2,7,8

191 Accesses
Explore all metrics

Abstract

Lettuce (Lactuca sativa L., Asteraceae) is one of the most important vegetable crops, known for its various horticultural types and significant morphological variation. The first reference genome of lettuce, a crisphead type (L. sativa var. capitata cv. Salinas), was previously released. Here, we reported a near-complete chromosome-level reference genome for looseleaf lettuce (L. sativa var. crispa). PacBio high-fidelity sequencing, Oxford Nanopore, and Hi-C technologies were employed to produce genome assembly. The final assembly is 2.59 Gb in length with a contig N50 of 205.47 Mb, anchored onto nine chromosomes, containing 14 recognizable telomeres and only 11 gaps. Repetitive sequences account for 77.11% of the genome, and 41,375 protein-coding genes were predicted, with 99.10% of these assigned functional annotations. This chromosome-level genome enriched genomic resources for various horticultural types of lettuce and will facilitate the characterization of morphological variation and genetic improvement in lettuce.

Chromosome-scale assembly and annotation of the perennial ryegrass genome

Article Open access 12 July 2022

Chromosome-level genome assembly and annotation of Zicaitai (Brassica rapa var. purpuraria)

Article Open access 03 November 2023

Chromosome-level genome assembly of the diploid oat species Avena longiglumis

Article Open access 22 April 2024

Background & Summary

Lactuca sativa L. (Asteraceae), known as lettuce, is considered one of the most important vegetable crops^1,2,3,4,5. Originating in the coastal Mediterranean regions, lettuce was featured in Egyptian tombs around 2,500 BC^2,3. Today, lettuce is cultivated as diverse horticultural varieties for different purposes, including leafy types (looseleaf, crisphead, romaine, and butterhead) and non-leafy types (stem and oilseed), each with distinct morphological characteristics^6,7. Leafy lettuces, particularly looseleaf and crisphead, are consumed globally in salads and hamburgers, and are also popular in hotpot cuisine in China and grilled with red meat in other parts of Asia. Looseleaf lettuce, compared to crisphead, grows faster, can be harvested earlier, and has better tolerates to abiotic stress. Thus, looseleaf lettuce is an important horticultural type for the annual leafy vegetable supply, and genomic research could greatly enhance its economic value.

A high-quality reference genome is crucial for identifying genetic variations, conducting phylogenetic research, and facilitating molecular marker-assisted breeding. As a representative species of the genus Lactuca in the Asteraceae family, the first reference genome for a crisphead lettuce type (L. sativa var. capitata cv. Salinas) was released in 2017, with a genome size of 2.38 Gb and contig N50 of 36 Kb⁸. With advancements in sequencing technology and broader use of Lactuca species, additional genome assemblies have been published, including those for two wild relatives (L. saligna and L. virosa), and one stem lettuce (L. sativa var. angustana cv. Yanling1)^9,10,11. Although these data are useful for identifying intraspecific variation, only two chromosome-level genome assemblies of cultivated lettuce (the crisphead and stem types) have been generated to date^8,11. A single or limited number of reference genomes for an economically important crop is insufficient for exploring genetic diversity, which hinders genomic research and molecular breeding^12,13. A high-quality genome assembly for the looseleaf type is crucial for identifying genetic variations, inferring phylogenetic relationships among different horticultural types, and facilitating comparative genomic analysis and genetic improvement in lettuce.

In this study, we generated a chromosome-level and near-complete reference genome assembly for looseleaf lettuce (L. sativa var. crispa cv. Green Elegance) using PacBio high-fidelity reads (~46×), Oxford Nanopore reads (~13×), Illumina short reads (~50.39×), and Hi-C reads (~97×). The assembled genome (Green Elegance) had a total length of 2.59 Gb, with a contig N50 of 205.47 Mb and a BUSCO completeness score of 98.39%. A total of 2,580.61 Mb (99.61%) of the genome sequences were anchored to nine chromosomes, featuring 14 recognizable telomeres and 11 gaps. Genome annotation predicted 41,375 protein-coding genes and 77.11% repetitive sequences. These genomic resources provide a roadmap for further genetic and evolutionary investigation.

Methods

Sample collection, library construction and sequencing

Looseleaf lettuce (Lactuca sativa var. crispa cv. Green Elegance) was provided by the Beijing Vegetable Research Center, Beijing Academy of Agriculture and Forestry Science, Beijing, China (Fig. 1). The seedlings were grown in a growth chamber at the Beijing Vegetable Research Center under a photoperiod of 16-hour light (200 μmol m⁻² s⁻¹) and 8-hour dark at 25 °C. Fresh and healthy leaves were collected at the rosette stage and immediately frozen in liquid nitrogen for genome survey and sequencing (Table 1). For transcriptomic sequencing, samples included mature leaves, young seedlings (including roots), and inflorescence (Table 1). Newly developed tender leaves, maintained under moist and low-temperature conditions, were used to construct the Hi-C library (Table 1).

Table 1 Summary of the sequencing data generated for the looseleaf lettuce (L. sativa var. crispa cv. Green Elegance) genome assembly.

Full size table

High molecular weight genomic DNA was extracted from leaves using a modified CTAB (cetyltrimethylammonium bromide) method¹⁴. RNA was removed by adding RNase A. The quality of the DNA was assessed using agarose gel electrophoresis, which confirmed excellent integrity of the DNA molecules.

For Illumina sequencing, a short-read library with an average insert size of 350 bp was constructed and sequenced on an Illumina Novaseq platform (Illumina, CA, USA) using the PE150 program. This yielded 135.8 Gb of raw data. Finally, 124.16 Gb (50.39×) of clean reads were obtained for genome size estimation, sequence correction, and assessment of heterozygosity and repeat content (Table 1 and Fig. 2).

For PacBio HiFi sequencing, genomic DNA was fragmented to ~15 Kb to construct a long-read library following the manufacturer’s instructions (Pacific Biosciences, CA, USA). The library was sequenced on a PacBio Sequel II platform using Circular Consensus Sequencing (CCS) mode. The SMRTbell library was constructed using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences). Library size and quantity were assessed using the FEMTO Pulse (Agilent Technologies, Wilmington, DE) and the Qubit dsDNA HS Assay Kit (Life Technologies, Carlsbad, CA, USA). The library was loaded at a concentration of 55 pM using diffusion loading. Single-molecule real-time (SMRT) sequencing was conducted on a single 8 M SMRT Cell on the Sequel II System. After filtering out the low-quality reads and sequence adapters, we obtained 109.27 Gb (46×) of clean subreads with a reads-length N50 of 17.81 Kb.

Long-read sequencing using the PromethION platform from Oxford Nanopore Technologies (ONT) was performed to fill assembly gaps. High-quality genomic DNA was fragmented to ~8 Kb using a gTube, and the library was constructed with the Ligation Sequencing Kit 1D (Nanopore, SQK-LSK109). We generated 85.80 Gb of raw data, and after filtering out adapters, low-quality reads, and reads shorter than 2 Kb reads, 32.55 Gb clean reads with a clean N50 length of 100,148 Kb were obtained (Table 1).

Genome size and heterozygosity estimation

Short Illumina reads were quality-filtered using fastp¹⁵ (v0.12.4; settings ‘-q 10 -u 50 -y -g -Y 10 -e 20 -l 100 -b 150 -B 150’). The quality-filtered reads were used for genome size estimation. We counted the 21-kmers with Jellyfish¹⁶ (v2.1.4; k-mer size 21), and Genomescope¹⁷ (v2.0; default settings) were used to estimate a genome size of 2.46 Gb, and a genome-wide heterozygosity rate of 0.21% of sites (Fig. 2).

De novo genome assembly

High-accuracy Circular Consensus Sequencing (CCS) data were used to generate 156 contigs, with the longest contig length of 282.47 Mb and an N50 length of 205.47 Mb using hifiasm (v 0.16) software¹⁸ (Table 2). This resulted in a total genome sequence size of 2.59 Gb.

Table 2 Comparison of genome assemblies of L. sativa var. crispa, L. sativa var. capitata, L. sativa var. angustana, L. saligna, and L. virosa.

Full size table

To anchor contigs, 251.37 Gb of clean reads pairs from the Hi-C library were mapped to the polished Green Elegance genome using BWA (bwa-0.7.17) with the default parameters. Invalid reads, such as self-ligation, non-ligation, PCR amplification, and random breaks, were filtered out. After correction and filtration, we obtained 77 high-accuracy scaffolds with a scaffold N50 length of 320.76 Mb and a total scaffold length of 2,590.68 Mb (Table 2). We successfully anchored 2,590.61 Mb (100%) of the genome into nine groups, which were designated as nine chromosomes of Green Elegance, using the agglomerative hierarchical clustering method in Lachesis¹⁹ (Fig. 3). Lachesis was then used to order and orient the clustered contigs. A total of 2,580.61 Mb (99.61%) was successfully ordered and oriented on the nine chromosomes (Table 3). The Hi-C contact heatmap, generated using Hicexplorer v3.7²⁰, revealed nine distinct groups based on interaction intensities between bins (a bin size of 800 Kb), indicating high quality of chromosome construction (Fig. 4). The final chromosomal-level assembly had chromosomal lengths ranging from 205,466,188 bp to 407,155,607 bp, encompassing 99.6% of the total sequence (Table 3). After gap filling with ONT sequencing data, 11 gaps remained across eight chromosomes, with one chromosome being complete. Fourteen telomeres, including 11 complete telomeres longer than 1 Kb, were distributed across the nine chromosomes (Fig. 5, Tables 2 and 3). This genome assembly of L. sativa var. crispa cv. Green Elegance represents a significant improvement in genome continuity (contig N50), gap number, and chromosome anchoring compared to the other sequenced Lactuca plants, including L. sativa var. capitata cv. Salinas, L. sativa var. angustana cv. Yanling1, L. saligna, and L. virosa (Table 2).

Table 3 Statistics of the L. sativa var. crispa chromosomes after assembly by Hi-C and gap filling.

Full size table

Repetitive sequences annotation

Transposon elements (TE) were identified using a combination of homology-based and de novo approaches. A de novo repeat library was first constructed with RepeatModeler (http://www.repeatmasker.org/RepeatModeler/)²¹. Full-length long terminal repeat retrotransposons (FL-LTR-RTs) were identified using LTRharvest (v1.5.9)²² and LTR_finder (v2.8)²³, and a high-quality library was produced with LTR_retriever²⁴. The de novo TE sequences library and known TE sequences from Dfam (v3.5) database were combined to create the final TE sequence set for the Green Elegance genome, which was classified using RepeatMasker (v4.12)²⁵. Tandem repeats were annotated using Tandem Repeats Finder (TRF 409)²⁶ and the MIcroSAtellite identification tool (MISA v2.1)²⁷ with the default parameters (definition: 1–10 2–6 3–5 4–5 5–5 6-5; interruptions: 100). In total, transposon elements and tandem repeats accounted for 77.11% and 4.14% of the Green Elegance genome sequence, respectively, amounting to 2.00 Gb and 107.58 Mb (Table 4).

Table 4 Statistics of repetitive element annotation.

Full size table

Gene prediction and functional annotation of protein-coding genes

Three approaches—de novo prediction, homology search, and transcript-based assembly—were integrated for annotating protein-coding genes in the genome (Table 5). De novo gene models were predicted using two ab initio gene-prediction software tools, Augustus (v3.1.0)²⁸ and SNAP (Korf, 2004). For homolog-based prediction, GeMoMa (v1.7) was used with reference gene models from the various species, including Arabidopsis thaliana, Oryza sativa, L. sativa var. capitata cv. Salinas, L. sativa var. angustana, L. serriola, L. virosa, Helianthus annus, Taraxacun kok-saghyz, and Artemisia annua. For transcript-based prediction, RNA-sequencing data were mapped to reference genome using Hisat (v2.1.0)²⁹ and assembled with Stringtie (v 2.1.4)¹⁷. GeneMarkS-T (v5.1) was used to predict genes based on these assembled transcripts. Additionally, PASA (v2.4.1) was employed to predict genes based on unigenes and full-length transcripts from PacBio/ONT sequencing assembled by Trinity (v2.11)³⁰. Gene models from these approaches were integrated using EVM (v1.1.1) and updated with PASA. In total, 41,375 protein-coding genes with an average length of 3,744 bp were predicted in the Green Elegance genome (Table 6).

Table 5 Statistics of gene number by different annotation methods.

Full size table

Table 6 Summary of gene annotation.

Full size table

Gene functions were inferred by aligning to the National Center for Biotechnology Information (NCBI) Non-Redundant (NR), EggNOG³¹, KOG, TrEMBL³², InterPro³³ and Swiss-Prot³² protein databases using Diamond blastp (diamond v2.0.4.142) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database³⁴ with an E-value threshold of 1E-5. Protein domains were annotated with InterProScan (v5.34-73.0)³⁵, while motifs and domains within gene models were identified using PFAM databases³⁶. Gene Ontology (GO) IDs for each gene were obtained from TrEMBL, InterPro and EggNOG. Approximately 41,004 (99.10%) of the predicted protein-coding genes in Green Elegance could be functionally annotated with known genes, conserved domains, and Gene Ontology terms (Table 6). This high annotation ratio (99.10%) is the highest among five Lactuca plants, including L. sativa var. capitata cv. Salinas, L. sativa var. angustana, L. saligna, and L. virosa (Table 6).

Whole genome synteny analysis

For synteny analysis, genomes of four other Lactuca species, including L. sativa var. capitata, L. sativa var. angustana, L. virosa and L. saligna assemblies, were aligned to the L. sativa var. crispa genome using Mummer (v 4.0)³⁷ with the parameters: -c 500 -b 500 -l 100–maxmatch (Fig. 6). Raw alignment results were filtered using delta filter with parameters: -1 -i 90 -l 500. MCScanX identified syntenic blocks³⁸ with the parameter -s 15 (number of genes required to call a collinear block) and visualized them using jcvi v1.2.8³⁹ with the parameter–minspan = 30.

Data Records

The raw genomic sequencing data used for genome assembly are available in the Genome Sequence Archive (GSA)⁴⁰ in the National Genomics Data Center (NGDC), Beijing Institute of Genomics (China National Center for Bioinformation)⁴¹, Chinese Academy of Sciences (https://bigd.big.ac.cn/gsa). The accession number CRA014873⁴² covers genome survey data, transcriptomic sequencing data, PacBio HiFi sequencing data, ONT sequencing data, and Hi-C sequencing data. The genome assembly and annotation files are available in the Genome Warehouse (GWH)⁴³ in NGDC (accession number is GWHERDY00000000⁴⁴), Genebank (JBFTWI000000000)⁴⁵ and Figshare (https://doi.org/10.6084/m9.figshare.25116548)⁴⁶.

Technical Validation

To evaluate the completeness of L. sativa var. crispa cv. Green Elegance (version 1.2) assembly, Illumina short-read and PacBio long-reads data were mapped back to the assembly. The alignment was analyzed using Qualimap v.2.2.2. The mapping rate for both libraries was 99.75% (an average 48× coverage) for Illumina short reads and 99.85% (average 42× coverage) for PacBio long-reads. BUSCO v5.2.2⁴⁷ with OrthoDB was used to assess genome completeness. In genome syntenic analysis, L. sativa var. crispa, L. sativa var. capitata and L. sativa var. angustana showed high conservation, compelling evidence that the gross genome structure has been accurately assembled (Fig. 6). We have observed relatively few genomic arrangement ambiguities in the Hi-C contact heat map, though with some discontinuities, which were probably caused by highly repetitive sequences. Visually inspection of the Hi-C map also revealed that some points of ambiguity appeared to be centromeres, likely due to sequence similarity in these regions. Meanwhile, clear antidiagonals for several chromosomes were also observed in the Hi-C contact heat map, such as chromosome 7 and chromosome 8. Such a pattern may suggest a Rabl configuration of the chromosomes, which could be validated in future cytological investigations. Overall, 98.39% BUSCOs were complete and 0.50% fragmented in the assembled genome (Table 7). CEGMA (Core Eukaryotic Genes Mapping Approach) (v2.5) analysis showed that 99.78% (457 CEG, Core Eukaryotic Genes) of CEGMA genes were present in the genome⁴⁸. The LTR Assembly Index (LAI)⁴⁹ of 17.34 indicated a high-quality genome assembly for L. sativa var. crispa cv. Green Elegance, with better continuity and completeness compared to other Lactuca species (Table 7). The higher ratio of complete BUSCOs and LAI values, compared to the other four Lactuca species, indicate the superior quality of the genome assembly for L. sativa var. crispa cv. Green Elegance (v1.2).

Table 7 BUSCO and LAI assessments of L. sativa var. crispa, L. sativa var. capitata, L. sativa var. angustana, L. saligna, and L. virosa.

Full size table

Code availability

No custom code was used for this study. All data analyses were conducted using published bioinformatics software with default settings unless otherwise specified.

References

Wei, T. et al. Whole-genome resequencing of 445 Lactuca accessions reveals the domestication history of cultivated lettuce. Nat. Genet. 53, 752–760, https://doi.org/10.1038/s41588-021-00831-0 (2021).
Article CAS PubMed Google Scholar
Lindqvist, K. On the origin of cultivated lettuce. Hereditas 46, 319–350, https://doi.org/10.1111/j.1601-5223.1960.tb03091.x (1960).
Article Google Scholar
de Vries, I. M. Origin and domestication of Lactuca sativa L. Genet. Resour. Crop Evol. 44, 165–174, https://doi.org/10.1023/A:1008611200727 (1997).
Article Google Scholar
Zohary, D. The wild genetic resources of cultivated lettuce (Lactuca sativa L.). Euphytica 53, 31–35, https://doi.org/10.1007/BF00032029 (1991).
Article Google Scholar
Křístková, E., Doležalová, I., Lebeda, A., Vinter, V. & Novotná, A. Description of morphological characters of lettuce (Lactuca sativa L.) genetic resources. A review. Hortic. Sci.e 35, 113–129 (2018).
Article Google Scholar
Lebeda, A., Ryder, E. J., Sideman, R., Ivana, D. & Křístková, E.in Genetic resources, chromosome engineering, and crop improvement Vol. 3 (ed R. J. Singh) 377–472 (2006).
Zhang, L. et al. RNA sequencing provides insights into the evolution of lettuce and the regulation of flavonoid biosynthesis. Nat. Commun. 8, 2264, https://doi.org/10.1038/s41467-017-02445-9 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Reyes-Chin-Wo, S. et al. Genome assembly with in vitro proximity ligation data and whole-genome triplication in lettuce. Nat. Commun. 8, 14953, https://doi.org/10.1038/ncomms14953 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Xiong, W. et al. The genome of Lactuca saligna, a wild relative of lettuce, provides insight into non-host resistance to the downy mildew Bremia lactucae. Plant J. 115, 108–126, https://doi.org/10.1111/tpj.16212 (2023).
Article CAS PubMed Google Scholar
Xiong, W. et al. Genome assembly and analysis of Lactuca virosa: implications for lettuce breeding. G3-GENES GENOM GENET 13, jkad204, https://doi.org/10.1093/g3journal/jkad204 (2023).
Article CAS Google Scholar
Shen, F. et al. Comparative genomics reveals a unique nitrogen-carbon balance system in Asteraceae. Nat. Commun. 14, 4334, https://doi.org/10.1038/s41467-023-40002-9 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Ballouz, S., Dobin, A. & Gillis, J. A. Is it time to change the reference genome? Genome Biol. 20, 159, https://doi.org/10.1186/s13059-019-1774-4 (2019).
Article PubMed PubMed Central Google Scholar
Sun, Y., Shang, L., Zhu, Q.-H., Fan, L. & Guo, L. Twenty years of plant genome sequencing: achievements and challenges. Trends Plant Sci. 27, 391–401, https://doi.org/10.1016/j.tplants.2021.10.006 (2022).
Article CAS PubMed Google Scholar
Abu Almakarem, A. S., Heilman, K. L., Conger, H. L., Shtarkman, Y. M. & Rogers, S. O. Extraction of DNA from plant and fungus tissues in situ. BMC Res. Notes 5, 266, https://doi.org/10.1186/1756-0500-5-266 (2012).
Article PubMed PubMed Central Google Scholar
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Article PubMed PubMed Central Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770, https://doi.org/10.1093/bioinformatics/btr011 (2011).
Article CAS PubMed PubMed Central Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295, https://doi.org/10.1038/nbt.3122 (2015).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175, https://doi.org/10.1038/s41592-020-01056-5 (2021).
Article CAS PubMed PubMed Central Google Scholar
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125, https://doi.org/10.1038/nbt.2727 (2013).
Article CAS PubMed PubMed Central Google Scholar
Wolff, J. et al. Galaxy HiCExplorer 3: a web server for reproducible Hi-C, capture Hi-C and single-cell Hi-C data analysis, quality control and visualization. Nucleic Acids Res. 48, W177–W184, https://doi.org/10.1093/nar/gkaa220 (2020).
Article CAS PubMed PubMed Central Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. USA 117, 9451–9457, https://doi.org/10.1073/pnas.1921046117 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinform. 9, 18, https://doi.org/10.1186/1471-2105-9-18 (2008).
Article CAS Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265–W268, https://doi.org/10.1093/nar/gkm286 (2007).
Article PubMed PubMed Central Google Scholar
Ou, S. & Jiang, N. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 176, 1410–1422, https://doi.org/10.1104/pp.17.01310 (2018).
Article CAS PubMed Google Scholar
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics 25, 4.10.11–14.10.14, https://doi.org/10.1002/0471250953.bi0410s25 (2009).
Article Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580, https://doi.org/10.1093/nar/27.2.573 (1999).
Article CAS PubMed PubMed Central Google Scholar
Beier, S., Thiel, T., Münch, T., Scholz, U. & Mascher, M. MISA-web: a web server for microsatellite prediction. Bioinformatics 33, 2583–2585, https://doi.org/10.1093/bioinformatics/btx198 (2017).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M., Steinkamp, R., Waack, S. & Morgenstern, B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 32, W309–W312, https://doi.org/10.1093/nar/gkh379 (2004).
Article CAS PubMed PubMed Central Google Scholar
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360, https://doi.org/10.1038/nmeth.3317 (2015).
Article CAS PubMed PubMed Central Google Scholar
Grabherr, M. G. et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-seq data. Nat. Biotechnol. 29, 644–652, https://doi.org/10.1038/nbt.1883 (2013).
Article CAS Google Scholar
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314, https://doi.org/10.1093/nar/gky1085 (2019).
Article CAS PubMed Google Scholar
Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31(1), 365–370 (2003).
Article CAS PubMed PubMed Central Google Scholar
Mitchell, A. et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 43, D213–D221, https://doi.org/10.1093/nar/gku1243 (2015).
Article PubMed Google Scholar
Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, D109–D114, https://doi.org/10.1093/nar/gkr988 (2012).
Article CAS PubMed Google Scholar
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240, https://doi.org/10.1093/bioinformatics/btu031 (2014).
Article CAS PubMed PubMed Central Google Scholar
Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230, https://doi.org/10.1093/nar/gkt1223 (2014).
Article CAS PubMed Google Scholar
Marçais, G. et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944, https://doi.org/10.1371/journal.pcbi.1005944 (2018).
Article CAS PubMed PubMed Central Google Scholar
Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49–e49, https://doi.org/10.1093/nar/gkr1293 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Tang, H. et al. Synteny and collinearity in plant genomes. Science 320, 486–488, https://doi.org/10.1126/science.1153917 (2008).
Article ADS CAS PubMed Google Scholar
Chen, T. et al. The Genome sequence archive family: toward explosive data growth and diverse data types. Genomics Proteom. Bioinform.19, 578–583, https://doi.org/10.1016/j.gpb.2021.08.001 (2021).
Article Google Scholar
Members, C.-N. & Partners Database resources of the National Genomics Data Center, China National Center for Bioinformation in 2023. Nucleic Acids Res. 51, D18–D28, https://doi.org/10.1093/nar/gkac1073 (2023).
Article CAS Google Scholar
NGDC Genome Sequence Archive. https://ngdc.cncb.ac.cn/gsa/browse/CRA014873 (2024).
Chen, M. et al. Genome warehouse: a public repository housing genome-scale data. Genomics Proteom. Bioinform.19, 584–589, https://doi.org/10.1016/j.gpb.2021.04.001 (2021).
Article Google Scholar
NGDC Genome Warehouse. https://ngdc.cncb.ac.cn/gwh/Assembly/83750/show (2024).
NCBI GenBank. https://identifiers.org/ncbi/insdc:JBFTWI000000000 (2024).
Zhang, B. Gemome assembly and gene annotation files of Lactuca sativa var. crispa cv. Green Elegance. figshare. https://doi.org/10.6084/m9.figshare.25116548 (2024).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212, https://doi.org/10.1093/bioinformatics/btv351 (2015).
Article CAS PubMed Google Scholar
Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067, https://doi.org/10.1093/bioinformatics/btm071 (2007).
Article CAS PubMed Google Scholar
Ou, S., Chen, J. & Jiang, N. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. 46, e126–e126, https://doi.org/10.1093/nar/gky730 (2018).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This research was supported by the Key Project at Central Government Level: The Ability Establishment of Sustainable Use for Valuable Chinese Medicine Resources (2060302), the Innovation and Development Program of Beijing Vegetable Research Center (KYCX202304), Beijing Joint Research Program for Germplasm Innovation and New Variety Breeding (G20220628003-01), and Collaborative Innovation Program of Beijing Vegetable Research Center (XTCX202302).

Author information

Authors and Affiliations

National Engineering Research Center for Vegetables, Beijing Vegetable Research Center, Beijing Academy of Agriculture and Forestry Science, Beijing, 100097, P. R. China
Bin Zhang, Xue Liu, Haifeng Ding & Dayong Li
State Key Laboratory of Vegetable Biobreeding, Beijing Vegetable Research Center, Beijing Academy of Agriculture and Forestry Science, Beijing, 100097, P. R. China
Bin Zhang, Xue Liu & Dayong Li
College of Life Sciences, Capital Normal University, Beijing, 100048, P. R. China
Yingfei Xue & Cheng Sun
State Key Laboratory for Quality Ensurance and Sustainable Use of Dao-di Herbs, National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing, 100700, P. R. China
Xue Liu & Jinfu Tang
Jingyan Yinong (Beijing) Seed Sci-Tech Co., Ltd., Beijing, 100097, P. R. China
Haifeng Ding & Yesheng Yang
College of Life Sciences, Shandong Normal University, Jinan, 250014, P. R. China
Chenchen Wang, Zhaoyang Xu & Jun Zhou
Beijing Key Laboratory of Vegetable Germplasms Improvement, Beijing, 100097, P. R. China
Dayong Li
Key Laboratory of Biology and Genetics Improvement of Horticultural Crops (North China), Beijing, 100097, P. R. China
Dayong Li

Authors

Bin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yingfei Xue
View author publications
You can also search for this author in PubMed Google Scholar
Xue Liu
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Ding
View author publications
You can also search for this author in PubMed Google Scholar
Yesheng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Chenchen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoyang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jinfu Tang
View author publications
You can also search for this author in PubMed Google Scholar
Dayong Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.L., J.T., and B.Z. designed and coordinated the study; X.L., H.D., Y.Y., C.W., Z.X., and B.Z. collected and prepared plant samples; Y.X., J.T., and C.S. performed the bioinformatic analyses; B.Z. and D.L. drafted the manuscript; J.T., J.Z. and C.S. revised the manuscript. All authors approved the final manuscript.

Corresponding authors

Correspondence to Jinfu Tang or Dayong Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, B., Xue, Y., Liu, X. et al. A near-complete chromosome-level genome assembly of looseleaf lettuce (Lactuca sativa var. crispa). Sci Data 11, 961 (2024). https://doi.org/10.1038/s41597-024-03830-y

Download citation

Received: 14 March 2024
Accepted: 27 August 2024
Published: 04 September 2024
DOI: https://doi.org/10.1038/s41597-024-03830-y
Springer Nature Limited

A near-complete chromosome-level genome assembly of looseleaf lettuce (Lactuca sativa var. crispa)

Abstract

Similar content being viewed by others

Chromosome-scale assembly and annotation of the perennial ryegrass genome

Chromosome-level genome assembly and annotation of Zicaitai (Brassica rapa var. purpuraria)

Chromosome-level genome assembly of the diploid oat species Avena longiglumis

Background & Summary

Methods

Sample collection, library construction and sequencing