Objective

Carnivorous plants are special plants that can capture and digest insects and other animals through their pitfalls and traps. Nepenthaceae is one of the largest carnivorous plant families, and Nepenthes is the only genus in the family and contains 160–180 species [1]. Nepenthes species attract people with their special and beautiful pitcher traps, making them popular ornamental plants, which results in their overexploitation in nature. All Nepenthes species are currently listed under the Convention on International Trade in Endangered Species of Wild Flora and Fauna (CITES) [2]. Nepenthes species are also widely used in traditional medicine, making them multi-purpose species [3]. Nepenthes species grow in moist and nutrient-poor environments, showing varied ecological adaptations related to nutrient acquisition (such as different prey as substrates) [1]. They are also divergent into “Highland” and “Lowland” groups according to the different altitudes at which they grow [1]. Therefore, ecological divergence plays an important role in Nepenthes speciation. Moreover, hybridization is a common phenomenon in Nepenthes species, which causes uncertainty about their evolutionary relationships [1, 2]. Thus, evolutionary and taxonomic studies are highly required for Nepenthes species, especially when using their genome resources.

Nepenthes mirabilis is the most widely distributed species in the genus and can be found from southern China to Australia, Africa and some Pacific islands [4, 5]. In particular, it is the only Nepenthes species naturally distributed in China [4, 5]. Currently, genome resources for carnivorous plants are still lacking [6, 7], and no genomes have been reported for Nepenthes species. Here, we report the whole genome and transcriptome assemblies of N. mirabilis by using whole genome sequencing (WGS) and RNA-seq reads. The assemblies will be useful resources for comparative genomics and conservation of N. mirabilis and other carnivorous species. The data will help further understand the multifaceted adaptation of carnivorous plants at the genomic levels.

Data description

One N. mirabilis sample was collected from Wuguishan (22°25’30.78” N, 113°26’16.37” E), Zhongshan City, China. Genomic DNA was isolated from its leaf tissues for WGS. Total RNAs were extracted from both its leaf and flower tissues. Standard genomic and trancriptomic 150 bp paired-end libraries were constructed and sequenced using the MGI DNBSEQ-T7 sequencing platform at GrandOmics Biosciences (Beijing, China). RNA-seq reads from leaves and flowers were combined to obtain the total transcriptome reads of N. mirabilis.

Multiple programs were used for genomic and transcriptomic sequencing data analysis. While performing these analyses, default parameter settings were applied, unless otherwise mentioned. The transcriptome reads were analyzed by the TransPi pipeline 1.3.0-rc [8], which included read filtering, de novo transcriptome assembly, and annotation. The raw WGS reads were trimmed using Sickle v. 1.33 [9] to remove the base quality scores < 30 and lengths < 80 bp. The trimmed reads were corrected by RECKONER v. 1.1 [10]. KmerGenie v. 1.7044 [11] (with the parameter of “-k 141 --diploid”) and GenomeScope 2.0 [12] (with the parameter of “-m 10000”) were used to estimate the genome size of N. mirabilis. The k-mer frequencies used by GenomeScope were generated using Jellyfish 2.3.0 [13] with the parameters of “-s 5G” in the “count” step and “-h 3000000” for the “histo” step. Using the quality filtered WGS reads, the genome was assembled by Platanus 1.2.4 [14], Redundans 0.14a [15] (with the parameter of “--identity 0.9”), P_RNA_scaffolder [16] (with the parameter of “-e 1000000”), and Hapo-G 1.3.2 [17] (Table 1 Data file 1 Fig. 1 - assembling steps) [18]. After assembling, the assembled genome sequences were uploaded to GenBank for contamination assessment, and the potential contaminant sequences were then removed from the assembly. The repeat sequences of the contamination-free assembly were identified by RED v2.0 [19]. Using a repeat masked assembly, gene prediction and annotation were performed with the funannotate pipeline v1.8.13 [20], in which the “max_intronlen 1000000” was set for prediction. BUSCO 5.2.2 [21] was applied to perform a quality assessment of both the genome and transcriptome assemblies using the eudicots odb10-2020-09-10 database. Finally, from the assembled genome, the ploidy level of N. mirabilis was determined by nQuire using the “lrdmodel” function [22].

Table 1 Overview of all data files/data sets

The WGS produced ~ 139.5 GB of raw data (Data file 2) [23], while the RNA-seq produced ~ 21.7 Gb of raw data for leaf tissues (Data file 3) [24] and ~ 27.9 GB of raw data for flower tissues (Data file 4) [25]. KmerGenie estimated the genome size of N. mirabilis at 791,766,648 bp under the best selected k-mer of 99, following a comparison of different k-mer spectra. GenomeScope estimated it between 640,605,912 and 776,677,804 bp using a k-mer from 17 to 99 (Table 1 Data file 5 Fig.2 - genome size estimation) [26]. GenomeScope also revealed that the level of heterozygosity in N. mirabilis was between 0.3% and 1%; the repeat content ranged from 13.8% at 99-mer to 50.4% at 17-mer.

The TransPi pipeline assembled a total of 339,802 transcripts (Data file 6) [27], with the longest transcript being 25,441 bp and an N50 of 945 bp. 79,758 ORFs were obtained in the transcriptome assemblies (Data files 7 and 8) [28, 29], of which 55,993 were complete. The BUSCO assessment for the ORFs indicated 93.7% completeness (Data file 9) [30]. The GO annotations for all ORFs revealed that they were mainly functionally related to proteolysis and DNA integration in the ‘Biological Processes’ category (Table 1 Data file 10 Fig.3 - GO annotation of open reading frames ) [31]. The full annotation of GO and Pfam for ORFs can be found in Data files 11 and 12 [32, 33].

The genome assembly was 691,409,685 bp with 159,555 contigs/scaffolds and an N50 of 10,307 bp (Data file 13) [34]. The largest scaffold was 312,611 bp, and the shortest was 200 bp. The BUSCO assessment for the assembly indicated 91.1% completeness (Data file 14) [35]. The ploidy level estimated by nQuire indicated a diploid genome of N. mirabilis, as the diploid model displayed the lowest delta likelihood from the free model (diploid delta likelihood: 626,726.449; triploid delta likelihood: 1,581,220.976; tetraploid delta likelihood: 1,425,949.675). Repeat prediction identified 44.21% (305,683,241 bp) of the genome as repetitive regions (Data file 15) [36]. Funannotate predicted 42,961 genes that code for 45,461 proteins (Data files 16–18) [37,38,39]. The gene annotation is shown in Data files 19–20 [40, 41].

Limitations

The assembled genome is still fragmented and not suitable for genome structure analysis. Further high-quality genome assemblies using long read, Hi-C, and other sequencing technologies are needed.

In particular, it is currently a challenge to obtain an accurate genome size with only short whole genome sequencing reads [42, 43], which is also true for N. mirabilis in this study. Although both KmerGenie and GenomeScope estimated comparable genome sizes at a k-mer of 99 with 791,766,648 bp and 774,029,535 bp, respectively, we observed that the estimated genome size was affected by increasing k-mer from 17 to 71 in GenomeScope [26]. Considering that a high k-mer might introduce errors in genome size estimation [12], a genome size of 99-mer could be misestimated. At small k-mers of 17 and 21, the genome size by GenomeScope was 640,605,912 bp and 675,699,094 bp, respectively. These results are then comparable to the final assembled genome size of 691,409,685 bp in N. mirabilis. Nevertheless, long HiFi sequencing provides an alternative way to use a larger k-mer and circumvent sequence errors in characterizing the species genome size and other features accurately [44].

In this study, RED [19] was used for repeat sequence identification, given its fastness, efficiency and accuracy in repeat detection [45, 46]. Compared to repeat content estimation by GenomeScope, whose results varied dramatically among different k-mers which made its results less referable, RED results provided an initial guideline for the percentage of repeat content in N. mirabilis. However, considering the fragmented nature of the assembly, repeat regions might still be missing and misassembled. Moreover, RED does not classify repeats into different types (e.g. retrotransposons, DNA transposons, and so on), additional repeat detectors are needed if the evolutions of different repeats are under investigation [47].