Objective

Castanopsis is the third largest genus in the Fagaceae family [1]. It includes about 120–130 species in the genus [1,2,3]. Fossil evidence indicates that Castanopsis is widely distributed in both the Northern and Southern Hemisphere through the Eocene to the Pliocene in history [2, 4], but currently, it is mainly distributed in the subtropics and tropics of East and Southeast Asia [2, 4]. Castanopsis species are mainly canopy-dominant trees and can grow up to 25–40 m in height [5]. Therefore, they are the main components of evergreen broadleaved forests, safeguarding local biodiversity [2, 6]. Castanopsis species are good timber trees, and their seeds are edible [2, 3, 7]. They also contain many polyphenols and are used in traditional medicines [3]. Climate change is the main threat to Castanopsis species due to its restricted migration ability [3, 7]. By studying 32 dominant Castanopsis species in East Asia that grow from 5°N to 38°N, it has been predicted that their present high richness distribution range will be reduced by 94.5%, on average, by 2070 [3].

China is a central region of the Castanopsis distribution and includes approximately 60 species [3], half of which are endemic to China [1]. Castanopsis chinensis is distributed from South China to Vietnam. It is a pioneer-dominant canopy tree in evergreen broad-leaved forests and plays a key role in ecosystems [3, 8]. As a fast-growing and soil erosion-controlling species, it is also widely used in reforestation [7]. Because the C. chinensis distribution area suffers from disturbances of high human activity, most forests have been converted or degraded. Therefore, we present here the C. chinensis genome to better understand its evolution and adaptation and enhance its conservation, management and utility in the future.

Data description

We collected leaf samples of C. chinensis from a individual planted in the South China Botanical Garden, Guangzhou China. To perform genome assembly and annotation, three sequencing libraries were constructed from genomic DNA or RNA extracted from the leaf tissues. The first library was constructed by long read whole genome sequencing using a Nanopore PromethION sequencer, which generated about 139.5 GB of data (Data files 1–3) [9,10,11]. The second was generated by short read whole genome sequencing using an MGI DNBSEQ-T7 sequencer, which generated about 149.6 GB data (Data file 4) [12], and the third was generated by RNA sequencing (RNA-seq) using an MGI DNBSEQ-T7 sequencer, which generated about 29.7 GB data (Data file 5) [13]. All sequencing using the MGI platform was applied using 150 bp paired-end mode.

The genome size of C. chinensis was estimated by KmerGenie v1.7044 [14] (under the parameter of “-k 141 --diploid”) and GenomeScope 2.0 [15] (under the k-mer of 21) with short whole genome reads, which were trimmed using Sickle v1.33 [16] with the parameter “-q 30 -l 80”. The genome sizes estimated by KmerGenie and GenomeScope were 1,143,475,699 and 744,772,109 bp, respectively. Nanopore long reads were quality trimmed by Porchop v0.2.4 [17] and ontbc v1.1 [18] to remove adapters, and the reads had quality scores < 7 and lengths < 5000 bp. NextDenovo v2.3.1 [19] was then used to assemble the genome with the filtered reads. Pseudohaploid [20] and Purge_Dups v1.2.6 [21] (running twice) were used to remove duplicated contigs. Racon v1.5.0 [22], hapo-G v1.3.2 [23] and polypolish v0.5.0 [24] were further used to improve the assembly, in which racon and hapo-G were each run twice. The final assembly consisted of 133 contigs of 888,699,661 bp in length and a contig N50 of 23,395,510 bp (Data file 6) [25]. BUSCO v5.4.6 [26] assessed 98.3% completeness using the Eudicots odb10-2020-09-10 database (Data file 7) [27].

RED v2.0 [28] and EDTA v2.1.0 [29] were used to predict repetitive elements in the assembly, which revealed 400,198,509 (Data file 8) [30] and 410,582,904 bp (Data file 9) [31] of the sequences, respectively. Combining the RED and EDTA results resulted in 496,557,194 bp sequences (Data file 10) [32], accounting for 55.9% of the genome assembly. After soft-masking the repetitive elements in the assembly, braker v.2.0 [33] was used to predict the gene structures. Braker is an automated gene annotation pipeline that uses transcriptome data and reference protein sequences (Data file 11) [34]. The braker results were then inputted into Funannotate pipeline v1.8.16 [35] to obtain integrated gene sets. The pipeline included “train”, “predict” and “update” steps. In all steps, the parameter of “--max_intronlen 1000000” was used. In the “predict” step, the parameters of “--busco_seed_species arabidopsis --organism other --busco_db embryophyta” were added. The Funannotate pipeline finally produced 51,406 genes that coded 54,310 protein sequences (Data files 12–14) [36,37,38]. After gene structure prediction, functional annotation of the genes was performed using the “funannotate annotate” command in the Funannotate pipeline (Data files 15–16) [39, 40].

Limitations

Although the genome size estimators of KmerGenie and GenomeScope were highly discrepant, yielding 1,143,475,699 and 744,772,109 bp, the final assembly size of 888,699,661 bp was comparable to previously reported genome sizes for Castanopsis species, including 878.6 Mb for C. tibetana [1] and 882.6 Mb for C. hystrix [41], and higher than 785.5 Mb for C. mollissima [42]. It has been reported that accurate genome size estimation with short reads is challenging [43, 44]. Therefore, long HiFi sequencing data may be further needed to obtain an accurate size estimation [44]. Due to their long length and high accuracy, HiFi date could character genome size reliably both in small and large k-mers [43, 45,46,47,48], which help determining the true result when the discrepancy happening in genome size estimation by short reads [43].

Currently, the assembled genome in this report is still fragmented. Therefore, it is not suitable for complete genome structure analysis, hindering the complex regions digging in its conservation and breeding [49,50,51]. Further high-quality genome assemblies (preferably complete and gapless) using ultra-long read, Hi-C, and other sequencing technologies are needed [52].

Table 1 Overview of all data files/data sets