Background & Summary

Knoxia roxburghii (Sprengel) M. A. Rau (2n = 20, homotypic synonym: Knoxia valerianoides Thorel ex Pitard), a perennial herb naturally distributed in southern China and Southeast Asia, is a member of the Rubiaceae family and the Knoxia genus1. The dried roots of K. roxburghii, known as hongdaji in Chinese medicine, exhibit a significant therapeutic effect in treating cancer, carbuncles, diarrhoea, ascites, chronic pharyngitis, and schizophrenia2. Additionally, the plant is a crucial ingredient in various Chinese herbal formulations, such as ZiJinDing, which has been shown to possess antitumour properties by modern pharmacology3. Phytochemical studies have revealed that K. roxburghii is rich in anthraquinones, triterpenoids, lignans, coumarins, sitosterols, and other important compounds4,5. Anthraquinones, such as 3-hydroxymoridone, knoxiadin, and damnacanthal, are considered key active components of K. roxburghii, exhibiting diverse biological activities including anticancer, antibacterial, anticoagulant, and antiviral effects6,7. Triterpenoids, which are a significant component of K. roxburghii, have anti-inflammatory, anticancer, and antioxidant effects. They are primarily responsible for reducing inflammation and swelling in K. roxburghii8,9.

In recent years, the wild populations of K. roxburghii in China have been facing an increased risk of extinction due to a surge in market demand10. Additionally, seed germination and emergence rates for this species are less than 1% under natural conditions, and it exhibits a protracted maturation period11. K. roxburghii has been categorized as a first-class protected wild Chinese herbal medicine, and its production area has been prohibited from being utilized12. As a result, artificially cultivated K. roxburghii has become the primary source of medicinal materials. Nevertheless, the cultivation process is plagued by southern blight and leaf spot, which have severely limited the plant’s production13. Therefore, there is an urgent need for the breeding of promising new K. roxburghii varieties to tackle this issue.

Whole‐genome-level studies can provide insights for enhancing medicinal material quality, molecular breeding, wild resource conservation, and functional gene discovery and utilization of plants14,15,16. However, to date, no whole-genome sequence of K. roxburghii has been reported. In the present study, by using DNBSEQ sequencing, single-molecule real-time sequencing, and high-throughput chromosome conformation capture sequencing (Hi-C) sequencing technologies, we provide a de novo high-quality chromosome-level genome sequence for K. roxburghii. The 99.78% genome sequence is anchored to 10 chromosomes, with a total length of 446.30 Mb and scaffold N50 of 44.38 Mb. Transposable elements accounted for 68.92% (307.60 Mb) of the assembled genome sequence, with long terminal repeats (LTRs) being the dominant type. The LTR retrotransposon burst was estimated to have occurred approximately 0.2 million years ago. Phylogenetic analysis revealed that Copia and Gypsy elements could be grouped into eight and five lineages, respectively. The reference genome information obtained herein constitutes a valuable resource for promoting genetic improvement and elucidating the biosynthesis of active ingredients in this medicinal plant.

Methods

Sample collection and sequencing

For genomic DNA extraction, fresh leaves of K. roxburghii were collected from Chuxiong (N24°58′, E101°28′) in Yunnan Province, China. Additionally, stems, roots, buds, and leaves were gathered to perform transcriptome sequencing. The materials were immediately preserved in liquid nitrogen, transported to the laboratory, and stored at −80 °C. High-quality genomic DNA was extracted from leaves using the DNeasy Plant Mini Kit (QIAGEN, Valencia, California, USA). Total RNA was extracted from each sample using the Directzol RNA kit (Zymo Research, Irvine, CA, USA) following the manufacturer’s instructions.

For short-reads sequencing, paired‐end DNBSEQ libraries were constructed using the NextEra DNA Flex Library Prep Kit (Illumina, San Diego, CA, USA) with an insert size of 350 bp and sequenced on the DNBSEQ-T7 platform (MGI Tech, Shenzhen, China). A quality assessment of the short sequencing reads was conducted using fastp v. 0.21.017 with default parameters. This process involved the removal of adapter sequences, contaminants, PCR duplicates, and reads with a low-quality base percentage exceeding 30%. A total of 107.86 Gb clean short reads (251.78 × coverage) were generated and used for subsequent data processing. The genome size was estimated to be 428.39 Mb, with a heterozygosity of 1.23% and repetitive content of 46.86% based on previous K-mer distribution analyses18.

For PacBio sequencing, the libraries were constructed with an insert size of 15 kb using the SMRTbell Template Prep Kits (Pacific Biosciences of California, Inc., CA, USA) and sequenced in CCS mode on the PacBio Sequel II platform (continuous long reads (CLR) sequencing mode). After trimming the low-quality reads and adaptor sequences from the raw data, approximately 52.85 Gb of long reads were generated, covering approximately 124 × of the estimated genome size.

For Hi-C sequencing, the library was prepared according to the protocol described by Lieberman-Aiden19 et al. DNA was purified from proteins and randomly sheared into fragments of 300–700 bp in size. The resulting Hi-C library was sequenced on the Illumina NovaSeq 6000 sequencing platform using paired-end 150 bp reads. The raw data from Hi-C sequencing were processed using fastp. A total of 36.14 Gb (84.36 × coverage) of clean reads were obtained.

For Oxford Nanopore Technologies (ONT) sequencing, all RNA samples of the same quantity were mixed for PCR-cDNA library construction using the Ligation Sequencing Kit (SQK-LSK109) and sequenced on the PromethION sequencer (Oxford Nanopore Technologies, Oxford, UK). NanoFilt v. 2.8.020 (parameters: –q 7 –l 100 –headcrop 30 –minGC 0.3) was used to process the RNA-seq data. Finally, a total of 6.2 Gb of full-length RNA-seq data were obtained for genome annotation.

Genome and chromosome assembly

The contig-level genome of K. roxburghii was assembled using Hifiasm v. 0.14.221 with default parameters. Two rounds of error correction were performed based on PacBio sequencing and Illumina NovaSeq sequencing data using NextPolish v. 1.3.122 (parameters: sgs_options = –max_depth 200 lgs_options = –min_read_len 1k –max_read_len 100k –max_depth 100 lgs_minimap2_options = –x map-ont) and Pilon v. 1.2323 (parameters:–fix all–changes), respectively. The heterozygous sequences were removed by using the Purge_haplotigs pipeline v. 1.0.424. The high, mid, and low cut-off read depth parameters were set to 170, 55, and 5, respectively, to remove haplotigs. Consequently, the genome assembly contained 446.30 Mb in 19 contigs with a contig N50 of 42.26 Mb, and the GC content of the genome was 35.98% (Table 1).

Table 1 Global statistics for the Knoxia roxburghii genome assembly.

The Hi-C clean data were mapped to the draft genome using HiCUP v. 0.8.225 (parameters: –format sanger –longest 800 –shortest 150 –nofill N), followed by filtration to remove unmapped reads, invalid pairs, and PCR amplification-induced repetitive sequences. ALLHiC v. 0.9.826 (parameters: –e GATC –k 10) was utilized to cluster the contigs into chromosomal groups, with subsequent sorting and orientation. The interactions between contigs were converted into a specific binary file using 3D-DNA v. 18041927 and Juicer v. 1.628. Then, the visual correction of the assembly was finalized using JuiceBox v. 1.11.0829 based on the intensity of chromosome interaction. Additionally, very short contigs without any interaction relationships were placed in the “unassigned” category. The final chromosomal-level genomic sequence was obtained by using 100 N to fill the gaps. Finally, 99.78% of the initial assembled sequences were anchored to 10 pseudo-chromosomes with lengths ranging from 42.02 Mb to 48.32 Mb (Fig. 1a, Table 2). The total length of the genome assembly was 446.30 Mb, with a scaffold N50 of 44.38 Mb (Table 1).

Fig. 1
figure 1

Overview of the genomic features of Knoxia roxburghii. (a) Genomic features of K. roxburghii. Tracks from outside to inside (ae) are as follows: chromosomes, gene density, repeat sequence density, GC content, and collinearity between the chromosomes; (b) Hi-C interaction heatmap for the K. roxburghii genome showing interactions among the ten chromosomes.

Table 2 Statistics of the pseudochromosome length obtained by Hi-C assisted assembly of Knoxia roxburghii.

Genome annotations

Three gene prediction methods, namely de novo-based, RNA-seq-based, and homologue-based, were combined to identify gene structures. For de novo‐based prediction, gene prediction was performed using AUGUSTUS v. 3.2.330 and GlimmerHMM v. 3.0.431 with default parameters. In the RNA-seq-based approach, the full-length sequence underwent alignment to the reference genome using Minimap2 v. 2.1732 (parameters: –ax map-ont –xsplice –G 1000000). Subsequently, the alignment results were used as inputs in StringTie v. 1.3.333 for genome-based transcript assembly, and coding regions were then predicted using TransDecoder v. 2.0 (http://transdecoder.github.io). Homology‐based predictions were performed with protein sequences from five reference species: Arabidopsis thaliana34, Coffea arabica35, Coffea canephora36, Leptodermis oblonga37, and Mitragyna speciosa38. The results of the three methods were integrated using MAKER v. 2.31.1039. Overall, a total of 24,507 genes have been successfully predicted, with an average gene length, average coding-sequence length, average exon length, and average exon number per gene of 4036.6 bp, 1205.64 bp, 318.24 bp, and 5.14, respectively (Table 3).

Table 3 Statistical results for the genetic structure of Knoxia roxburghii.

Gene functions were assigned to the protein-coding gene models and compared to the National Center for Biotechnology Information (NCBI) Non-redundant protein (NR) (ftp://ftp.ncbi.nih.gov/pub/nrdb/), the Universal Protein Knowledgebase (UniProt) database40, and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database41 using diamond v. 2.0.11.14942 (parameters: –evalue 1e-5). The motifs and domains were identified using InterProScan v. 5.52-86.043 against multiple publicly available databases including ProDom44, PRINTS45, Pfam46, SMRT47, PANTHER48, and PROSITE49. A total of 24,236 genes (94.85% of the predicted protein-coding genes) were annotated using the above databases. Specifically, approximately 90.88%, 91.06%, 25.34%, 92.88%, 70.87%, and 69.22% were annotated in UniProt, Nr, KEGG, InterPro, GO, and Pfam, respectively.

The identification of transfer RNAs (tRNAs) was performed using tRNAscan-SE v. 2.0.750. Other non-coding RNAs (ncRNAs), such as microRNAs (miRNAs) and small nuclear RNAs (snRNAs), were identified using Infernal v. 1.1.251 by searching against the Rfam database52. Lastly, the number of rRNAs, snRNAs, miRNAs, and tRNAs predicted from K. roxburghii genome were 1,053, 550, 81, and 387, respectively (Table 4).

Table 4 Statistics of non-coding RNA prediction in the Knoxia roxburghii genome.

Transposable elements and annotation of repeat sequences

Repetitive elements were identified through transposable element annotation using the Extensive de novo TE Annotator (EDTA) program v. 2.0.153 (parameters:–sensitive 1–anno 1). The insertion time was calculated using the LTR_retriever54 with default parameters. TEsorter v. 1.355 (parameters: -db rexdb) was used to classify the clade level of LTR-RTs and extract LTR-RT protein domains. MAFFT v. 7.47556 (parameters:–auto) was utilized to align LTR-RT sequences, and a phylogenetic tree was constructed using IQ‐TREE v. 2.2.2.657 (parameters: –bb 1000).

Based on the high-quality reference genome in this study, 307.60 Mb of repetitive sequences of K. roxburghii were predicted (Table 6). Among the integrated results, 33.56% (149.76 Mb) of the sequences were long terminal repeat (LTR) retrotransposons, with LTR/Copia elements being the dominant class (28.71% of the whole genome, 128.15 Mb), followed by LTR/Gypsy elements (2.79% of the whole genome, 12.47 Mb). To investigate the evolutionary history of transposable elements (TEs) in the K. roxburghii genome, a distribution plot of identity values between genomic copies and their consensus sequences was generated. The distributions of LTRs showed a peak at 89% identity, which was larger than the peaks of the other TE types, indicating that LTR-retrotransposons were recently transposed in the genome of K. roxburghii (Fig. 2a). Additionally, the genome contained 3,394 LTR-RTs, and the LTR retrotransposon burst was estimated to have occurred approximately 0.2 million years ago (Fig. 2b). For LTR/Gypsy and LTR/Copia, phylogenetic trees revealed that repeat elements were organized into different clades and expanded in clusters (Fig. 2c,d).

Fig. 2
figure 2

Repeat sequence analysis of the Knoxia roxburghii genome. (a) Distribution of sequence identity values between genomic copies and consensus repeats in the K. roxburghii genome. (b) Distribution of sequence identity values between genomic copies and consensus repeats in the K. roxburghii genome. (c) Phylogenetic tree of Ty1/Copia-type retrotransposons. (d) Phylogenetic tree of Ty3/Gypsy-type LTR retrotransposons.

Data Records

The BGI short reads, PacBio HiFi long-reads, Hi-C reads, and RNA-Seq data have been deposited in the NCBI Sequence Read Archive with accession numbers SRR2577737258, SRR2578793459, SRR2495841360, and SRR2577516761. The genome assembly has been deposited in DDBJ/ENA/GenBank under the accession number JAUECX00000000062. The chromosomal assembly and dataset of gene annotation have been deposited in the FigShare database at https://doi.org/10.6084/m9.figshare.2354256663.

Technical Validation

The integrity of the genome assembly was assessed using the sequence identity method. Reads from a small-fragment library were specifically selected and aligned to the assembled genome using BWA v. 0.7.17-r118864. The alignment rate of all small fragment reads to the genome was approximately 99.60%, and the coverage rate was approximately 99.49%, indicating consistency between the reads and the assembled genome.

We performed a Benchmarking Universal Single-Copy Orthology (BUSCO) v. 4.1.465 analysis based on the embryophyta_odb10 database to assess the completeness of the assembly, which indicated that 97.50% of the complete BUSCOs were present in the assembly (Table 5). Furthermore, 99.78% of the scaffolds were successfully anchored to the 10 chromosomes. The accuracy of the chromosome assembly was indirectly confirmed by examining the Hi-C heatmap, which revealed a well-organized interaction contact pattern along the diagonals within and around the chromosome region (Fig. 1b). This observation provides additional support for the precision of the chromosome assembly.

Table 5 Statistics for BUSCO estimation for Knoxia roxburghii genome assembly and annotation.
Table 6 Statistics of repeat elements of the Knoxia roxburghii assembly.

To validate the predicted genes, we performed a BUSCO analysis. The analysis revealed a high reliability of the annotated results, as approximately 98.40% of the complete BUSCOs were identified (Table 5). The annotation results were considered acceptable since the number of predicted genes and structural characteristics of the K. roxburghii genome were consistent with those of the genomes of closely related species.