Background & Summary

Recent studies indicate that snakes gradually evolved from lizards during the Early Cretaceous period, approximately 117.68 million years ago1. According to the most recent entries in the Reptile Database (https://reptile-database.reptarium.cz/), there are over four thousand snake species distributed across all continents except Antarctica, occupying diverse ecological niches and demonstrating high species diversity2. This broad distribution and adaptation to various habitats make snakes a vital component of Earth’s biodiversity3. Furthermore, certain snakes have developed distinct characteristics through evolution. For instance, Viperidae and Elapidae snakes exhibit high venom potency4. Snakes within the subfamily Hydrophiinae have adapted to sea life5, while those in the Typhlopidae are adept at living in soil6. Consequently, snakes represent an irreplaceable subject for biodiversity and adaptive evolutionary research. In recent years, high-quality chromosome-level genomes of several snake species have been published, offering valuable insights into unique traits and snake evolution1,7,8,9,10. However, despite these advancements, there remains a significant dearth of available data on snakes, both in terms of quantity and quality of reference genomes, hindering further research in this field.

Malnate (1960) partitioned Natrix sensu lato based on several morphological characters and restored the genus Rhabdophis11, which was established by Fitzinger in 1843 with the R. subminiatus as the type species. Unlike other snakes, the genus Rhabdophis possesses a distinctive trait of having glands beneath the skin of the neck and back, referred to as nuchal and dorsal glands, respectively12. In some species, these glands are confined to the neck (e.g., R. tigrinus)13. These glands harbor potent cardiotonic steroids known as Bufadienolides (BDs), which serve as defensive toxins against predators13,14. According to the latest records from the Reptile Database, there are currently 34 known species of the genus Rhabdophis worldwide. However, to date, there is not yet a reference genome available in the entire genus, which poses a challenge for genomics studies of these species.

In 1891, Boulenger described Tropidonotus nuchalis15 based on a specimen from Hubei, China, subsequently classified as Natrix nuchalis16 by Parker in 1925 and revised by Malnate in 1960 as R. nuchalis11. This species is known by the common name of Hubie keelback and exhibits a wide distribution in China17. Its diet primarily consists of earthworms and firefly larvae. Notably, R. nuchalis acquires BDs from firefly larvae and stores them in its dorsal neck glands, making it an ideal candidate for studying genetic diversity and complex trait evolution14. However, current research on R. nuchalis primarily focuses on morphology18, phylogenetic relationships17, and biogeography19, yet the absence of genomic data has hindered further exploration.

In this study, we successfully assembled and annotated the genome of R. nuchalis at the chromosome level by MGI short-read sequencing, PacBio Revio long-read sequencing, Hi-C20 sequencing, and RNA sequencing (RNA-seq) techniques. We estimated genome size and heterozygosity from clean short reads, performed long-read sequencing using the PacBio Revio System, and combined it with Hi-C20 reads to achieve chromosome-level assembly. Genome annotation was conducted using RNA-seq reads from five tissues (heart, spleen, lung, kidney, and muscle), published genomes of closely related species, and de novo prediction methods. Additionally, we assessed the quality of genome assembly using various metrics. Our efforts culminated in the first high-quality reference genome of the genus Rhabdophis, providing essential genetic data for studying adaptive evolution, genetic diversity, and resequencing analysis of R. nuchalis and the broader genus Rhabdophis.

Methods

Ethics statement

This study adhered to all pertinent ethical and legal guidelines and regulations. The collection of animals and extraction of tissues underwent thorough review and received approval from the Animal Ethics and Welfare Committee of Sichuan Agricultural University (Approval No. 20230121).

Sample collection

An adult female R. nuchalis (body length of 755 mm) was collected from the Shennongjia forest area (latitude: 31.683625, longitude: 110.418075) in Hubei Province, China, for genome sequencing and assembly. Six different tissues (heart, liver, spleen, lung, kidney, and muscle) were sequentially collected and rapidly frozen in liquid nitrogen upon collection, then stored at −80 °C. Liver tissue was utilized for MGI short-read sequencing, PacBio Revio HiFi long-read sequencing, and Hi-C sequencing, while the remaining five tissues were designated for RNA sequencing.

Library construction and sequencing

The collected tissues were sent to GrandOmics Biosciences Co., Ltd. (Wuhan, China), for DNA extraction, library construction, and sequencing. Genomic DNA (gDNA) was extracted from the liver following the manufacturer’s instructions and used for the construction of gDNA libraries. The integrity and purity of the gDNA samples were assessed using agarose gel electrophoresis.

For short-read sequencing, 1.5 μg gDNA was randomly fragmented by Covaris, following the guidelines specified in the device’s operating manual, and 300–400 bp fragments were selected with the Agencourt AMPure XP-Medium kit. The library was then constructed from the selected fragments using the AxyPrep Mag PCR clean-up Kit according to the manufacturer’s instructions. Finally, the qualified libraries were sequenced on the BGISEQ DNBSEQ-T7 platform. This yielded 108.82 Gb of raw reads, and 101.02 Gb of clean reads (with an average depth of coverage of 38.95×) were obtained after quality control using fastp v0.21.021 (Table 1). These clean reads were utilized for genome size estimation and to evaluate the accuracy of genome assembly.

Table 1 Statistics of sequencing data used for R. nuchalis genome assembly in this study.

For PacBio HiFi long-read sequencing, 5 µg of gDNA was used to construct SMRTbell libraries following PacBio’s standard protocol (Pacific Biosciences, CA, USA). The process included shearing of gDNA using g-TUBEs (Covaris, USA) according to the expected size of the fragments for the library, DNA damage repair, end repair, and A-tailing, followed by ligating hairpin adapters at both ends of the fragments using the SMRTbell Express Template Prep Kit 3.0 (Pacific Biosciences). After nuclease treatment of the SMRTbell library using the SMRTbell Enzyme Cleanup Kit, target fragments were screened using PippinHT (Sage Science, USA), and the prepared SMRTbell library was sequenced on the PacBio Revio platform instrument with Revio Kit in Grandomics. This resulted in 79.31 Gb of HiFi long reads (with an average coverage depth of 39.65×) for genome assembly (Table 1).

Hi-C libraries were constructed following the protocol22 to obtain the genome at the chromosome level. Key steps included fixation of liver samples using 2% formaldehyde, cleavage of sequences with the DpnII enzyme, end repair, biotin-14-dCTP labeling, ligation with T4 DNA ligase, and uncross-linking and interrupting the sequences. Subsequently, the libraries were sequenced on the BGISEQ DNBSEQ-T7 platform. This generated 209.72 Gb of raw reads, and 209.72 Gb of clean reads (with an average depth of coverage of 107.70×) were obtained after quality control using fastp v0.21.021 (Table 1).

To improve the precision of genome annotation, RNA sequencing was conducted across five distinct tissues: heart, spleen, lungs, kidneys, and muscles. Each tissue underwent RNA extraction utilizing TRIzol reagent (Invitrogen, USA), followed by assessment of RNA purity and concentration using Nanodrop and Qubit, construction of RNA-seq libraries employing the MGIEasy RNA Sample Prep Kit (UW Genetics), and sequencing on the BGISEQ DNBSEQ-T7 platform. A minimum of 6 Gb of sequencing data was guaranteed for each tissue. In total, 40.26 Gb of raw reads were generated, with 40.14 Gb of clean reads obtained post quality control using fastp v0.21.021 (Table 1). These clean reads were utilized for transcriptome annotation of the genome.

Predicting genome size and heterozygosity

The genome size and heterozygosity of R. nuchalis were predicted using KMC v3.2.123 and GenomeScope v124 software before assembly. Initially, the short reads, post-quality control, underwent analysis with KMC v3.2.123 (parameter k = 17) to generate the k-mer frequency distribution table. Subsequently, the obtained k-mer frequency distribution table was analyzed using GenomeScope v124 software to derive genome prediction information. Finally, the prediction results indicated a genome size of 1.57 Gb and a heterozygosity of 1.20% (Table 2).

Table 2 Statistical analysis of the size and heterozygosity prediction for the R. nuchalis genome (K-mer = 17).

De novo assembly of the R. nuchalis genome

De novo assembly of the R. nuchalis genome was conducted using the obtained HiFi long reads through hifiasm v0.16.025. We acquired the preliminary assembled genome, which underwent comparison with the NT (Nucleotide Sequence Database) library. Sequences longer than 1 Mb were subjected to 50 kb cuts, and contaminating reads (non-target macroclasses, mitochondria) were subsequently removed from the genome to yield the final assembly. The resulting genome size of R. nuchalis, post-contamination removal, was 1.93 Gb, with a contig N50 of 104.79 Mb (Table 3).

Table 3 Assembly statistics for R. nuchalis are presented in two parts: the first part comprises the assembly results prior to Hi-C integration, while the second part showcases the outcomes of the Hi-C-assisted assembly process.

To assess the quality of the genome assembly, we first employed BUSCO v4.0.526 (Benchmarking Universal Single-Copy Orthologs) to evaluate completeness. This involved analyzing single-copy homologous genes in the OrthoDB database vertebrata_odb10. The analysis revealed that 3,270 (97.50%) out of 3,354 BUSCO groups were identified as complete, including 3,232 complete and single-copy BUSCOs (96.36%), and 38 complete and duplicated BUSCOs (1.13%), indicating high completeness of the assembled genome (Table 4).

Table 4 Statistics from the BUSCO assessment of the R. nuchalis genome assembly and annotation.

Furthermore, to evaluate the accuracy of the assembly, clean short reads and HiFi long reads were mapped to the R. nuchalis genome using BWA v0.7.1527 and minimap228, respectively. The results indicated that at a coverage depth of 1×, the clean short reads and HiFi long reads achieved 98.24% and 99.97% coverage across the entire genome, respectively (Table 5). This demonstrates the high accuracy of the genome assembly.

Table 5 Results from this study involve the alignment of quality-controlled short-read and long-read sequences to the assembled R. nuchalis genome.

Hi-C assisted assembly

We employed a multi-step process to assemble the genome of R. nuchalis to the chromosome level using quality-filtered Hi-C reads. Firstly, clean Hi-C reads were aligned to genomes assembled with HiFi long reads using bowtie2 v2.3.229 to obtain uniquely mapped paired-end reads. Subsequently, HiC-Pro v2.8.130 was utilized to identify and retain valid interacting paired-end reads from these uniquely mapped pairs while filtering out invalid sequences such as dangling-end, self-cycle, re-ligation, and dumped products.

Subsequently, the scaffolds underwent further clustering, sorting, and chromosomal localization using LACHESIS v131. Subsequent manual adjustments were made to the genome using Juicebox v1.11.0832 to derive the final pseudochromosomes. The chromosomes, GC content, gene density, abundance of repetitive sequences, and ncRNA distribution of the genome were visualized using the advanced circos33 in TBtools II34 (Fig. 1B). The analysis unveiled that R. nuchalis features 20 chromosomes, consisting of 9 macrochromosomes and 11 microchromosomes (with a 50 Mb threshold in squamates35). Chromosome sizes varied from 14.96 Mb to 411.07 Mb, contributing to a total genome size of 1.92 Gb (Tables 3, 6, and Fig. 1). Notably, the contig N50 stood at 104.79 Mb, while the scaffold N50 reached 204.96 Mb (Table 3). This comprehensive approach facilitated the structuring of the genome into chromosomal configurations, offering profound insights into the genomic architecture of R. nuchalis.

Fig. 1
figure 1

(A) Hi-C interaction heatmap for R. nuchalis. Numbers on the x-axis and y-axis, 1 to 9 represent 9 macrochromosomes and “10~20” correspond to 11 microchromosomes. (B) Circos diagram of the R. nuchalis genome. “Chr 01” to “Chr 20” represent chromosomes; (a) chromosomes; (b) GC ratio; (c) gene density; (d) abundance of repetitive sequences; (e) abundance of ncRNA; (b–e) per 100-kb window. At the center is an adult R. nuchalis photographed by Xuemei Tang.

Table 6 Statistics of 20 chromosomes of R. nuchalis genome (1–9 are macrochromosomes and 10–20 are microchromosomes).

Repeat sequence annotation

Repeat sequences, comprising tandem repeats (TRs) and transposable elements (TEs), were annotated in the genome of R. nuchalis using a combination of software tools and databases. For TRs, we employed GMATA v2.236 and Tandem Repeats Finder (TRF v4.07b37) software pairs. GMATA v2.236 identified simple repeat sequences (SSRs), while TRF v4.07b37 identified all tandem repeats in the genome. Regarding TEs, a dual approach of de novo and homologous annotation was adopted. Firstly, transposable elements were de novo annotated using MITE-hunter38 and RepeatModeler v1.0.1139 software, in which also uses LTR_FINDER40, LTR_harvest41 and LTR_retriver42 for synchronization detection of repeat sequences. Subsequently, the obtained libraries were compared with the TEclass Repbase database to categorize each repeat family using TEclass v2.1.343. Furthermore, RepeatMasker v1.33144 was utilized to search for both known and novel TEs by localizing sequences from de novo repeat libraries and Repbase repeat libraries. Overlapping transposon factors belonging to the same repeat class were sorted and combined.

The results indicated that a total of 1.09 Gb of repetitive sequences were annotated in the genome of R. nuchalis, constituting 56.51% of the entire genome. Among these, TRs and TEs accounted for 13.78 Mb and 885.68 Mb in size, representing 0.72% and 46.02% of the whole genome, respectively. Class I and Class II TRs comprised 628.50 Mb and 257.18 Mb, contributing to 32.66% and 13.36% of the entire genome, respectively (Table 7). This comprehensive annotation provides insights into the repetitive landscape of the R. nuchalis genome.

Table 7 Statistical outcomes regarding repetitive sequences in the R. nuchalis genome (categorized by the type of repetitive sequences).

Gene structure annotation

In the structural annotation of the R. nuchalis genome, we initially applied RepeatMasker v1.33144 to soft-mask the annotated repetitive sequences. Subsequently, gene structure prediction was conducted through three methods: homology prediction, transcriptome prediction, and de novo prediction, with integration of the results to derive the final gene structure annotation. For homology prediction, comparisons were made with the genomes of five closely related species (Ahaetulla prasina7, Calamaria septentrionalis1, Pantherophis guttatus1, Thamnophis elegans NCBI accession GCA_009769535.1, and Thermophis baileyi8) using GeMoMa v1.6.145 software. Transcriptome prediction involved mapping quality-controlled RNA-seq reads to the R. nuchalis genome using STAR v2.7.3a46, followed by transcript assembly with Stringtie v1.3.4d47 and prediction of open reading frames (ORFs) using PASA v2.3.348. De novo prediction entailed reassembly of RNA-seq reads using Stringtie v1.3.4d and analysis with PASA v2.3.348 to generate a training set, followed by de novo gene prediction using Augustus v3.3.149. Finally, the predictions were integrated using EVM v1.1.148 (EVidenceModeler).

The results indicated that homology prediction, transcriptome prediction, and de novo prediction annotated 48,439, 18,203, and 20,575 genes, respectively, with a final count of 22,057 protein-coding genes successfully annotated after EVM v1.1.148 integration. Among them, the average gene length and CDS length were 34,853.45 bp and 1,617.01 bp, respectively. Each exon contained an average of 9.12 genes, while the average lengths of exons and introns were 177.32 bp and 4,093.52 bp, respectively (Table 8).

Table 8 Statistical outcomes of gene structure annotation in the R. nuchalis genome, obtained through three different methods and subsequently integrated.

Gene function annotation

We have successfully completed the functional gene annotation of the R. nuchalis genome by utilizing five key public databases: GO (Gene Ontology)50, SwissProt51, NR (Non-Redundant protein Database), KEGG (Kyoto Encyclopedia of Genes and Genomes)52, and KOG (Eukaryotic Orthologous Groups of proteins)53. In the case of the GO database, we employed the default parameters of the InterProScan v5.3254 program for gene function annotation. For the remaining four databases, we utilized Blastp v2.7.1 to annotate gene functions. The results revealed that 13,451, 18,567, 19,655, 14,474, and 13,362 genes were annotated in GO50, SwissProt51, NR, KEGG52, and KOG53, respectively, accounting for 60.98%, 84.18%, 89.11%, 65.62%, and 60.58% of the total number of genes in R. nuchalis (Table 9). Notably, 9,343 genes were annotated across all five databases (Fig. 2). By integrating the annotation outcomes from these databases, we completed the functional annotation of 19,918 genes, representing 90.30% of the total gene count (Table 9, Fig. 2).

Table 9 Statistical findings from the functional annotation of genes within the R. nuchalis genome, sourced from five distinct databases and subsequently consolidated.
Fig. 2
figure 2

Venn diagram illustrating the outcomes of functional annotation for R. nuchalis across five databases.

Subsequently, we conducted an evaluation of the genome annotation results. Initially, the annotated genes were assessed using BUSCO v4.0.526 based on the OrthoDB database vertebrata_odb10. The evaluation revealed that 3,237 complete genes were identified within 3,354 BUSCO groups, accounting for 96.51% of the database, underscoring the high completeness of the annotated genome of R. nuchalis (Table 4). Furthermore, we compared the genome of R. nuchalis with the published genomes of five closely related species, which exhibited a total gene count ranging from 18,213 to 22,959 genes (Table 10). Remarkably, R. nuchalis possessed 22,057 genes, aligning well with the published species (Table 10). Additionally, in terms of gene length, average CDS length, exon length, the average number of exons per gene, intron length, and the distribution of intron number, R. nuchalis exhibited consistency with the five closely related species (Table 10, Fig. 3).

Table 10 Comparison of results of R. nuchalis genome annotation with closely related species.
Fig. 3
figure 3

Comparing the genomes of R. nuchalis assembled and annotated in this study with the published genomes of five closely related species. Each species is represented by a different colored line; A-F show the gene length, CDS length, exon number, exon length, intron number, and intron length distribution in the genomes of each species sequentially.

Non-coding RNA (ncRNA) annotation

The annotation of ncRNAs in the R. nuchalis genome was accomplished through a combination of database searching and model prediction methods. Specifically, tRNAs were annotated using tRNAscan-SE v2.055, while MicroRNAs, rRNAs, small nucleolar RNAs, and small nucleolar kernel RNAs were identified by searching the Rfam database56 using Infernal v1.1.2 cmscan57. Additionally, RNAmmer v1.258 prediction was employed for the annotation of rRNAs and their subunits. The results showed that a total of 3,599 ncRNA were annotated in the R. nuchalis genome, including 397 rRNA, 981 snRNA, and 2,063 tRNA (Table 11).

Table 11 Statistical results of Non-coding RNA annotation of the R. nuchalis genome (categorized by the type of Non-coding RNA).

Data Records

All the raw sequencing data generated in this study have been uploaded to the NCBI Sequence Read Archive (SRA) database with the accession number SRP50004559. The assembled chromosome-level genome data have been deposited in Genbank with the accession number GCA_039707465.160. The genome annotation data have been uploaded to Figshare (https://doi.org/10.6084/m9.figshare.25559178.v1)61.

Technical Validation

To assess the accuracy and completeness of the assembled genome of R. nuchalis, we conducted BUSCO v4.0.526 assessment, identifying 3,270 complete BUSCO genes out of 3,354, indicating 97.50% completeness(Table 4). Furthermore, mapping clean short reads and HiFi long reads to the genome revealed 98.24% and 99.97% mapping ratio, respectively, at a coverage depth of 1×, demonstrating high accuracy (Table 5). Additionally, for genome structure annotation, BUSCO assessment yielded 3,237 complete genes out of 3,354 BUSCO groups, representing 96.51% completeness (Table 4). Comparison with five closely related species showed consistency in gene count and various gene parameters, affirming the effectiveness of genome annotation (Table 10, Fig. 3).