Background & Summary

Freshwater mussels (Unionoida) represent the most diverse order of freshwater bivalves1 and are found in all regions of the world except the Antarctic2. They not only play an important role in the food web structure and material cycle of ecosystems3,4 but also have high economic value, such as for food5, pearl cultivation6, and anti-tumor ingredients7. They also have been used as an indicator for biological monitoring and evaluation of heavy metal pollution8.

Freshwater mussels are benthic filter feeders9. Suitable substrate, water quality, and food are important factors for the survival and reproduction of mussels. In recent years, human activities, such as river diversion, chemical pollution, and overfishing have caused serious damage to mussel habitats10. The developmental life history of most mussels involves a parasitic larval stage (glochidia) that must attach to vertebrate hosts (primarily fish) to complete metamorphosis11 which increases their vulnerability2. The International Union for Conservation of Nature (IUCN) Red List reports that 173 species are extinct, endangered, or threatened, 99 are vulnerable or nearly threatened, and 84 are unclassified because data are deficient12.

There are 57 endemic species in China13, and eight species have now been listed as Grade II national protected animals14. The biodiversity and population size of freshwater mussels in large water bodies such as the Yangtze River15 and the Songhua River16 have shown a significant decline. S.oleivora is endemic to China. In 2022, S. oleivora was identified as one of the top ten characteristic aquatic germplasm resources by the Ministry of Agriculture and Rural Affairs. S. oleivora has fresh and tender meat, delicious taste, and high nutrient content17. In Fuyang of Anhui Province, Tianmen of Hubei Province, and other places, S. oleivora is a famous delicacy with a high economic value, and it is called “abalone in Huaihe River.” It once ranged an extensive distribution—in five freshwater lakes and the tributaries of the Yangtze and Huaihe Rivers18. Habitat fragmentation and other human activities (e.g., overfishing) have resulted in their endangerment19. Tianmen in Hubei Province and Fuyang in Anhui Province has established the S. oleivora Nature Reserve to support this ecologically and economically vital resource.

Genomic data is considered fundamental for revealing biological characteristics, inferring evolutionary mechanisms, and promoting effective conservation20. To date, only seven freshwater mussel species have had their genomes sequenced (Table S1, Supplementary File)21,22,23,24,25,26,27,28, and only one of these is a Chinese species27. The whole genome of S. oleivora is lacking. We applied multiple sequencing technologies, including Illumina Nova 6000 sequencing, PacBio long-read sequencing (PacBio), and high-throughput chromosome conformation capture (Hi-C) technology to complete genome sequencing and assembly. Three methods, including de novo gene prediction, homolog, and RNA-Seq-based prediction, were used to perform genomic annotation. In addition, the comparative genomics analysis of S. oleivora and 10 other distantly related species was performed. This study provides important genomic resources for conservation and evolutionary research and guides genetic trait improvements (e.g., growth).

Methods

Sample collection and sequencing

One female S. oleivora was sampled from the national-level protection zone of the aquatic germplasm resource of S. oleivora in the Fuyang Division of Huaihe River (32.428725°N, 115.600287°E). Total DNA was extracted from the adductor muscle of S. oleivora using the DNeasy Blood and Tissue Kit (Qiagen, Germany) for genome sequencing. For short-read sequencing, Covaris M220 was used to break DNA into 300–350 bp fragments. DNA library preparation was completed by terminal repair, an A-tail addition, sequencing junction addition, DNA purification, and bridge PCR. Based on a paired-end(PE) sequencing strategy. These libraries were sequenced on the Illumina NovaSeq Nova 6000 platform. For long-read sequencing, according to the PacBio standard protocol, a PacBio HiFi library was generated using an SMRTbell Template Prep Kit 2.0 (Pacific Biosciences, USA) and sequenced using the PacBio Sequel II platform. A Hi-C library was prepared following the Hi-C library protocol29 and sequenced using the Illumina Novaseq 6000 platform. Total RNA was extracted from the adductor muscle of S. oleivora using TRIzol reagent (Invitrogen, MA, USA) for transcriptome sequencing. The RNA-seq library was generated using NEBNext®UltraTM RNA Library Prep Kit (NEB, USA) for PE sequencing, and short reads were produced on the Illumina NovaSeq 6000 platform. A total of 192.1 Gb of Illumina data, 63.2 Gb of PacBio data, 191.8 Gb of Hi-C data, and 5.6 Gb RNA-Seq data were obtained (Fig. 1, Table 1).

Fig. 1
figure 1

Genome characteristics of Sinosolenaia oleivora.

Table 1 Statistics for the sequencing data of the Sinosolenaia oleivora genome.

Estimation of genome size

A K-mer-based method30 was applied to estimate the genome size, heterozygosity, and repeat content in S. oleivora. We performed a k-mer (k = 17) frequency distribution analysis using 192.1 Gb of Illumina clean data (Fig. 2). A total of 153,573,141,235 k-mers with a depth of 73 was obtained. The genome size was 2,025 Mb, the heterozygosity ratio was 0.78%, and the repeat sequence ratio was 61.37%.

Fig. 2
figure 2

Frequency distribution of sample’s K-mer depth and K-mer species.

Genome assembly

PacBio Hi-Fi reads were assembled using Hifiasm(v. 0.16.1-r375) software31 with the default parameters. Redundant sequences were filtered out using Purge_Haplotigs (v1.0.4) software32 with the parameter of cutoff “-a 70 -j 80 -d 200.” Based on PacBio sequencing data, the genome length was 2090.51 Mb. The number of contigs was 302 and N50 reached 23.99 Mb. The max length was 88.20 Mb and the GC content was 34.38% (Table 2).

Table 2 Gene assembly results of Sinosolenaia oleivora.

Hi-C-assisted chromosome-level assembly

To assemble the chromosome-level genome, Hi-C sequencing data were mapped and sorted against the draft genome assembly with Juicer v1.6 software33. The contigs were linked to 19 distinct chromosomes by 3D-DNA (v. 180922)34. Based on chromosome interactions, the contig orientation was corrected and suspicious fragments were removed from the contigs in the Juicebox software35. The genome contigs were further anchored and oriented to chromosomes by Hi-C scaffolding. The Hi-C library generated 191.8.2 Gb of clean data, with 55.56% valid pairs. A total of 302 contigs, accounting for 98.41% of the total assembled genome, were anchored into 19 chromosomes. The 19 pseudo-chromosomes were clearly distinguished from the Hi-C heatmap with strong pseudo-chromosome interactions confirming high-quality Hi-C assembly (Figs. 3, 4). This resulted in a high-quality genome of 2052.30 Mb, with a contig N50 of 20.36 Mb and scaffold N50 of 103.57 Mb (Table 3).

Fig. 3
figure 3

Chromosomes Hi-C heatmap of Sinosolenaia oleivora. Blocks represent height pseudochromosomes. The color bar represents contact density from white (low) to red (high). The same applies to Fig. 4.

Fig. 4
figure 4

Genome-wide Hi-C heatmap of Sinosolenaia oleivora.

Table 3 Statistics of Hi-C assembly results of Sinosolenaia oleivora.

Repeat annotation, gene prediction, and gene functional annotation

Combined homologous and de novo prediction methods, repeat elements of the S. oleivora genome, were annotated. For homologous alignment, we used RepeatMasker (v4.1.2-p1)36 and Repeat-proteinmask (v4.1.0)37 to annotate the transposable elements (TEs) by comparing sequences to the Repbase database38. For de novo prediction, Tandem Repeat Finder (TRF) (version 4.09)39 was executed to detect the tandem repeat elements based on sequence features. LTR_FINDER (v. 1.07)40 and RepeatModeler (v. 2.0.3)36 were used to construct a repeat library. The library was then used to detect repetitive sequences by RepeatMasker (v. 4.1.2-p1)36. After eliminating redundancy, we obtained the final annotated repeat sets. A total of 1171.79 Mb repeat sequences were annotated accounting for 56.05% of the total genome sequence (Table 4). The major repetitive elements were DNA (15.74%), long interspersed nuclear elements (LINEs, 8.95%), and long terminal repeats (LTRs, 4.98%) (Table 5).

Table 4 Statistics of repetitive sequences in the Sinosolenaia oleivora genome.
Table 5 Statistics of transposable elements for the Sinosolenaia oleivora genome.

The genome sequence was soft-masked based on repetitive element predictions and then used for protein-coding gene prediction. We employed three methods for gene prediction. For homology-based annotation, the protein sequences of Mizuhopecten yessoensis, Crassostrea gigas, Crassostrea virginica, and Mytilus galloprovincialis were downloaded from NCBI and aligned to the genome sequence using BLAST(E-value: 1e-5)41. Homologous sequences were then aligned to corresponding matching proteins using GeneWise (v. wise2-4-1)42. For the RNA-seq-based annotation, transcriptomic data were assembled using Trinity v2.1143, and BLAST(E-value: 1e-5)41 to align transcriptome to the genome. For de novo prediction, Augustus(v3.4.0)44, and Genscan (version1.0)45 were used to generate de novo-predicted gene sets. Maker (v2.31.10)46 was used to integrate the results from these methods to produce the final gene set. The genome sequence was also aligned to the homologous single-copy gene database of Benchmarking Universal Single-Copy Orthologs(BUSCO)47. MAKER (version 2.31.10)48 and HiCESAP (Wuhan Gooalgene Co., Ltd., https://www.gooalgene.com/) were employed to merge all the data and filter out redundancies. The combination of de novo and homolog-based methods predicted 22,971 protein-coding genes (Table 6). The predicted genes were functionally annotated based on exogenous protein databases including SwissProt, InterPro, TrEMBL, Kyoto Encyclopedia of Genes and Genomes (KEGG), and Gene Ontology (GO). A total of 19,229 genes, accounting for 87.52% of all predicted genes, were annotated using public databases (Table 7).

Table 6 Statistics of gene predictions in the Sinosolenaia oleivora genome.
Table 7 Functional annotations of predicted genes.

Based on Rfam49 and miRbase50 databases, we used tRNAscan-SE (v1.3.1)51 to identify transfer RNAs (tRNAs), and Infernal(v1.1.2)52 to annotate other ncRNAs, including microRNAs (miRNAs) and small nuclear RNAs (snRNAs), and BLAST(E-value: 1e-5)41 was used to obtain ribosomal RNA (rRNA) to predict noncoding RNA (ncRNA) in the genome of S. oleivora. For non-coding RNA predictions, we successfully annotated 119 miRNAs, 2643 tRNAs, 366 rRNAs, and 867 snRNAs, with average lengths of 98, 74, 254, and 168 bp, respectively (Table 8).

Table 8 Non-coding RNA annotation of the Sinosolenaia oleivora genome.

Comparative genomic analyses

To clarify the evolutionary position of S. oleivora, OrthoMCL (Verison v2.0.9)53 with the parameter “-l 1.5” was used to detect orthologous groups by retrieving the protein sequences of Mizuhopecten yessoensis, Biomphalaria glabrata, Crassostrea gigas, C. virginica, Lingula anatina, Lottia gigantea, Mercenaria mercenaria, Ostrea edulis, Pecten maximus, and Pomacea canaliculate. Sequence alignment was performed by MUSCLE(v5)54 for single-copy orthologous genes. Basing on this result, KaKs Calculator(v2.0)55 was utilized to fetch Kolmogorov-Smirnov(Ks) with default parameters. The S. oleivora genome shared 82,067 gene families and 17,699 single-copy genes with ten other mollusk species. The S. oleivora genome contained 21971 genes clustered into 18,312 gene families and 2,273 unique families (Table 9). The phylogenetic tree was constructed using the “-f a -N 100 -m GTRGAMMA” parameter of RAxML (version 8.2.12)56 based on multiple sequence alignment. Divergence times were estimated using the MCMCtree (v4.9) program in PAML (v4.9)57 with clock = 3 and model = 0 parameters. The divergence time of L. anatina and C. gigas 619.3 (582.0–689.2 MYA); B. glabrata and C. gigas 544.1 (520.2–567.9 MYA); P. canaliculata and B. glabrata 444.6 (377.0–490.4 MYA) from TimeTree database58 (http://www.timetree.org/) were used for calibration. Divergence time analysis showed that S. oleivora was closely related to M. mercenaria, with a divergence time of 516.7 (486.9–541.0) Mya (Fig. 5).

Table 9 Gene family clustering.
Fig. 5
figure 5

Estimates of species divergence times.

CAFE59,60 was applied for gene expansion and contraction analysis. Compared with the nearest ancestor, a total of 603 expanded and 1767 contracted gene families were found in S. oleivora (Fig. 6). There were 69 significantly expanded (984 genes) and 83 significantly contracted (118 genes) gene families (p < 0.05). We then performed GO and KEGG enrichment analysis and terms with enrichment-adjusted p-values ≤ 0.05 were chosen for further analysis. The program CODEML (v4.9)57 of PAML was used for positive selection gene (PSG) identification. PSGs were also chosen for enrichment analysis. A total of 552 protein-coding genes were positively selected in S. oleivora (FDR < 0.05, Table 10). GO and KEGG enrichment of positively selected genes focused on the DNA binding, nucleolus, and protein processing in the endoplasmic reticulum, ribosome, and mTOR signaling pathway (Figs. 7, 8).

Fig. 6
figure 6

Numbers of gene families for expansion and contraction in Sinosolenaia oleivora. The green number represents the number of gene families that have expanded during the evolutionary process of a species, whereas the red number represents the number of gene families that have contracted.

Table 10 Protein-coding genes under positive selection in Sinosolenaia oleivora (FDR < 0.05).
Fig. 7
figure 7

GO enrichment analysis of positively selected genes.

Fig. 8
figure 8

KEGG enrichment analysis of positively selected genes.

Data Records

All sequencing data from three sequencing platforms have been uploaded to the NCBI SRA database (transcriptomic sequencing data: SRR2835217161, genomic Illumina sequencing data: SRR2655134462, genomic PacBio sequencing data: SRR2840605563, Hi-C sequencing data: SRR2840626464). The final chromosome-level assembled genome file has been uploaded to the GenBank database under the accession JBDPLI00000000065. Genome annotation files have been uploaded to the Figshare database66.

Technical Validation

Evaluating the quality of the DNA and RNA

The quality and concentration of extracted DNA/RNA were assessed using NanoDrop 2000 Spectrophotometer (Thermo Fisher Scientific, San Jose, CA, USA) and Qubit 3.0 Fluorometer (Thermo Fisher Scientific, San Jose, CA, USA)(OD260/280 and OD260/230) before the genome sequencing and their integrity was further evaluated on 1% agarose gel stained with ethidium bromide.

Evaluating the quality of the genome assembly

We evaluated the genome assembly quality through the following measures: (i) Confirmation that the assembly result belongs to the target species was made by software BLAST(E-value: 1e-5)26 comparison to the NCBI nucleotide database (NT library)(Table S2, S3, Supplementary File);(ii) Illumina short reads and PacBio reads were mapped onto the assembled genome using BWA (v. 0.7.17-r1188)67 and Minimap268 to evaluate the completeness and accuracy of the genome. The read-mapping rates were 99.27% and 99.74%, and genome coverage rates were 99.7% and 99.98% for the Illumina and PacBio reads, respectively (Table 11), indicating high mapping efficiency and comprehensive coverage. (iii) BUSCO (v5.2.3)32 analysis was conducted to evaluate the assembly quality based on the mollusca_odb10 database. Using BUSCO analysis, 100% (5295/5295) of complete BUSCO genes were found in the assembly, including 88.6% complete BUSCOs, 85.8% complete and single-copy BUSCOs, and 2.8% complete and duplicated BUSCOs (Table 12).

Table 11 The alignment of Illumina and PacBio reads to Sinosolenaia oleivora.
Table 12 BUSCO analysis results of the Sinosolenaia oleivora genome.

Evaluating the quality of the genome annotation

BUSCO (v5.2.2)32 was used to evaluate the completeness of the genome annotation. The reference BUSCO database was mollusca_odb10. Among the 5295 BUSCO groups searched, 4575 (86.4%) of the complete BUSCOs were detected in the genome annotations (Table 12).