Background & Summary

The smallscale yellowfin, Plagiognathops microlepis, belongs to the cyprinid subfamily Xenocyprinae. It is a small to medium-sized fish that inhabits the middle-to-bottom layers of the water and is widely distributed in freshwater ecosystems of China1. This fish feeds on humus, organic debris and algae, therefore it is often used as a tool fish to purify water and control algal bloom, playing an important role in freshwater systems2,3. With the advantages of delicious taste, fast growth and few diseases, P. microlepis has been domesticated into an aquaculture variety4,5,6. Due to its unique dietary characteristics, when co-cultured with other fish species, P. microlepis does not affect the growth of its companions. Instead, it can help to increase yield and purify water, showing broad development prospects7,8. In recent years, the artificial seedling breeding and stock enhancement and releasing of P. microlepis have been carried out in many regions of China9. Studies on the culture technique and models7, population structure10, nutritional composition8, biochemistry and toxicology2,3, and the roles in water quality improvement of this fish11 have also been conducted. However, due to the lack of genetic data resources, researches on the evolutionary adaptation strategies and the molecular mechanisms of excellent traits in P. microlepis are still scarce, which limit our understanding and effective utilization of this fish.

In phylogenetic studies, the subfamily Xenocyprinae is a branch of Cyprinidae with fewer species. These species distribute widely and discretely in East Asia (especially in China) and have a long history, making it possible to evaluate species and populations differentiation under historical environment changes in East Asia12. Currently, the widely accepted classification view is that this subfamily includes 10 species in 4 genera13. However, different views have emerged on the phylogenetic positions and evolutionary history of some fish within this subfamily as phylogenetic studies increasing. Some studies considered P. microlepis as the only one species in Plagiognathops1,14, while others suggest that Plagiognathops is not a valid genus and this fish should be classified into the genus Xenocypris13,15. Both views have been supported by phylogenetic evidences based on different molecular markers16. Nevertheless, most of these studies were conducted based on mitochondrial or several nuclear genes, resulting in inconsistent results. With the development of sequencing technology, phylogenetic studies have entered the era of omics. However, in the subfamily Xenocyprinae, only the genome of Pseudobrama simoni has recently been deciphered17. Therefore, to provide a high-quality genome of P. microlepis is also essential for conducting phylogenetic analysis at the genomics level and elucidating the controversy in the validity of the genus Plagiognathops.

In this work, we constructed a chromosome-level reference genome of P. microlepis by integrating 33.62 Gb of HiFi reads, 44.58 Gb of short reads, and 99.57 Gb of Hi-C reads. The assembled size of this genome was 1004.34 Mb, and 1004.12 Mb of which was anchored to 24 chromosomes with a contig N50 length of 39.98 Mb. This genome contains about 57.64% (578.91 Mb) of repeat elements and 26,929 protein-coding genes. In addition, we also assembled two chromosome-level haplotypes, Haploid A (997.69 Mb) and Haploid B (995.24 Mb), with contig N50 lengths of 36.21 Mb and 33.97 Mb respectively. The completeness (>97.0% complete BUSCO), consistency (>99.8% mapping ratio) and consensus quality values (>47.5) of all the 3 assemblies were estimated to be high. Overall, this genome will provide a reference for phylogenetic, adaptative evolutionary and genetic basis studies on P. microlepis and other cyprinid fishes, which may also provide valuable information for the regulation and restoration of freshwater ecosystems. And the assembled haplotype genomes can serve as a baseline for studies on allele-specific expression or conservation genomics.

Methods

Ethics statement

This work was approved by the Care and Use of Laboratory Animals in Yangtze River Fisheries Research Institute, Chinese Academy of Fishery Sciences (Wuhan, China).

Sampling and genome survey

A healthy female P. microlepis (body weight: 539.79 g) was collected from the original breeding farm in Suizhou, Hubei Province, China (Fig. 1). After anesthesia with MS222 (0.05% in concertation), the muscle, heart, liver, brain, gill and spleen tissues were immediately sampled and frozen in liquid nitrogen, then transferred to −80 °C for further use. High-quality genomic DNA (gDNA) of muscle was extracted with a modified cetyltrimethyl ammonium bromide (CTAB) method18, while the total RNAs were isolated from all tissues using the Omega Bio-tek’s E.Z.N.A.® Total RNA Kit I (R6834, Omega, USA). The quality and concentration of DNA (and RNA) were tested by 0.75% (and 1.5%) agarose gel electrophoresis, NanoDrop One spectrophotometer (Thermo Fisher Scientific) and Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA).

Fig. 1
figure 1

Morphological characters of the Plagiognathops microlepis used for genome sequencing.

For genome survey, libraries with 350 bp insert size were constructed using the Nextera DNA Flex Library Prep Kit (Illumina, San Diego, CA, USA) and sequenced with PE-150 paired-end strategy on Illumina Novaseq 6000 platform. After obtaining raw data (44.58 Gb), the sequencing adaptors and low-quality reads were filtered using the Fastp (v 0.21.0) tool. The 19-mer frequency depth distribution was constructed using Jellyfish (v 2.2.10)19, and the genome size was subsequently estimated with Jellyfish and Genomescope (v 2.0)20. Finally, based on the obtained 43.72 Gb clean data, the estimated genome size for P. microlepis was 949.39 Mb with a heterozygous ratio of 0.52 (Fig. S1).

PacBio and Hi-C based whole-genome sequencing

For PacBio sequencing, the qualified high-quality gDNA was sheared to 15–20 Kbp and a SMART bell library was constructed according to the manufacturer’s instructions, followed by sequenced on PacBio Sequel II system in CCS mode. Raw subreads were obtained after filtering polymerase reads, and then processed using SMARTLink (v 8.0) with the parameter “–min-passes = 3–min-rq = 0.99” to generate HiFi reads. As a result, a total of 33.62 Gb HiFi reads were obtained with a mean read length of 17.82 Kb (Table 1).

Table 1 Summary of the sequencing data used in the assembly and annotation of Plagiognathops microlepis genome.

A Hi-C library was constructed by cross-linking the muscle tissue, digesting with Dpn II restriction enzyme, biotinylating 5′ overhang, blunt-end ligation, and shearing the DNA into 300–700 bp size21. Hi-C sequencing was performed on Illumina Novaseq 6000 platform with PE-150 strategy, and 99.57 Gb of raw data were obtained. After filtering with fastp (v 0.21.0), 98.31 Gb of Hi-C clean reads were retained (Table 1). All of the above sequencing was performed in the Wuhan Benagen Technology Co., Ltd (Wuhan, China).

De novo assembly and Hi-C assembly

In order to generate the monoploid and two haplotype-resolved assembly, the Hi-C integration strategy was used for de novo assembly. HiFi long reads and Hi-C short reads were submitted simultaneously to HiFiasm (v 0.16.1)22 to improve the accuracy of assembly and haplotype construction. Haplotypic duplications in the assembly were removed using purge_dups (v1.2.5, parameter: -f 0.9)23. As a result, a preliminary monoploid assembly of 1004.34 Mb and two haploid assemblies of 997.69 Mb (Haploid A) and 995.24 Mb (Haploid B) were yielded, whose contig N50 lengths were 38.80 Mb, 36.21 Mb and 33.97 Mb, respectively (Table 2). The size of this genome is slightly larger than that of survey result and the previously assembled genome of P. simoni’s (940.9 Mb) in Xenocyprinae.

Table 2 Statistics and evaluation of the genome assemblies of monoploid and two haploids.

For the chromosome-level assembly, clean Hi-C reads were mapped to the preliminary assembled genome and filtered using Jucier (v1.6)24 with default parameters. Only valid interaction pairs were retained for further analysis. As a diploid, the chromosome number of P. microlepis is reported to be 2n = 48 based on karyotype analysis9,25,26. Therefore, we used the software 3D-DNA (v 180419)27 and Jucier to scaffold the genome onto 24 chromosomes, followed by using Juicebox (v 1.11.08)28 to manually adjust and orient the chromosomes and draw Hi-C interaction heatmap of contigs. Ultimately, 99.98% and 99.59% of the preliminary assembled contigs were anchored to the chromosomes of monoploid (24 chromosomes) and two haplotypes (48 chromosomes) respectively, and a chromosome-level genome of P. microlepis with haplotype-resolution was therefore obtained (Fig. 2, S2 and Table 3). In the assembled monoploid and two haplotypes, 19 and 24 chromosomes were found gap-free (contained only one contig for each chromosome).

Fig. 2
figure 2

Circos plot (a) and Hi-C interaction heatmap (b) showing the features and interactions among chromosomes of the assembled P. microlepis genome. Tracks from outer to inner layers represent the 24 chromosomes (1), gene density (2), repeat elements density (3), GC content (4) and links of intragenomic syntenic blocks within 100Kbp sliding windows.

Table 3 Statistics of the 24 anchored chromosomes of monoploid and two haploids.

The mitochondrial genome of P. microlepis was also assembled using MitoZ (v 3.6) and Getorganelle (v 1.7.1a) based on the short reads. The obtained circular mitochondrial genome is 11,619 bp in size, with 13 protein-coding genes, 22 tRNAs and 2 rRNAs (Fig. S3).

Genome annotation

For repeat annotation of the P. microlepis genome, de novo prediction was firstly conducted using RepeatModeler (v 1.0.11, parameters: BuildDatabase -name mydb, RepeatModeler -database mydb -pa 10)29 to detect repetitive elements. LTR sequences were predicted and deduplicated with LTR_FINDER_parallel (parameters: -threads 16 -harvest_out -size 1000000 -time 300) and LTR_retriever (v 2.9.0). These de novo predicted sequences were merged with the RepBase library (v 20181026), and RepeatMasker (v 4.0.9, parameters: -nolow -no_is -norna -parallel 2) and RepeatProteinMask (v 4.0.9) were subsequently employed to identify repeat elements and TE_protein class repeat sequences30. Finally, all the predicted results were merged together and deduplicated, and 578.91 Mb of repeat sequences were identified, accounting for 57.64% of the assembled genome. The most abundant element among these repetitive sequences was DNA transposon, encompassed 31.55% (316.89 Mb) of the assembled genome (Table 4 and Fig. 3a).

Table 4 Summary of the transposable elements in P. microlepis genome.
Fig. 3
figure 3

The repeat elements distribution and identified protein-coding genes in P. microlepis genome. (a) Distribution of divergence rate for transposable elements in the genome. (b) Veen diagram showing the number of shared and unique genes annotated with different databases.

Three prediction methods were used for the structural annotation of protein-coding genes, including transcript mapping, ab initio prediction and homologous gene alignment. In the RNA-seq based method, 2 μg of qualified RNA from each tissue was equally pooled and an RNA-seq library was prepared with NEBNext® Ultra™ RNA Library Prep Kit (#E7530L, NEB, USA). The constructed library was sequenced on the Illumina Novaseq 6000 platform to obtain 150 bp paired-end reads. After sequencing, the obtained data (11.48 Gb) were filtered with Fastp (v0.21.0), aligned against the genome with Hisat2 (v2.1.0)31 and assembled with Stringtie (v2.1.4)32. The assembled transcripts ORF were predicted using TransDecoder (v 5.1.0). The ab initio prediction was conducted using Augustus (v3.3.2, parameter: –uniqueGeneId = true–noInFrameStop = true–gff3 = on–strand = both)33, Genscan (v1.0)34 and GlimmerHMM (v3.0.4, parameter: -f -g)35 after repeat elements were masked from the genome. In the homology-based prediction, the protein sequences of Danio rerio (GCF_000002035.6), Ctenopharyngodon idellus (GCF_019924925.1), Megalobrama amblycephala (GCF_018812025.1), P. simoni17 and Hypophthalmichthys molitrix (unpublished data, provided by our lab) were downloaded from NCBI database or GigaDB (http://GigaDB.org) and mapped to the genome using tblastn (v 2.7.1, parameter: -t 16 -q 7). Transcripts and coding regions in P. microlepis genome were then predicted using Exonerate (v 2.4.0, parameter: -model protein2genome -showtargetgff 1)36. Finally, all the gene sets predicted by three methods were integrated using MAKER (v 2.31.10, parameter: maker_exe.ctl maker_opts.ctl maker_bopts.ctl -ignore_nfs_tmp -fix_nucleotides)37, and incomplete genes and genes with too short CDS (<150 bp) were also removed, resulting in a non-redundant reference gene set including 28,337 protein-coding genes. The average exon number, exon length and CDS length in each gene were 9.35, 300.06 bp and 1,644.51 bp, respectively (Table 5).

Table 5 The annotated genes and features based on different methods.

Functional annotation of protein-coding genes was carried out by aligning the predicted sequences against entries in Uniprot and NR databases using Diamond (v 2.0.11.149, parameter: -e-value 1e-5)38. Gene motifs and domains were searched using InterProScan (v 5.52–86.0, parameter: -goterms -pa -dp -verbose -cpu 20)39 and Hmmscan (v 3.3.2, parameter: -E 0.01). The GO terms for genes were obtained from the corresponding InterPro or Uniprot entry. Pathway annotation was performed using Diamond and KOBAS (v3.0) against the KEGG database. At last, a total of 26,929 protein-coding genes were functionally annotated, representing 95.03% of all the predicted genes (Fig. 3b, Table 6).

Table 6 Summary for genome function annotation based on different databases.

For non-coding RNA annotation, tRNAs in the genome were searched using tRNAscan-SE (v 2.0.12, parameters: -E -j tRNA.gff -o tRNA.result -f tRNA.struct -thread 16)40 based on the structural characteristics of tRNA, rRNAs were predicted using RNAmmer (v 1.2, parameters: -S euk -m tsu,lsu,ssu)41, and ncRNA sequences were searched using INFERNAL (v 1.1.4, parameter: -cut_ga -rfam -nohmmonly -fmt 2)42 based on the Rfam database. Ultimately, 2,442 miRNAs, 9,677 tRNAs, 4,455 rRNAs and 1,058 snRNAs were identified (Table 7). Based on the annotation results, syntenic blocks among the 24 chromosomes were identified using MCScanX (https://github.com/wyp1125/MCScanx, parameter: -a -e 1e-5 -s 5), and a circular diagram showing the distribution of gene and repeat density, GC content and synteny in the genome was generated using circlize package43 (Fig. 2a).

Table 7 Statistic of the annotated non-coding RNAs in the genome.

Chromosomal synteny analysis

To compare the structural characteristics of the genomes, as well as to verify the accuracy of our assemblies, genomic synteny analysis was performed between P. microlepis and two related species, H. molitrix and M. amblycephala. Similar gene pairs and syntenic blocks between any two of the genomes were determined and visualized using Last (v1170)44 and JCVI (v0.9.13)45. The result showed a high degree of collinearity across the three genomes, in which the structures of most chromosomes in P. microlepis remained unchanged compared with those of H. molitrix and M. amblycephala (Fig. 4). Such high consistency also indicated the high quality of our assembled and annotated genome.

Fig. 4
figure 4

Genomic synteny among P. microlepis, Megalobrama amblycephala and Hypophthalmichthys molitrix.

Data Records

The raw sequencing data reported in this paper have been deposited in NCBI Sequence Read Archive (SRA) database under the accession number SRR2788402746, SRR2788402847, SRR2788402948 and SRR2788403049. The assembled nuclear and mitochondrial genomes are available in GenBank with the accessions GCA_040144785.150 and PP836169.151. The genome annotation results have been deposited in the figshare database52.

Technical Validation

Quality evaluation of the genome assembly and annotation

Completeness of the assembled genome was evaluated using BUSCO (v 5.3.0, parameter: -m prot -c 40 -long -f)53 with actinopterygii_odb10 database, and 97.6%, 97.6% and 97.3% complete BUSCOs were found in the assembled monoploid, Haploid A and Haploid B genomes (Table 8), manifesting high completeness of our assembled genomes. The consistency was estimated by mapping the Illumina short-reads to the assembled genomes using BWA (v 0.7.17). As a result, high mapping rates against the three assemblies (99.87%~99.88%) and high coverages (>99.94%) were also found (Table 8). Using Merqury54, the consensus quality value (QV) of genomes representing per-base consensus accuracy were estimated to be 48.11, 48.01 and 47.93 for the assembled monoploid, haploid A and haploid B genomes, respectively. For the quality evaluation of chromosomes, strong interactive signals were found along the diagonals of Hi-C heatmaps, and no obvious noises were found at other areas (Fig. 2b and S2), supporting the precision of chromosome assembly. Finally, to verify the accuracy of haploid splitting, the assembled Haploid A was aligned to Haploid B and similar syntenic blocks between them were also shown. The result showed high similarity and high synteny between the two haploids, indicating high splitting accuracy (Fig. S4).

Table 8 Completeness and accuracy evaluation of the genome and annotation.

BUSCO analysis was also performed to validate the quality of genome annotation, and the result revealed that 95% of the identified BUSCOs (including 93.5% single-copy and 1.5% duplicated genes) were complete (Table 8). In addition, the length distributions of genes, CDSs, introns and exons in the genomes of P. microlepis, D. rerio, P. simoni, H. molitrix, M. amblycephala and C. idellus were also compared and found to be similar (Fig. S5), indicating the reliability of our genome annotation.