Background & Summary

The genus Acrossocheilus belongs to Barbinae, Cyprinidae, and is composed of approximately 26 species, which are mainly native in Laos, Vietnam, and China1. Meanwhile, these groups exhibit diversiform morphological characteristics and ecological habits, providing a great model for investigating species origin and geographical distribution of freshwater fish2. In addition, it’s flesh is tender, delicious and contains highly polyunsaturated fatty acids (PUFA), possessing a considerable market value. Recently, the freshwater grouper A. fasciatus has become a commercially emerging aquaculture fish due to its nutritive and ornamental value3. Moreover, as an omnivorous fish, the growth of A. fasciatus requires to feed with moss and other algae plants, which can inhibit the rankness of these aquatic plants, thus playing a role in ecological balance. Previous studies of A. fasciatus have primarily focused on its embryos and larval development, gonad histological characteristics, phylogenetic relationships, population structure,and artificial breeding4,5,6. On the other hand, A. fasciatus represents significant difference in growth rate and body size between males and females, with females growing faster than males (Fig. 1a), indicating all-female breeding is of high commercial value in aquaculture7. However, our knowledge of A. fasciatus on genetic and evolutionary mechanisms have been limited due to lack of genetic resources and genomic information. In this study, we employed an integrated strategy of PacBio, Illumina and Hi-C sequencing technologies to assemble a high-quality genome in a size of 879.52 Mb with scaffold N50 of 32.7 Mb (Fig. 1b,c). We believe that this high-quality of chromosomal-level genome data will provide valuable resources for breeding programs and evolutionary investigation.

Fig. 1
figure 1

Workflow of the genome assembly and survey analysis in A. fasciatus. (a) A picture of female and male A. fasciatus. ♂ indicates male individual, and ♀ indicates female individual. (b) The work flow used for genome sequencing. (c) Flow chart of the genome annotation. (d) The 17-mer distribution for the genome size estimation.

Methods

Sample collection and nucleic acid extraction

Mature and healthy A. fasciatus were obtained from Zhejiang institute of freshwater fisheries in Huzhou, Zhejiang province, China. Muscle tissues from adult female A. fasciatus was prepared for DNA extraction with SDS lysis method, while ovary, kidney, brain, testis, skin, and gill were collected for total RNA extraction using a TRIzoL kit following the manufacturer’s protocol. Herein, the high-quality gDNA was used for genome sequencing, and total RNA isolated from all tissues were used for transcriptome sequencing.

Library construction and genome sequencing

For the Illumina platform (NEB, USA), a paired-end library with an insertion size of 350 bp was generated using NEB Next® Ultra™ DNA Library Prep Kit following manufacturer’s recommendations. As a result, a total of 41 Gb Illumina short-reads (coverage of 47.56X, Table 1) with paired-end 150 bp were generated. Simultaneously, HiFi SMRTbell Libraries was prepared using SMRTbell Express Template Prep Kit 2.0 for long-read sequencing with insert size of 20 kb on Pacbio platform. In briefly, gDNA was sheared to 6–20 kb fragments using the g-TUBE, and the ssDNA overhangs were removed with Exo VII. Then DNA damage was repaired for Blunt-End ligation, and large insert SMRTbell libraries were constructed after size selection to prepare for sequencing use DNA Sequencing Reagent Kit. For the PacBio platform, approximately 32 Gb PacBio reads (37.12X coverage, Table 1) were obtained with the longest read of 47.52 kb and the N50 length of 14.56 kb.

Table 1 Statistics of the sequencing data for the A. fasciatus genome assembly.

Genome size estimation and assembly

Herein, clean data generated from Illumina sequencing were subjected to k-mer analysis to estimate the genome size, heterozygosity, and the proportion of repetitive sequences in A. fasciatus. Based on 17-mer frequency distribution using Jellyfish v2.3.08 and GenomeScope v2.09, the genome size was estimated to be 862.9 Mb, with a heterozygosity ratio of 0.56% and repeat sequence ratio of 47.09% (Fig. 1d). The 32.66 Gb raw subreads from the PacBio Sequel platform were filtered out, and the remaining clean subreads were error-corrected by Canu (v1.5)10 and pre-assembled into contigs using FALCON software11. The assembled scaffolds were polished by Pilon (v1.22)12 with default parameters. The finally assembled genome was 879.52 Mb in size with 134 contigs and a contig N50 of 32.70 Mb (Table 2).

Table 2 Summary of the assembled genome for A. fasciatus genome.

Hi-C library preparation and sequencing

The Hi-C libraries were constructed following the standard protocol described previously with certain modifications. Firstly, female muscle samples were cross-linked by 4% formaldehyde, and the fixed tissues were homogenised and centrifuged to collect the nuclei, then digested with Mbo I enzyme overnight at 37 °C. The proximal chromatin DNA was re-ligated using T4 ligase, and Biotin-labeled Hi-C samples were specifically enriched using magnetic beads. After adding A-tails to the fragment ends, Hi-C sequencing libraries were amplified by PCR and sequenced on Illumina HiSeq-2500 platform (PE 150 bp). For chromosome-level assembly, the raw Hi-C sequencing data were primarily filtered using Hi-C-Pro v2.8.013, and the high-quality clean reads were aligned to the polished A. fasciatus genome using BWA (v0.7.10)14 with default parameters (samtools sort sample.sam–output-fmt BAM–o sample.sort.bam). Finally, 96.95% of the initial assembled sequences were anchored to 25 pseudo-chromosomes that ranged in size from 24.09 to 54.14 Mb (Fig. 2a, Table S1), and the total length of the genome assembly was 879.52 Mb with a contig N50 of 22.57 Mb, and scaffold N50 of 33.13 Mb (Table 2).

Fig. 2
figure 2

Chromosomal level assembly of A. fasciatus genome and functional annotation. (a) Heat maps of Hi-C assembly of A. fasciatus. The color bar indicates the logarithm of the strength of the contact density. (b) The Venn graph of the numbers of annotated genes with different databases. (c) The comparisons of different gene elements in A. fasciatus geneome with three other fish species.

Repetitive sequence annotation

Repeat elements in the A. fasciatus genome were annotated employing a combined methods of homology alignment and de novo searches. The homology-based blast was performed against the RepBase data base (http://www.girinst.org/repbase/)15 using Repeatmasker and repeatproteinmask software for known repeat elements. For de novo annotation, we firstly employed LTR_FINDER16, RepeatModeler17 and RepeatScout18 to bulid a de novo repeat library, and then was used to predict repeat elements using Repeatmasker with default parameters. Additionally, Tandem Repeats can be identified using Tandem Repeat Finder (TRF, http://tandem.bu.edu/trf/trf.html)19. In this study, we identified 390.91 Mb of repetitive sequences, accounting for 44.45% of the assembled genome (Table 3).

Table 3 Classification of the predicted repeat sequences in the genome of A. fasciatus.

Gene prediction and functional annotation

Protein-coding genes were annotated through integrating three different strategies of homology, de novo, and transcriptome-based prediction methods. For homology-based gene prediction, the published protein sequences of Sinocyclocheilus grahami, Puntius tetrazona and Carassius auratus were aligned to the A. fasciatus genome assembly using BLAST20 and Genewise21 with default parameters. Five de novo programs, including Augustus22, GlimmerHMM23, SNAP24, GeneID25 and GENSCAN26, were used to predict coding regions in the repeat-masked assembly with default parameters. For the transcriptome-based annotation, the RNA-seq data were de novo assembled by Trinity (v2.1.1)27 and splicing variations were identified by PASApipeline (v2.4.1)28. Finally, a non-redundant reference gene set was established by merging the above three methods, resulting in a total of 24,900 protein-coding genes (Fig. 2b, Table 4). Simultaneously, we compared the gene parameters of different elements in A. fasciatus and three relative species (S. grahami, C. auratus, P. tetrazona), and the result showed a similar distribution of coding DNA sequence (CDS) length, exon length and number, intron length and mRNA length among the sequenced fish genomes (Fig. 2c).

Table 4 Statistical analysis of predicted protein-coding genes in A. fasciatus genome.

Furthermore, all predicted genes were functionally annotated using public biological function databases of SwissPro29, Nr (http://www.ncbi.nlm.nih.gov/protein), KEGG30 and InterPro31 and Pfam (http://pfam.xfam.org/). Overall, a total of 24,000 genes (96.40%) were successfully annotated with an average transcript length of 15,927.24 bp and an average CDS length of 1,627.71 bp (Table 5). In addition, non-coding RNAs (ncRNAs) were also annotated, and tRNAscan-SE (v2.0)32 was used to predict tRNAs, and Infernal (1.1)33 was used to identify rRNAs, snRNAs, and miRNAs. In total, 43,620 non-coding RNAs were predicted, including 17,604 tRNAs, 9,157 rRNAs, 2,606 miRNAs and 2,548 snRNAs (Table 6).

Table 5 Summary of functional annotation in A. fasciatus genome.
Table 6 Statistics of annotated non-coding RNAs in the A. fasciatus genome assembly.

Gene family construction

Firstly, the protein sequences of other 13 fish species, including P. tetrazona, S. grahami, C. auratus, Opsariichthys bidenswere, Cyprinus carpio, Danio rerio, Ictalurus punctatus, Megalobrama amblycephala, Ctenopharyngodon idellus, Micropterus salmoides, Oreochromis niloticus, Cynoglossus semilaevis, Larimichthys crocea, were downloaded from the public database. The low quality of sequences with less than 50 amino acids were then filtered out and only retained the longest predicted transcript per locus. Next, similarities between the protein sequences of all species were identified employing an all-to-all BLAST search with an e-value of 1e-5. Finally, orthologous gene clusters were performed using the the OrthoMCL34. In summary, we identified 27,983 gene families shared by A. fasciatus and the additional 13 species, and 10,524 gene families and 604 single-copy gene families were found in all species, respectively (Fig. 3a). Moreover, gene families from A. fasciatus, O. bidens, S. grahami, D. rerio, C. carpio and C. auratus, were further clustered, of which 13,850 gene families were shared by these fish species, and 262 gene families were specific to A. fasciatus (Fig. 3b). In addition, functional annotation was conducted for unique gene families in A. fasciatus, and revealed that Phosphatidylinositol signaling system, GABAergic synapse, Vitamin digestion and absorption, Lysine degradation, Synaptic vesicle cycle were enriched.

Fig. 3
figure 3

Comparative genomic analysis reveals phylogenetic positioning and genome evolution of A. fasciatus. (a) Statistics of orthologous gene families in 14 representative fish species. (b) Venn diagram of shared and unique orthologous gene families in A. fasciatus and four other teleosts. (c) Phylogenetic analysis and divergence time tree of A. fasciatus and other representative species. (d) Statistical analysis of contraction and expansion of gene families. (e) Comparative synteny analysis between A. fasciatus and zebrafish.

Phylogenetic and evolutionary analysis

All single-copy gene families were subjected to multiple sequence alignment to generated a super alignment matrix by MUSCLE35, and a phylogenetic tree was constructed using RAxML36. Subsequently, the MCMCTree package in PAML37 was used to estimate divergence times. As expected, evolutionary analysis demonstrated that A. fasciatus and P. tetrazona were clustered into one clade, and their divergence time was estimated to be 156.3 million years ago (Fig. 3c). Furthermore, gene expansions and contractions were analyzed employing CAFE (v3.1)38 with default parameters based on the the divergence times and phylogenetic relationships. A total of 38 and 135 gene families significantly expanded and contracted in A. fasciatus, respectively (Fig. 3d). Finally, chromosome synteny between A. fasciatus and D. rerio were carried out using MCScanX software39, and visual diagram was generated by Circos. Synteny relationships analysis showed that the chromosomes of A. fasciatus displayed a high homology with the D. rerio chromosomes (Fig. 3e).

Data Records

All sequencing data had been uploaded to NCBI database via the project PRJNA1012810. The genomic Illumina sequencing data were deposited in the Sequence Read Archive at SRR2594994040, SRR2594994141. The genomic PacBio sequencing data were deposited in the SRA at NCBI SRR2593343742. The transcriptomic sequencing data were deposited in the SRA at NCBI SRR2594984043, SRR2594984144, SRR2594984245, SRR2594984346, SRR2594984447, SRR2594984548. The Hi-C sequencing data were deposited in the SRA at NCBI SRR2594711549, SRR2594711650, SRR2594711751. The final chromosome assembly was deposited in the GenBank at NCBI with accession number: JAVLVS00000000052. The genome annotation file was also available in figshare53. The data for the gene family construction was available in the figshare database54.

Technical Validation

DNA quantification and qualification

DNA degradation and contamination was monitored on 1.5% agarose gels. DNA purity was checked using the NanoPhotometer® spectrophotometer (IMPLEN, CA, USA). DNA concentration was measured using Qubit® DNA Assay Kit in Qubit® 2.0 Fluorometer (Life Technologies, CA, USA).

Quality control of raw sequencing data

To make sure reads reliable and without artificial bias (low quality paired reads, which mainly resulted from base-calling duplicates and adapter contamination) in the following analyses, raw data were firstly processed through a series of quality control (QC) procedures in-house C scripts. QC standards as the following: (1) Removing reads with ≥ 10% unidentified nucleotides (N); (2) Removing reads with >50% bases having phred quality <5.

RNA quality evaluation

Before transcriptomes sequecing, the quality of total RNA from six tissues was validated. The concentration was measured by Qubit Fluorometr, and the integrity was detected using Aglient 2100 Bioanalyzer. Overall, RNAs samples with a total RNA amount ≧ 10 μg, RNA integrity ≧ 8, and rRNA ratio ≧ 1.5 were served as libraries construction.

Evaluation of the assembled genome

The completeness and accuracy of the A. fasciatus genome assembly were evaluated by multiple methods. First, Benchmarking Universal Single-Copy Orthologs (BUSCO, v5.4.4)55 and Core Eukaryotic Genes Mapping Approach (CEGMA, v2.5)56 were used to assess the completeness of the assembled genome. The BUSCO results revealed that 98.3% of the complete BUSCOs and 0.7% of the fragmented BUSCOs were found in 3640 single-copy orthologs of actinopterygii_odb10, and 1.0% of BUSCOs was missing. Moreover, CEGMA evaluation showed that 96.77% (240/248) core eukaryotic genes (CEGs) were obtained. In addition, Merqury (v1.3)57 was ran to evaluate the accuracy of genome assembly, and a high quality value (QV) of 44.81 indicated that this assembly was of good quality. Taken together, these results suggested that the assembled A. fasciatus genome was of high quality at chromosome level.