Gap-free genome assembly of anadromous Coilia nasus

Ma, Fengjiao; Wang, Yinping; Su, Bixiu; Zhao, Chenxi; Yin, Denghua; Chen, Chunhai; Yang, Yanping; Wang, Chenhe; Luo, Bei; Wang, Hongqi; Deng, Yanmin; Xu, Pao; Yin, Guojun; Jian, Jianbo; Liu, Kai

doi:10.1038/s41597-023-02278-w

Gap-free genome assembly of anadromous Coilia nasus

Data Descriptor
Open access
Published: 06 June 2023

Volume 10, article number 360, (2023)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

Gap-free genome assembly of anadromous Coilia nasus

Download PDF

Fengjiao Ma¹^na1,
Yinping Wang^1,2^na1,
Bixiu Su³^na1,
Chenxi Zhao³^na1,
Denghua Yin ORCID: orcid.org/0000-0002-1193-5333²,
Chunhai Chen³,
Yanping Yang²,
Chenhe Wang³,
Bei Luo³,
Hongqi Wang³,
Yanmin Deng⁴,
Pao Xu ORCID: orcid.org/0000-0001-7007-8530^1,2,4,
Guojun Yin^1,2,
Jianbo Jian ORCID: orcid.org/0000-0003-2187-5490³ &
…
Kai Liu ORCID: orcid.org/0000-0002-5730-3040^1,2,4

1738 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

The Chinese tapertail anchovy, Coilia nasus, is a socioeconomically important anadromous fish that migrates from near ocean waters to freshwater to spawn every spring. The analysis of genomic architecture and information of C. nasus were hindered by the previously released versions of reference genomes with gaps. Here, we report the assembly of a chromosome-level gap-free genome of C. nasus by incorporating high-coverage and accurate long-read sequence data with multiple assembly strategies. All 24 chromosomes were assembled without gaps, representing the highest completeness and assembly quality. We assembled the genome with a size of 851.67 Mb and used BUSCO to estimate the completeness of the assembly as 92.5%. Using a combination of de novo prediction, protein homology and RNA-seq annotation, 21,900 genes were functionally annotated, representing 99.68% of the total predicted protein-coding genes. The availability of gap-free reference genomes for C. nasus will provide the opportunity for understanding genome structure and function, and will also lay a solid foundation for further management and conservation of this important species.

Measurement(s)	Coilia nasus • Gap-free genome assembly • sequence annotation
Technology Type(s)	DNBSEQ • Pacbio HiFi Sequencing • Nanopore Sequencing • Hi-C
Sample Characteristic - Organism	Coilia nasus
Sample Characteristic - Environment	freshwater
Sample Characteristic - Location	Taizhou City, Jiangsu Province, China

Gap-free genome assembly of Salangid icefish Neosalanx taihuensis

Article Open access 04 November 2023

Chromosome-level genome assembly and annotation of the Antarctica whitefin plunderfish Pogonophryne albipinna

Article Open access 12 December 2023

Chromosome-level genome assembly and annotation of eel goby (Odontamblyopus rebecca)

Article Open access 02 February 2024

Background & Summary

Coilia nasus (C. nasus, FishBase ID: 45335; Fig. 1A), also called Chinese tapertail anchovy, is a small to medium-sized migratory fish in the family Engraulidae, order Clupeiformes. Its native range extends from the coastal waters of China, Japan, and South Korea in the Northwest Pacific to interconnected freshwater tributaries (such as the Yellow River and Yangtze River). The middle and lower reaches of the Yangtze River and its affiliated lakes are the most important migration channels for C. nasus. During the spawning season, C. nasus migrates to river estuaries and further up to the middle and lower reaches of the Yangtze River and its conjoining lakes for spawning. Historically, some C. nasus reached Dongting Lake, over 1000 km upstream of the Yangtze River, to spawn¹. After reproduction, these fish and their offspring return to the sea. Indeed, Yangtze C. nasus used to be one of the most important economic fish and was known as one of the “Three Yangtze Delicacies”, along with two other fishes (Tenualosa reevesii and Takifugu fasciatus). Due to its high social and nutritional value, Yangtze C. nasus has received overwhelming consumer demand in the last decade, thereby driving up the exploitation of the species. In 1973, the annual production of C. nasus from the Yangtze River reached 4142 tonnes², and since then, it has been declining over the last few decades as a consequence of human influences, including overfishing, habitat degradation and other factors, reaching only 57.5 tonnes in 2012. Additionally, a strong response to stress (such as netting) would often cause tissue damage and death of C. nasus, which would reduce the survival rate and severely restrict the development of artificial breeding and large-scale cultivation of C. nasus^3,4. As a result, C. nasus was listed as an endangered (EN) species in a 2018 Red List of Threatened Species report from the International Union for Conservation of Nature (IUCN) (www.iucnredlist.org). To protect this important resource in the Yangtze River, various strategies have been developed. For example, China has implemented a conservation policy for banning commercial fishing of Yangtze C. nasus since 2018 and implemented the longest and strictest 10-year fishing ban on the Yangtze River since 2021, which will be beneficial to the recovery of wild stocks of C. nasus. Therefore, it is urgent to assess the genetic background of C. nasus to further understand the diversity and population dynamics of this species and implement conservation efforts for wild populations.

Many anadromous fish show complex migration patterns known as natal homing (migration to the place of natal origin) or natal river (migration to a general ‘home’ area, but not necessarily natal water), which requires remarkable precise orientation abilities⁵. In addition to the iconic example of salmonid migration, C. nasus may also exhibit natal homing behaviours⁶. If so, a question arises: how could they find their way back to their native rivers from these vast distances to locate their spawning grounds? This interesting question is divided into two subsequent questions. First, how does C. nasus orient itself at sea? Second, how does C. nasus locate its birth river? These questions have yet to be answered. Many different mechanisms have been demonstrated for orientation and navigation in some species with long-distance migration, including orientation using information from the sun, polarized light patterns, olfactory cues, and the Earth’s geomagnetic field^7,8,9. During spawning migration in freshwater, anadromous fish rely primarily on the olfactory system to locate their spawning grounds⁵. Migratory fishes also respond to external triggering factors (such as water current, light, temperature, food availability, and upstream distance), which can trigger internal cues (such as circadian rhythm, hormones, and fat deposits) to drive migration and influence migration propensity¹⁰. For C. nasus, olfactory receptor genes have been identified at the transcript level, but the precise migration navigational mechanism is poorly understood¹¹. In many migratory animals, migration exhibits a suite of traits with substantial phenotypic variability, which are most likely under genetic control and have been shown to be highly heritable^12,13. Several studies have provided insight into the genetic mechanisms of migratory behaviour in animals that travel large distances and display precise homing ability^14,15,16. Recent developments in genomics have resulted in a new and powerful molecular approach that could be used to study the genetics behind migratory behaviour. For example, many loci associated with migratory traits across many chromosomes in rainbow trout (Oncorhynchus mykiss) were found¹⁷. A region of the genome consisting of a large block of linkage disequilibrium in O. mykiss (on chromosome 5, or Omy5) is also reported to be closely associated with anadromy¹⁸. Based on the above studies, the identification of the chromosomal regions or genetic mechanisms by obtaining a gap-free reference genome will facilitate a deep understanding of the spawning migration behaviours of C. nasus.

Short-read sequencing technologies have led to a paradigm shift in biology over the last decade. Until recently, genomics has seen dramatic advances due to improvements in DNA sequencing technologies and assembly methodology, allowing the generation of more complete genome assemblies. The fast development of long-read sequencing technologies, such as the Pacific Biosciences (PacBio) HiFi and Oxford Nanopore Technology (ONT), has overcome early assembly limitations in continuity, correctness, and completeness, making it possible to understand the complexity and structure of genomes¹⁹. The long reads generated by PacBio HiFi and ONT are capable of resolving complex repetitive DNA regions on genome chromosomes, leading to a very contiguous assembly with higher mapping certainty^20,21. As a complementary approach, high-throughput chromosome conformation capture (Hi-C) technology can capture chromatin three-dimensional structure information across the genome, and this spatial information can be used to assemble contigs and scaffolds as a chromosome-level²². Meanwhile, multiple assemblers that have been developed using different algorithms have provided an opportunity to generate high-quality assemblies and even achieve gap-free genomes.

Decoding complete genome sequence information is indispensable for the study of genomic variants and biological discoveries. In 2020, the high-quality reference genomes of cultivated C. nasus have been released²³. However, the genome of the current version remained incomplete (87.1% complete), with many gaps. Gap-free genome assemblies are now a reality, allowing for nearly complete identification of genomic information, such as unique genes and structural variations (SVs)^24,25. In recent years, gapless genomes of many species have been deciphered, such as Arabidopsis (Arabidopsis thaliana), rice (Oryza sativa), watermelon (Citrullus lanatus), banana (Musa acuminata), wild strawberry (Fragaria vesca), and humans^{26,27,28,29,30}, but no gap-free genome assembly has been reported in C. nasus. In our study, we incorporated datasets of Pacific Biosciences (PacBio) HiFi reads, Nanopore Ultra-long reads, MGI short reads and Hi-C reads to assemble a gap-free genome of C. nasus, successfully bridging all the remaining assembly gaps across each chromosome in the currently available reference genomes. A gap-free genome assembly is not only essential to develop genomic research for C. nasus but is also a valuable resource for comparative genomics and evolutionary studies in Coilia fishes.

Methods

Sample collection, otolith validation and DNA extraction

A muscle sample was collected from a female C. nasus with a body weight of 211.8 g that was captured on 19 April 2021 in the Taizhou section of the Yangtze River, Jiangsu Province, China (32°12′N, 119°54′E), using a research boat from the Freshwater Fishery Research Center, Chinese Academy of Fishery Sciences (Fig. 1B). Sample collection was approved by the Department of Agriculture and Rural Affairs of Jiangsu Province, with the approval fishing licence code (Jiangsu) Scientific Fishing (2021) ZX-006 and −007. All specimen sampling was conducted in strict accordance with relevant guidelines and regulations established by the Animal Care and Use Committee of the Freshwater Fisheries Research Center, Chinese Academy of Fishery Sciences. According to a previously published study, the right sagittal otolith of C. nasus was used to confirm whether it was migratory using otolith fingerprint element technology³¹. As shown in Fig. 1C, the fluctuation patterns of Sr: Ca exhibit a life history of freshwater habitat, brackish water and seawater habitat, suggesting that the collected C. nasus specimen is typically anadromous. The muscle tissue below the dorsal fin was taken, quickly frozen in liquid nitrogen and stored at −80 °C for DNA sequencing for genome assembly.

WGS library and PacBio library construction, sequencing and assembly

Long-read sequencing was performed using the PacBio Sequel-II platform, and the short but accurate reads from the MGISEQ platform were analysed for genome survey and evaluation of the assembly.

For the WGS library of short insert reads, genomic DNA was extracted from the muscle tissue by using MZ 1.3 (hypervariable minisatellite probe), as well as locus-specific minisatellite probes (g3, MS1 and MS43). Then these DNA samples were sheared into fragments between 50 and 800 bp using a Covaris E220 ultrasonicator (Covaris, Brighton, UK) according to the manufacturer’s recommendations. Between 300 and 400 bp were selected to construct a single-stranded circular DNA library and sequenced on an MGISEQ-2000 platform. A total of 86.07 Gb raw reads was generated (Table 1). Approximately 71.49 Gb of clean reads were retained after adapter sequence removal and low-quality read filtering by SOAPnuke v 2.0³² (parameters: -n 0.01 -l 20 -q 0.1 -i -Q 2 -G 2 -M 2 -A 0.5).

Table 1 Summary of the sequencing data obtained for C. nasus genome assembly.

Full size table

For the PacBio platform of long reads, genomic DNA was extracted from the same muscle tissue using a QIAGEN Blood & Cell Culture DNA Midi Kit following the manufacturer’s instructions (QIAGEN, Germany). After DNA preparation, two sequencing libraries were prepared according to the “Using SMRTbell Express Template Prep Kit 2.0 With Low DNA Input” protocol from PacBio and sequenced on a PacBio Sequel II SMRT cells in circular consensus sequence (CCS) mode with an insert size of approximately 20 kb (Pacific Biosciences, USA). After the removal of low-quality reads, a total of 36.75 Gb reads with a mean length of 15.4 kb were processed using the CCS version 4.0.0 (SMRTLink v 8.0.0) algorithm with the parameters “--minPasses 3--minPredictedAccuracy 0.99--minLength 500”.

With the HiFi reads of PacBio sequencing, the primary contigs were assembled using the default parameters of Hifiasm (v 0.15.1)³³. Then the Purge Haplotigs program³⁴ was used to remove redundant sequences with the parameters “-j 80 -s 80 -a 30”, which yielded a draft assembly with a size of approximately 850.52 Mb. The maximum contig size and N50 were 11.30 Mb and 0.97 Mb, respectively (Table 2).

Table 2 Statistics of C. nasus genome.

Full size table

Hi-C library preparation, sequencing and chromosome anchoring

To conduct the chromosome-level genome assembly, the draft genome contigs were anchored and oriented using the Hi-C data. In brief, muscle tissue (~1 g) of C. nasus was fixed with 1% formaldehyde for 10–30 min at room temperature to coagulate proteins that are involved in chromatin interaction in the genome. The restriction enzyme Mbo I (NEB, Ipswich, USA) was added to digest DNA, and fragments with flat or sticky ends were obtained. The processes of biotin marking, proximity ligations, crosslinking reversal, and DNA purification steps were used in previous studies³⁵. The Hi-C library was made by capturing the biotin with magnetic beads and sequenced on the MGISEQ-2000 platform, and 125.09 Gb of Hi-C reads were generated (Table 1). A total of 105.32 Gb clean data were obtained from sequencing data using the software SOAPnuke v 2.0 with the parameters “-n 0.01 -l 20 -q 0.1 -i -Q 2 -G 2 -M 2 -A 0.5”.

The Hi-C sequencing data were aligned to the assembled contigs using BWA v 0.7.12³⁶. We also utilized the juicer pipeline v 1.5 to remove the erroneous mappings (MAPQ = 0) and duplicated contigs to obtain the interaction matrix. Following this, approximately 193.87 Mb read pairs (~ 55.23%) were used to anchor the contigs into chromosomes with 3D-DNA pipeline v 180,922³⁷. The 3D-DNA pipeline was used to remove select short contigs using default parameters. Scaffolds were manually checked and refined with JUICEBOX Assembly Tools (v 2.15.07)³⁸. By using these Hi-C data, the assembled sequences were further anchored and oriented onto 24 chromosomes with a total length of 847.47 Mb, covering ~99.64% of the scaffold-level genome (Fig. 2A,B). The length of chromosomes ranged in size from 28.96 to 45.20 Mb (Table 3).

Table 3 Summary of assembled 24 chromosomes of C. nasus.

Full size table

Oxford Nanopore PromethION library preparation, sequencing and assembly

For ONT sequencing, genomic DNA was extracted using the CTAB method ( > 50 kb) with the SageHLS HMW library system (Sage Science) and was processed using the Ligation sequencing 1D kit (SQK-LSK109, Oxford Nanopore Technologies, Oxford, UK) according to the manufacturer’s instructions. Then, the ONT library was prepared. The genome was sequenced on the Nanopore PromethION platform (Oxford Nanopore Technologies) at the Genome Center of Grandomics (Wuhan, China). After filtering with length < 5 kb and quality value < 7, a total of 24.33 Gb of ONT long reads were generated, the N50 of ONT long reads was 47.11 kb, and the longest reads were 462.45 kb. The ultra-long ONT reads were corrected to improve the final consensus assembly by NECAT (https://www.nature.com/articles/s41467-020-20236-7) with the following parameters: ‘OVLP_FAST_OPTIONS = -n 500 -z 20 -b 2000 -e 0.5 -j 0 -u 1 -a 1000’ and ‘CNS_FAST_OPTIONS = -a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0’. The consensus ONT ultra-long reads were used to fill gaps of the above assembly by running three iterations of LR_Gapcloser (v1.0)³⁹ and TGS-GapCloser (v 1.0.1)⁴⁰ with the parameter “--min_match 2000”. The gap-free level was reached after three rounds of gap filling. With all these processes, we generated a genome assembly of C. nasus, where the genome size was approximately 851.67 Mb and N50 was 35.42 Mb (Table 2).

Repetitive sequence annotation

A combined strategy based on de novo searches and homologue alignments was used to annotate whole-genome repeat elements. A de novo repetitive element database was identified by Repeat Modeler v 1.0.4⁴¹ and long terminal repeats were annotated by LTR-FINDER v 1.0.7⁴². For homologue prediction, DNA and protein transposable elements (TEs) were detected by RepeatMasker (v 4.0.7)⁴³ and RepeatProteinMasker (v 4.0.7)⁴⁴, respectively, based on the Repbase database. Tandem repeats were performed by Tandem Repeat Finder v 4.10.0⁴⁵. The combination of Repbase and our de novo TE library revealed that 38.26% of the assembled C. nasus genome was annotated as repetitive elements, of which short interspersed nuclear elements (SINEs) and long terminal repeats (LTRs) accounted for 0.44% and 8.86% of the whole genome, respectively, and long interspersed nuclear elements (LINEs) accounted for 9.98% (Table 4).

Table 4 Summary statistics of repetitive sequences annotation in C. nasus genome.

Full size table

Protein-coding gene annotation

To obtain protein-coding genes, we employed de novo prediction, homology-based annotation and RNA-Seq assisted prediction. For de novo prediction, gene models of C. nasus were predicted by Augustus (v 3.2.1)⁴⁶ with default parameters. For homology-based prediction, protein sequences of six representative teleosts, including Clupea harengus (GCF_900700415.2.), Danio rerio (GCF_000002035.6), Denticeps clupeoides (GCF_900700375.1), Electrophorus electricus (GCF_013358815.1), Oncorhynchus mykiss (GCF_013265735.2) and Sardina pilchardus (GCA_900499035.1), were downloaded from the National Center for Biotechnology Information (NCBI). GeMoMa (v1.8) was used to search coding structures based on transcriptome data and homologous proteins⁴⁷. For the transcriptome-based annotation, pooled RNA-seq reads from the liver, brain, and stomach were mapped onto the C. nasus genome by using Hisat2 (v 2.1.0)⁴⁸ with the following parameters:--sensitive--no-discordant--no-mixed -I 1 -X 1000--max-intronlen 1000000. The aligned reads were assembled using Stringtie (v 1.3.5)⁴⁹ with the following parameters: -f 0.3 -j 3 -c 5 -g 100 -s 10000. Subsequently, TransDecoder (v 5.5.0; https://github.com/TransDecoder/TransDecoder) was used to identify the coding sequence with default parameters. The abovementioned transcriptome data and homologous proteins were merged by GeMoMa v1.8 software. A total of 21,971 protein-coding genes with a mean length of 23,357 bp were predicted (close to the 21,469 of Danio rerio; Table 5). The final gene sets were functionally annotated by aligning the gene sequences to KEGG (Kyoto Encyclopedia of Genes and Genomes, http://www.genome.jp/kegg/), Swiss-Prot (http://www.gpmaw.com/html/swiss-prot.html), TrEMBL (http://www.uniprot.org), KOG⁵⁰, and NR (NCBI nonredundant protein) databases using BLASTp v 2.2.26⁵¹ with an E-value threshold of 1E-5. The protein domains and motifs were annotated using InterProScan⁵². GO Ontology (GO)⁵³ was obtained from the InterProSca results in this study. Approximately 99.68% (21,900 genes) of the total predicted genes were successfully annotated by at least one database (Fig. 3 and Table 6). Of these functional proteins, 17,190 genes (~78.24%) were supported by all five databases.

Table 5 Statistics of predicted protein-coding genes in the C. nasus genome.

Full size table

Table 6 Statistics of functional annotation of C. nasus.

Full size table

Data Records

The sequencing dataset and genome assembly of C. nasus have been deposited in the Sequence Read Archive (SRA) under project number SRP405363⁵⁴. DNA sequencing data from the WGS library were deposited in the SRA at SRR22102323⁵⁵. DNA sequencing data from the ONT library were deposited in the SRA at SRR22102324⁵⁶. DNA sequencing data from the Hi-C library were deposited in the SRA at SRR22102325⁵⁷. DNA sequencing data from the PacBio HiFi library were deposited in the SRA at SRR22102326⁵⁸. This Whole Genome Shotgun project was deposited at GenBank under accession JAPTFL000000000⁵⁹. Moreover, files of the assembled genome, gene structure annotation and repeat prediction annotation of C. nasus were deposited in Figshare database under DOI code⁶⁰.

Technical Validation

Evaluation of the genome assembly

The paired-end short reads (including DNA and RNA sequencing) were mapped to the assembled genome using BWA software, and the results showed that 98.06% and 96.21% of the reads could be mapped, respectively. Furthermore, the HiFi sequencing data was mapped to the assembled genome using Minimap2, with a mapping rate of 99.82%⁶¹. In terms of some assembled metrics, such as contig lengths, gap number and BUSCO completeness, our new genome assembly showed a great improvement compared to the previously reported C. nasus genome. By comparing previously published assembly data, our gap-free genome assembly increased the contiguity metrics by contig N50²³. Among the published genomes in Clupeiformes, the assembly in this study had the longest contig N50 length and was the first gap-free genome, suggesting that our C. nasus genome was of high quality (Table 2).

The completeness of the assembled genome sequence was evaluated using Benchmarking Universal Single-Copy Orthologs (BUSCO, v 5.1.0). The BUSCO analysis based on the actinopterygii_odb10 database showed that 92.5% of the expected actinopterygii_odb10 genes (single-copy genes: 90.7% and duplicated genes: 1.8%) were identified as complete, and 2.5% fragmented genes were found in the genome assembly. However, 5% were missing from our C. nasus genome. Nevertheless, the complete evaluation of the C. nasus genome was superior to other current public cetacean genomes.

Evaluation of the gene annotation

With this gap-free reference genome, we identified approximately 325.81 Mb repetitive sequences of the assembled C. nasus genome, accounting for 38.26% of the total genome sequences. The repetitive elements in the C. nasus genome sequences were masked, and the repeat-masked genome was used for the gene prediction (Tables 7, 8).

Table 7 Summary of transposon element families in C. nasus based on various methods.

Full size table

Table 8 Statistics of classified repeat in the C. nasus.

Full size table

We also performed BUSCO analysis with the actinopterygii_odb10 database to assess the completeness of the coding sequences for C. nasus. The results showed a total of 21,971 protein-coding genes, and each gene had an average number of 11 exons (Table 9). Approximately 99.68% (21,900 genes) of the total predicted genes were assigned with at least functional annotation, showing a more complete annotation. Furthermore, we compared the conservation synteny between C. nasus and C. harengus to validate the chromosome assembly⁶². We observed highly conserved synteny and strict correspondence of chromosome assignment (Fig. 4).

Table 9 The evidence supporting gene models of the C. nasus genome.

Full size table

Code availability

No specific code was developed for this work. The data analyses were performed according to the manuals and protocols provided by the developers of the corresponding bioinformatics tools in the methods.

References

Yang, Q. L., Gao, T. X. & Miao, Z. Q. Differentiation between populations of Japanese grenadier anchovy (Coilia nasus) in Northwestern Pacific based on ISSR markers: Implications for biogeography. Biochem Syst and Ecol 39, 286–296 (2011).
Article CAS Google Scholar
Shen, H. S. et al. In-depth transcriptome analysis of Coilia ectenes, an important fish resource in the Yangtze River: de novo assembly, gene annotation. Mar Genom 23, 15–17 (2015).
Article Google Scholar
Xu, G. C., Du, F. K., Li, Y., Nie, Z. J. & Xu, P. Integrated application of transcriptomics and metabolomics yields insights into population-asynchronous ovary development in Coilia nasus. Sci Rep 6, 31835 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Du, F. K., Xu, G. C., Li, Y., Nie, Z. J. & Xu, P. Glyoxalase 1 gene of Coilia nasus: molecular characterization and differential expression during transport stress. Fisheries Sci 82, 719–728 (2016).
Article CAS Google Scholar
Bett, N. N. & Hinch, S. G. Olfactory navigation during spawning migrations: a review and introduction of the Hierarchical Navigation Hypothesis. Biol Rev 91, 728–759 (2016).
Article PubMed Google Scholar
Xuan, Z. Y., Jiang, T., Liu, H. B. & Yang, J. Otolith microchemistry and microsatellite DNA provide evidence for divergence between estuarine tapertail anchovy (Coilia nasus) populations from the Poyang Lake and the Yangtze River Estuary of China. Reg Stud Mar Sci 56, 102649 (2022).
Google Scholar
Brönmark, C. et al. There and back again: migration in freshwater fishes. Can J Zool 92, 467–479 (2014).
Article Google Scholar
Able, K. P. in Animal Migration, Orientation and Navigation (ed Gauthreaux, S. A.) Ch. 5 283–373 (Academic Press, 1980).
Alerstam, T., Hedenström, A. & Åkesson, S. Long-distance migration: evolution and determinants. Oikos: A Journal of Ecology 103, 247–260 (2003).
Article Google Scholar
Baerwald, M. R. et al. Migration-related phenotypic divergence is associated with epigenetic modifications in rainbow trout. Mol Ecol 25, 1785–1800 (2016).
Article CAS PubMed Google Scholar
Zhu, G. L., Wang, L. J., Tang, W. Q., Wang, X. M. & Wang, C. Identification of olfactory receptor genes in the Japanese grenadier anchovy Coilia nasus. Genes Genom 39, 521–532 (2017).
Article CAS Google Scholar
Liedvogel, M., Akesson, S. & Bensch, S. The genetics of migration on the move. Trends Ecol Evol 26, 561–569 (2011).
Article PubMed Google Scholar
Teplitsky, C., Mouawad, N. G., Balbontin, J., De Lope, F. & Møller, A. P. Quantitative genetics of migration syndromes: a study of two barn swallow populations. J Evolution Biol 24, 2025–2039 (2011).
Article CAS Google Scholar
Zhu, H. S., Gegear, R. J., Casselman, A., Kanginakudru, S. & Reppert, S. M. Defining behavioral ad molecular differences between summer and migratory monarch butterflies. BMC Biology 7, 14 (2009).
Article PubMed PubMed Central Google Scholar
Hecht, B. C., Campbell, N. R., Holecek, D. E. & Narum, S. R. Genome-wide association reveals genetic basis for the propensity to migrate in wild populations of rainbow and steelhead trout. Mol Ecol 22, 3061–3076 (2013).
Article CAS PubMed Google Scholar
O’Malley, K. G., Jacobson, D. P., Kurth, R., Dill, A. J. & Banks, M. A. Adaptive genetic markers discriminate migratory runs of Chinook salmon (Oncorhynchus tshawytscha) amid continued gene flow. Evol Appl 6, 1184–1194 (2013).
Article PubMed PubMed Central Google Scholar
Hale, M. C., Thrower, F. P., Berntson, E. A., Miller, M. R. & Nichols, K. M. Evaluating adaptive divergence between migratory and nonmigratory ecotypes of a salmonid fish. Oncorhynchus mykiss. G3 Genes Genom Genet 3, 1273–1285 (2013).
Google Scholar
Pearse, D. E., Miller, M. R., Abadía-Cardoso, A. & Garza, J. C. Rapid parallel evolution of standing variation in a single, complex, genomic region is associated with life history in steelhead/rainbow trout. Pro Biol Sci 281, 20140012 (2014).
Google Scholar
Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol 21, 1 (2020).
Article Google Scholar
Naish, M. et al. The genetic and epigenetic landscape of the Arabidopsis centromeres. Science 374, eabi7489 (2021).
Article PubMed PubMed Central Google Scholar
Pollard, M. O., Gurdasani, D., Mentzer, A. J., Porter, T. & Sandhu, M. S. Long reads: their purpose and place. Hum Mol Genet 27, 234–241 (2018).
Article Google Scholar
Hu, G. Q. Evaluation of 3D Chromatin Interactions Using Hi-C. Methods Mol Biol 2117, 65–78 (2020).
Article CAS PubMed PubMed Central Google Scholar
Xu, G. C. et al. Genome and population sequencing of a chromosome-level genome assembly of the Chinese tapertail anchovy (Coilia nasus) provides novel insights into migratory adaptation. Gigascience 9, 1–13 (2020).
Article Google Scholar
Li, K. et al. Gapless indica rice genome reveals synergistic contributions of active transposable elements and segmental duplications to rice genome evolution. Mol Plant 14, 1745–1756 (2021).
Article CAS PubMed Google Scholar
Song, J. M. et al. Two gap-free reference genomes and a global view of the centromere architecture in rice. Mol Plant 14, 1757–1767 (2021).
Article CAS PubMed Google Scholar
Zhang, Y. L. et al. The telomere-to-telomere gap-free genome of four rice parents reveals SV and PAV patterns in hybrid rice breeding. Plant Biotechnol J 20, 1642–1644 (2022).
Article CAS PubMed PubMed Central Google Scholar
Deng, Y. et al. A telomere-to-telomere gap-free reference genome of watermelon and its mutation library provide important resources for gene discovery and breeding. Mol Plant 15, 1268–1284 (2022).
Article CAS PubMed Google Scholar
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Hou, X. R., Wang, D. P., Cheng, Z. K., Wang, Y. & Jiao, Y. L. A near-complete assembly of an Arabidopsis thaliana genome. Mol Plant 8, 1247–1250 (2022).
Article Google Scholar
Belser, C. et al. Telomere-to-telomere gapless chromosomes of banana using nanopore sequencing. Commun Biol 4, 1047 (2021).
Article CAS PubMed PubMed Central Google Scholar
Jiang, T., Liu, H. B., Hu, Y. H., Chen, X. B. & Yang, J. Revealing population connectivity of the estuarine tapertail anchovy Coilia nasus in the Changjiang River estuary and its adjacent waters using otolith microchemistry. Fishes 7, 147 (2022).
Article Google Scholar
Chen, Y. X. et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. GigaScience 7, 1–6 (2018).
Article ADS MathSciNet PubMed PubMed Central Google Scholar
Cheng, H. Y., Concepcion, G. T., Feng, X. W., Zhang, H. W. & Heng, L. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Roach, M. J., Schmidt, S. A. & Borneman, A. R. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinformatics 19, 460 (2018).
Article CAS PubMed PubMed Central Google Scholar
Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Article CAS PubMed PubMed Central Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst 3, 95–98 (2016).
Article CAS PubMed PubMed Central Google Scholar
Xu, M. Y. et al. Tgs-gapcloser: a fast and accurate gap closer for large genomes with low coverage of error-prone long reads. GigaScience 9, giaa094 (2020).
Article PubMed PubMed Central Google Scholar
Xu, G. C. et al. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. Gigascience 8, giy157 (2019).
Article PubMed Google Scholar
Tarailo‐Graovac, M. & Chen, N. S. Using repeatmasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics Chapter 4, Unit 4.10, 1 (2009).
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res 35, 265–268 (2007).
Article Google Scholar
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21, 351–358 (2005).
Article Google Scholar
Bao, W., Kojima, K. K. & Kohany, O. Repbase update, a database of repetitive elements in eukaryotic genomes. Mob DNA 6, 11 (2015).
Article PubMed PubMed Central Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27, 573–580 (1999).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008).
Article CAS PubMed Google Scholar
Keilwagen, J., Hartung, F. & Grau, J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Methods Mol Biol 1962, 161–177 (2019).
Article CAS PubMed Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37, 907–915 (2019).
Article CAS PubMed PubMed Central Google Scholar
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol 20, 278 (2019).
Article CAS PubMed PubMed Central Google Scholar
Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).
Article PubMed PubMed Central Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J Mol Biol 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Mulder, N. & Apweiler, R. InterPro and InterProScan: tools for protein sequence classification and comparison. Meth In Molec Biol 396, 59 (2007).
Article CAS Google Scholar
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat Genet 25, 25–29 (2000).
Article CAS PubMed PubMed Central Google Scholar
NCBI Sequence Read Archive https://identifiers.org/insdc.sra:SRP405363 (2022).
NCBI Sequence Read Archive https://identifiers.org/insdc.sra:SRR22102323 (2022).
NCBI Sequence Read Archive https://identifiers.org/insdc.sra:SRR22102324 (2022).
NCBI Sequence Read Archive https://identifiers.org/insdc.sra:SRR22102325 (2022).
NCBI Sequence Read Archive https://identifiers.org/insdc.sra:SRR22102326 (2022).
Ma, F. J. et al. Coilia nasus isolate 0094818, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc:JAPTFL000000000 (2022).
Ma, F. J. et al. Gap-free genome assembly of anadromous Chinese tapertail anchovy, Coilia nasus. figshare https://doi.org/10.6084/m9.figshare.21529488 (2022).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
í Kongsstovu, S. et al. Using long and linked reads to improve an Atlantic herring (Clupea harengus) genome assembly. Scientific Reports 9, 17716 (2019).
Article ADS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was funded by the Monitoring of Aquatic Living Resources in Jiangsu Section in the Mainstream of the Yangtze River (2021-SJ-110-04), and the Monitoring of Aquatic Living Resources in Key Waters of Anhui Province (ZF2021-18-0786).

Author information

These authors contributed equally: Fengjiao Ma, Yinping Wang, Bixiu Su, Chenxi Zhao.

Authors and Affiliations

Wuxi Fisheries College, Nanjing Agricultural University, Wuxi, 214081, China
Fengjiao Ma, Yinping Wang, Pao Xu, Guojun Yin & Kai Liu
Key Laboratory of Freshwater Fisheries and Germplasm Resources Utilization, Ministry of Agriculture and Rural Affairs, Freshwater Fisheries Research Center, Chinese Academy of Fishery Sciences, Wuxi, 214081, China
Yinping Wang, Denghua Yin, Yanping Yang, Pao Xu, Guojun Yin & Kai Liu
BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
Bixiu Su, Chenxi Zhao, Chunhai Chen, Chenhe Wang, Bei Luo, Hongqi Wang & Jianbo Jian
National Demonstration Center for Experimental Fisheries Science Education, Shanghai Ocean University, Shanghai, 201306, China
Yanmin Deng, Pao Xu & Kai Liu

Authors

Fengjiao Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yinping Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bixiu Su
View author publications
You can also search for this author in PubMed Google Scholar
Chenxi Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Denghua Yin
View author publications
You can also search for this author in PubMed Google Scholar
Chunhai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yanping Yang
View author publications
You can also search for this author in PubMed Google Scholar
Chenhe Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bei Luo
View author publications
You can also search for this author in PubMed Google Scholar
Hongqi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yanmin Deng
View author publications
You can also search for this author in PubMed Google Scholar
Pao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Guojun Yin
View author publications
You can also search for this author in PubMed Google Scholar
Jianbo Jian
View author publications
You can also search for this author in PubMed Google Scholar
Kai Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

K.L., J.B.J., G.J.Y. and P.X. conceived of the study. D.H.Y., Y.M.D. and Y.P.Y. collected and prepared the samples. C.X.Z., C.H.C. and J.B.J. performed bioinformatics analysis. F.J.M., Y.P.W., B.X.S. and C.X.Z. wrote the manuscript with significant contributions from C.H.C., B.L., H.Q.W. and C.H.W. K.L. provided the financial support. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Pao Xu, Guojun Yin, Jianbo Jian or Kai Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ma, F., Wang, Y., Su, B. et al. Gap-free genome assembly of anadromous Coilia nasus. Sci Data 10, 360 (2023). https://doi.org/10.1038/s41597-023-02278-w

Download citation

Received: 30 December 2022
Accepted: 30 May 2023
Published: 06 June 2023
DOI: https://doi.org/10.1038/s41597-023-02278-w
Springer Nature Limited

This article is cited by

Single molecule real-time sequencing data sets of Hypericum perforatum L. plantlets and cell suspension cultures
- Rajendran K. Selvakesavan
- Maria Nuc
- Gregory Franklin
Scientific Data (2024)
Telomere-to-telomere gapless genome assembly of the Chinese sea bass (Lateolabrax maculatus)
- Zhilong Sun
- Shuo Li
- Changwei Shao
Scientific Data (2024)

Gap-free genome assembly of anadromous Coilia nasus

Abstract

Similar content being viewed by others

Gap-free genome assembly of Salangid icefish Neosalanx taihuensis

Chromosome-level genome assembly and annotation of the Antarctica whitefin plunderfish Pogonophryne albipinna

Chromosome-level genome assembly and annotation of eel goby (Odontamblyopus rebecca)

Background & Summary

Methods

Sample collection, otolith validation and DNA extraction

WGS library and PacBio library construction, sequencing and assembly

Hi-C library preparation, sequencing and chromosome anchoring

Oxford Nanopore PromethION library preparation, sequencing and assembly

Repetitive sequence annotation

Protein-coding gene annotation

Data Records

Technical Validation

Evaluation of the genome assembly

Evaluation of the gene annotation

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Single molecule real-time sequencing data sets of Hypericum perforatum L. plantlets and cell suspension cultures

Telomere-to-telomere gapless genome assembly of the Chinese sea bass (Lateolabrax maculatus)

Search

Navigation