Abstract
Long-read sequencing technologies can generate highly contiguous genome assemblies compared to short-read methods. However, their higher cost often poses a significant barrier. To address this, we explore the utilization of mapping-based genome assembly and reference-guided assembly as cost-effective alternative approaches. We assess the efficacy of these approaches in improving the contiguity of Clarias batrachus and Culter alburnus draft genomes. Our findings demonstrate that employing an iterative mapping strategy leads to a reduction in assembly errors. Specifically, after three iterations, the Mismatches per 100 kbp value for the C. batrachus genome decreased from 2447.20 to 2432.67, reaching a minimum of 2422.67 after two iterations. Additionally, the N50 value for the C. batrachus genome increased from 362,143 to 1,315,126 bp, with a maximum of 1,315,403 bp after two iterations. Furthermore, we achieved Mismatches per 100 kbp values of 3.70 for the reference-guided assembly of C. batrachus and 0.34 for C. alburnus. Correspondingly, the N50 value for the C. batrachus and C. alburnus genomes increased from 362,143 bp and 3,686,385 bp to 2,026,888 bp and 43,735,735 bp, respectively. Finally, we successfully utilized the improved C. batrachus and C. alburnus genomes to compare genome studies using the combined approach of Ragout and Ragtag. Through a comprehensive comparative analysis of mapping-based and reference-guided genome assembly methods, we shed light on the specific contributions of reference-guided assembly in reducing assembly errors and improving assembly continuity and integrity. These advancements establish reference-guided assembly and the utilization of in silico libraries as a promising and suitable approach for comparative genomics studies.
Similar content being viewed by others
Availability of Data and Material
This study utilized several data sets which are publicly available in the NCBI Sequence Read Archive database and European Nucleotide Archive (ENA) database, including SRR7440018, GCA_003987875, GCA_013621035, GCA_011419295, GCA_009869775, GCA_024489055, and GCA_018812025.
Code Availability
Not applicable.
References
Ali RH, Bogusz M, Whelan S (2019) Identifying clusters of high confidence homologies in multiple sequence alignments. Mol Biol Evol 36:2340–2351
Alonge M, Lebeigle L, Kirsche M, Jenike K, Ou S, Aganezov S, Wang X, Lippman ZB, Schatz MC, Soyk S (2022) Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome Biol 23:258
Altenhoff AM, Train CM, Gilbert KJ, Mediratta I, Mendes de Farias T, Moi D, Nevers Y, Radoykova HS, Rossier V, Warwick Vesztrocy A, Glover NM, Dessimoz C (2021) OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res 49:D373–D379
Antipov D, Korobeynikov A, McLean JS, Pevzner PA (2016) hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32:1009–1015
Bao E, Jiang T, Girke T (2014) AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references. Bioinformatics 30:i319–i328
Barnett R, Westbury MV, Sandoval-Velasco M, Vieira FG, Jeon S, Zazula G, Martin MD, Ho SYW, Mather N, Gopalakrishnan S, Ramos-Madrigal J, de Manuel M, Zepeda-Mendoza ML, Antunes A, Baez AC, De Cahsan B, Larson G, O'Brien SJ, Eizirik E, Johnson WE, Koepfli KP, Wilting A, Fickel J, Dalen L, Lorenzen ED, Marques-Bonet T, Hansen AJ, Zhang G, Bhak J, Yamaguchi N, Gilbert MTP (2020) Genomic adaptations and evolutionary history of the extinct scimitar-toothed cat, homotherium latidens. Curr Biol 30:5018–5025 e5015
Beier S, Himmelbach A, Colmsee C, Zhang XQ, Barrero RA, Zhang Q, Li L, Bayer M, Bolser D, Taudien S, Groth M, Felder M, Hastie A, Simkova H, Stankova H, Vrana J, Chan S, Munoz-Amatriain M, Ounit R, Wanamaker S, Schmutzer T, Aliyeva-Schnorr L, Grasso S, Tanskanen J, Sampath D, Heavens D, Cao S, Chapman B, Dai F, Han Y, Li H, Li X, Lin C, McCooke JK, Tan C, Wang S, Yin S, Zhou G, Poland JA, Bellgard MI, Houben A, Dolezel J, Ayling S, Lonardi S, Langridge P, Muehlbauer GJ, Kersey P, Clark MD, Caccamo M, Schulman AH, Platzer M, Close TJ, Hansson M, Zhang G, Braumann I, Li C, Waugh R, Scholz U, Stein N, Mascher M (2017) Construction of a map-based reference genome sequence for barley, Hordeum vulgare L. Sci Data 4:170044
Bouckaert R, Vaughan TG, Barido-Sottani J, Duchene S, Fourment M, Gavryushkina A, Heled J, Jones G, Kuhnert D, De Maio N, Matschiner M, Mendes FK, Muller NF, Ogilvie HA, du Plessis L, Popinga A, Rambaut A, Rasmussen D, Siveroni I, Suchard MA, Wu CH, Xie D, Zhang C, Stadler T, Drummond AJ (2019) BEAST 2.5: an advanced software platform for Bayesian evolutionary analysis. PLoS Comput Biol 15:e1006650
Brandt DY, Aguiar VR, Bitarello BD, Nunes K, Goudet J, Meyer D (2015) Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 genomes project phase I data. G3 (Bethesda) 5:931–941
Chen S, Zhou Y, Chen Y, Gu J (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34:i884–i890
Chen Z, Erickson DL, Meng J (2020) Benchmarking hybrid assembly approaches for genomic analyses of bacterial pathogens using Illumina and Oxford nanopore sequencing. BMC Genom 21:631
Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H (2021) Twelve years of SAMtools and BCFtools. Gigascience 10:1–4
Douglas J, Zhang R, Bouckaert R (2021) Adaptive dating and fast proposals: revisiting the phylogenetic relaxed clock model. PLoS Comput Biol 17:e1008322
Duong TY, Tan MH, Lee YP, Croft L, Austin CM (2020) Dataset for genome sequencing and de novo assembly of the Vietnamese bighead catfish (Clarias macrocephalus Gunther, 1864). Data Brief 31:105861
Emms DM, Kelly S (2019) OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol 20:238
Gavrielatos M, Kyriakidis K, Spandidos DA, Michalopoulos I (2021) Benchmarking of next and third generation sequencing technologies and their associated algorithms for de novo genome assembly. Mol Med Rep 23(4):251. https://doi.org/10.3892/mmr.2021.11890
Grau JH, Hackl T, Koepfli KP, Hofreiter M (2018) Improving draft genome contiguity with reference-derived in silico mate-pair libraries. Gigascience 7(5):giy029. https://doi.org/10.1093/gigascience/giy029
Gui S, Peng J, Wang X, Wu Z, Cao R, Salse J, Zhang H, Zhu Z, Xia Q, Quan Z, Shu L, Ke W, Ding Y (2018) Improving Nelumbo nucifera genome assemblies using high-resolution genetic maps and BioNano genome mapping reveals ancient chromosome rearrangements. Plant J 94:721–734
Gunther T, Nettelblad C (2019) The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLoS Genet 15:e1008302
Howe K, Wood JM (2015) Using optical mapping data for the improvement of vertebrate genome assemblies. Gigascience 4:10
Huang W, Li L, Myers JR, Marth GT (2012) ART: a next-generation sequencing read simulator. Bioinformatics 28:593–594
Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, Birol I (2017) ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter. Genome Res 27:768–777
Jung Y, Han D (2022) BWA-MEME: BWA-MEM emulated with a machine learning approach. Bioinformatics 38:2404–2413
Kajitani R, Toshimoto K, Noguchi H, Toyoda A, Ogura Y, Okuno M, Yabana M, Harada M, Nagayasu E, Maruyama H, Kohara Y, Fujiyama A, Hayashi T, Itoh T (2014) Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res 24:1384–1395
Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772–780
Kim J, Larkin DM, Cai Q, Asan ZY, Ge RL, Auvil L, Capitanu B, Zhang G, Lewin HA, Ma J (2013) Reference-assisted chromosome assembly. Proc Natl Acad Sci U S A 110:1785–1790
Kolmogorov M, Armstrong J, Raney BJ, Streeter I, Dunn M, Yang F, Odom D, Flicek P, Keane TM, Thybert D, Paten B, Pham S (2018) Chromosome assembly of large and complex genomes using multiple references. Genome Res 28:1720–1732
Kumar S, Stecher G, Suleski M, Hedges SB (2017) Timetree: a resource for timelines, timetrees, and divergence times. Mol Biol Evol 34:1812–1819
Kushwaha B, Pandey M, Das P, Joshi CG, Nagpure NS, Kumar R, Kumar D, Agarwal S, Srivastava S, Singh M, Sahoo L, Jayasankar P, Meher PK, Shah TM, Hinsu AT, Patel N, Koringa PG, Das SP, Patnaik S, Bit A, Iquebal MA, Jaiswal S, Jena J (2021) The genome of walking catfish Clarias magur (Hamilton, 1822) unveils the genetic basis that may have facilitated the development of environmental and terrestrial adaptation systems in air-breathing catfishes. DNA Res 28(1):dsaa031. https://doi.org/10.1093/dnares/dsaa031
Li H (2022) auN: a new metric to measure assembly contiguity. https://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity. Accessed 10 March 2023
Li H (2023) Protein-to-genome alignment with miniprot. Bioinformatics 39(1):btad014. https://doi.org/10.1093/bioinformatics/btad014
Li N, Bao L, Zhou T, Yuan Z, Liu S, Dunham R, Li Y, Wang K, Xu X, Jin Y, Zeng Q, Gao S, Fu Q, Liu Y, Yang Y, Li Q, Meyer A, Gao D, Liu Z (2018) Genome sequence of walking catfish (Clarias batrachus) provides insights into terrestrial adaptation. BMC Genom 19:952
Lischer HEL, Shimizu KK (2017) Reference-guided de novo assembly approach improves genome reconstruction for related species. BMC Bioinform 18:474
Liu H, Chen C, Lv M, Liu N, Hu Y, Zhang H, Enbody ED, Gao Z, Andersson L, Wang W (2021) A chromosome-level assembly of blunt snout bream (Megalobrama amblycephala) genome reveals an expansion of olfactory receptor genes in freshwater fish. Mol Biol Evol 38:4238–4251
Liu K, Xie N, Wang Y, Liu X (2023) Contribution bias of parental genomes to the hybrid lineages of black Amur bream and topmouth culter revealed by low-coverage whole-genome sequencing. Gene 852:147058
Lu H, Giordano F, Ning Z (2016) Oxford nanopore MinION sequencing and genome assembly. Genom Proteom Bioinform 14:265–279
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, Tang J, Wu G, Zhang H, Shi Y, Liu Y, Yu C, Wang B, Lu Y, Han C, Cheung DW, Yiu SM, Peng S, Xiaoqian Z, Liu G, Liao X, Li Y, Yang H, Wang J, Lam TW, Wang J (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1:18
Manni M, Berkeley MR, Seppey M, Simao FA, Zdobnov EM (2021) BUSCO Update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral Genomes. Mol Biol Evol 38:4647–4654
Marcais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A (2018) MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol 14:e1005944
Mikheenko A, Prjibelski A, Saveliev V, Antipov D, Gurevich A (2018) Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34:i142–i150
Miller JR, Zhou P, Mudge J, Gurtowski J, Lee H, Ramaraj T, Walenz BP, Liu J, Stupar RM, Denny R, Song L, Singh N, Maron LG, McCouch SR, McCombie WR, Schatz MC, Tiffin P, Young ND, Silverstein KAT (2017) Hybrid assembly with long and short reads improves discovery of gene family expansions. BMC Genom 18:541
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R (2020) IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol 37:1530–1534
Palkopoulou E, Lipson M, Mallick S, Nielsen S, Rohland N, Baleka S, Karpinski E, Ivancevic AM, To TH, Kortschak RD, Raison JM, Qu Z, Chin TJ, Alt KW, Claesson S, Dalen L, MacPhee RDE, Meller H, Roca AL, Ryder OA, Heiman D, Young S, Breen M, Williams C, Aken BL, Ruffier M, Karlsson E, Johnson J, Di Palma F, Alfoldi J, Adelson DL, Mailund T, Munch K, Lindblad-Toh K, Hofreiter M, Poinar H, Reich D (2018) A comprehensive genomic history of extinct and living elephants. Proc Natl Acad Sci USA 115:E2566–E2574
Paril J, Zare T, Fournier-Level A (2023) Compare_Genomes: a comparative genomics workflow to streamline the analysis of evolutionary divergence across eukaryotic genomes. Curr Protoc 3(8):e876. https://doi.org/10.1002/cpz1.876
Paulino D, Warren RL, Vandervalk BP, Raymond A, Jackman SD, Birol I (2015) Sealer: a scalable gap-closing application for finishing draft genomes. BMC Bioinform 16:230
Prasad A, Lorenzen ED, Westbury MV (2022) Evaluating the role of reference-genome phylogenetic distance on evolutionary inference. Mol Ecol Resour 22:45–55
Ren L, Li W, Qin Q, Dai H, Han F, Xiao J, Gao X, Cui J, Wu C, Yan X, Wang G, Liu G, Liu J, Li J, Wan Z, Yang C, Zhang C, Tao M, Wang J, Luo K, Wang S, Hu F, Zhao R, Li X, Liu M, Zheng H, Zhou R, Shu Y, Wang Y, Liu Q, Tang C, Duan W, Liu S (2019) The subgenomes show asymmetric expression of alleles in hybrid lineages of Megalobrama amblycephala x Culter alburnus. Genome Res 29:1805–1815
Rhoads A, Au KF (2015) PacBio sequencing and its applications. Genom Proteom Bioinform 13:278–289
Ros-Freixedes R, Battagin M, Johnsson M, Gorjanc G, Mileham AJ, Rounsley SD, Hickey JM (2018) Impact of index hopping and bias towards the reference allele on accuracy of genotype calls from low-coverage sequencing. Genet Sel Evol 50:64
Sarver BA, Keeble S, Cosart T, Tucker PK, Dean MD, Good JM (2017) Phylogenomic insights into mouse evolution using a pseudoreference approach. Genome Biol Evol 9:726–739
Shapiro B, Hofreiter M (2014) A paleogenomic perspective on evolution and gene function: new insights from ancient DNA. Science 343:1236573
Stevenson KR, Coolon JD, Wittkopp PJ (2013) Sources of bias in measures of allele-specific expression derived from RNA-sequence data aligned to a single reference genome. BMC Genom 14:536
Than C, Ruths D, Nakhleh L (2008) PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships. BMC Bioinform 9:322
Thomas PD, Ebert D, Muruganujan A, Mushayahama T, Albou LP, Mi H (2022) PANTHER: making genome-scale phylogenetics accessible to all. Protein Sci 31:8–22
Yu Y, Nakhleh L (2015) A maximum pseudo-likelihood approach for phylogenetic networks. BMC Genom 16:S10
Zhang J, Li C, Zhou Q, Zhang G (2015) Improving the ostrich genome assembly using optical mapping data. Gigascience 4:24
Zhao S, Yang X, Pang B, Zhang L, Wang Q, He S, Dou H, Zhang H (2022) A chromosome-level genome assembly of the redfin culter (Chanodichthys erythropterus). Sci Data 9:535
Zhou T, Lu L, Li C (2023) Optimization of the “in-silico” mate-pair method improves contiguity and accuracy of genome assembly. Ecol Evol 13:e9745
Funding
Science & Technology Innovation Program of Hangzhou Academy of Agricultural Sciences (Grant numbers 2022HNCT-01).
Author information
Authors and Affiliations
Contributions
Kai Liu and Nan Xie conducted the experiments; Kai Liu analyzed the data and wrote the manuscript; Yuxi Wang and Xinyi Liu participated in the data collection; All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics Approval
Approval from the Science and Technology Bureau of China and the Department of Wildlife Administration is not required for the experiments conducted in this paper when the fish in question are neither rare nor near extinction (first- or second-class state protection level). All activities comply with China's Wildlife Protection and Fishery Law.
Consent to Participate
The participant has consented to the participants of the manuscript.
Consent for Publication
The participant has consented to the submission of the manuscript to the journal.
Conflict of Interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, K., Xie, N., Wang, Y. et al. The Utilization of Reference-Guided Assembly and In Silico Libraries Improves the Draft Genome of Clarias batrachus and Culter alburnus. Mar Biotechnol 25, 907–917 (2023). https://doi.org/10.1007/s10126-023-10248-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10126-023-10248-x