Genome sequence assembly algorithms and misassembly identification methods

Meng, Yue; Lei, Yu; Gao, Jianlong; Liu, Yuxuan; Ma, Enze; Ding, Yunhong; Bian, Yixin; Zu, Hongquan; Dong, Yucui; Zhu, Xiao

doi:10.1007/s11033-022-07919-8

Genome sequence assembly algorithms and misassembly identification methods

Review
Published: 23 September 2022

Volume 49, pages 11133–11148, (2022)
Cite this article

Molecular Biology Reports Aims and scope Submit manuscript

Yue Meng²^na1,
Yu Lei⁴^na1,
Jianlong Gao³,
Yuxuan Liu³,
Enze Ma³,
Yunhong Ding³,
Yixin Bian³,
Hongquan Zu⁵,
Yucui Dong⁶ &
…
Xiao Zhu¹

1738 Accesses
1 Citation
Explore all metrics

Abstract

The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Next-Generation Sequencing: Advantages, Disadvantages, and Future

The Illumina Sequencing Protocol and the NovaSeq 6000 System

MiSeq: A Next Generation Sequencing Platform for Genomic Analysis

Abbreviations

bp:: Base pair
CCS:: Circular Consensus Sequencing
CLR:: Continuous Long Reads
DBG:: De Bruijn graph
DDF:: Distance difference factor
FBG:: Fuzzy-Bruijn graph
FCD:: Fragment coverage distribution
FPR:: False-positive rate
GFA:: Graph fragment assembly
GPU:: Graphics processing unit
HiFi:: High fidelity
HTS:: High-throughput sequencing
kb:: Kilobase pair
Mb:: Megabase pair
OLC:: Overlap-Layout-Consensus
POA:: Partial order alignment
QVs:: Quality values
SMRT:: Single-Molecule Real-Time
SNP:: Single nucleotide polymorphism
SV:: Structural variation
SVs:: Structural variations
tf-idf :: Term frequency, inverse document frequency
TGS:: Third-generation sequencing
WGS:: Whole-genome sequencing

References

Ardui S, Ameur A, Vermeesch JR, Hestand MS (2018) Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics. Nucleic Acids Res 46(5):2159–2168. https://doi.org/10.1093/nar/gky066
Article CAS PubMed PubMed Central Google Scholar
Bravo-Egana V, Sanders H, Chitnis N (2021) New challenges, new opportunities: Next generation sequencing and its place in the advancement of HLA typing. Hum Immunol 82(7):478–487. https://doi.org/10.1016/j.humimm.2021.01.010
Article CAS PubMed Google Scholar
Escalona M, Rocha S, Posada D (2016) A comparison of tools for the simulation of genomic next-generation sequencing data. Nat Rev Genet 17(8):459–469. https://doi.org/10.1038/nrg.2016.57
Article CAS PubMed PubMed Central Google Scholar
Miller JR, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data. Genomics 95(6):315–327. https://doi.org/10.1016/j.ygeno.2010.03.001
Article CAS PubMed Google Scholar
Salzberg SL, Phillippy AM, Zimin A, Puiu D et al (2012) GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res 22(3):557–567. https://doi.org/10.1101/gr.131383.111
Article CAS PubMed PubMed Central Google Scholar
Honskus M, Okonji Z, Musilek M, Krizova P (2022) Whole genome sequencing of Neisseria meningitidis Y isolates collected in the Czech Republic in 1993–2018. PLoS ONE 17(3):e0265066. https://doi.org/10.1371/journal.pone.0265066
Article CAS PubMed PubMed Central Google Scholar
Alkan C, Coe BP, Eichler EE (2011) Genome structural variation discovery and genotyping. Nat Rev Genet 12(5):363–376. https://doi.org/10.1038/nrg2958
Article CAS PubMed PubMed Central Google Scholar
Estrada-Rivadeneyra D (2017) Sanger sequencing. FEBS J 284(24):4174. https://doi.org/10.1111/febs.14319
Article CAS PubMed Google Scholar
Knief C (2014) Analysis of plant microbe interactions in the era of next generation sequencing technologies. Front Plant Sci 5:216. https://doi.org/10.3389/fpls.2014.00216
Article PubMed PubMed Central Google Scholar
Zheng GX, Lau BT, Schnall-Levin M, Jarosz M et al (2016) Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat Biotechnol 34(3):303–311. https://doi.org/10.1038/nbt.3432
Article CAS PubMed PubMed Central Google Scholar
Lappalainen T, Scott AJ, Brandt M, Hall IM (2019) Genomic analysis in the age of human genome sequencing. Cell 177(1):70–84. https://doi.org/10.1016/j.cell.2019.02.032
Article CAS PubMed PubMed Central Google Scholar
Koeppel F, Bobard A, Lefebvre C, Pedrero M et al (2018) Added value of whole-exome and transcriptome sequencing for clinical molecular screenings of advanced cancer patients with solid tumors. Cancer J 24(4):153–162. https://doi.org/10.1097/ppo.0000000000000322
Article CAS PubMed Google Scholar
Jain M, Olsen HE, Paten B, Akeson M (2016) The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol 17(1):239. https://doi.org/10.1186/s13059-016-1103-0
Article CAS PubMed PubMed Central Google Scholar
Jeck WR, Iafrate AJ, Nardi V (2021) Nanopore flongle sequencing as a rapid, single-specimen clinical test for fusion detection. J Mol Diagn 23(5):630–636. https://doi.org/10.1016/j.jmoldx.2021.02.001
Article CAS PubMed Google Scholar
Wenger AM, Peluso P, Rowell WJ, Chang PC et al (2019) Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 37(10):1155–1162. https://doi.org/10.1038/s41587-019-0217-9
Article CAS PubMed PubMed Central Google Scholar
Cretu Stancu M, van Roosmalen MJ, Renkens I, Nieboer MM et al (2017) Mapping and phasing of structural variation in patient genomes using nanopore sequencing. Nat Commun 8(1):1326. https://doi.org/10.1038/s41467-017-01343-4
Article CAS PubMed PubMed Central Google Scholar
Midha MK, Wu M, Chiu KP (2019) Long-read sequencing in deciphering human genetics to a greater depth. Hum Genet 138(11):1201–1215. https://doi.org/10.1007/s00439-019-02064-y
Article CAS PubMed Google Scholar
Xiao T, Zhou W (2020) The third generation sequencing: the advanced approach to genetic diseases. Transl Pediatr 9(2):163–173. https://doi.org/10.21037/tp.2020.03.06
Article PubMed PubMed Central Google Scholar
Poplin R, Zook JM, DePristo M (2021) Challenges of Accuracy in Germline Clinical Sequencing Data. JAMA 326(3):268–269. https://doi.org/10.1001/jama.2021.0407
Article PubMed Google Scholar
Alosaimi S, Bandiang A, van Biljon N, Awany D et al (2019) A broad survey of DNA sequence data simulation tools. Brief Funct Genomics 19(1):49–59. https://doi.org/10.1093/bfgp/elz033
Article CAS PubMed Central Google Scholar
Richter DC, Ott F, Auch AF, Schmid R et al (2008) MetaSim: a sequencing simulator for genomics and metagenomics. PLoS ONE 3(10):e3373. https://doi.org/10.1371/journal.pone.0003373
Article CAS PubMed PubMed Central Google Scholar
Angly FE, Willner D, Rohwer F, Hugenholtz P et al (2012) Grinder: a versatile amplicon and shotgun sequence simulator. Nucleic Acids Res 40(12):e94. https://doi.org/10.1093/nar/gks251
Article CAS PubMed PubMed Central Google Scholar
McElroy KE, Luciani F, Thomas T (2012) GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13:74. https://doi.org/10.1186/1471-2164-13-74
Article PubMed PubMed Central Google Scholar
Jia B, Xuan L, Cai K, Hu Z et al (2013) NeSSM: a Next-generation Sequencing Simulator for Metagenomics. PLoS ONE 8(10):e75448. https://doi.org/10.1371/journal.pone.0075448
Article CAS PubMed PubMed Central Google Scholar
Shcherbina A (2014) FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets. BMC Res Notes 7:533. https://doi.org/10.1186/1756-0500-7-533
Article PubMed PubMed Central Google Scholar
Ono Y, Asai K, Hamada M (2012) PBSIM: PacBio reads simulator—toward accurate genome assembly. Bioinformatics 29(1):119–121. https://doi.org/10.1093/bioinformatics/bts649
Article CAS PubMed Google Scholar
Ono Y, Asai K, Hamada M (2020) PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores. Bioinformatics 37(5):589–595. https://doi.org/10.1093/bioinformatics/btaa835
Article CAS PubMed Central Google Scholar
Wei ZG, Zhang SW (2018) NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model. BMC Bioinformatics 19(1):177. https://doi.org/10.1186/s12859-018-2208-0
Article CAS PubMed PubMed Central Google Scholar
Zhang W, Jia B, Wei C (2019) PaSS: a sequencing simulator for PacBio sequencing. BMC Bioinformatics 20(1):352. https://doi.org/10.1186/s12859-019-2901-7
Article PubMed PubMed Central Google Scholar
Yang C, Chu J, Warren RL, Birol I (2017) NanoSim: nanopore sequence read simulator based on statistical characterization. Gigascience. https://doi.org/10.1093/gigascience/gix010
Article PubMed PubMed Central Google Scholar
Li Y, Han R, Bi C, Li M et al (2018) DeepSimulator: a deep simulator for Nanopore sequencing. Bioinformatics 34(17):2899–2908. https://doi.org/10.1093/bioinformatics/bty223
Article CAS PubMed PubMed Central Google Scholar
Howe K, Wood JM (2015) Using optical mapping data for the improvement of vertebrate genome assemblies. Gigascience 4:10. https://doi.org/10.1186/s13742-015-0052-y
Article PubMed PubMed Central Google Scholar
Tang H, Zhang X, Miao C, Zhang J et al (2015) ALLMAPS: robust scaffold ordering based on multiple maps. Genome Biol 16(1):3. https://doi.org/10.1186/s13059-014-0573-1
Article CAS PubMed PubMed Central Google Scholar
Zhang X, Zhang S, Zhao Q, Ming R et al (2019) Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat Plants 5(8):833–845. https://doi.org/10.1038/s41477-019-0487-8
Article CAS PubMed Google Scholar
Kelley DR, Schatz MC, Salzberg SL (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biol 11(11):R116. https://doi.org/10.1186/gb-2010-11-11-r116
Article CAS PubMed PubMed Central Google Scholar
Medvedev P, Scott E, Kakaradov B, Pevzner P (2011) Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics 27(13):i137–i141. https://doi.org/10.1093/bioinformatics/btr208
Article CAS PubMed PubMed Central Google Scholar
Abdallah M, Mahgoub A, Ahmed H, Chaterji S (2019) Athena: automated tuning of k-mer based genomic error correction algorithms using language models. Sci Rep 9(1):16157. https://doi.org/10.1038/s41598-019-52196-4
Article CAS PubMed PubMed Central Google Scholar
Ilie L, Fazayeli F, Ilie S (2010) HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics 27(3):295–302. https://doi.org/10.1093/bioinformatics/btq653
Article CAS PubMed Google Scholar
Schulz MH, Weese D, Holtgrewe M, Dimitrova V et al (2014) Fiona: a parallel and automatic strategy for read error correction. Bioinformatics 30(17):i356–i363. https://doi.org/10.1093/bioinformatics/btu440
Article CAS PubMed PubMed Central Google Scholar
Sheikhizadeh S, de Ridder D (2015) ACE: accurate correction of errors using K-mer tries. Bioinformatics 31(19):3216–3218. https://doi.org/10.1093/bioinformatics/btv332
Article CAS PubMed Google Scholar
Salmela L, Schröder J (2011) Correcting errors in short reads by multiple alignments. Bioinformatics 27(11):1455–1461. https://doi.org/10.1093/bioinformatics/btr170
Article CAS PubMed Google Scholar
Allam A, Kalnis P, Solovyev V (2015) Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics 31(21):3421–3428. https://doi.org/10.1093/bioinformatics/btv415
Article CAS PubMed Google Scholar
Kallenborn F, Hildebrandt A, Schmidt B (2021) CARE: context-aware sequencing read error correction. Bioinformatics 37(7):889–895. https://doi.org/10.1093/bioinformatics/btaa738
Article CAS PubMed Google Scholar
Morisse P, Lecroq T, Lefebvre A (2018) Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph. Bioinformatics 34(24):4213–4222. https://doi.org/10.1093/bioinformatics/bty521
Article CAS PubMed Google Scholar
Das AK, Goswami S, Lee K, Park SJ (2019) A hybrid and scalable error correction algorithm for indel and substitution errors of long reads. BMC Genomics 20(Suppl 11):948. https://doi.org/10.1186/s12864-019-6286-9
Article PubMed PubMed Central Google Scholar
Holley G, Beyter D, Ingimundardottir H, Møller PL et al (2021) Ratatosk: hybrid error correction of long reads enables accurate variant calling and assembly. Genome Biol 22(1):28. https://doi.org/10.1186/s13059-020-02244-4
Article CAS PubMed PubMed Central Google Scholar
Salmela L, Walve R, Rivals E, Ukkonen E (2016) Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics 33(6):799–806. https://doi.org/10.1093/bioinformatics/btw321
Article CAS PubMed Central Google Scholar
Bao E, Xie F, Song C, Song D (2019) FLAS: fast and high-throughput algorithm for PacBio long-read self-correction. Bioinformatics 35(20):3953–3960. https://doi.org/10.1093/bioinformatics/btz206
Article CAS PubMed Google Scholar
Morisse P, Marchet C, Limasset A, Lecroq T et al (2021) Scalable long read self-correction and assembly polishing with multiple sequence alignment. Sci Rep 11(1):761. https://doi.org/10.1038/s41598-020-80757-5
Article CAS PubMed PubMed Central Google Scholar
Bankevich A, Nurk S, Antipov D, Gurevich AA et al (2012) SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19(5):455–477. https://doi.org/10.1089/cmb.2012.0021
Article CAS PubMed PubMed Central Google Scholar
Li M, Liao Z, He Y, Wang J et al (2017) ISEA: iterative seed-extension algorithm for de novo assembly using paired-end information and insert size distribution. IEEE/ACM Trans Comput Biol Bioinform 14(4):916–925. https://doi.org/10.1109/TCBB.2016.2550433
Article PubMed Google Scholar
Zhu X, Leung HC, Chin FY, Yiu SM et al (2013) PERGA: A Paired-end read guided de novo assembler for extending contigs using SVM approach. In Proceedings of the ACM Conf Bioinform Comput Biol Biomed Inform. https://doi.org/10.1145/2506583.2506612
Article Google Scholar
Zhu X, Leung HC, Chin FY, Yiu SM et al (2014) PERGA: a paired-end read guided de novo assembler for extending contigs using SVM and look ahead approach. PLoS ONE 9(12):e114253. https://doi.org/10.1371/journal.pone.0114253
Article CAS PubMed PubMed Central Google Scholar
Cao MD, Nguyen SH, Ganesamoorthy D, Elliott AG et al (2017) Scaffolding and completing genome assemblies in real-time with nanopore sequencing. Nat Commun 8:14515. https://doi.org/10.1038/ncomms14515
Article CAS PubMed PubMed Central Google Scholar
Wang A, Wang Z, Li Z, Li LM (2018) BAUM: improving genome assembly by adaptive unique mapping and local overlap-layout-consensus approach. Bioinformatics 34(12):2019–2028. https://doi.org/10.1093/bioinformatics/bty020
Article CAS PubMed Google Scholar
Koren S, Walenz BP, Berlin K, Miller JR et al (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27(5):722–736. https://doi.org/10.1101/gr.215087.116
Article CAS PubMed PubMed Central Google Scholar
Chin CS, Peluso P, Sedlazeck FJ, Nattestad M et al (2016) Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods 13(12):1050–1054. https://doi.org/10.1038/nmeth.4035
Article CAS PubMed PubMed Central Google Scholar
Xiao CL, Chen Y, Xie SQ, Chen KN et al (2017) MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat Methods 14(11):1072–1074. https://doi.org/10.1038/nmeth.4432
Article CAS PubMed Google Scholar
Kamath GM, Shomorony I, Xia F, Courtade TA et al (2017) HINGE: long-read assembly achieves optimal repeat resolution. Genome Res 27(5):747–756. https://doi.org/10.1101/gr.216465.116
Article CAS PubMed PubMed Central Google Scholar
Li H (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32(14):2103–2110. https://doi.org/10.1093/bioinformatics/btw152
Article CAS PubMed PubMed Central Google Scholar
Cheng H, Concepcion GT, Feng X, Zhang H et al (2021) Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18(2):170–175. https://doi.org/10.1038/s41592-020-01056-5
Article CAS PubMed PubMed Central Google Scholar
Berlin K, Koren S, Chin CS, Drake JP et al (2015) Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 33(6):623–630. https://doi.org/10.1038/nbt.3238
Article CAS PubMed Google Scholar
Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18):3094–3100. https://doi.org/10.1093/bioinformatics/bty191
Article CAS PubMed PubMed Central Google Scholar
Vaser R, Sović I, Nagarajan N, Šikić M (2017) Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 27(5):737–746. https://doi.org/10.1101/gr.214270.116
Article CAS PubMed PubMed Central Google Scholar
Walker B, Abeel T, Shea T, Priest M et al (2014) Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9(11):e112963. https://doi.org/10.1371/journal.pone.0112963
Article CAS PubMed PubMed Central Google Scholar
Peng Y, Leung HC, Yiu SM, Chin FY (2012) IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11):1420–1428. https://doi.org/10.1093/bioinformatics/bts174
Article CAS PubMed Google Scholar
El-Metwally S, Zakaria M, Hamza T (2016) LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads. Bioinformatics 32(21):3215–3223. https://doi.org/10.1093/bioinformatics/btw470
Article CAS PubMed Google Scholar
Kolmogorov M, Yuan J, Lin Y, Pevzner PA (2019) Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37(5):540–546. https://doi.org/10.1038/s41587-019-0072-8
Article CAS PubMed Google Scholar
Ruan J, Li H (2020) Fast and accurate long-read assembly with wtdbg2. Nat Methods 17(2):155–158. https://doi.org/10.1038/s41592-019-0669-3
Article CAS PubMed Google Scholar
Treangen TJ, Salzberg SL (2012) Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13(1):36–46. https://doi.org/10.1038/nrg3117
Article CAS Google Scholar
Chen Y, Liu T, Yu C, Chiang T et al (2013) Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS ONE 8(4):e62856. https://doi.org/10.1371/journal.pone.0062856
Article CAS PubMed PubMed Central Google Scholar
Clavijo BJ, Venturini L, Schudoma C, Accinelli GG et al (2017) An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations. Genome Res 27(5):885–896. https://doi.org/10.1101/gr.217117.116
Article CAS PubMed PubMed Central Google Scholar
Aird D, Ross MG, Chen WS, Danielsson M et al (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol 12(2):R18. https://doi.org/10.1186/gb-2011-12-2-r18
Article CAS PubMed PubMed Central Google Scholar
Alkan C, Sajjadian S, Eichler EE (2011) Limitations of next-generation genome sequence assembly. Nat Methods 8(1):61–65. https://doi.org/10.1038/nmeth.1527
Article CAS PubMed Google Scholar
Voshall A, Moriyama EN (2020) Next-generation transcriptome assembly and analysis: Impact of ploidy. Methods 176:14–24. https://doi.org/10.1016/j.ymeth.2019.06.001
Article CAS PubMed Google Scholar
Chaisson MJ, Sanders AD, Zhao X, Malhotra A et al (2019) Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun 10(1):1784. https://doi.org/10.1038/s41467-018-08148-z
Article CAS PubMed PubMed Central Google Scholar
Garg S, Rautiainen M, Novak AM, Garrison E et al (2018) A graph-based approach to diploid genome assembly. Bioinformatics 34(13):i105–i114. https://doi.org/10.1093/bioinformatics/bty279
Article CAS PubMed PubMed Central Google Scholar
Hunt M, Kikuchi T, Sanders M, Newbold C et al (2013) REAPR: a universal tool for genome assembly evaluation. Genome Biol 14(5):R47. https://doi.org/10.1186/gb-2013-14-5-r47
Article PubMed PubMed Central Google Scholar
Muggli MD, Puglisi SJ, Ronen R, Boucher C (2015) Misassembly detection using paired-end sequence reads and optical mapping data. Bioinformatics 31(12):i80–i88. https://doi.org/10.1093/bioinformatics/btv262
Article CAS PubMed PubMed Central Google Scholar
Li M, Wu B, Yan X, Luo J et al (2017) PECC: Correcting contigs based on paired-end read distribution. Comput Biol Chem 69:178–184. https://doi.org/10.1016/j.compbiolchem.2017.03.012
Article CAS PubMed Google Scholar
Wu B, Li M, Liao X, Luo J et al (2020) MEC: Misassembly error correction in contigs based on distribution of paired-end reads and statistics of GC-contents. IEEE/ACM Trans Comput Biol Bioinform 17(3):847–857. https://doi.org/10.1109/TCBB.2018.2876855
Article Google Scholar
Gurevich A, Saveliev V, Vyahhi N, Tesler G (2013) QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8):1072–1075. https://doi.org/10.1093/bioinformatics/btt086
Article CAS PubMed PubMed Central Google Scholar
Zhu X, Leung HC, Wang R, Chin FY et al (2015) misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads. BMC Bioinformatics 16:386. https://doi.org/10.1186/s12859-015-0818-3
Article PubMed PubMed Central Google Scholar
Bao E, Song C, Lan L (2017) ReMILO: reference assisted misassembly detection algorithm using short and long reads. Bioinformatics 34(1):24–32. https://doi.org/10.1093/bioinformatics/btx524
Article CAS Google Scholar
Wang K, Wang J, Zhu C, Yang L et al (2021) African lungfish genome sheds light on the vertebrate water-to-land transition. Cell 184(5):1362–1376. https://doi.org/10.1016/j.cell.2021.01.047
Article CAS PubMed Google Scholar
Akdel M, Geest H, Schijlen E, Rijswijck I et al (2021) Signal-based optical map alignment. PLoS ONE 16(9):e0253102. https://doi.org/10.1371/journal.pone.0253102
Article CAS PubMed PubMed Central Google Scholar
Bertrand D, Shaw J, Kalathiyappan M, Ng AH et al (2019) Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes. Nat Biotechnol 37(8):937–944. https://doi.org/10.1038/s41587-019-0191-2
Article CAS PubMed Google Scholar
Lei Y, Meng Y, Guo X, Ning K et al (2022) Overview of structural variation calling: simulation, identification, and visualization. Comput Biol Med 145:105534. https://doi.org/10.1016/j.compbiomed.2022.105534
Article PubMed Google Scholar
Lee C, Grasso C, Sharlow MF (2002) Multiple sequence alignment using partial order graphs. Bioinformatics 18(3):452–464. https://doi.org/10.1093/bioinformatics/18.3.452
Article CAS PubMed Google Scholar
Liu Y, Jiang T, Gao Y, Liu B et al (2021) Psi-Caller: a lightweight short read-based variant caller with high speed and accuracy. Front Cell Dev Biol 9:731424. https://doi.org/10.3389/fcell.2021.731424
Article PubMed PubMed Central Google Scholar
Gao Y, Liu Y, Ma Y, Liu B et al (2020) abPOA: an SIMD-based C library for fast partial order alignment using adaptive band. Bioinformatics 37(15):2209–2211. https://doi.org/10.1093/bioinformatics/btaa963
Article CAS Google Scholar
Yang X, Dorman KS, Aluru S (2010) Reptile: representative tiling for short read error correction. Bioinformatics 26(20):2526–2533. https://doi.org/10.1093/bioinformatics/btq468
Article CAS PubMed Google Scholar
Greenfield P, Duesing K, Papanicolaou A, Bauer DC (2014) Blue: correcting sequencing errors using consensus and context. Bioinformatics 30(19):2723–2732. https://doi.org/10.1093/bioinformatics/btu368
Article CAS PubMed Google Scholar
Lim EC, Müller J, Hagmann J, Henz SR et al (2014) Trowel: a fast and accurate error correction module for Illumina sequencing reads. Bioinformatics 30(22):3264–3265. https://doi.org/10.1093/bioinformatics/btu513
Article CAS PubMed Google Scholar
Saha S, Rajasekaran S (2015) EC: an efficient error correction algorithm for short reads. BMC Bioinformatics 16(Suppl 17):S2. https://doi.org/10.1186/1471-2105-16-s17-s2
Article PubMed PubMed Central Google Scholar
Li H (2015) BFC: correcting Illumina sequencing errors. Bioinformatics 31(17):2885–2887. https://doi.org/10.1093/bioinformatics/btv290
Article CAS PubMed PubMed Central Google Scholar
Marçais G, Yorke JA, Zimin A (2015) QuorUM: an error corrector for illumina reads. PLoS ONE 10(6):e0130821. https://doi.org/10.1371/journal.pone.0130821
Article CAS PubMed PubMed Central Google Scholar
Marinier E, Brown DG, McConkey BJ (2015) Pollux: platform independent error correction of single and mixed genomes. BMC Bioinformatics 16(1):10. https://doi.org/10.1186/s12859-014-0435-6
Article CAS PubMed PubMed Central Google Scholar
Heo Y, Ramachandran A, Hwu WM, Ma J et al (2016) BLESS 2: accurate, memory-efficient and fast error correction method. Bioinformatics 32(15):2369–2371. https://doi.org/10.1093/bioinformatics/btw146
Article CAS PubMed PubMed Central Google Scholar
Dlugosz M, Deorowicz S (2017) RECKONER: read error corrector based on KMC. Bioinformatics 33(7):1086–1089. https://doi.org/10.1093/bioinformatics/btw746
Article CAS PubMed Google Scholar
Kao WC, Chan A, Song Y (2011) ECHO: A reference-free short-read error correction algorithm. Genome Res 21(7):1181–1192. https://doi.org/10.1101/gr.111351.110
Article CAS PubMed PubMed Central Google Scholar
David M, Dzamba M, Lister D, Ilie L et al (2011) SHRiMP2: Sensitive yet Practical Short Read Mapping. Bioinformatics 27(7):1011–1012. https://doi.org/10.1093/bioinformatics/btr046
Article CAS PubMed Google Scholar
Limasset A, Flot JF, Peterlongo P (2020) Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs. Bioinformatics 36(5):1374–1381. https://doi.org/10.1093/bioinformatics/btz102
Article CAS PubMed Google Scholar
Heydari M, Miclotte G, Van de Peer Y, Fostier J (2019) Illumina error correction near highly repetitive DNA regions improves de novo genome assembly. BMC Bioinformatics 20(1):298. https://doi.org/10.1186/s12859-019-2906-2
Article CAS PubMed PubMed Central Google Scholar
Koren S, Schatz MC, Walenz BP, Martin J et al (2012) Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 30(7):693–700. https://doi.org/10.1038/nbt.2280
Article CAS PubMed PubMed Central Google Scholar
Au KF, Underwood JG, Lee L, Wong WH (2017) Improving PacBio long read accuracy by short read alignment. PLoS ONE 7(10):e46679. https://doi.org/10.1371/journal.pone.0046679
Article CAS Google Scholar
Miclotte G, Heydari M, Demeester P, Rombauts S et al (2016) Jabba: hybrid error correction for long sequencing reads. Algorithms Mol Biol 11:10. https://doi.org/10.1186/s13015-016-0075-7
Article CAS PubMed PubMed Central Google Scholar
Bao E, Lan L (2017) HALC: High throughput algorithm for long read error correction. BMC Bioinformatics 18(1):204. https://doi.org/10.1186/s12859-017-1610-3
Article PubMed PubMed Central Google Scholar
Haghshenas E, Hach F, Sahinalp SC, Chauve C (2016) CoLoRMap: correcting long reads by mapping short reads. Bioinformatics 32(17):i545–i551. https://doi.org/10.1093/bioinformatics/btw463
Article CAS PubMed Google Scholar
Goodwin S, Gurtowski J, Ethe-Sayers S, Deshpande P et al (2015) Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res 25(11):1750–1756. https://doi.org/10.1101/gr.191395.115
Article CAS PubMed PubMed Central Google Scholar
Madoui MA, Engelen S, Cruaud C, Belser C et al (2015) Genome assembly using Nanopore-guided long and error-free DNA reads. BMC Genomics 16(1):327. https://doi.org/10.1186/s12864-015-1519-z
Article CAS PubMed PubMed Central Google Scholar
Firtina C, Bar-Joseph Z, Alkan C, Cicek AE (2018) Hercules: a profile HMM-based hybrid error correction algorithm for long reads. Nucleic Acids Res 46(21):e125. https://doi.org/10.1093/nar/gky724
Article CAS PubMed PubMed Central Google Scholar
Wang JR, Holt J, McMillan L, Jones CD (2018) FMLRC: Hybrid long read error correction using an FM-index. BMC Bioinformatics 19(1):50. https://doi.org/10.1186/s12859-018-2051-3
Article CAS PubMed PubMed Central Google Scholar

Download references

Funding

This work was supported by grants from the National Natural Science Foundation of China (61902094), the Heilongjiang Provincial Natural Science Foundation of China (QC2018082), the Shandong Provincial Natural Science Foundation of China (ZR2021MH036), the Science and Technology Program of Binzhou Medical University (50012304325), the Fundamental Research Funds for the Central University (2017-KYYWF-0140), the University Nursing Program for Young Scholars with Creative Talents in Heilongjiang Province (UNPYSCT) (UNPYSCT-2018183), the Doctoral Scientific Research Foundation of Harbin Normal University (XKB201916, XKB201801), the Science Foundation of School of Computer Science and Information Engineering of Harbin Normal University (JKYKYZ202004, JKYKYZ202104), and the Graduate Innovative Research Project of Harbin Normal University (HSDSSCX2021-31).

Author information

Yue Meng, Yu Lei and Jianlong Gao have contributed equally to this work.

Authors and Affiliations

School of Computer and Control Engineering, Yantai University, Yantai, Shandong, China
Xiao Zhu
School of Information Engineering, Zhengzhou University of Industrial Technology, Zhengzhou, Henan, China
Yue Meng
School of Computer Science and Information Engineering, Harbin Normal University, Harbin, Heilongjiang, China
Jianlong Gao, Yuxuan Liu, Enze Ma, Yunhong Ding & Yixin Bian
Department of Big Data and Intelligent Engineering, Shanxi Institute of Technology, Yangquan, Shanxi, China
Yu Lei
Center of Network and Information, Harbin Institute of Technology, Harbin, Heilongjiang, China
Hongquan Zu
Department of Immunology, Binzhou Medical University, Yantai, Shandong, China
Yucui Dong

Authors

Yue Meng
View author publications
You can also search for this author in PubMed Google Scholar
Yu Lei
View author publications
You can also search for this author in PubMed Google Scholar
Jianlong Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yuxuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Enze Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yunhong Ding
View author publications
You can also search for this author in PubMed Google Scholar
Yixin Bian
View author publications
You can also search for this author in PubMed Google Scholar
Hongquan Zu
View author publications
You can also search for this author in PubMed Google Scholar
Yucui Dong
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YM, YL (Yu Lei) and JG conceptualized and wrote the manuscript. YL (Yuxuan Liu), EM and YD (Yunhong Ding) prepared the figures. YB and HZ prepared the tables. YD (Yucui Dong) and XZ revised this review. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Yucui Dong or Xiao Zhu.

Ethics declarations

Conflict of interest

All authors declare no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 24 kb)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Meng, Y., Lei, Y., Gao, J. et al. Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 49, 11133–11148 (2022). https://doi.org/10.1007/s11033-022-07919-8

Download citation

Received: 20 March 2022
Accepted: 05 September 2022
Published: 23 September 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s11033-022-07919-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Genome sequence assembly algorithms and misassembly identification methods

Abstract

Access this article

Similar content being viewed by others

Next-Generation Sequencing: Advantages, Disadvantages, and Future

The Illumina Sequencing Protocol and the NovaSeq 6000 System

MiSeq: A Next Generation Sequencing Platform for Genomic Analysis

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 24 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Genome sequence assembly algorithms and misassembly identification methods

Abstract

Access this article

Similar content being viewed by others

Next-Generation Sequencing: Advantages, Disadvantages, and Future

The Illumina Sequencing Protocol and the NovaSeq 6000 System

MiSeq: A Next Generation Sequencing Platform for Genomic Analysis

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 24 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation