Skip to main content
Log in

Genome sequence assembly algorithms and misassembly identification methods

  • Review
  • Published:
Molecular Biology Reports Aims and scope Submit manuscript

Abstract

The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Abbreviations

bp:

Base pair

CCS:

Circular Consensus Sequencing

CLR:

Continuous Long Reads

DBG:

De Bruijn graph

DDF:

Distance difference factor

FBG:

Fuzzy-Bruijn graph

FCD:

Fragment coverage distribution

FPR:

False-positive rate

GFA:

Graph fragment assembly

GPU:

Graphics processing unit

HiFi:

High fidelity

HTS:

High-throughput sequencing

kb:

Kilobase pair

Mb:

Megabase pair

OLC:

Overlap-Layout-Consensus

POA:

Partial order alignment

QVs:

Quality values

SMRT:

Single-Molecule Real-Time

SNP:

Single nucleotide polymorphism

SV:

Structural variation

SVs:

Structural variations

tf-idf :

Term frequency, inverse document frequency

TGS:

Third-generation sequencing

WGS:

Whole-genome sequencing

References

  1. Ardui S, Ameur A, Vermeesch JR, Hestand MS (2018) Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics. Nucleic Acids Res 46(5):2159–2168. https://doi.org/10.1093/nar/gky066

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Bravo-Egana V, Sanders H, Chitnis N (2021) New challenges, new opportunities: Next generation sequencing and its place in the advancement of HLA typing. Hum Immunol 82(7):478–487. https://doi.org/10.1016/j.humimm.2021.01.010

    Article  CAS  PubMed  Google Scholar 

  3. Escalona M, Rocha S, Posada D (2016) A comparison of tools for the simulation of genomic next-generation sequencing data. Nat Rev Genet 17(8):459–469. https://doi.org/10.1038/nrg.2016.57

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Miller JR, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data. Genomics 95(6):315–327. https://doi.org/10.1016/j.ygeno.2010.03.001

    Article  CAS  PubMed  Google Scholar 

  5. Salzberg SL, Phillippy AM, Zimin A, Puiu D et al (2012) GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res 22(3):557–567. https://doi.org/10.1101/gr.131383.111

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Honskus M, Okonji Z, Musilek M, Krizova P (2022) Whole genome sequencing of Neisseria meningitidis Y isolates collected in the Czech Republic in 1993–2018. PLoS ONE 17(3):e0265066. https://doi.org/10.1371/journal.pone.0265066

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Alkan C, Coe BP, Eichler EE (2011) Genome structural variation discovery and genotyping. Nat Rev Genet 12(5):363–376. https://doi.org/10.1038/nrg2958

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Estrada-Rivadeneyra D (2017) Sanger sequencing. FEBS J 284(24):4174. https://doi.org/10.1111/febs.14319

    Article  CAS  PubMed  Google Scholar 

  9. Knief C (2014) Analysis of plant microbe interactions in the era of next generation sequencing technologies. Front Plant Sci 5:216. https://doi.org/10.3389/fpls.2014.00216

    Article  PubMed  PubMed Central  Google Scholar 

  10. Zheng GX, Lau BT, Schnall-Levin M, Jarosz M et al (2016) Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat Biotechnol 34(3):303–311. https://doi.org/10.1038/nbt.3432

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Lappalainen T, Scott AJ, Brandt M, Hall IM (2019) Genomic analysis in the age of human genome sequencing. Cell 177(1):70–84. https://doi.org/10.1016/j.cell.2019.02.032

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Koeppel F, Bobard A, Lefebvre C, Pedrero M et al (2018) Added value of whole-exome and transcriptome sequencing for clinical molecular screenings of advanced cancer patients with solid tumors. Cancer J 24(4):153–162. https://doi.org/10.1097/ppo.0000000000000322

    Article  CAS  PubMed  Google Scholar 

  13. Jain M, Olsen HE, Paten B, Akeson M (2016) The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol 17(1):239. https://doi.org/10.1186/s13059-016-1103-0

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Jeck WR, Iafrate AJ, Nardi V (2021) Nanopore flongle sequencing as a rapid, single-specimen clinical test for fusion detection. J Mol Diagn 23(5):630–636. https://doi.org/10.1016/j.jmoldx.2021.02.001

    Article  CAS  PubMed  Google Scholar 

  15. Wenger AM, Peluso P, Rowell WJ, Chang PC et al (2019) Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 37(10):1155–1162. https://doi.org/10.1038/s41587-019-0217-9

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Cretu Stancu M, van Roosmalen MJ, Renkens I, Nieboer MM et al (2017) Mapping and phasing of structural variation in patient genomes using nanopore sequencing. Nat Commun 8(1):1326. https://doi.org/10.1038/s41467-017-01343-4

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Midha MK, Wu M, Chiu KP (2019) Long-read sequencing in deciphering human genetics to a greater depth. Hum Genet 138(11):1201–1215. https://doi.org/10.1007/s00439-019-02064-y

    Article  CAS  PubMed  Google Scholar 

  18. Xiao T, Zhou W (2020) The third generation sequencing: the advanced approach to genetic diseases. Transl Pediatr 9(2):163–173. https://doi.org/10.21037/tp.2020.03.06

    Article  PubMed  PubMed Central  Google Scholar 

  19. Poplin R, Zook JM, DePristo M (2021) Challenges of Accuracy in Germline Clinical Sequencing Data. JAMA 326(3):268–269. https://doi.org/10.1001/jama.2021.0407

    Article  PubMed  Google Scholar 

  20. Alosaimi S, Bandiang A, van Biljon N, Awany D et al (2019) A broad survey of DNA sequence data simulation tools. Brief Funct Genomics 19(1):49–59. https://doi.org/10.1093/bfgp/elz033

    Article  CAS  PubMed Central  Google Scholar 

  21. Richter DC, Ott F, Auch AF, Schmid R et al (2008) MetaSim: a sequencing simulator for genomics and metagenomics. PLoS ONE 3(10):e3373. https://doi.org/10.1371/journal.pone.0003373

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Angly FE, Willner D, Rohwer F, Hugenholtz P et al (2012) Grinder: a versatile amplicon and shotgun sequence simulator. Nucleic Acids Res 40(12):e94. https://doi.org/10.1093/nar/gks251

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. McElroy KE, Luciani F, Thomas T (2012) GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13:74. https://doi.org/10.1186/1471-2164-13-74

    Article  PubMed  PubMed Central  Google Scholar 

  24. Jia B, Xuan L, Cai K, Hu Z et al (2013) NeSSM: a Next-generation Sequencing Simulator for Metagenomics. PLoS ONE 8(10):e75448. https://doi.org/10.1371/journal.pone.0075448

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Shcherbina A (2014) FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets. BMC Res Notes 7:533. https://doi.org/10.1186/1756-0500-7-533

    Article  PubMed  PubMed Central  Google Scholar 

  26. Ono Y, Asai K, Hamada M (2012) PBSIM: PacBio reads simulator—toward accurate genome assembly. Bioinformatics 29(1):119–121. https://doi.org/10.1093/bioinformatics/bts649

    Article  CAS  PubMed  Google Scholar 

  27. Ono Y, Asai K, Hamada M (2020) PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores. Bioinformatics 37(5):589–595. https://doi.org/10.1093/bioinformatics/btaa835

    Article  CAS  PubMed Central  Google Scholar 

  28. Wei ZG, Zhang SW (2018) NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model. BMC Bioinformatics 19(1):177. https://doi.org/10.1186/s12859-018-2208-0

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Zhang W, Jia B, Wei C (2019) PaSS: a sequencing simulator for PacBio sequencing. BMC Bioinformatics 20(1):352. https://doi.org/10.1186/s12859-019-2901-7

    Article  PubMed  PubMed Central  Google Scholar 

  30. Yang C, Chu J, Warren RL, Birol I (2017) NanoSim: nanopore sequence read simulator based on statistical characterization. Gigascience. https://doi.org/10.1093/gigascience/gix010

    Article  PubMed  PubMed Central  Google Scholar 

  31. Li Y, Han R, Bi C, Li M et al (2018) DeepSimulator: a deep simulator for Nanopore sequencing. Bioinformatics 34(17):2899–2908. https://doi.org/10.1093/bioinformatics/bty223

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Howe K, Wood JM (2015) Using optical mapping data for the improvement of vertebrate genome assemblies. Gigascience 4:10. https://doi.org/10.1186/s13742-015-0052-y

    Article  PubMed  PubMed Central  Google Scholar 

  33. Tang H, Zhang X, Miao C, Zhang J et al (2015) ALLMAPS: robust scaffold ordering based on multiple maps. Genome Biol 16(1):3. https://doi.org/10.1186/s13059-014-0573-1

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Zhang X, Zhang S, Zhao Q, Ming R et al (2019) Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat Plants 5(8):833–845. https://doi.org/10.1038/s41477-019-0487-8

    Article  CAS  PubMed  Google Scholar 

  35. Kelley DR, Schatz MC, Salzberg SL (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biol 11(11):R116. https://doi.org/10.1186/gb-2010-11-11-r116

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Medvedev P, Scott E, Kakaradov B, Pevzner P (2011) Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics 27(13):i137–i141. https://doi.org/10.1093/bioinformatics/btr208

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Abdallah M, Mahgoub A, Ahmed H, Chaterji S (2019) Athena: automated tuning of k-mer based genomic error correction algorithms using language models. Sci Rep 9(1):16157. https://doi.org/10.1038/s41598-019-52196-4

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Ilie L, Fazayeli F, Ilie S (2010) HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics 27(3):295–302. https://doi.org/10.1093/bioinformatics/btq653

    Article  CAS  PubMed  Google Scholar 

  39. Schulz MH, Weese D, Holtgrewe M, Dimitrova V et al (2014) Fiona: a parallel and automatic strategy for read error correction. Bioinformatics 30(17):i356–i363. https://doi.org/10.1093/bioinformatics/btu440

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Sheikhizadeh S, de Ridder D (2015) ACE: accurate correction of errors using K-mer tries. Bioinformatics 31(19):3216–3218. https://doi.org/10.1093/bioinformatics/btv332

    Article  CAS  PubMed  Google Scholar 

  41. Salmela L, Schröder J (2011) Correcting errors in short reads by multiple alignments. Bioinformatics 27(11):1455–1461. https://doi.org/10.1093/bioinformatics/btr170

    Article  CAS  PubMed  Google Scholar 

  42. Allam A, Kalnis P, Solovyev V (2015) Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics 31(21):3421–3428. https://doi.org/10.1093/bioinformatics/btv415

    Article  CAS  PubMed  Google Scholar 

  43. Kallenborn F, Hildebrandt A, Schmidt B (2021) CARE: context-aware sequencing read error correction. Bioinformatics 37(7):889–895. https://doi.org/10.1093/bioinformatics/btaa738

    Article  CAS  PubMed  Google Scholar 

  44. Morisse P, Lecroq T, Lefebvre A (2018) Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph. Bioinformatics 34(24):4213–4222. https://doi.org/10.1093/bioinformatics/bty521

    Article  CAS  PubMed  Google Scholar 

  45. Das AK, Goswami S, Lee K, Park SJ (2019) A hybrid and scalable error correction algorithm for indel and substitution errors of long reads. BMC Genomics 20(Suppl 11):948. https://doi.org/10.1186/s12864-019-6286-9

    Article  PubMed  PubMed Central  Google Scholar 

  46. Holley G, Beyter D, Ingimundardottir H, Møller PL et al (2021) Ratatosk: hybrid error correction of long reads enables accurate variant calling and assembly. Genome Biol 22(1):28. https://doi.org/10.1186/s13059-020-02244-4

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Salmela L, Walve R, Rivals E, Ukkonen E (2016) Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics 33(6):799–806. https://doi.org/10.1093/bioinformatics/btw321

    Article  CAS  PubMed Central  Google Scholar 

  48. Bao E, Xie F, Song C, Song D (2019) FLAS: fast and high-throughput algorithm for PacBio long-read self-correction. Bioinformatics 35(20):3953–3960. https://doi.org/10.1093/bioinformatics/btz206

    Article  CAS  PubMed  Google Scholar 

  49. Morisse P, Marchet C, Limasset A, Lecroq T et al (2021) Scalable long read self-correction and assembly polishing with multiple sequence alignment. Sci Rep 11(1):761. https://doi.org/10.1038/s41598-020-80757-5

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Bankevich A, Nurk S, Antipov D, Gurevich AA et al (2012) SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19(5):455–477. https://doi.org/10.1089/cmb.2012.0021

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Li M, Liao Z, He Y, Wang J et al (2017) ISEA: iterative seed-extension algorithm for de novo assembly using paired-end information and insert size distribution. IEEE/ACM Trans Comput Biol Bioinform 14(4):916–925. https://doi.org/10.1109/TCBB.2016.2550433

    Article  PubMed  Google Scholar 

  52. Zhu X, Leung HC, Chin FY, Yiu SM et al (2013) PERGA: A Paired-end read guided de novo assembler for extending contigs using SVM approach. In Proceedings of the ACM Conf Bioinform Comput Biol Biomed Inform. https://doi.org/10.1145/2506583.2506612

    Article  Google Scholar 

  53. Zhu X, Leung HC, Chin FY, Yiu SM et al (2014) PERGA: a paired-end read guided de novo assembler for extending contigs using SVM and look ahead approach. PLoS ONE 9(12):e114253. https://doi.org/10.1371/journal.pone.0114253

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Cao MD, Nguyen SH, Ganesamoorthy D, Elliott AG et al (2017) Scaffolding and completing genome assemblies in real-time with nanopore sequencing. Nat Commun 8:14515. https://doi.org/10.1038/ncomms14515

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Wang A, Wang Z, Li Z, Li LM (2018) BAUM: improving genome assembly by adaptive unique mapping and local overlap-layout-consensus approach. Bioinformatics 34(12):2019–2028. https://doi.org/10.1093/bioinformatics/bty020

    Article  CAS  PubMed  Google Scholar 

  56. Koren S, Walenz BP, Berlin K, Miller JR et al (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27(5):722–736. https://doi.org/10.1101/gr.215087.116

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Chin CS, Peluso P, Sedlazeck FJ, Nattestad M et al (2016) Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods 13(12):1050–1054. https://doi.org/10.1038/nmeth.4035

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Xiao CL, Chen Y, Xie SQ, Chen KN et al (2017) MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat Methods 14(11):1072–1074. https://doi.org/10.1038/nmeth.4432

    Article  CAS  PubMed  Google Scholar 

  59. Kamath GM, Shomorony I, Xia F, Courtade TA et al (2017) HINGE: long-read assembly achieves optimal repeat resolution. Genome Res 27(5):747–756. https://doi.org/10.1101/gr.216465.116

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Li H (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32(14):2103–2110. https://doi.org/10.1093/bioinformatics/btw152

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Cheng H, Concepcion GT, Feng X, Zhang H et al (2021) Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18(2):170–175. https://doi.org/10.1038/s41592-020-01056-5

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Berlin K, Koren S, Chin CS, Drake JP et al (2015) Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 33(6):623–630. https://doi.org/10.1038/nbt.3238

    Article  CAS  PubMed  Google Scholar 

  63. Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18):3094–3100. https://doi.org/10.1093/bioinformatics/bty191

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Vaser R, Sović I, Nagarajan N, Šikić M (2017) Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 27(5):737–746. https://doi.org/10.1101/gr.214270.116

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Walker B, Abeel T, Shea T, Priest M et al (2014) Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9(11):e112963. https://doi.org/10.1371/journal.pone.0112963

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Peng Y, Leung HC, Yiu SM, Chin FY (2012) IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11):1420–1428. https://doi.org/10.1093/bioinformatics/bts174

    Article  CAS  PubMed  Google Scholar 

  67. El-Metwally S, Zakaria M, Hamza T (2016) LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads. Bioinformatics 32(21):3215–3223. https://doi.org/10.1093/bioinformatics/btw470

    Article  CAS  PubMed  Google Scholar 

  68. Kolmogorov M, Yuan J, Lin Y, Pevzner PA (2019) Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37(5):540–546. https://doi.org/10.1038/s41587-019-0072-8

    Article  CAS  PubMed  Google Scholar 

  69. Ruan J, Li H (2020) Fast and accurate long-read assembly with wtdbg2. Nat Methods 17(2):155–158. https://doi.org/10.1038/s41592-019-0669-3

    Article  CAS  PubMed  Google Scholar 

  70. Treangen TJ, Salzberg SL (2012) Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13(1):36–46. https://doi.org/10.1038/nrg3117

    Article  CAS  Google Scholar 

  71. Chen Y, Liu T, Yu C, Chiang T et al (2013) Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS ONE 8(4):e62856. https://doi.org/10.1371/journal.pone.0062856

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Clavijo BJ, Venturini L, Schudoma C, Accinelli GG et al (2017) An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations. Genome Res 27(5):885–896. https://doi.org/10.1101/gr.217117.116

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Aird D, Ross MG, Chen WS, Danielsson M et al (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol 12(2):R18. https://doi.org/10.1186/gb-2011-12-2-r18

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Alkan C, Sajjadian S, Eichler EE (2011) Limitations of next-generation genome sequence assembly. Nat Methods 8(1):61–65. https://doi.org/10.1038/nmeth.1527

    Article  CAS  PubMed  Google Scholar 

  75. Voshall A, Moriyama EN (2020) Next-generation transcriptome assembly and analysis: Impact of ploidy. Methods 176:14–24. https://doi.org/10.1016/j.ymeth.2019.06.001

    Article  CAS  PubMed  Google Scholar 

  76. Chaisson MJ, Sanders AD, Zhao X, Malhotra A et al (2019) Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun 10(1):1784. https://doi.org/10.1038/s41467-018-08148-z

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Garg S, Rautiainen M, Novak AM, Garrison E et al (2018) A graph-based approach to diploid genome assembly. Bioinformatics 34(13):i105–i114. https://doi.org/10.1093/bioinformatics/bty279

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Hunt M, Kikuchi T, Sanders M, Newbold C et al (2013) REAPR: a universal tool for genome assembly evaluation. Genome Biol 14(5):R47. https://doi.org/10.1186/gb-2013-14-5-r47

    Article  PubMed  PubMed Central  Google Scholar 

  79. Muggli MD, Puglisi SJ, Ronen R, Boucher C (2015) Misassembly detection using paired-end sequence reads and optical mapping data. Bioinformatics 31(12):i80–i88. https://doi.org/10.1093/bioinformatics/btv262

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  80. Li M, Wu B, Yan X, Luo J et al (2017) PECC: Correcting contigs based on paired-end read distribution. Comput Biol Chem 69:178–184. https://doi.org/10.1016/j.compbiolchem.2017.03.012

    Article  CAS  PubMed  Google Scholar 

  81. Wu B, Li M, Liao X, Luo J et al (2020) MEC: Misassembly error correction in contigs based on distribution of paired-end reads and statistics of GC-contents. IEEE/ACM Trans Comput Biol Bioinform 17(3):847–857. https://doi.org/10.1109/TCBB.2018.2876855

    Article  Google Scholar 

  82. Gurevich A, Saveliev V, Vyahhi N, Tesler G (2013) QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8):1072–1075. https://doi.org/10.1093/bioinformatics/btt086

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. Zhu X, Leung HC, Wang R, Chin FY et al (2015) misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads. BMC Bioinformatics 16:386. https://doi.org/10.1186/s12859-015-0818-3

    Article  PubMed  PubMed Central  Google Scholar 

  84. Bao E, Song C, Lan L (2017) ReMILO: reference assisted misassembly detection algorithm using short and long reads. Bioinformatics 34(1):24–32. https://doi.org/10.1093/bioinformatics/btx524

    Article  CAS  Google Scholar 

  85. Wang K, Wang J, Zhu C, Yang L et al (2021) African lungfish genome sheds light on the vertebrate water-to-land transition. Cell 184(5):1362–1376. https://doi.org/10.1016/j.cell.2021.01.047

    Article  CAS  PubMed  Google Scholar 

  86. Akdel M, Geest H, Schijlen E, Rijswijck I et al (2021) Signal-based optical map alignment. PLoS ONE 16(9):e0253102. https://doi.org/10.1371/journal.pone.0253102

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Bertrand D, Shaw J, Kalathiyappan M, Ng AH et al (2019) Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes. Nat Biotechnol 37(8):937–944. https://doi.org/10.1038/s41587-019-0191-2

    Article  CAS  PubMed  Google Scholar 

  88. Lei Y, Meng Y, Guo X, Ning K et al (2022) Overview of structural variation calling: simulation, identification, and visualization. Comput Biol Med 145:105534. https://doi.org/10.1016/j.compbiomed.2022.105534

    Article  PubMed  Google Scholar 

  89. Lee C, Grasso C, Sharlow MF (2002) Multiple sequence alignment using partial order graphs. Bioinformatics 18(3):452–464. https://doi.org/10.1093/bioinformatics/18.3.452

    Article  CAS  PubMed  Google Scholar 

  90. Liu Y, Jiang T, Gao Y, Liu B et al (2021) Psi-Caller: a lightweight short read-based variant caller with high speed and accuracy. Front Cell Dev Biol 9:731424. https://doi.org/10.3389/fcell.2021.731424

    Article  PubMed  PubMed Central  Google Scholar 

  91. Gao Y, Liu Y, Ma Y, Liu B et al (2020) abPOA: an SIMD-based C library for fast partial order alignment using adaptive band. Bioinformatics 37(15):2209–2211. https://doi.org/10.1093/bioinformatics/btaa963

    Article  CAS  Google Scholar 

  92. Yang X, Dorman KS, Aluru S (2010) Reptile: representative tiling for short read error correction. Bioinformatics 26(20):2526–2533. https://doi.org/10.1093/bioinformatics/btq468

    Article  CAS  PubMed  Google Scholar 

  93. Greenfield P, Duesing K, Papanicolaou A, Bauer DC (2014) Blue: correcting sequencing errors using consensus and context. Bioinformatics 30(19):2723–2732. https://doi.org/10.1093/bioinformatics/btu368

    Article  CAS  PubMed  Google Scholar 

  94. Lim EC, Müller J, Hagmann J, Henz SR et al (2014) Trowel: a fast and accurate error correction module for Illumina sequencing reads. Bioinformatics 30(22):3264–3265. https://doi.org/10.1093/bioinformatics/btu513

    Article  CAS  PubMed  Google Scholar 

  95. Saha S, Rajasekaran S (2015) EC: an efficient error correction algorithm for short reads. BMC Bioinformatics 16(Suppl 17):S2. https://doi.org/10.1186/1471-2105-16-s17-s2

    Article  PubMed  PubMed Central  Google Scholar 

  96. Li H (2015) BFC: correcting Illumina sequencing errors. Bioinformatics 31(17):2885–2887. https://doi.org/10.1093/bioinformatics/btv290

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  97. Marçais G, Yorke JA, Zimin A (2015) QuorUM: an error corrector for illumina reads. PLoS ONE 10(6):e0130821. https://doi.org/10.1371/journal.pone.0130821

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  98. Marinier E, Brown DG, McConkey BJ (2015) Pollux: platform independent error correction of single and mixed genomes. BMC Bioinformatics 16(1):10. https://doi.org/10.1186/s12859-014-0435-6

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  99. Heo Y, Ramachandran A, Hwu WM, Ma J et al (2016) BLESS 2: accurate, memory-efficient and fast error correction method. Bioinformatics 32(15):2369–2371. https://doi.org/10.1093/bioinformatics/btw146

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  100. Dlugosz M, Deorowicz S (2017) RECKONER: read error corrector based on KMC. Bioinformatics 33(7):1086–1089. https://doi.org/10.1093/bioinformatics/btw746

    Article  CAS  PubMed  Google Scholar 

  101. Kao WC, Chan A, Song Y (2011) ECHO: A reference-free short-read error correction algorithm. Genome Res 21(7):1181–1192. https://doi.org/10.1101/gr.111351.110

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  102. David M, Dzamba M, Lister D, Ilie L et al (2011) SHRiMP2: Sensitive yet Practical Short Read Mapping. Bioinformatics 27(7):1011–1012. https://doi.org/10.1093/bioinformatics/btr046

    Article  CAS  PubMed  Google Scholar 

  103. Limasset A, Flot JF, Peterlongo P (2020) Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs. Bioinformatics 36(5):1374–1381. https://doi.org/10.1093/bioinformatics/btz102

    Article  CAS  PubMed  Google Scholar 

  104. Heydari M, Miclotte G, Van de Peer Y, Fostier J (2019) Illumina error correction near highly repetitive DNA regions improves de novo genome assembly. BMC Bioinformatics 20(1):298. https://doi.org/10.1186/s12859-019-2906-2

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  105. Koren S, Schatz MC, Walenz BP, Martin J et al (2012) Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 30(7):693–700. https://doi.org/10.1038/nbt.2280

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  106. Au KF, Underwood JG, Lee L, Wong WH (2017) Improving PacBio long read accuracy by short read alignment. PLoS ONE 7(10):e46679. https://doi.org/10.1371/journal.pone.0046679

    Article  CAS  Google Scholar 

  107. Miclotte G, Heydari M, Demeester P, Rombauts S et al (2016) Jabba: hybrid error correction for long sequencing reads. Algorithms Mol Biol 11:10. https://doi.org/10.1186/s13015-016-0075-7

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  108. Bao E, Lan L (2017) HALC: High throughput algorithm for long read error correction. BMC Bioinformatics 18(1):204. https://doi.org/10.1186/s12859-017-1610-3

    Article  PubMed  PubMed Central  Google Scholar 

  109. Haghshenas E, Hach F, Sahinalp SC, Chauve C (2016) CoLoRMap: correcting long reads by mapping short reads. Bioinformatics 32(17):i545–i551. https://doi.org/10.1093/bioinformatics/btw463

    Article  CAS  PubMed  Google Scholar 

  110. Goodwin S, Gurtowski J, Ethe-Sayers S, Deshpande P et al (2015) Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res 25(11):1750–1756. https://doi.org/10.1101/gr.191395.115

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  111. Madoui MA, Engelen S, Cruaud C, Belser C et al (2015) Genome assembly using Nanopore-guided long and error-free DNA reads. BMC Genomics 16(1):327. https://doi.org/10.1186/s12864-015-1519-z

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  112. Firtina C, Bar-Joseph Z, Alkan C, Cicek AE (2018) Hercules: a profile HMM-based hybrid error correction algorithm for long reads. Nucleic Acids Res 46(21):e125. https://doi.org/10.1093/nar/gky724

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  113. Wang JR, Holt J, McMillan L, Jones CD (2018) FMLRC: Hybrid long read error correction using an FM-index. BMC Bioinformatics 19(1):50. https://doi.org/10.1186/s12859-018-2051-3

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Funding

This work was supported by grants from the National Natural Science Foundation of China (61902094), the Heilongjiang Provincial Natural Science Foundation of China (QC2018082), the Shandong Provincial Natural Science Foundation of China (ZR2021MH036), the Science and Technology Program of Binzhou Medical University (50012304325), the Fundamental Research Funds for the Central University (2017-KYYWF-0140), the University Nursing Program for Young Scholars with Creative Talents in Heilongjiang Province (UNPYSCT) (UNPYSCT-2018183), the Doctoral Scientific Research Foundation of Harbin Normal University (XKB201916, XKB201801), the Science Foundation of School of Computer Science and Information Engineering of Harbin Normal University (JKYKYZ202004, JKYKYZ202104), and the Graduate Innovative Research Project of Harbin Normal University (HSDSSCX2021-31).

Author information

Authors and Affiliations

Authors

Contributions

YM, YL (Yu Lei) and JG conceptualized and wrote the manuscript. YL (Yuxuan Liu), EM and YD (Yunhong Ding) prepared the figures. YB and HZ prepared the tables. YD (Yucui Dong) and XZ revised this review. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Yucui Dong or Xiao Zhu.

Ethics declarations

Conflict of interest

All authors declare no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 24 kb)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Meng, Y., Lei, Y., Gao, J. et al. Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 49, 11133–11148 (2022). https://doi.org/10.1007/s11033-022-07919-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11033-022-07919-8

Keywords

Navigation