Gap Filling as Exact Path Length Problem

  • Leena Salmela
  • Kristoffer Sahlin
  • Veli Mäkinen
  • Alexandru I. Tomescu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9029)

Abstract

One of the last steps in a genome assembly project is filling the gaps between consecutive contigs in the scaffolds. This problem can be naturally stated as finding an \(s\)-\(t\) path in a directed graph whose sum of arc costs belongs to a given range (the estimate on the gap length). Here \(s\) and \(t\) are any two contigs flanking a gap. This problem is known to be NP-hard in general. Here we derive a simpler dynamic programming solution than already known, pseudo-polynomial in the maximum value of the input range. We implemented various practical optimizations to it, and compared our exact gap filling solution experimentally to popular gap filling tools. Summing over all the bacterial assemblies considered in our experiments, we can in total fill 28% more gaps than the best previous tool and the gaps filled by our method span 80% more sequence. Furthermore, the error level of the newly introduced sequence is comparable to that of the previous tools.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Boetzer, M., Pirovano, W.: Toward almost closed genomes with GapFiller. Genome Biology 13(6), R56 (2012)CrossRefGoogle Scholar
  2. 2.
    Drezen, E., et al.: GATB: genome assembly & analysis tool box. Bioinformatics 30(20), 2959–2961 (2014)CrossRefGoogle Scholar
  3. 3.
    Durbin, R., et al.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press (1998)Google Scholar
  4. 4.
    Dyer, M.E., et al.: A mildly exponential time algorithm for approximating the number of solutions to a multidimensional knapsack problem. Combinatorics, Probability & Computing 2(3), 271–284 (1993)CrossRefMATHMathSciNetGoogle Scholar
  5. 5.
    Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York (1979)MATHGoogle Scholar
  6. 6.
    Gnerre, S., et al.: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences 108(4), 1513–1518 (2010)CrossRefGoogle Scholar
  7. 7.
    Gurevich, A., et al.: QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), 1072–1075 (2013)CrossRefGoogle Scholar
  8. 8.
    Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85–103. Plenum Press, New York (1972)CrossRefGoogle Scholar
  9. 9.
    Kurtz, S., et al.: Versatile and open software for comparing large genomes. Genome Biology 5(2), R12 (2004)CrossRefGoogle Scholar
  10. 10.
    Langmead, B., et al.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), R25 (2009)CrossRefGoogle Scholar
  11. 11.
    Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)CrossRefGoogle Scholar
  12. 12.
    Luo, R., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(18) (2012)Google Scholar
  13. 13.
    Nadalin, F., et al.: GapFiller: a de novo assembly approach to fill the gap within paired reads. BMC Bioinformatics 13(suppl. 14), S8 (2012)CrossRefGoogle Scholar
  14. 14.
    Nykänen, M., Ukkonen, E.: The exact path length problem. J. Algorithms 42(1), 41–53 (2002)CrossRefMATHMathSciNetGoogle Scholar
  15. 15.
    Pabinger, S., et al.: A survey of tools for variant analysis of next-generation genome sequencing data. Briefings in Bioinformatics 15(2), 256–278 (2013)CrossRefGoogle Scholar
  16. 16.
    Pevzner, P.A., Tang, H.: Fragment assembly with double-barreled data. Bioinformatics 17(suppl. 1), S225–S233 (2001)CrossRefGoogle Scholar
  17. 17.
    Salzberg, S.L., et al.: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research 22(3), 557–567 (2012)CrossRefGoogle Scholar
  18. 18.
    Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Research 22, 549–556 (2012)CrossRefGoogle Scholar
  19. 19.
    Simpson, J., et al.: ABySS: A parallel assembler for short read sequence data. Genome Research 19, 1117–1123 (2009)CrossRefGoogle Scholar
  20. 20.
    Wetzel, J., et al.: Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinformatics 12(1), 95 (2011)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Leena Salmela
    • 1
  • Kristoffer Sahlin
    • 2
  • Veli Mäkinen
    • 1
  • Alexandru I. Tomescu
    • 1
  1. 1.Helsinki Institute for Information Technology HIIT, Department of Computer ScienceUniversity of HelsinkiHelsinkiFinland
  2. 2.Science for Life Laboratory, School of Computer Science and CommunicationKTH Royal Institute of TechnologySolnaSweden

Personalised recommendations