Approximate All-Pairs Suffix/Prefix Overlaps

  • Niko Välimäki
  • Susana Ladra
  • Veli Mäkinen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6129)

Abstract

Finding approximate overlaps is the first phase of many sequence assembly methods. Given a set of r strings of total length n and an error-rate ε, the goal is to find, for all-pairs of strings, their suffix/prefix matches (overlaps) that are within edit distance k = ⌈εℓ⌉, where ℓ is the length of the overlap. We propose new solutions for this problem based on backward backtracking (Lam et al. 2008) and suffix filters (Kärkkäinen and Na, 2008). Techniques use nHk + o(nlogσ) + rlogr bits of space, where Hk is the k-th order entropy and σ the alphabet size. In practice, methods are easy to parallelize and scale up to millions of DNA reads.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)Google Scholar
  2. 2.
    Roche Company. 454 life sciences, http://www.454.com/
  3. 3.
    Simpson, J.T., et al.: Abyss: A parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009)CrossRefMathSciNetGoogle Scholar
  4. 4.
    Morin, R.D., et al.: Profiling the hela s3 transcriptome using randomly primed cdna and massively parallel short-read sequencing. BioTechniques 45(1), 81–94 (2008)CrossRefGoogle Scholar
  5. 5.
    Li, R., et al.: Soap2. Bioinformatics 25(15), 1966–1967 (2009)CrossRefGoogle Scholar
  6. 6.
    Wicker, T., et al.: 454 sequencing put to the test using the complex genome of barley. BMC Genomics 7(1), 275 (2006)CrossRefGoogle Scholar
  7. 7.
    Ferragina, P., Manzini, G.: Indexing compressed texts. Journal of the ACM 52(4), 552–581 (2005)CrossRefMathSciNetGoogle Scholar
  8. 8.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2), article 20 (2007)Google Scholar
  9. 9.
    Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)MATHGoogle Scholar
  10. 10.
    Hyyrö, H., Navarro, G.: Bit-parallel witnesses and their applications to approximate string matching. Algorithmica 41(3), 203–231 (2005)MATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Kärkkäinen, J., Na, J.C.: Faster filters for approximate string matching. In: Proc. ALENEX 2007, pp. 84–90. SIAM, Philadelphia (2007)Google Scholar
  12. 12.
    Kececioglu, J.D., Myers, E.W.: Combinatorial algorithms for dna sequence assembly. Algorithmica 13, 7–51 (1995)MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Lam, T.W., Sung, W.K., Tam, S.L., Wong, C.K., Yiu, S.M.: Compressed indexing and local alignment of dna. Bioinformatics 24(6), 791–797 (2008)CrossRefGoogle Scholar
  14. 14.
    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology 10(3), R25 (2009)Google Scholar
  15. 15.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  16. 16.
    Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics (2009), Advance accessGoogle Scholar
  17. 17.
    Mäkinen, V., Välimäki, N., Laaksonen, A., Katainen, R.: Unifying view of backward backtracking in short read mapping. In: Elomaa, T., Mannila, H., Orponen, P. (eds.) LNCS Festschrifts. Springer, Heidelberg (to appear 2010)Google Scholar
  18. 18.
    Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms 4(3) (2008)Google Scholar
  19. 19.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)MATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46(3), 395–415 (1999)MATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surveys 33(1), 31–88 (2001)CrossRefGoogle Scholar
  22. 22.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), article 2 (2007)Google Scholar
  23. 23.
    Pevzner, P., Tang, H., Waterman, M.: An eulerian path approach to dna fragment assembly. Proc. Natl. Acad. Sci. 98(17), 9748–9753 (2001)MATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    Pop, M., Salzberg, S.L.: Bioinformatics challenges of new sequencing technology. Trends Genet. 24, 142–149 (2008)Google Scholar
  25. 25.
    Salmela, L.: Personal communication (2010)Google Scholar
  26. 26.
    Sellers, P.: The theory and computation of evolutionary distances: Pattern recognition. Journal of Algorithms 1(4), 359–373 (1980)MATHCrossRefMathSciNetGoogle Scholar
  27. 27.
    Wang, Z., Gerstein, M., Snyder, M.: Rna-seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10(1), 57–63 (2009)CrossRefGoogle Scholar
  28. 28.
    Weiner, P.: Linear pattern matching algorithm. In: Proc. 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar
  29. 29.
    Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Research 18(5), 821–829 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Niko Välimäki
    • 1
  • Susana Ladra
    • 2
  • Veli Mäkinen
    • 1
  1. 1.Department of Computer ScienceUniversity of HelsinkiFinland
  2. 2.Department of Computer ScienceUniversity of A CoruñaSpain

Personalised recommendations