Unified View of Backward Backtracking in Short Read Mapping

  • Veli Mäkinen
  • Niko Välimäki
  • Antti Laaksonen
  • Riku Katainen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6060)


Mapping short DNA reads to the reference genome is the core task in the recent high-throughput technologies to study e.g. protein-DNA interactions (ChIP-seq) and alternative splicing (RNA-seq). Several tools for the task (bowtie, bwa, SOAP2, TopHat) have been developed that exploit Burrows-Wheeler transform and the backward backtracking technique on it, to map the reads to their best approximate occurrences in the genome. These tools use different tailored mechanisms for small error-levels to prune the search phase significantly. We propose a new pruning mechanism that can be seen a generalization of the tailored mechanisms used so far. It uses a novel idea of storing all cyclic rotations of fixed length substrings of the reference sequence with a compressed index that is able to exploit the repetitions created to level out the growth of the input set. For RNA-seq we propose a new method that combines dynamic programming with backtracking to map efficiently and correctly all reads that span two exons. Same mechanism can also be used for mapping mate-pair reads.


Personal genomics full-text indexing compressed data structures 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)Google Scholar
  2. 2.
    Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the Thirty Sixth Annual Symposium on the Theory of Computing, pp. 91–100 (2004)Google Scholar
  3. 3.
    Tuupanen, et al.: The common colorectal cancer predisposition snp rs6983267 at chromosome 8q24 confers potential to enhanced wnt signaling. Nature Genetics 41, 885–890 (2009)CrossRefGoogle Scholar
  4. 4.
    Ferragina, P., Manzini, G.: Indexing compressed texts. Journal of the ACM 52(4), 552–581 (2005)CrossRefMathSciNetGoogle Scholar
  5. 5.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2), article 20 (2007)Google Scholar
  6. 6.
    Gabow, H.N., Bentley, J.L., Tarjan, R.E.: Scaling and related techniques for geometry problems. In: Proc. 16th ACM Symposium on Theory of Computing (STOC 1984), pp. 135–143 (1984)Google Scholar
  7. 7.
    Harismendy, O., Ng, P.C., Strausberg, R.L., Wang, X., Stockwell, T.B., Beeson, K.Y., Schork, N.J., Murray, S.S., Topol, E.J., Levy, S., Frazer, K.A.: Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biology 10(R10) (2009)Google Scholar
  8. 8.
    Johnson, D.S., Mortazavi, A., Myers, R.M., Wold, B.: Genome-wide mapping of in vivo protein-dna interactions. Science 316(5830), 1497–1502 (2007)CrossRefGoogle Scholar
  9. 9.
    Jothi, R., Cuddapah, S., Barski, A., Cui, K., Zhao, K.: Genome-wide identification of in vivo protein-dna binding sites from chip-seq data. Nucl. Acids Res. 36(16), 5221–5231 (2008)CrossRefGoogle Scholar
  10. 10.
    Kärkkäinen, J., Na, J.C.: Faster filters for approximate string matching. In: Proc. 9th Workshop on Algorithm Engineering and Experiments (ALENEX 2007), pp. 84–90. SIAM, Philadelphia (2007)Google Scholar
  11. 11.
    Lam, T.W., Sung, W.K., Tam, S.L., Wong, C.K., Yiu, S.M.: Compressed indexing and local alignment of dna. Bioinformatics 24(6), 791–797 (2008)CrossRefGoogle Scholar
  12. 12.
    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology 10(3), R25 (2009)CrossRefGoogle Scholar
  13. 13.
    Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18, 1851–1858 (2008)CrossRefGoogle Scholar
  14. 14.
    Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics (2009) (Advance access)Google Scholar
  15. 15.
    Li, R., Li, Y., Kristiansen, K., Wang, J.: Soap: short oligonucleotide alignment program. Bioinformatics 24(5), 713–714 (2008)CrossRefGoogle Scholar
  16. 16.
    Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., Wang, J.: Soap2. Bioinformatics 25(15), 1966–1967 (2009)CrossRefGoogle Scholar
  17. 17.
    Mäkinen, V.: Parameterized Approximate String Matching and Local-Similarity-Based Point-Pattern Matching. PhD thesis, University of Helsinki (2003)Google Scholar
  18. 18.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of individual genomes. In: Batzoglou, S. (ed.) RECOMB 2009. LNCS, vol. 5541, pp. 121–137. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  19. 19.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)zbMATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    McCreight, E.: A space-economical suffix tree construction algorithm. Journal of the ACM 23(2), 262–272 (1976)zbMATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Morin, R.D., Bainbridge, M., Fejes, A., Hirst, M., Krzywinski, M., Pugh, T.J., McDonald, H., Varhol, R., Jones, S.J.M., Marra, M.A.: Profiling the hela s3 transcriptome using randomly primed cdna and massively parallel short-read sequencing. BioTechniques 45, 81–94 (2008)CrossRefGoogle Scholar
  22. 22.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), article 2 (2007)Google Scholar
  23. 23.
    Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surveys 33(1), 31–88 (2001)CrossRefGoogle Scholar
  24. 24.
    Trapnell, C., Pachter, L., Salzberg, S.L.: Tophat: discovering splice junctions with Rna-seq. Bioinformatics 25(9), 1105–1111 (2009)CrossRefGoogle Scholar
  25. 25.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  26. 26.
    Wang, Z., Gerstein, M., Snyder, M.: Rna-seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10(1), 57–63 (2009)CrossRefGoogle Scholar
  27. 27.
    Weiner, P.: Linear pattern matching algorithm. In: Proc. 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Veli Mäkinen
    • 1
  • Niko Välimäki
    • 1
  • Antti Laaksonen
    • 1
  • Riku Katainen
    • 1
  1. 1.Department of Computer ScienceUniversity of HelsinkiFinland

Personalised recommendations