mpscan: Fast Localisation of Multiple Reads in Genomes

  • Eric Rivals
  • Leena Salmela
  • Petteri Kiiskinen
  • Petri Kalsi
  • Jorma Tarhio
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5724)

Abstract

With Next Generation Sequencers, sequence based transcriptomic or epigenomic assays yield millions of short sequence reads that need to be mapped back on a reference genome. The upcoming versions of these sequencers promise even higher sequencing capacities; this may turn the read mapping task into a bottleneck for which alternative pattern matching approaches must be experimented. We present an algorithm and its implementation, called mpscan, which uses a sophisticated filtration scheme to match a set of patterns/reads exactly on a sequence. mpscan can search for millions of reads in a single pass through the genome without indexing its sequence. Moreover, we show that mpscan offers an optimal average time complexity, which is sublinear in the text length, meaning that it does not need to examine all sequence positions. Comparisons with BLAT-like tools and with six specialised read mapping programs (like bowtie or zoom) demonstrate that mpscan also is the fastest algorithm in practice for exact matching. Our accuracy and scalability comparisons reveal that some tools are inappropriate for read mapping. Moreover, we provide evidence suggesting that exact matching may be a valuable solution in some read mapping applications. As most read mapping programs somehow rely on exact matching procedures to perform approximate pattern mapping, the filtration scheme we experimented may reveal useful in the design of future algorithms. The absence of genome index gives mpscan its low memory requirement and flexibility that let it run on a desktop computer and avoids a time-consuming genome preprocessing.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Kim, J., Porreca, G., Song, L., Greenway, S., Gorham, J., Church, G., Seidman, C., Seidman, J.: Polony Multiplex Analysis of Gene Expression (PMAGE) in Mouse Hypertrophic Cardiomyopathy. Science 316(5830), 1481–1484 (2007)CrossRefPubMedGoogle Scholar
  2. 2.
    Johnson, D., Mortazavi, A., Myers, R., Wold, B.: Genome-Wide Mapping of in Vivo Protein-DNA Interactions. Science 316(5830), 1497–1502 (2007)CrossRefPubMedGoogle Scholar
  3. 3.
    Boyle, A.P., Davis, S., Shulha, H.P., Meltzer, P., Margulies, E.H., Weng, Z., Furey, T.S., Crawford, G.E.: High-Resolution Mapping and Characterization of Open Chromatin across the Genome. Cell 132, 311–322 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Schones, D., Zhao, K.: Genome-wide approaches to studying chromatin modifications. Nat. Rev. Genet. 9(3), 179–191 (2008)CrossRefPubMedGoogle Scholar
  5. 5.
    Mardis, E.R.: ChIP-seq: welcome to the new frontier. Nat. Methods 4(8), 613–614 (2007)CrossRefPubMedGoogle Scholar
  6. 6.
    Sultan, M., Schulz, M.H., Richard, H., Magen, A., Klingenhoff, A., Scherf, M., Seifert, M., Borodina, T., Soldatov, A., Parkhomchuk, D., Schmidt, D., O’Keeffe, S., Haas, S., Vingron, M., Lehrach, H., Yaspo, M.L.: A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome. Science 321(5891), 956–960 (2008)CrossRefPubMedGoogle Scholar
  7. 7.
    Barski, A., Cuddapah, S., Cui, K., Roh, T.Y., Schones, D.E., Wang, Z., Wei, G., Chepelev, I., Zhao, K.: High-Resolution Profiling of Histone Methylations in the Human Genome. Cell 129(4), 823–837 (2007)CrossRefPubMedGoogle Scholar
  8. 8.
    Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings - Practical on-line search algorithms for texts and biological sequences. Cambridge Univ. Press, Cambridge (2002)CrossRefGoogle Scholar
  9. 9.
    Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008) (in press)CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Li, R., Li, Y., Kristiansen, K., Wang, J.: SOAP: short oligonucleotide alignment program. Bioinformatics 24(5), 713–714 (2008)CrossRefPubMedGoogle Scholar
  11. 11.
    Smith, A., Xuan, Z., Zhang, M.: Using quality scores and longer reads improves accuracy of solexa read mapping. BMC Bioinformatics 9(1), 128 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  12. 12.
    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology 10(3), R25 (2009)CrossRefGoogle Scholar
  13. 13.
    Jiang, H., Wong, W.H.: Seqmap: mapping massive amount of oligonucleotides to the genome. Bioinformatics 24(20), 2395–2396 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Saha, S., Sparks, A., Rago, C., Akmaev, V., Wang, C., Vogelstein, B., Kinzler, K., Velculescu, V.: Using the transcriptome to annotate the genome. Nat. Biotech. 20(5), 508–512 (2002)CrossRefGoogle Scholar
  15. 15.
    Philippe, N., Boureux, A., Tarhio, J., Bréhélin, L., Commes, T., Rivals, E.: Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity. Nucleic Acids Research (2009), doi:10.1093/nar/gkp492Google Scholar
  16. 16.
    Kent, J.W.: BLAT—The BLAST-Like Alignment Tool. Genome Res. 12(4), 656–664 (2002)CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Zhang, Z., Schwartz, S., Wagner, L., Miller, W.: A greedy algorithm for aligning DNA sequences. J. of Computational Biology 7(1-2), 203–214 (2000)CrossRefGoogle Scholar
  18. 18.
    Ning, Z., Cox, A., Mulikin, J.: SSAHA: A Fast Search Method for large DNA Databases. Genome Res. 11, 1725–1729 (2001)CrossRefPubMedPubMedCentralGoogle Scholar
  19. 19.
    Iseli, C., Ambrosini, G., Bucher, P., Jongeneel, C.: Indexing Strategies for Rapid Searches of Short Words in Genome Sequences. PLoS ONE 2(6), e579 (2007)CrossRefGoogle Scholar
  20. 20.
    Lin, H., Zhang, Z., Zhang, M.Q., Ma, B., Li, M.: ZOOM! Zillions of oligos mapped. Bioinformatics 24(21), 2431–2437 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  21. 21.
    Kharchenko, P., Tolstorukov, M.Y., Park, P.J.: Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat. Biotech. 26(12), 1351–1359 (2008)CrossRefGoogle Scholar
  22. 22.
    Salmela, L., Tarhio, J., Kytöjoki, J.: Multipattern string matching with q-grams. ACM Journal of Experimental Algorithmics 11 (2006)Google Scholar
  23. 23.
    Navarro, G., Fredriksson, K.: Average complexity of exact and approximate multiple string matching. Theoretical Computer Science 321(2-3), 283–290 (2004)CrossRefGoogle Scholar
  24. 24.
    Faulkner, G., Forrest, A., Chalk, A., Schroder, K., Hayashizaki, Y., Carninci, P., Hume, D., Grimmond, S.: A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE. Genomics 91, 281–288 (2008)CrossRefPubMedGoogle Scholar
  25. 25.
    Kucherov, G., Noé, L., Roytberg, M.: Multiseed Lossless Filtration. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2(1), 51–61 (2005)CrossRefPubMedGoogle Scholar
  26. 26.
    Ma, B., Li, M.: On the complexity of the spaced seeds. J. of Computer and System Sciences 73(7), 1024–1034 (2007)CrossRefGoogle Scholar
  27. 27.
    Nicolas, F., Rivals, E.: Hardness of optimal spaced seed design. J. of Computer and System Sciences 74, 831–849 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Eric Rivals
    • 1
  • Leena Salmela
    • 2
  • Petteri Kiiskinen
    • 2
  • Petri Kalsi
    • 1
  • Jorma Tarhio
    • 2
  1. 1.LIRMM, CNRS and Université de Montpellier 2MontpellierFrance
  2. 2.Helsinki University of TechnologyTKKFinland

Personalised recommendations