mpscan: Fast Localisation of Multiple Reads in Genomes

Rivals, Eric; Salmela, Leena; Kiiskinen, Petteri; Kalsi, Petri; Tarhio, Jorma

doi:10.1007/978-3-642-04241-6_21

Eric Rivals²¹,
Leena Salmela²²,
Petteri Kiiskinen²²,
Petri Kalsi²¹ &
…
Jorma Tarhio²²

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5724))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

816 Accesses
11 Citations

Abstract

With Next Generation Sequencers, sequence based transcriptomic or epigenomic assays yield millions of short sequence reads that need to be mapped back on a reference genome. The upcoming versions of these sequencers promise even higher sequencing capacities; this may turn the read mapping task into a bottleneck for which alternative pattern matching approaches must be experimented. We present an algorithm and its implementation, called mpscan, which uses a sophisticated filtration scheme to match a set of patterns/reads exactly on a sequence. mpscan can search for millions of reads in a single pass through the genome without indexing its sequence. Moreover, we show that mpscan offers an optimal average time complexity, which is sublinear in the text length, meaning that it does not need to examine all sequence positions. Comparisons with BLAT-like tools and with six specialised read mapping programs (like bowtie or zoom) demonstrate that mpscan also is the fastest algorithm in practice for exact matching. Our accuracy and scalability comparisons reveal that some tools are inappropriate for read mapping. Moreover, we provide evidence suggesting that exact matching may be a valuable solution in some read mapping applications. As most read mapping programs somehow rely on exact matching procedures to perform approximate pattern mapping, the filtration scheme we experimented may reveal useful in the design of future algorithms. The absence of genome index gives mpscan its low memory requirement and flexibility that let it run on a desktop computer and avoids a time-consuming genome preprocessing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kim, J., Porreca, G., Song, L., Greenway, S., Gorham, J., Church, G., Seidman, C., Seidman, J.: Polony Multiplex Analysis of Gene Expression (PMAGE) in Mouse Hypertrophic Cardiomyopathy. Science 316(5830), 1481–1484 (2007)
Article CAS PubMed Google Scholar
Johnson, D., Mortazavi, A., Myers, R., Wold, B.: Genome-Wide Mapping of in Vivo Protein-DNA Interactions. Science 316(5830), 1497–1502 (2007)
Article CAS PubMed Google Scholar
Boyle, A.P., Davis, S., Shulha, H.P., Meltzer, P., Margulies, E.H., Weng, Z., Furey, T.S., Crawford, G.E.: High-Resolution Mapping and Characterization of Open Chromatin across the Genome. Cell 132, 311–322 (2008)
Article CAS PubMed PubMed Central Google Scholar
Schones, D., Zhao, K.: Genome-wide approaches to studying chromatin modifications. Nat. Rev. Genet. 9(3), 179–191 (2008)
Article CAS PubMed Google Scholar
Mardis, E.R.: ChIP-seq: welcome to the new frontier. Nat. Methods 4(8), 613–614 (2007)
Article CAS PubMed Google Scholar
Sultan, M., Schulz, M.H., Richard, H., Magen, A., Klingenhoff, A., Scherf, M., Seifert, M., Borodina, T., Soldatov, A., Parkhomchuk, D., Schmidt, D., O’Keeffe, S., Haas, S., Vingron, M., Lehrach, H., Yaspo, M.L.: A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome. Science 321(5891), 956–960 (2008)
Article CAS PubMed Google Scholar
Barski, A., Cuddapah, S., Cui, K., Roh, T.Y., Schones, D.E., Wang, Z., Wei, G., Chepelev, I., Zhao, K.: High-Resolution Profiling of Histone Methylations in the Human Genome. Cell 129(4), 823–837 (2007)
Article CAS PubMed Google Scholar
Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings - Practical on-line search algorithms for texts and biological sequences. Cambridge Univ. Press, Cambridge (2002)
Book Google Scholar
Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008) (in press)
Article CAS PubMed PubMed Central Google Scholar
Li, R., Li, Y., Kristiansen, K., Wang, J.: SOAP: short oligonucleotide alignment program. Bioinformatics 24(5), 713–714 (2008)
Article CAS PubMed Google Scholar
Smith, A., Xuan, Z., Zhang, M.: Using quality scores and longer reads improves accuracy of solexa read mapping. BMC Bioinformatics 9(1), 128 (2008)
Article PubMed PubMed Central Google Scholar
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology 10(3), R25 (2009)
Article Google Scholar
Jiang, H., Wong, W.H.: Seqmap: mapping massive amount of oligonucleotides to the genome. Bioinformatics 24(20), 2395–2396 (2008)
Article CAS PubMed PubMed Central Google Scholar
Saha, S., Sparks, A., Rago, C., Akmaev, V., Wang, C., Vogelstein, B., Kinzler, K., Velculescu, V.: Using the transcriptome to annotate the genome. Nat. Biotech. 20(5), 508–512 (2002)
Article CAS Google Scholar
Philippe, N., Boureux, A., Tarhio, J., Bréhélin, L., Commes, T., Rivals, E.: Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity. Nucleic Acids Research (2009), doi:10.1093/nar/gkp492
Google Scholar
Kent, J.W.: BLAT—The BLAST-Like Alignment Tool. Genome Res. 12(4), 656–664 (2002)
Article CAS PubMed PubMed Central Google Scholar
Zhang, Z., Schwartz, S., Wagner, L., Miller, W.: A greedy algorithm for aligning DNA sequences. J. of Computational Biology 7(1-2), 203–214 (2000)
Article CAS Google Scholar
Ning, Z., Cox, A., Mulikin, J.: SSAHA: A Fast Search Method for large DNA Databases. Genome Res. 11, 1725–1729 (2001)
Article CAS PubMed PubMed Central Google Scholar
Iseli, C., Ambrosini, G., Bucher, P., Jongeneel, C.: Indexing Strategies for Rapid Searches of Short Words in Genome Sequences. PLoS ONE 2(6), e579 (2007)
Article Google Scholar
Lin, H., Zhang, Z., Zhang, M.Q., Ma, B., Li, M.: ZOOM! Zillions of oligos mapped. Bioinformatics 24(21), 2431–2437 (2008)
Article CAS PubMed PubMed Central Google Scholar
Kharchenko, P., Tolstorukov, M.Y., Park, P.J.: Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat. Biotech. 26(12), 1351–1359 (2008)
Article CAS Google Scholar
Salmela, L., Tarhio, J., Kytöjoki, J.: Multipattern string matching with q-grams. ACM Journal of Experimental Algorithmics 11 (2006)
Google Scholar
Navarro, G., Fredriksson, K.: Average complexity of exact and approximate multiple string matching. Theoretical Computer Science 321(2-3), 283–290 (2004)
Article Google Scholar
Faulkner, G., Forrest, A., Chalk, A., Schroder, K., Hayashizaki, Y., Carninci, P., Hume, D., Grimmond, S.: A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE. Genomics 91, 281–288 (2008)
Article CAS PubMed Google Scholar
Kucherov, G., Noé, L., Roytberg, M.: Multiseed Lossless Filtration. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2(1), 51–61 (2005)
Article CAS PubMed Google Scholar
Ma, B., Li, M.: On the complexity of the spaced seeds. J. of Computer and System Sciences 73(7), 1024–1034 (2007)
Article Google Scholar
Nicolas, F., Rivals, E.: Hardness of optimal spaced seed design. J. of Computer and System Sciences 74, 831–849 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

LIRMM, CNRS and Université de Montpellier 2, Montpellier, France
Eric Rivals & Petri Kalsi
Helsinki University of Technology, P.O. Box 5400, FI-02015, TKK, Finland
Leena Salmela, Petteri Kiiskinen & Jorma Tarhio

Authors

Eric Rivals
View author publications
You can also search for this author in PubMed Google Scholar
Leena Salmela
View author publications
You can also search for this author in PubMed Google Scholar
Petteri Kiiskinen
View author publications
You can also search for this author in PubMed Google Scholar
Petri Kalsi
View author publications
You can also search for this author in PubMed Google Scholar
Jorma Tarhio
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Bioinformatics and Computational Biology, and Department of Computer Science, University of Maryland, MD, College Park, USA
Steven L. Salzberg
Department of Computer Sciences, The University of Texas at Austin, TX, USA
Tandy Warnow

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rivals, E., Salmela, L., Kiiskinen, P., Kalsi, P., Tarhio, J. (2009). mpscan: Fast Localisation of Multiple Reads in Genomes. In: Salzberg, S.L., Warnow, T. (eds) Algorithms in Bioinformatics. WABI 2009. Lecture Notes in Computer Science(), vol 5724. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04241-6_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-04241-6_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04240-9
Online ISBN: 978-3-642-04241-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics