Skip to main content

A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10229))

Abstract

Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this paper, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290x faster than BWA-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each \(\ge 5\) kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and \(> 60,000\) genomes.

The rights of this work are transferred to the extent transferable according to title 17 \(\S \) 105 U.S.C.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)

    Article  Google Scholar 

  2. Ashton, P.M., Nair, S., Dallman, T., Rubino, S., Rabsch, W., Mwaigwisya, S., Wain, J., O’Grady, J.: MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island. Nat. Biotechnol. 33(3), 296–300 (2015)

    Article  Google Scholar 

  3. Berlin, K., Koren, S., Chin, C.S., Drake, J.P., Landolin, J.M., Phillippy, A.M.: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33(6), 623–630 (2015)

    Article  Google Scholar 

  4. Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of Sequences 1997, pp. 21–29. IEEE (1997)

    Google Scholar 

  5. Chaisson, M.J., Tesler, G.: Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinf. 13(1), 238 (2012)

    Article  Google Scholar 

  6. Chaisson, M.J., Huddleston, J., Dennis, M.Y., Sudmant, P.H., Malig, M., Hormozdiari, F., Antonacci, F., Surti, U., Sandstrom, R., Boitano, M., et al.: Resolving the complexity of the human genome using single-molecule sequencing. Nature 517(7536), 608–611 (2015)

    Article  Google Scholar 

  7. Chin, C.S., Alexander, D.H., Marks, P., Klammer, A.A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E.E., et al.: Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10(6), 563–569 (2013)

    Article  Google Scholar 

  8. Delcher, A.L., Phillippy, A., Carlton, J., Salzberg, S.L.: Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30(11), 2478–2483 (2002)

    Article  Google Scholar 

  9. Fan, H., Ives, A.R., Surget-Groba, Y., Cannon, C.H.: An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 16(1), 1 (2015)

    Article  Google Scholar 

  10. Koren, S., Harhay, G.P., Smith, T.P., Bono, J.L., Harhay, D.M., Mcvey, S.D., Radune, D., Bergman, N.H., Phillippy, A.M.: Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 14(9), 1 (2013)

    Article  Google Scholar 

  11. Laehnemann, D., Borkhardt, A., McHardy, A.C.: Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief. Bioinf. 17(1), 154–179 (2016)

    Article  Google Scholar 

  12. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods 9(4), 357–359 (2012)

    Article  Google Scholar 

  13. Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arxiv preprint arXiv:1303.3997 (2013)

  14. Li, H.: Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, btw152 (2016)

    Google Scholar 

  15. Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)

    Article  Google Scholar 

  16. Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinf. 11(5), 473–483 (2010)

    Article  Google Scholar 

  17. Loman, N.J.: Nanopore R9 rapid run data release (2016). https://goo.gl/UlHVtL. Accessed 8 Sept 2016

  18. Loose, M., Malla, S., Stout, M.: Real time selective sequencing using nanopore technology. Nat. Methods 13(9), 751–754 (2016)

    Article  Google Scholar 

  19. Ondov, B.D., Treangen, T.J., Melsted, P., Mallonee, A.B., Bergman, N.H., Koren, S., Phillippy, A.M.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016)

    Article  Google Scholar 

  20. Pacific Biosciences: Human microbiome mock community shotgun sequencing data (2014). https://goo.gl/kjRcLb. Accessed 8 Sept 2016

  21. Popic, V., Batzoglou, S.: Privacy-preserving read mapping using locality sensitive hashing and secure kmer voting. bioRxiv, 046920 (2016)

    Google Scholar 

  22. Quick, J., Loman, N.J., Duraffour, S., Simpson, J.T., Severi, E., Cowley, L., Bore, J.A., Koundouno, R., Dudas, G., Mikhail, A., et al.: Real-time, portable genome sequencing for Ebola surveillance. Nature 530(7589), 228–232 (2016)

    Article  Google Scholar 

  23. Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)

    Article  Google Scholar 

  24. Ruffalo, M., LaFramboise, T., Koyutürk, M.: Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 27(20), 2790–2796 (2011)

    Article  Google Scholar 

  25. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85. ACM (2003)

    Google Scholar 

  26. Smith, K.C.: Sliding window minimum implementations (2016). https://goo.gl/8RC54b. Accessed 8 Sept 2016

  27. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)

    Article  Google Scholar 

Download references

Acknowledgments

This research was supported in part by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health, and the U.S. National Science Foundation under IIS-1416259.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adam M. Phillippy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG (outside the US)

About this paper

Cite this paper

Jain, C., Dilthey, A., Koren, S., Aluru, S., Phillippy, A.M. (2017). A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases. In: Sahinalp, S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science(), vol 10229. Springer, Cham. https://doi.org/10.1007/978-3-319-56970-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-56970-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-56969-7

  • Online ISBN: 978-3-319-56970-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics