A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases

  • Chirag Jain
  • Alexander Dilthey
  • Sergey Koren
  • Srinivas Aluru
  • Adam M. Phillippy
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10229)

Abstract

Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this paper, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290x faster than BWA-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each \(\ge 5\) kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and \(> 60,000\) genomes.

Keywords

Long read mapping Jaccard MinHash Winnowing Minimizers Sketching Nanopore PacBio 

Notes

Acknowledgments

This research was supported in part by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health, and the U.S. National Science Foundation under IIS-1416259.

References

  1. 1.
    Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)CrossRefGoogle Scholar
  2. 2.
    Ashton, P.M., Nair, S., Dallman, T., Rubino, S., Rabsch, W., Mwaigwisya, S., Wain, J., O’Grady, J.: MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island. Nat. Biotechnol. 33(3), 296–300 (2015)CrossRefGoogle Scholar
  3. 3.
    Berlin, K., Koren, S., Chin, C.S., Drake, J.P., Landolin, J.M., Phillippy, A.M.: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33(6), 623–630 (2015)CrossRefGoogle Scholar
  4. 4.
    Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of Sequences 1997, pp. 21–29. IEEE (1997)Google Scholar
  5. 5.
    Chaisson, M.J., Tesler, G.: Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinf. 13(1), 238 (2012)CrossRefGoogle Scholar
  6. 6.
    Chaisson, M.J., Huddleston, J., Dennis, M.Y., Sudmant, P.H., Malig, M., Hormozdiari, F., Antonacci, F., Surti, U., Sandstrom, R., Boitano, M., et al.: Resolving the complexity of the human genome using single-molecule sequencing. Nature 517(7536), 608–611 (2015)CrossRefGoogle Scholar
  7. 7.
    Chin, C.S., Alexander, D.H., Marks, P., Klammer, A.A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E.E., et al.: Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10(6), 563–569 (2013)CrossRefGoogle Scholar
  8. 8.
    Delcher, A.L., Phillippy, A., Carlton, J., Salzberg, S.L.: Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30(11), 2478–2483 (2002)CrossRefGoogle Scholar
  9. 9.
    Fan, H., Ives, A.R., Surget-Groba, Y., Cannon, C.H.: An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 16(1), 1 (2015)CrossRefGoogle Scholar
  10. 10.
    Koren, S., Harhay, G.P., Smith, T.P., Bono, J.L., Harhay, D.M., Mcvey, S.D., Radune, D., Bergman, N.H., Phillippy, A.M.: Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 14(9), 1 (2013)CrossRefGoogle Scholar
  11. 11.
    Laehnemann, D., Borkhardt, A., McHardy, A.C.: Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief. Bioinf. 17(1), 154–179 (2016)CrossRefGoogle Scholar
  12. 12.
    Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods 9(4), 357–359 (2012)CrossRefGoogle Scholar
  13. 13.
    Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arxiv preprint arXiv:1303.3997 (2013)
  14. 14.
    Li, H.: Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, btw152 (2016)Google Scholar
  15. 15.
    Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)CrossRefGoogle Scholar
  16. 16.
    Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinf. 11(5), 473–483 (2010)CrossRefGoogle Scholar
  17. 17.
    Loman, N.J.: Nanopore R9 rapid run data release (2016). https://goo.gl/UlHVtL. Accessed 8 Sept 2016
  18. 18.
    Loose, M., Malla, S., Stout, M.: Real time selective sequencing using nanopore technology. Nat. Methods 13(9), 751–754 (2016)CrossRefGoogle Scholar
  19. 19.
    Ondov, B.D., Treangen, T.J., Melsted, P., Mallonee, A.B., Bergman, N.H., Koren, S., Phillippy, A.M.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016)CrossRefGoogle Scholar
  20. 20.
    Pacific Biosciences: Human microbiome mock community shotgun sequencing data (2014). https://goo.gl/kjRcLb. Accessed 8 Sept 2016
  21. 21.
    Popic, V., Batzoglou, S.: Privacy-preserving read mapping using locality sensitive hashing and secure kmer voting. bioRxiv, 046920 (2016)Google Scholar
  22. 22.
    Quick, J., Loman, N.J., Duraffour, S., Simpson, J.T., Severi, E., Cowley, L., Bore, J.A., Koundouno, R., Dudas, G., Mikhail, A., et al.: Real-time, portable genome sequencing for Ebola surveillance. Nature 530(7589), 228–232 (2016)CrossRefGoogle Scholar
  23. 23.
    Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)CrossRefGoogle Scholar
  24. 24.
    Ruffalo, M., LaFramboise, T., Koyutürk, M.: Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 27(20), 2790–2796 (2011)CrossRefGoogle Scholar
  25. 25.
    Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85. ACM (2003)Google Scholar
  26. 26.
    Smith, K.C.: Sliding window minimum implementations (2016). https://goo.gl/8RC54b. Accessed 8 Sept 2016
  27. 27.
    Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG (outside the US) 2017

Authors and Affiliations

  • Chirag Jain
    • 1
    • 2
  • Alexander Dilthey
    • 2
  • Sergey Koren
    • 2
  • Srinivas Aluru
    • 1
  • Adam M. Phillippy
    • 2
  1. 1.Georgia Institute of TechnologyAtlantaUSA
  2. 2.National Institutes of HealthBethesdaUSA

Personalised recommendations