Changepoint Analysis for Efficient Variant Calling

  • Adam Bloniarz
  • Ameet Talwalkar
  • Jonathan Terhorst
  • Michael I. Jordan
  • David Patterson
  • Bin Yu
  • Yun S. Song
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8394)

Abstract

We present CAGe, a statistical algorithm which exploits high sequence identity between sampled genomes and a reference assembly to streamline the variant calling process. Using a combination of changepoint detection, classification, and online variant detection, CAGe is able to call simple variants quickly and accurately on the 90-95% of a sampled genome which differs little from the reference, while correctly learning the remaining 5-10% that must be processed using more computationally expensive methods. CAGe runs on a deeply sequenced human whole genome sample in approximately 20 minutes, potentially reducing the burden of variant calling by an order of magnitude after one memory-efficient pass over the data.

Keywords

genome complexity next-generation sequencing variant calling changepoint detection 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Tishkoff, S.A., Kidd, K.K.: Implications of biogeography of human populations for ‘race’ and medicine. Nature Genetics 36, S21–S27 (2004)Google Scholar
  2. 2.
    Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)CrossRefGoogle Scholar
  3. 3.
    Hsi-Yang, F.M., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Research 21(5), 734–740 (2011)CrossRefGoogle Scholar
  4. 4.
    Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Research 40(22), e171 (2012)Google Scholar
  5. 5.
    Li, H., et al.: The sequence alignment/map format and samtools. Bioinformatics 25(16), 2078–2079 (2009)CrossRefGoogle Scholar
  6. 6.
    DePristo, M.A.: et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43(5), 491–498 (2011)CrossRefGoogle Scholar
  7. 7.
    Zaharia, M., Bolosky, W., Curtis, K., Fox, A., Patterson, P., Shenker, S., Stoica, I., Karp, R., Sittler, T.: Faster and more accurate sequence alignment with SNAP (2011), http://arxiv.org/abs/1111.5572
  8. 8.
    Popitsch, N., von Haeseler, A.: NGC: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Research 41(1), e27 (2013)Google Scholar
  9. 9.
    Shen, J.J., Zhang, N.R.: Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing. The Annals of Applied Statistics 40(6), 476–496 (2012)CrossRefMathSciNetGoogle Scholar
  10. 10.
    Shen, Y., Gu, Y., Pe’er, I.: A Hidden Markov Model for Copy Number Variant prediction from whole genome resequencing data. BMC Bioinformatics 12(suppl. 6), S4 (2011)Google Scholar
  11. 11.
    Wang, K., et al.: PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research 17(11), 1665–1674 (2007)CrossRefGoogle Scholar
  12. 12.
    Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2(3), 231–239 (1988)CrossRefGoogle Scholar
  13. 13.
    Evans, S.N., Hower, V., Pachter, L.: Coverage statistics for sequence census methods. BMC Bioinformatics 11, 430 (2010)CrossRefGoogle Scholar
  14. 14.
    Hower, V., Starfield, R., Roberts, A., Pachter, L.: Quantifying uniformity of mapped reads. Bioinformatics 28(20), 2680–2682 (2012)CrossRefGoogle Scholar
  15. 15.
    Medvedev, P., Stanciu, M., Brudno, M.: Computational methods for discovering structural variation with next-generation sequencing. Nature Methods 6, S13–S20 (2009)Google Scholar
  16. 16.
    Sherry, S.T., et al.: dbSNP: the NCBI database of genetic variation. Nucleic Acids Research 29(1), 308–311 (2001)CrossRefMathSciNetGoogle Scholar
  17. 17.
    Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009)CrossRefGoogle Scholar
  18. 18.
    Jackson, B., et al.: An algorithm for optimal partitioning of data on an interval. IEEE Signal Processing Letters 12, 105–108 (2005)CrossRefGoogle Scholar
  19. 19.
    Killick, R., Fearnhead, P., Eckley, I.A.: Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association 107(500), 1590–1598 (2012)CrossRefMATHMathSciNetGoogle Scholar
  20. 20.
    Talwalkar, A., et al.: SMaSH: A benchmarking toolkit for variant calling (2013), http://arxiv.org/abs/1310.8420
  21. 21.
    Levy, S., et al.: The diploid genome sequence of an individual human. PLoS Biology 5(10), e254 (2007)Google Scholar
  22. 22.
    Illumina Corporation. Platinum genomes project (2013), http://www.platinumgenomes.org
  23. 23.
    Zhao, Z., Fu, Y., Hewett-Emmett, D., Boerwinkle, E.: Investigating single nucleotide polymorphism (SNP) density in the human genome and its implications for molecular evolution. Gene 312, 207–213 (2003)CrossRefGoogle Scholar
  24. 24.
    Derrien, T., et al.: Fast computation and applications of genome mappability. PLoS ONE 7(1), e30377 (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Adam Bloniarz
    • 1
  • Ameet Talwalkar
    • 2
  • Jonathan Terhorst
    • 1
  • Michael I. Jordan
    • 1
    • 2
  • David Patterson
    • 2
  • Bin Yu
    • 1
  • Yun S. Song
    • 1
    • 2
    • 3
  1. 1.Department of StatisticsUniversity of CaliforniaBerkeleyUSA
  2. 2.Computer Science DivisionUniversity of CaliforniaBerkeleyUSA
  3. 3.Department of Integrative BiologyUniversity of CaliforniaBerkeleyUSA

Personalised recommendations