Statistics in Biosciences

, Volume 5, Issue 1, pp 3–25 | Cite as

Single Nucleotide Polymorphism (SNP) Detection and Genotype Calling from Massively Parallel Sequencing (MPS) Data

  • Yun Li
  • Wei Chen
  • Eric Yi Liu
  • Yi-Hui Zhou


Massively parallel sequencing (MPS), since its debut in 2005, has transformed the field of genomic studies. These new sequencing technologies have resulted in the successful identification of causal variants for several rare Mendelian disorders. They have also begun to deliver on their promise to explain some of the missing heritability from genome-wide association studies (GWAS) of complex traits. We anticipate a rapidly growing number of MPS-based studies for a diverse range of applications in the near future. One crucial and nearly inevitable step is to detect SNPs and call genotypes at the detected polymorphic sites from the sequencing data. Here, we review statistical methods that have been proposed in the past five years for this purpose. In addition, we discuss emerging issues and future directions related to SNP detection and genotype calling from MPS data.


Massively parallel sequencing Next-generation sequencing SNP detection Genotype calling Linkage disequilibrium (LD) 



The authors thank Mingyao Li and Andrea Byrnes for critical reading of earlier versions of the manuscript. We are also grateful to an anonymous reviewer, whose comments have resulted in an improved manuscript. This research was supported by the National Institute of Health Grants R01 HG006292-01 and HG006703-01 (to Y.L.).


  1. 1.
    Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74(12):5463–5467 CrossRefGoogle Scholar
  2. 2.
    Shendure J, Ji HL (2008) Next-generation DNA sequencing. Nat Biotechnol 26(10):1135–1145 CrossRefGoogle Scholar
  3. 3.
    Shendure J et al. (2004) Advanced sequencing technologies: methods and goals. Nat Rev Genet 5(5):335–344 CrossRefGoogle Scholar
  4. 4.
    Margulies M et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057):376–380 Google Scholar
  5. 5.
    Moore GE (1998) Cramming more components onto integrated circuits. Proc IEEE 86(1):82–85. (Reprinted from Electronics, pp. 114–117, April 19, 1965) CrossRefGoogle Scholar
  6. 6.
    Bentley DR et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456(7218):53–59 CrossRefGoogle Scholar
  7. 7.
    Valouev A et al. (2008) A High-resolution, nucleosome position map of C. Elegans reveals a lack of universal Sequence-dictated positioning. Genome Res 18(7):1051–1063 CrossRefGoogle Scholar
  8. 8.
    Ozsolak F et al. (2009) Direct RNA sequencing. Nature 461(7265):814–818 CrossRefGoogle Scholar
  9. 9.
    Eid J et al. (2009) Real-time DNA sequencing from single polymerase molecules. Science 323(5910):133–138 CrossRefGoogle Scholar
  10. 10.
    Ansorge WJ (2009) Next-generation DNA sequencing techniques. New Biotechnol 25(4):195–203 CrossRefGoogle Scholar
  11. 11.
    Mardis ER (2008) Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9:387–402 CrossRefGoogle Scholar
  12. 12.
    Metzker ML (2010) Sequencing technologies—the next generation. Nat Rev Genet 11(1):31–46 CrossRefGoogle Scholar
  13. 13.
    Metzker ML (2005) Emerging technologies in DNA sequencing. Genome Res 15(12):1767–1776 CrossRefGoogle Scholar
  14. 14.
    Bentley DR (2006) Whole-genome re-sequencing. Curr Opin Genet Dev 16(6):545–552 CrossRefGoogle Scholar
  15. 15.
    Ng SB et al. (2009) Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461(7261):272–U153 CrossRefGoogle Scholar
  16. 16.
    Ng SB et al. (2010) Exome sequencing identifies the cause of a Mendelian disorder. Nat Genet 42(1):30–35 CrossRefGoogle Scholar
  17. 17.
    Ng SB et al. (2010) Exome sequencing identifies MLL2 mutations as a cause of kabuki syndrome. Nat Genet 42(9):790–793 CrossRefGoogle Scholar
  18. 18.
    Ng SB et al (2010) Massively parallel sequencing and rare disease. Hum Mol Genet Google Scholar
  19. 19.
    Nikopoulos K et al. (2010) Next-generation sequencing of a 40 MB linkage interval reveals TSPAN12 mutations in patients with familial exudative vitreoretinopathy. Am J Hum Genet 86(2):240–247 CrossRefGoogle Scholar
  20. 20.
    Roach JC et al. (2010) Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328(5978):636–639 CrossRefGoogle Scholar
  21. 21.
    Lupski JR et al. (2010) Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. N Engl J Med 362(13):1181–1191 CrossRefGoogle Scholar
  22. 22.
    Maher B (2008) Personal genomes: the case of the missing heritability. Nature 456(7218):18–21 CrossRefGoogle Scholar
  23. 23.
    Manolio TA et al. (2009) Finding the missing heritability of complex diseases. Nature 461(7265):747–753 CrossRefGoogle Scholar
  24. 24.
    Eichler EE et al. (2010) Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet 11(6):446–450 MathSciNetCrossRefGoogle Scholar
  25. 25.
    Sidore C et al (2011) Whole genome sequencing of 1000 individuals in an isolated population (Platform 188). Presented at the 12th international congress of human Genetics/61st annual meeting of the American Society of Human Genetics, Montreal, Canada Google Scholar
  26. 26.
    Nielsen R et al. (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12(6):443–451 CrossRefGoogle Scholar
  27. 27.
    Quinlan AR et al. (2008) Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nat Methods 5(2):179–181 CrossRefGoogle Scholar
  28. 28.
    Erlich Y et al. (2008) Alta-Cyclic: a selfoptimizing base caller for next-generation sequencing. Nat Methods 5(8):679–682 CrossRefGoogle Scholar
  29. 29.
    Kao WC, Stevens K, Song YS (2009) BayesCall: a model-based base-calling algorithm for high-throughput short-read sequencing. Genome Res 19(10):1884–1895 CrossRefGoogle Scholar
  30. 30.
    Kao WC, Song YS (2011) naiveBayesCall: an efficient model-based base-calling algorithm for high-throughput sequencing. J Comput Biol 18(3):365–377 MathSciNetCrossRefGoogle Scholar
  31. 31.
    Li H et al. (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25(16):2078–2079 CrossRefGoogle Scholar
  32. 32.
    The 1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467(7319):1061–1073 CrossRefGoogle Scholar
  33. 33.
    Ewing B et al. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 8(3):175–185 MathSciNetGoogle Scholar
  34. 34.
    Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8(3):186–194 Google Scholar
  35. 35.
    Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18(11):1851–1858 CrossRefGoogle Scholar
  36. 36.
    Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25(14):1754–1760 CrossRefGoogle Scholar
  37. 37.
    Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26(5):589–595 CrossRefGoogle Scholar
  38. 38.
    Lunter G, Goodson M (2010) Stampy: a statistical algorithm for sensitive and fast mapping of illumina sequence reads. Genome Res Google Scholar
  39. 39.
    Li R et al. (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15):1966–1967 CrossRefGoogle Scholar
  40. 40.
    Homer N, Merriman B, Nelson SF (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS ONE 4(11):e7767. CrossRefGoogle Scholar
  41. 41.
    Ning Z, Cox AJ, Mullikin JC (2001) SSAHA: a fast search method for large DNA databases. Genome Res 11(10):1725–1729 CrossRefGoogle Scholar
  42. 42.
    Langmead B et al. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25 CrossRefGoogle Scholar
  43. 43.
    Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9):1105–1111 CrossRefGoogle Scholar
  44. 44.
    Wang K et al (2010) MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res Google Scholar
  45. 45.
    Wu TD, Nacu S (2010) Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26(7):873–881 CrossRefGoogle Scholar
  46. 46.
    Grant GR et al. (2011) Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM). Bioinformatics 27(18):2518–2528 MathSciNetGoogle Scholar
  47. 47.
    Flicek P, Birney E (2009) Sense from sequence reads: methods for alignment and assembly. Nat Methods 6(11 Suppl):S6–S12 CrossRefGoogle Scholar
  48. 48.
    Li H, Homer N (2010) A survey of sequence alignment algorithms for Next-generation sequencing. Brief Bioinform 11(5):473–483 CrossRefGoogle Scholar
  49. 49.
    Trapnell C, Salzberg SL (2009) How to map billions of short reads onto genomes. Nat Biotechnol 27(5):455–457 CrossRefGoogle Scholar
  50. 50.
    McKenna A et al. (2010) The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20(9):1297–1303 CrossRefGoogle Scholar
  51. 51.
    Brockman W et al. (2008) Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res 18(5):763–770 CrossRefGoogle Scholar
  52. 52.
    Dohm JC et al. (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36(16):10 CrossRefGoogle Scholar
  53. 53.
    Ossowski S et al. (2008) Sequencing of natural strains of arabidopsis thaliana with short reads. Genome Res 18(12):2024–2033 CrossRefGoogle Scholar
  54. 54.
    Shen Y et al. (2010) A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Res 20(2):273–280 CrossRefGoogle Scholar
  55. 55.
    The International HapMap Consortium (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449:851–861 CrossRefGoogle Scholar
  56. 56.
    The International HapMap Consortium (2010) Integrating common and rare genetic variation in diverse human populations. Nature 467(7311):52–58 CrossRefGoogle Scholar
  57. 57.
    Sherry ST et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29(1):308–311 MathSciNetCrossRefGoogle Scholar
  58. 58.
    Marth GT et al. (1999) A general approach to single-nucleotide polymorphism discovery. Nat Genet 23(4):452–456 CrossRefGoogle Scholar
  59. 59.
    Nickerson DA, Tobe VO, Taylor SL (1997) PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res 25(14):2745–2751 CrossRefGoogle Scholar
  60. 60.
    Stephens M et al. (2006) Automating sequence-based detection and genotyping of SNPs from diploid samples. Nat Genet 38(3):375–381 CrossRefGoogle Scholar
  61. 61.
    Chen K et al. (2007) PolyScan: an automatic indel and SNP detection approach to the analysis of human resequencing data. Genome Res 17:659–666 CrossRefGoogle Scholar
  62. 62.
    Koboldt DC et al. (2009) VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25(17):2283–2285 CrossRefGoogle Scholar
  63. 63.
    Kent WJ (2002) BLAT—the BLAST-like alignment tool. Genome Res 12(4):656–664 MathSciNetGoogle Scholar
  64. 64.
    Hoberman R et al. (2009) A probabilistic approach for SNP discovery in High-throughput human resequencing data. Genome Res 19(9):1542–1552 CrossRefGoogle Scholar
  65. 65.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32 zbMATHCrossRefGoogle Scholar
  66. 66.
    Altshuler D et al. (2000) An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407(6803):513–516 CrossRefGoogle Scholar
  67. 67.
    Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273:1516–1517 CrossRefGoogle Scholar
  68. 68.
    Frazer KA et al. (2009) Human genetic variation and its contribution to complex traits. Nat Rev Genet 10(4):241–251 CrossRefGoogle Scholar
  69. 69.
    Nielsen R et al. (2007) Recent and ongoing selection in the human genome. Nat Rev Genet 8(11):857–868 CrossRefGoogle Scholar
  70. 70.
    Keinan A et al. (2007) Measurement of the human allele frequency spectrum demonstrates greater genetic drift in East Asians than in Europeans. Nat Genet 39(10):1251–1255 CrossRefGoogle Scholar
  71. 71.
    Van Tassell CP et al. (2008) SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods 5(3):247–252 CrossRefGoogle Scholar
  72. 72.
    Holt KE et al. (2009) Detecting SNPs and estimating allele frequencies in clonal bacterial populations by sequencing pooled DNA. Bioinformatics 25(16):2074–2075 CrossRefGoogle Scholar
  73. 73.
    Lynch M (2009) Estimation of allele frequencies from high-coverage genome-sequencing projects. Genetics 182(1):295–301 CrossRefGoogle Scholar
  74. 74.
    Bao H et al. (2009) MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads. BMC Genomics 10(Suppl 3):S13 CrossRefGoogle Scholar
  75. 75.
    Kim SY et al. (2011) Estimation of allele frequency and association mapping using next-generation sequencing data. BMC Bioinform 12:231 CrossRefGoogle Scholar
  76. 76.
    Wei Z et al. (2011) SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids Res 39(19):e132 CrossRefGoogle Scholar
  77. 77.
    Huang X, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 9(9):868–877 CrossRefGoogle Scholar
  78. 78.
    Li RQ et al. (2009) SNP detection for massively parallel whole-genome resequencing. Genome Res 19(6):1124–1132 CrossRefGoogle Scholar
  79. 79.
    Ley TJ et al. (2008) DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456(7218):66–72 CrossRefGoogle Scholar
  80. 80.
    Bansal V et al. (2010) Accurate detection and genotyping of SNPs utilizing population sequencing data. Genome Res 20(4):537–545 CrossRefGoogle Scholar
  81. 81.
    Hardy HG (1908) Mendelian proportions in a mixed population. Science 28:49–50 CrossRefGoogle Scholar
  82. 82.
    Weinberg W (1908) On the demonstration of heredity in man. In: Papers on human genetics. Prentice Hall, Englewood Cliffs (1963, translation by S. H. Boyer) Google Scholar
  83. 83.
    Martin ER et al. (2010) SeqEM: an adaptive genotype-calling approach for next-generation sequencing studies. Bioinformatics 26(22):2803–2810 CrossRefGoogle Scholar
  84. 84.
    Minichiello MJ, Durbin R (2006) Mapping trait loci by use of inferred ancestral recombination graphs. Am J Hum Genet 79(5):910–922 CrossRefGoogle Scholar
  85. 85.
    Le SQ, Durbin R (2010) SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res Google Scholar
  86. 86.
    Browning BL, Yu Z (2009) Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am J Hum Genet 85(6):847–861 CrossRefGoogle Scholar
  87. 87.
    Li Y et al (2011) Low-coverage sequencing: implications for design of complex trait association studies. Genome Res Google Scholar
  88. 88.
    DePristo MA et al. (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43(5):491–498 CrossRefGoogle Scholar
  89. 89.
    Browning BL, Browning SR (2009) A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 84(2):210–223 CrossRefGoogle Scholar
  90. 90.
    Hudson RR (1991) Gene genealogies and the coalescent process. In: Futuyma D, Antonovics J (eds) Oxford surveys in evolutionary biology. Oxford University Press, New York, pp 1–44 Google Scholar
  91. 91.
    Zhao Z, Boerwinkle E (2002) Neighboring-nucleotide effects on single nucleotide polymorphisms: A study of 2.6 million polymorphisms across the human genome. Genome Res 12(11):1679–1686 CrossRefGoogle Scholar
  92. 92.
    Zhang ZL, Gerstein M (2003) Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucleic Acids Res 31(18):5338–5348 CrossRefGoogle Scholar
  93. 93.
    Collins FS et al. (2004) Finishing the euchromatic sequence of the human genome. Nature 431(7011):931–945 CrossRefGoogle Scholar
  94. 94.
    Sachidanandam R et al. (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409(6822):928–933 CrossRefGoogle Scholar
  95. 95.
    Li Y et al. (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 34(8):816–834 CrossRefGoogle Scholar
  96. 96.
    Li Y et al. (2009) Genotype imputation. Annu Rev Genomics Hum Genet 10:387–406 CrossRefGoogle Scholar
  97. 97.
    Marchini J, Howie B (2010) Genotype imputation for genome-wide association studies. Nat Rev Genet 11(7):499–511 CrossRefGoogle Scholar
  98. 98.
    Smith AV et al. (2005) Sequence features in regions of weak and strong linkage disequilibrium. Genome Res 15:1519–1534 CrossRefGoogle Scholar
  99. 99.
    Liu EY et al (2011) MaCH-Admix: genotype imputation for admixed populations (submitted) Google Scholar
  100. 100.
    Sampson J et al (2011) Efficient study design for next generation sequencing. Genet Epidemiol Google Scholar
  101. 101.
    Liu DJ, Leal SM (2010) Replication strategies for rare variant complex trait association studies via next-generation sequencing. Am J Hum Genet 87(6):790–801 CrossRefGoogle Scholar
  102. 102.
    Schaid DJ, Sinnwell JP (2010) Two-stage Case-control designs for rare genetic variants. Hum Genet 127(6):659–668 CrossRefGoogle Scholar
  103. 103.
    Lee JS et al (2011) On optimal pooling designs to identify rare variants through massive resequencing. Genet Epidemiol Google Scholar
  104. 104.
    Kim SY et al. (2010) Design of association studies with pooled or un-pooled next-generation sequencing data. Genet Epidemiol 34(5):479–491 CrossRefGoogle Scholar
  105. 105.
    Yang F, Thomas DC (2011) Two-stage design of sequencing studies for testing association with rare variants. Hum Hered 71(4):209–220 CrossRefGoogle Scholar
  106. 106.
    Wang T et al. (2010) Resequencing of pooled DNA for detecting disease associations with rare variants. Genet Epidemiol 34(5):492–501 CrossRefGoogle Scholar
  107. 107.
    Feng B-J et al. (2011) Design considerations for massively parallel sequencing studies of complex human disease. PLoS ONE 6(8):e23221 CrossRefGoogle Scholar
  108. 108.
    Edwards TL, Song Z, Li C (2011) Enriching targeted sequencing experiments for rare disease alleles. Bioinformatics 27(15):2112–2118 CrossRefGoogle Scholar
  109. 109.
    Ionita-Laza I, Laird NM (2010) On the optimal design of genetic variant discovery studies. Stat Appl Genet Mol Biol 9(1):Article33 MathSciNetGoogle Scholar
  110. 110.
    Degner JF et al (2009) Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics Google Scholar
  111. 111.
    Langmead B, Hansen KD, Leek JT (2010) Cloud-scale RNA-sequencing differential expression analysis with myrna. Genome Biol 11(8):R83 CrossRefGoogle Scholar
  112. 112.
    Chen W et al (2010) An efficient LD based variant calling and phasing method for next generation sequencing in trios. ASHG Program # 134 Google Scholar
  113. 113.
    Li B, Chen W, Abecasis G (2010) Variant calling from low-pass next generation sequence data in families. ASHG Program # 2993 Google Scholar
  114. 114.
    Li Y, Byrnes AE, Li M (2010) To identify associations with rare variants, just WHaIT: weighted haplotype and imputation-based tests. Am J Hum Genet 87(5):728–735 CrossRefGoogle Scholar
  115. 115.
    Wu MC et al. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 89(1):82–93 CrossRefGoogle Scholar
  116. 116.
    Zawistowski M et al. (2010) Extending rare-variant testing strategies: analysis of noncoding sequence and imputed genotypes. Am J Hum Genet 87(5):604–617 CrossRefGoogle Scholar
  117. 117.
    Asimit J, Zeggini E (2010) Rare variant association analysis methods for complex traits. Annu Rev Genet 44:293–308 CrossRefGoogle Scholar
  118. 118.
    Bansal V et al. (2010) Statistical analysis strategies for association studies involving rare variants. Nat Rev Genet 11(11):773–785 CrossRefGoogle Scholar
  119. 119.
    Price AL et al. (2010) New approaches to population stratification in genome-wide association studies. Nat Rev Genet 11(7):459–463 CrossRefGoogle Scholar
  120. 120.
    Li H (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27(21):2987–2993 CrossRefGoogle Scholar

Copyright information

© International Chinese Statistical Association 2012

Authors and Affiliations

  1. 1.Department of GeneticsUniversity of North CarolinaChapel HillUSA
  2. 2.Department of BiostatisticsUniversity of North CarolinaChapel HillUSA
  3. 3.Division of Pediatric Pulmonary Medicine, Allergy and Immunology, Department of Pediatrics, Children’s Hospital of Pittsburgh of UPMCUniversity of Pittsburgh School of MedicinePittsburghUSA
  4. 4.Department of Computer ScienceUniversity of North CarolinaChapel HillUSA

Personalised recommendations