Skip to main content

Machine Learning-Based Imputation of Missing SNP Genotypes in SNP Genotype Arrays

  • Chapter
  • First Online:
Computational Medicine in Data Mining and Modeling
  • 1759 Accesses

Abstract

The missing value problem in SNP genotype data sets is introduced along with a short overview of two commonly used imputation algorithms, fastPHASE and KNNimpute, used to resolve the missing value problem for such data sets. A comparison of the two algorithms is provided with additional preliminary biological and mathematical background information for a better understanding of the problem and the algorithms mentioned.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The agents of evolutionary change are mutation, genetic drift, sexual selection, natural selection, and gene flow.

References

  1. Lon R. Cardon and John I. Bell, (2001),Association study designs for complex diseases, Nature Reviews, Genetics Vol 2, February 2001, pp.91–99.

    Google Scholar 

  2. (2012)Variations in genes making them faulty (mutating), Fact sheet produced by the Centre for Genetics Education, March 2012, http://www.genetics.edu.au 555

  3. Rubin D. B, (1976), Inference and Missing Data, Biometrika, Vol63, Issue 3, December 1975, pp.581–592

    Google Scholar 

  4. James Y. Dai, Ingo Ruczinski, Michael LeBlanc, Charles Kooperberg,(2006),Comparison of Haplotype-based and Tree-based SNP Imputation in Association Studies, 2006, Genet Epidemiol, 30(8): pp.690–702.

    Google Scholar 

  5. Kalla S., (2012), Statistical Data Sets, 27.07.2012, http://www.experiment-resources.com/statistical-data-sets.html

  6. (2012) What Is the Human Genome?, Understanding cancer series - Lesson 1, National Cancer Institute. http://www.cancer.gov/cancertopics/understandingcancer/cancergenomics/AllPages

  7. Sawyer SA, Parsch J, Zhang Z, Hartl DL., (2007), Prevalence of positive selection among nearly neutral amino acid replacements in Drosophila. Proc. Natl. Acad. Sci. U.S.A. 104 (16): pp.6504–6510.

    Google Scholar 

  8. Cooper, D.N., Smith, B.A., Cooke, H.J., Niemann, S., and Schmidtke, J., (1985), An estimate of unique DNA sequence heterozygosity in the human genome. Hum. Genet. 69, 1985: pp.201–205.

    Google Scholar 

  9. Collins, F.S., Guyer, M.S., and Charkravarti, A., (1997), Variations on a theme: cataloging human DNA sequence variation. Science. 278, 1997: pp.1580–1581.

    Google Scholar 

  10. Sachidanandam, Ravi; Weissman, David; Schmidt, Steven C.; Kakol, Jerzy M.; Stein, Lincoln D.; Marth, Gabor; Sherry, Steve; Mullikin, James C. et al., (2001), A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409 (6822), pp.928–33.

    Google Scholar 

  11. (2012), Single Nucleotide Polymorphism, Chinese medical and biological information (CMBI) site,http://cmbi.bjmu.edu.cn/cmbidata/snp/index00.htm http://cmbi.bjmu.edu.cn/

  12. (2008) SNP fact sheet, Human genome project information, Last modified: Friday, September 19, 2008, http://www.ornl.gov/sci/techresources/Human_Genome/faq/snps.shtml

  13. Kerchner C.F., (2005), Haplotype vs. Haplogroup, 29 Sep 2005 http://www.kerchner.com/haplotypevshaplogroup.htm

  14. Jonathan Marchini, Bryan Howie, Simon Myers, Gil McVean, Peter Donnelly, (2007),A new multipoint method for genome-wide association studies by imputation of genotypes, Vol 39, No 7, July 2007, Nature Genetics, pp.906–913.

    Google Scholar 

  15. Aroon D Hingorani, Tina Shah, MeenaKumari, ReechaSofat, Liam Smeeth, (2010), Translating genomics into improved healthcare, Clinical Review, Science, medicine, and the future, BMJ, November 2010, pp.341

    Google Scholar 

  16. (2012), Genome Wide Association Study (GWAS),National Human Genome Research Institute, Stanford school of medicine, http://med.stanford.edu/advance/phase2/

  17. Guttmacher, A. E., Manolio, T. A., (2010),Genomewide association studies and assessment of the risk of disease. July 2010. N. Engl. J. Med. 363 (2): pp.166–76.

    Google Scholar 

  18. Pearson T., ManolioT., (2008),How to interpret a genome-wide association study. March 2008. JAMA 299 (11)

    Google Scholar 

  19. Gibson G.(2010),Hints of hidden heritability in GWAS. 2010. Nature Genetics 42 (7): pp.558–560.

    Google Scholar 

  20. Barrett J.,(2010), How to read a genome-wide association study, genomes unzipped, public personal genomics, http://www.genomesunzipped.org/2010/07/how-to-read-a-genome-wide-association-study.php

  21. Broman K. W., (1999), Cleaning Genotype Data, December, 1999, Genetic Analysis Workshop 11: Analysis of genetic and environmental factors in common diseases. Genetic Epidemiology

    Google Scholar 

  22. Pompanon F., Bonin A., et. al. (2005), Genotyping Errors: Causes, Consequences and Solutions, 2005, Nature Reviews: Genetics, Nature Publishing Group, pp. 2

    Google Scholar 

  23. Kirk, K. M. &Cardon, L. R., (2002), The impact of genotyping error on haplotype reconstruction and frequency estimation, European Journal of Human Genetics, 10, 616–622

    Google Scholar 

  24. Akey, J. M., Zhang, K., Xiong, M. M., Doris, P., Jin, L., (2001),The effect that genotyping errors have on the robustness of common linkage-disequilibrium measures, Am. J. Hum. Genet. 68, 1447–1456 (2001): A study that investigates the effects of genotyping error on estimates of linkage disequilibrium, and shows that the robustness of the estimates depends on allelic frequencies and assumed error models.

    Google Scholar 

  25. Hackett, C. A. &Broadfott, L. B., (2003), Effects of genotyping errors, missing values and segregation distortion in molecular marker data on the construction of linkage maps, Heredity 90, 33–38

    Google Scholar 

  26. Douglas J. A., Boehnke M. & Lange K., (2000), A multipoint method for detecting genotyping errors and mutations in sibling-pair linkage data, Am. J. Hum. Genet. 66, 1287–1297

    Article  Google Scholar 

  27. Manolio T. A., (2007), Update on Genome-Wide Association Studies: We Live in Interesting Times, U.S. Department of Health and Human Services, National Institute of Health, National Human Genome Research Institute, September 19, 2007

    Google Scholar 

  28. Huisman, M. (2000).Imputation of missing item responses: Some simple techniques. Quality and Quantity 34 331–351.

    Article  Google Scholar 

  29. Marwala T. (2009), Computational Intelligence for Missing Data Imputation, Estimation, and Management Knowledge Optimization Techniques.” (2009) Information Science Reference

    Google Scholar 

  30. Koler Daphne, Friedman Nir, (2009), Probabilistic graphical models – Principles and techniques, The MIT Press, Cambridge and London.

    Google Scholar 

  31. Gross, Jonathan L., Yellen, Jay, (2004), Handbook of graph theory, CRC Press, 2004, p. 35.

    Google Scholar 

  32. Gross, J. &Yellen, J., (2007),Graph Theory and Its Applications. CRC Press.

    Google Scholar 

  33. Markov, A. A., (1913), An example of statistical investigation in the text of “Eugene Onyegin” illustrating coupling of “tests” in chains. Proc. Acad. Sci. St. Petersburg VI Ser. 7:153–162.

    MATH  Google Scholar 

  34. LiseGetoor, Ben Taskar, (2007), Introduction to statistical relational learning, Bioinformatics, Adaptive computation and machine learning, MIT Press, Cambridge and London, pp. 28–71, Ch.2. Graphical models in a nutshell.

    Google Scholar 

  35. Padhraic Smyth, David Heckerman, Michael I. Jordan, (1996), Probabilistic Independence Networks for Hidden Markov Probability Models, Microsoft technical report, May 1, 1996.

    Google Scholar 

  36. L.R. Rabiner and B.H. Juang. (1986), An introduction to hidden markov models. In IEEE, ASSP Magazine, pp. 4{16.

    Google Scholar 

  37. Rabiner, L. R., (1989),A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77:257–285.

    Article  Google Scholar 

  38. Krogh, A., M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. (1994). Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Bio. 235: pp.1501–1531.

    Google Scholar 

  39. Paul Scheet and Matthew Stephens, (2006),A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase, Am J. Hum Genet. 2006 April; 78(4): 629–644. Published online 2006 February 17.

    Google Scholar 

  40. Gibbons J., Dickinson J. &Subhabrata S., (2003), Nonparametric Statistical Inference, 4th Ed. 2003. CRC Press

    Google Scholar 

  41. Zhaoxia Yu, Daniel J. Schaid, (2007), Methods to impute missing genotypes for population data, Hum Genet. 122: pp.495–504.

    Google Scholar 

  42. Bhatia N., Vandana, (2010), Survey of Nearest Neighbor Techniques, IJCSIS Intr. Jour. of Comp. Sci. and Inf. Sec., Vol. 8, No.2

    Google Scholar 

  43. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D.& Altman, R.B., (2001),Missing value estimation methods for dna microarrays, Bioinformatics 17(6), pp.520–525.

    Google Scholar 

  44. Tohka J., (2011), 8001652 Introduction to Pattern Recognition. Lecture 8: k-Nearest neighbors classification, Institute of Signal Processing Tampere University of Technology, lecture notes 2010–2011.

    Google Scholar 

  45. Anton Bovier,(2012), Markov Processes Lecture Notes, Ch.3 Stochastic Models of Complex Processes and Their Applications – Lecture Notes, Summer 2012, Universitat Bonn, Bonn, July 10, 2012.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aleksandar R. Mihajlovic .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this chapter

Cite this chapter

Mihajlovic, A.R. (2013). Machine Learning-Based Imputation of Missing SNP Genotypes in SNP Genotype Arrays. In: Rakocevic, G., Djukic, T., Filipovic, N., Milutinović, V. (eds) Computational Medicine in Data Mining and Modeling. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8785-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-8785-2_6

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-8784-5

  • Online ISBN: 978-1-4614-8785-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics