Machine Learning-Based Imputation of Missing SNP Genotypes in SNP Genotype Arrays

Mihajlovic, Aleksandar R.

doi:10.1007/978-1-4614-8785-2_6

Aleksandar R. Mihajlovic⁵

1759 Accesses

Abstract

The missing value problem in SNP genotype data sets is introduced along with a short overview of two commonly used imputation algorithms, fastPHASE and KNNimpute, used to resolve the missing value problem for such data sets. A comparison of the two algorithms is provided with additional preliminary biological and mathematical background information for a better understanding of the problem and the algorithms mentioned.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 119.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Impact of pre-imputation SNP-filtering on genotype imputation results

Article Open access 12 August 2014

Estimation of Missing Values in SNP Array

Evaluation of different approaches for missing data imputation on features associated to genomic data

Article Open access 03 September 2021

Notes

1.
The agents of evolutionary change are mutation, genetic drift, sexual selection, natural selection, and gene flow.

References

Lon R. Cardon and John I. Bell, (2001),Association study designs for complex diseases, Nature Reviews, Genetics Vol 2, February 2001, pp.91–99.
Google Scholar
(2012)Variations in genes making them faulty (mutating), Fact sheet produced by the Centre for Genetics Education, March 2012, http://www.genetics.edu.au 555
Rubin D. B, (1976), Inference and Missing Data, Biometrika, Vol63, Issue 3, December 1975, pp.581–592
Google Scholar
James Y. Dai, Ingo Ruczinski, Michael LeBlanc, Charles Kooperberg,(2006),Comparison of Haplotype-based and Tree-based SNP Imputation in Association Studies, 2006, Genet Epidemiol, 30(8): pp.690–702.
Google Scholar
Kalla S., (2012), Statistical Data Sets, 27.07.2012, http://www.experiment-resources.com/statistical-data-sets.html
(2012) What Is the Human Genome?, Understanding cancer series - Lesson 1, National Cancer Institute. http://www.cancer.gov/cancertopics/understandingcancer/cancergenomics/AllPages
Sawyer SA, Parsch J, Zhang Z, Hartl DL., (2007), Prevalence of positive selection among nearly neutral amino acid replacements in Drosophila. Proc. Natl. Acad. Sci. U.S.A. 104 (16): pp.6504–6510.
Google Scholar
Cooper, D.N., Smith, B.A., Cooke, H.J., Niemann, S., and Schmidtke, J., (1985), An estimate of unique DNA sequence heterozygosity in the human genome. Hum. Genet. 69, 1985: pp.201–205.
Google Scholar
Collins, F.S., Guyer, M.S., and Charkravarti, A., (1997), Variations on a theme: cataloging human DNA sequence variation. Science. 278, 1997: pp.1580–1581.
Google Scholar
Sachidanandam, Ravi; Weissman, David; Schmidt, Steven C.; Kakol, Jerzy M.; Stein, Lincoln D.; Marth, Gabor; Sherry, Steve; Mullikin, James C. et al., (2001), A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409 (6822), pp.928–33.
Google Scholar
(2012), Single Nucleotide Polymorphism, Chinese medical and biological information (CMBI) site,http://cmbi.bjmu.edu.cn/cmbidata/snp/index00.htm http://cmbi.bjmu.edu.cn/
(2008) SNP fact sheet, Human genome project information, Last modified: Friday, September 19, 2008, http://www.ornl.gov/sci/techresources/Human_Genome/faq/snps.shtml
Kerchner C.F., (2005), Haplotype vs. Haplogroup, 29 Sep 2005 http://www.kerchner.com/haplotypevshaplogroup.htm
Jonathan Marchini, Bryan Howie, Simon Myers, Gil McVean, Peter Donnelly, (2007),A new multipoint method for genome-wide association studies by imputation of genotypes, Vol 39, No 7, July 2007, Nature Genetics, pp.906–913.
Google Scholar
Aroon D Hingorani, Tina Shah, MeenaKumari, ReechaSofat, Liam Smeeth, (2010), Translating genomics into improved healthcare, Clinical Review, Science, medicine, and the future, BMJ, November 2010, pp.341
Google Scholar
(2012), Genome Wide Association Study (GWAS),National Human Genome Research Institute, Stanford school of medicine, http://med.stanford.edu/advance/phase2/
Guttmacher, A. E., Manolio, T. A., (2010),Genomewide association studies and assessment of the risk of disease. July 2010. N. Engl. J. Med. 363 (2): pp.166–76.
Google Scholar
Pearson T., ManolioT., (2008),How to interpret a genome-wide association study. March 2008. JAMA 299 (11)
Google Scholar
Gibson G.(2010),Hints of hidden heritability in GWAS. 2010. Nature Genetics 42 (7): pp.558–560.
Google Scholar
Barrett J.,(2010), How to read a genome-wide association study, genomes unzipped, public personal genomics, http://www.genomesunzipped.org/2010/07/how-to-read-a-genome-wide-association-study.php
Broman K. W., (1999), Cleaning Genotype Data, December, 1999, Genetic Analysis Workshop 11: Analysis of genetic and environmental factors in common diseases. Genetic Epidemiology
Google Scholar
Pompanon F., Bonin A., et. al. (2005), Genotyping Errors: Causes, Consequences and Solutions, 2005, Nature Reviews: Genetics, Nature Publishing Group, pp. 2
Google Scholar
Kirk, K. M. &Cardon, L. R., (2002), The impact of genotyping error on haplotype reconstruction and frequency estimation, European Journal of Human Genetics, 10, 616–622
Google Scholar
Akey, J. M., Zhang, K., Xiong, M. M., Doris, P., Jin, L., (2001),The effect that genotyping errors have on the robustness of common linkage-disequilibrium measures, Am. J. Hum. Genet. 68, 1447–1456 (2001): A study that investigates the effects of genotyping error on estimates of linkage disequilibrium, and shows that the robustness of the estimates depends on allelic frequencies and assumed error models.
Google Scholar
Hackett, C. A. &Broadfott, L. B., (2003), Effects of genotyping errors, missing values and segregation distortion in molecular marker data on the construction of linkage maps, Heredity 90, 33–38
Google Scholar
Douglas J. A., Boehnke M. & Lange K., (2000), A multipoint method for detecting genotyping errors and mutations in sibling-pair linkage data, Am. J. Hum. Genet. 66, 1287–1297
Article Google Scholar
Manolio T. A., (2007), Update on Genome-Wide Association Studies: We Live in Interesting Times, U.S. Department of Health and Human Services, National Institute of Health, National Human Genome Research Institute, September 19, 2007
Google Scholar
Huisman, M. (2000).Imputation of missing item responses: Some simple techniques. Quality and Quantity 34 331–351.
Article Google Scholar
Marwala T. (2009), Computational Intelligence for Missing Data Imputation, Estimation, and Management Knowledge Optimization Techniques.” (2009) Information Science Reference
Google Scholar
Koler Daphne, Friedman Nir, (2009), Probabilistic graphical models – Principles and techniques, The MIT Press, Cambridge and London.
Google Scholar
Gross, Jonathan L., Yellen, Jay, (2004), Handbook of graph theory, CRC Press, 2004, p. 35.
Google Scholar
Gross, J. &Yellen, J., (2007),Graph Theory and Its Applications. CRC Press.
Google Scholar
Markov, A. A., (1913), An example of statistical investigation in the text of “Eugene Onyegin” illustrating coupling of “tests” in chains. Proc. Acad. Sci. St. Petersburg VI Ser. 7:153–162.
MATH Google Scholar
LiseGetoor, Ben Taskar, (2007), Introduction to statistical relational learning, Bioinformatics, Adaptive computation and machine learning, MIT Press, Cambridge and London, pp. 28–71, Ch.2. Graphical models in a nutshell.
Google Scholar
Padhraic Smyth, David Heckerman, Michael I. Jordan, (1996), Probabilistic Independence Networks for Hidden Markov Probability Models, Microsoft technical report, May 1, 1996.
Google Scholar
L.R. Rabiner and B.H. Juang. (1986), An introduction to hidden markov models. In IEEE, ASSP Magazine, pp. 4{16.
Google Scholar
Rabiner, L. R., (1989),A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77:257–285.
Article Google Scholar
Krogh, A., M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. (1994). Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Bio. 235: pp.1501–1531.
Google Scholar
Paul Scheet and Matthew Stephens, (2006),A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase, Am J. Hum Genet. 2006 April; 78(4): 629–644. Published online 2006 February 17.
Google Scholar
Gibbons J., Dickinson J. &Subhabrata S., (2003), Nonparametric Statistical Inference, 4th Ed. 2003. CRC Press
Google Scholar
Zhaoxia Yu, Daniel J. Schaid, (2007), Methods to impute missing genotypes for population data, Hum Genet. 122: pp.495–504.
Google Scholar
Bhatia N., Vandana, (2010), Survey of Nearest Neighbor Techniques, IJCSIS Intr. Jour. of Comp. Sci. and Inf. Sec., Vol. 8, No.2
Google Scholar
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D.& Altman, R.B., (2001),Missing value estimation methods for dna microarrays, Bioinformatics 17(6), pp.520–525.
Google Scholar
Tohka J., (2011), 8001652 Introduction to Pattern Recognition. Lecture 8: k-Nearest neighbors classification, Institute of Signal Processing Tampere University of Technology, lecture notes 2010–2011.
Google Scholar
Anton Bovier,(2012), Markov Processes Lecture Notes, Ch.3 Stochastic Models of Complex Processes and Their Applications – Lecture Notes, Summer 2012, Universitat Bonn, Bonn, July 10, 2012.
Google Scholar

Download references

Author information

Authors and Affiliations

Mathematical Institute of the Serbian Academy of Sciences and Art (MISANU), Belgrade, Serbia
Aleksandar R. Mihajlovic

Authors

Aleksandar R. Mihajlovic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aleksandar R. Mihajlovic .

Editor information

Editors and Affiliations

Mathematical Institute, Serbian Academy of Science and Arts, Belgrade, Serbia
Goran Rakocevic
Faculty of Engineering, University of Kragujevac, Kragujevac, Serbia
Tijana Djukic
Faculty of Engineering, University of Kragujevac, Kragujevac, Serbia
Nenad Filipovic
School of Electrical Engineering, University of Belgrade, Belgrade, Serbia
Veljko Milutinović

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mihajlovic, A.R. (2013). Machine Learning-Based Imputation of Missing SNP Genotypes in SNP Genotype Arrays. In: Rakocevic, G., Djukic, T., Filipovic, N., Milutinović, V. (eds) Computational Medicine in Data Mining and Modeling. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8785-2_6

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8785-2_6
Published: 19 September 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8784-5
Online ISBN: 978-1-4614-8785-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Machine Learning-Based Imputation of Missing SNP Genotypes in SNP Genotype Arrays

Abstract

Access this chapter

Similar content being viewed by others

Impact of pre-imputation SNP-filtering on genotype imputation results

Estimation of Missing Values in SNP Array

Evaluation of different approaches for missing data imputation on features associated to genomic data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Machine Learning-Based Imputation of Missing SNP Genotypes in SNP Genotype Arrays

Abstract

Access this chapter

Similar content being viewed by others

Impact of pre-imputation SNP-filtering on genotype imputation results

Estimation of Missing Values in SNP Array

Evaluation of different approaches for missing data imputation on features associated to genomic data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation