Tag SNP Selection Based on Multivariate Linear Regression

  • Jingwu He
  • Alex Zelikovsky
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3992)


The search for the association between complex diseases and single nucleotide polymorphisms (SNPs) or haplotypes has been recently received great attention. For these studies, it is essential to use a small subset of informative SNPs (tag SNPs) accurately representing the rest of the SNPs. Tagging can achieve budget savings by genotyping only a limited number of SNPs and computationally inferring all other SNPs and compaction of extremely long SNP sequences (obtained, e.g., from Affimetrix Map Array) for further fine genotype analysis. Tagging should first choose tags from the SNPs under consideration and then knowing the values of chosen tag SNPs predict (or statistically cover) the non-tag SNPs. In this paper we propose a new SNP prediction method based on rounding of multivariate linear regression (MLR) analysis in sigma-restricted coding. When predicting a non-tag SNP, the MLR method accumulates information about all tag SNPs resulting in significantly higher prediction accuracy with the same number of tags than for the previously known tagging methods. We also show that the tag selection strongly depends on how the chosen tags will be used – advantage of one tag set over another can only be considered with respect to a certain prediction method. Two simple universal tag selection methods have been applied: a (faster) stepwise and a (slower) local-minimization tag selection algorithms. An extensive experimental study on various datasets including 6 regions from HapMap shows that the MLR prediction combined with stepwise tag selection uses significantly fewer tags (e.g., up to two times less tags to reach 90% prediction accuracy) than the state-of-art methods of Halperin et al. [8] for genotypes and Halldorsson et al.[7] for haplotypes, respectively. Our stepwise tagging matches the quality of while being faster than STAMPA [8]. The code is publicly available at


Prediction Accuracy Statistical Covering High Prediction Accuracy Informative SNPs Extensive Experimental Study 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
  2. 2.
    Carlson, C.S., Eberle, M.A., Rieder, M.J., Yi, Q., Kruglyak, L., Nickerson, D.A.: Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. American Journal of Human Genetics 74(1), 106–120 (2004)CrossRefGoogle Scholar
  3. 3.
    Clark, A., Weiss, K., Nickerson, D., Taylor, S., Buchanan, A., Stengard, J., Salomaa, V., Vartiainen, E., Perola, M., Boerwinkle, E.: Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase. American Journal of Human Genetics 63, 595–612 (1998)CrossRefGoogle Scholar
  4. 4.
    Chapman, J.M., Cooper, J.D., Todd, J.A., Clayton, D.G.: Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power. Human Heredity 56, 18–31 (2003)CrossRefGoogle Scholar
  5. 5.
    Daly, M., Rioux, J., Schaffner, S., Hudson, T., Lander, E.: High resolution haplotype structure in the human genome. Nature Genetics 29, 229–232 (2001)CrossRefGoogle Scholar
  6. 6.
    Kimmel, G., Shamir, R.: GERBIL: Genotype resolution and block identification using likelihood. PNAS 102, 158–162 (2004)CrossRefGoogle Scholar
  7. 7.
    Halldorsson, B.V., Bafna, V., Lippert, R., Schwartz, R., de la Vega, F.M., Clark, A.G., Istrail, S.: Optimal haplotype block-free selection of tagging SNPs for genome-wide association studies. Genome Research 14, 1633–1640 (2004)CrossRefGoogle Scholar
  8. 8.
    Halperin, E., Kimmel, G., Shamir, R.: Tag SNP Selection in Genotype Data for Maximizing SNP Prediciton Accuracy. Bioinformatics 203, i195–i203 (2005)CrossRefGoogle Scholar
  9. 9.
    He, J., Zelikovsky, A.: Linear Reduction Methods for Tag SNP Selection. In: Proceedings of the International Conference of the IEEE Engineering in Medicine and Biology, pp. 2840–2843 (2004)Google Scholar
  10. 10.
    He, J., Zelikovsky, A.: Linear Reduction Method for Predictive and Informative Tag SNP Selection. International Journal Bioinformtics Research and Applications 3, 249–260 (2005)CrossRefGoogle Scholar
  11. 11.
    Patil, N., Berno, A., Hinds, D., Barrett, W., Doshi, J., Hacker, C., Kautzer, C., Lee, D., Marjoribanks, C., McDonough, D., Nguyen, B., Norris, M., Sheehan, J., Shen, N., Stern, D., Stokowski, R., Thomas, D., Trulson, M., Vyas, K., Frazer, K., Fodor, S., Cox, D.: Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome. Science 294, 1719–1723 (2001)CrossRefGoogle Scholar
  12. 12.
    StatSoft, Inc. Electronic Statistics Textbook. Tulsa, OK: StatSoft. WEB (1999),
  13. 13.
    Stram, D., Haiman, C., Hirschhorn, J., Altshuler, D., Kolonel, L., Henderson, B., Pike, M.: Choosing haplotype-tagging SNPs based on unphased genotype data using as preliminary sample of unrelated subjects with an example from the multiethnic cohort study. Human Heredity 55, 27–36 (2003)CrossRefGoogle Scholar
  14. 14.
    Zhang, K., Qin, Z., Liu, J., Chen, T., Waterman, M., Sun, F.: Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies. Genome Research 14, 908–916 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Jingwu He
    • 1
  • Alex Zelikovsky
    • 1
  1. 1.Department of Computer ScienceGeorgia State UniversityAtlanta

Personalised recommendations