Tag SNP Selection Based on Multivariate Linear Regression
The search for the association between complex diseases and single nucleotide polymorphisms (SNPs) or haplotypes has been recently received great attention. For these studies, it is essential to use a small subset of informative SNPs (tag SNPs) accurately representing the rest of the SNPs. Tagging can achieve budget savings by genotyping only a limited number of SNPs and computationally inferring all other SNPs and compaction of extremely long SNP sequences (obtained, e.g., from Affimetrix Map Array) for further fine genotype analysis. Tagging should first choose tags from the SNPs under consideration and then knowing the values of chosen tag SNPs predict (or statistically cover) the non-tag SNPs. In this paper we propose a new SNP prediction method based on rounding of multivariate linear regression (MLR) analysis in sigma-restricted coding. When predicting a non-tag SNP, the MLR method accumulates information about all tag SNPs resulting in significantly higher prediction accuracy with the same number of tags than for the previously known tagging methods. We also show that the tag selection strongly depends on how the chosen tags will be used – advantage of one tag set over another can only be considered with respect to a certain prediction method. Two simple universal tag selection methods have been applied: a (faster) stepwise and a (slower) local-minimization tag selection algorithms. An extensive experimental study on various datasets including 6 regions from HapMap shows that the MLR prediction combined with stepwise tag selection uses significantly fewer tags (e.g., up to two times less tags to reach 90% prediction accuracy) than the state-of-art methods of Halperin et al.  for genotypes and Halldorsson et al. for haplotypes, respectively. Our stepwise tagging matches the quality of while being faster than STAMPA . The code is publicly available at http://alla.cs.gsu.edu/~software.
KeywordsPrediction Accuracy Statistical Covering High Prediction Accuracy Informative SNPs Extensive Experimental Study
- 3.Clark, A., Weiss, K., Nickerson, D., Taylor, S., Buchanan, A., Stengard, J., Salomaa, V., Vartiainen, E., Perola, M., Boerwinkle, E.: Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase. American Journal of Human Genetics 63, 595–612 (1998)CrossRefGoogle Scholar
- 9.He, J., Zelikovsky, A.: Linear Reduction Methods for Tag SNP Selection. In: Proceedings of the International Conference of the IEEE Engineering in Medicine and Biology, pp. 2840–2843 (2004)Google Scholar
- 11.Patil, N., Berno, A., Hinds, D., Barrett, W., Doshi, J., Hacker, C., Kautzer, C., Lee, D., Marjoribanks, C., McDonough, D., Nguyen, B., Norris, M., Sheehan, J., Shen, N., Stern, D., Stokowski, R., Thomas, D., Trulson, M., Vyas, K., Frazer, K., Fodor, S., Cox, D.: Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome. Science 294, 1719–1723 (2001)CrossRefGoogle Scholar
- 12.StatSoft, Inc. Electronic Statistics Textbook. Tulsa, OK: StatSoft. WEB (1999), http://www.statsoft.com/textbook/stathome.html
- 13.Stram, D., Haiman, C., Hirschhorn, J., Altshuler, D., Kolonel, L., Henderson, B., Pike, M.: Choosing haplotype-tagging SNPs based on unphased genotype data using as preliminary sample of unrelated subjects with an example from the multiethnic cohort study. Human Heredity 55, 27–36 (2003)CrossRefGoogle Scholar