ICONIP 2007: Neural Information Processing pp 789-797 | Cite as
Perfect Population Classification on Hapmap Data with a Small Number of SNPs
Abstract
The single nucleotide polymorphisms (SNPs) are believed to determine human differences and, to some degree, provide biomedical researchers a possibility of predicting risks of some diseases and explaining patients’ different responses to drug regimens. With the availability of millions of SNPs in the Hapmap Project, although large amount of information about SNPs is available, the tremendous size also causes a major challenge for research on SNPs. Inspired from the recent research work on population classification by Park et al (2006), we attempt to find as few SNPs as possible from the original nearly 4 millions SNPs to classify the 3 populations in the Hapmap genotype data. In this paper, we propose to first use a modified t-test measure to rank SNPs, and then combine the ranking result with a classifier, e.g., the support vector machine, to find the optimal SNP subset. Compared with Park et al’s result, our proposed method is more efficient in ranking features and classifying the three populations, i.e., we obtained perfect classification using only 11 SNPs in comparison with 82 SNPs used by Park et al.
Keywords
Support Vector Machine Feature Selection Feature Subset Ranking List Hapmap ProjectPreview
Unable to display preview. Download preview PDF.
References
- 1.Bafna, V., Halldorsson, B., Schwartz, R., Clark, A., Istrail, S.: Haplotypes and Informative SNP selection: Don’t block out information. In: Proc. of RECOMB, pp. 19–27 (2003)Google Scholar
- 2.Celedon, J.C.: Candidate genes, SNPs, Haplotypes and linkage disequilibrium. Powerpoint presentation (2004), http://innateimmunity.net/files/CANDGENES/siframes.html
- 3.Devore, J., Peck, R.: Statistics:the exploration and analysis of data, 3rd edn. Duxbury Press, Pacific Grove (1997)Google Scholar
- 4.Duerinck, K.F.: (2001), http://www.duerinck.com/snp.html
- 5.Francois, R., Langrognet, F.: Double Cross Validation for Model Based Classification, User (2006), http://www.r-project.org/user-2006/Abstracts/Francois+Langrognet.pdf
- 6.Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3, 1157–1182 (2003)MATHCrossRefGoogle Scholar
- 7.Halldrsson, B., Bafna, V., Lippert, R., Schwartz, R., de la Vega, F., Clark, A., Istrail, S.: Optimal haplotype blockfree selection of tagging snps for genome-wide association studies. Genome research 14, 1633–1640 (2004)CrossRefGoogle Scholar
- 8.Halperin, E., Kimmel, G., Shamir, R.: Tag SNP selection in genotype data for maximizig SNP prediction accuracy. Bioinformatics 199, 195–203 (2005)CrossRefGoogle Scholar
- 9.Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei (2003)Google Scholar
- 10.Human genome project information (2006), http://www.ornl.gov/sci/techresources/Human_Genome/faq/snps.html
- 11.Jaeger, J., Sengupta, R., Ruzzo, W.L.: Improved Gene Selection For Classification Of Microarrays. Pac. Symp. Biocomput., 53–64 (2003)Google Scholar
- 12.Keerthi, S.S., Lin, C.-J.: Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation 15, 1667–1689 (2003)MATHCrossRefGoogle Scholar
- 13.Levner, I.: Feature selection and nearest centroid classification for protein mass spectrometry. BMC Bioinformatics 6, 68 (2005)CrossRefGoogle Scholar
- 14.Liu, B., Wan, C.R., Wang, L.P.: An efficient semi-unsupervised gene selection method via spectral biclustering. IEEE Trans. on Nano-Bioscience 5, 110–114 (2006)Google Scholar
- 15.Mitra, Pabitra, Murthy, C.A., Pal, S.K.: Unsupervised feature selection using feature similarity. IEEE trans. on Pattern analysis and machine intelligence 3, 301–312 (2002)CrossRefGoogle Scholar
- 16.Park, J.S., Hwang, S.H., Lee, Y.S., Kim, S.C.: SNP@Ethnos: a database of ethnically variant single-nucleotide polymorphisms. Nucleic Acids Research 0, D1–D5 (2006)Google Scholar
- 17.Phuong, T.M., Lin, Z., Altman, R.B.: Choosing SNPs using Feature Selection. In: Proc IEEE Comput Syst Bioinform Conf. 2005 (CSB 2005), pp. 301–309 (2005)Google Scholar
- 18.Pritchard, J.K., Przeworski, M.: Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 69, 1–14 (2001)CrossRefGoogle Scholar
- 19.Rosenberg, N.A., et al.: Informativeness of genetic markers for inference of ancestry. Am. J. Hum. Genet. 73, 1402–1422 (2003)CrossRefGoogle Scholar
- 20.Rosenberg, N.A.: Algorithms for selecting informative marker panels for population assignment. Journal of computational biology 9, 1183–1201 (2005)CrossRefGoogle Scholar
- 21.Su, Y., Murali, T.M., Pavlovic, V., Schaffer, M., Kasif, S.: RankGene: Identifcation of Diagnostic Genes Based on Expression Data. Bioinformatics 19, 1578–1579 (2003)CrossRefGoogle Scholar
- 22.The International HapMap Consortium: The international Hapmap Project. Nature 426, 789–796 (2003), www.hapmap.org/genotypes Google Scholar
- 23.Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS 99, 6567–6572 (2002)CrossRefGoogle Scholar
- 24.Trochim, W.M.: The Research Methods Knowledge Base, 2nd edn. Atomic Dog Publishing (2004), http://www.socialresearchmethods.net/kb/
- 25.Vapnik, V.: Statistical learning theory. Wiley, NewYork (1998)MATHGoogle Scholar
- 26.Wang, L.P.: Support Vector Machines: Theory and Applications. Springer, Heidelberg (2005)MATHGoogle Scholar
- 27.Wang, L.P., Chu, F., Xie, W.: Accurate cancer classification using expressions of very few genes. IEEE Transactions on Bioinformatics and Computational Biology 4, 40–53 (2007)CrossRefGoogle Scholar
- 28.Wang, L.P., Fu, X.J.: Data Mining with Computational Intelligence. Springer, Berlin (2005)MATHGoogle Scholar
- 29.Welch, B.L.: The generalizaition of student’s problem when several different population are involved. Biomethika 34, 28–35 (1947)MATHMathSciNetGoogle Scholar
- 30.Wright, S.: The interpretation of population structure by F-statistics with special regard to systems of mating. Evolution 19, 395–420 (1965)CrossRefGoogle Scholar
- 31.Wu, B., Abbott, T., Fishman, D., McMurray, W., Mor, G., Stone, K., Ward, D., Williams, K., Zhao, H.: Comparison of statistical methods for classifcation of ovarian cancer using mass spectrometry data. BioInformatics 19, 1636–1643 (2003)CrossRefGoogle Scholar
- 32.Zhen, L., Altman, R.B.: Finding Haplotype Tagging SNPs by Use of Principle Components Analysis. Am. J. Hum. Genet. 75, 850–861 (2004)CrossRefGoogle Scholar