ICONIP 2007: Neural Information Processing pp 789-797 | Cite as

Perfect Population Classification on Hapmap Data with a Small Number of SNPs

  • Nina Zhou
  • Lipo Wang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4985)

Abstract

The single nucleotide polymorphisms (SNPs) are believed to determine human differences and, to some degree, provide biomedical researchers a possibility of predicting risks of some diseases and explaining patients’ different responses to drug regimens. With the availability of millions of SNPs in the Hapmap Project, although large amount of information about SNPs is available, the tremendous size also causes a major challenge for research on SNPs. Inspired from the recent research work on population classification by Park et al (2006), we attempt to find as few SNPs as possible from the original nearly 4 millions SNPs to classify the 3 populations in the Hapmap genotype data. In this paper, we propose to first use a modified t-test measure to rank SNPs, and then combine the ranking result with a classifier, e.g., the support vector machine, to find the optimal SNP subset. Compared with Park et al’s result, our proposed method is more efficient in ranking features and classifying the three populations, i.e., we obtained perfect classification using only 11 SNPs in comparison with 82 SNPs used by Park et al.

Keywords

Support Vector Machine Feature Selection Feature Subset Ranking List Hapmap Project 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bafna, V., Halldorsson, B., Schwartz, R., Clark, A., Istrail, S.: Haplotypes and Informative SNP selection: Don’t block out information. In: Proc. of RECOMB, pp. 19–27 (2003)Google Scholar
  2. 2.
    Celedon, J.C.: Candidate genes, SNPs, Haplotypes and linkage disequilibrium. Powerpoint presentation (2004), http://innateimmunity.net/files/CANDGENES/siframes.html
  3. 3.
    Devore, J., Peck, R.: Statistics:the exploration and analysis of data, 3rd edn. Duxbury Press, Pacific Grove (1997)Google Scholar
  4. 4.
    Duerinck, K.F.: (2001), http://www.duerinck.com/snp.html
  5. 5.
    Francois, R., Langrognet, F.: Double Cross Validation for Model Based Classification, User (2006), http://www.r-project.org/user-2006/Abstracts/Francois+Langrognet.pdf
  6. 6.
    Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3, 1157–1182 (2003)MATHCrossRefGoogle Scholar
  7. 7.
    Halldrsson, B., Bafna, V., Lippert, R., Schwartz, R., de la Vega, F., Clark, A., Istrail, S.: Optimal haplotype blockfree selection of tagging snps for genome-wide association studies. Genome research 14, 1633–1640 (2004)CrossRefGoogle Scholar
  8. 8.
    Halperin, E., Kimmel, G., Shamir, R.: Tag SNP selection in genotype data for maximizig SNP prediction accuracy. Bioinformatics 199, 195–203 (2005)CrossRefGoogle Scholar
  9. 9.
    Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei (2003)Google Scholar
  10. 10.
  11. 11.
    Jaeger, J., Sengupta, R., Ruzzo, W.L.: Improved Gene Selection For Classification Of Microarrays. Pac. Symp. Biocomput., 53–64 (2003)Google Scholar
  12. 12.
    Keerthi, S.S., Lin, C.-J.: Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation 15, 1667–1689 (2003)MATHCrossRefGoogle Scholar
  13. 13.
    Levner, I.: Feature selection and nearest centroid classification for protein mass spectrometry. BMC Bioinformatics 6, 68 (2005)CrossRefGoogle Scholar
  14. 14.
    Liu, B., Wan, C.R., Wang, L.P.: An efficient semi-unsupervised gene selection method via spectral biclustering. IEEE Trans. on Nano-Bioscience 5, 110–114 (2006)Google Scholar
  15. 15.
    Mitra, Pabitra, Murthy, C.A., Pal, S.K.: Unsupervised feature selection using feature similarity. IEEE trans. on Pattern analysis and machine intelligence 3, 301–312 (2002)CrossRefGoogle Scholar
  16. 16.
    Park, J.S., Hwang, S.H., Lee, Y.S., Kim, S.C.: SNP@Ethnos: a database of ethnically variant single-nucleotide polymorphisms. Nucleic Acids Research 0, D1–D5 (2006)Google Scholar
  17. 17.
    Phuong, T.M., Lin, Z., Altman, R.B.: Choosing SNPs using Feature Selection. In: Proc IEEE Comput Syst Bioinform Conf. 2005 (CSB 2005), pp. 301–309 (2005)Google Scholar
  18. 18.
    Pritchard, J.K., Przeworski, M.: Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 69, 1–14 (2001)CrossRefGoogle Scholar
  19. 19.
    Rosenberg, N.A., et al.: Informativeness of genetic markers for inference of ancestry. Am. J. Hum. Genet. 73, 1402–1422 (2003)CrossRefGoogle Scholar
  20. 20.
    Rosenberg, N.A.: Algorithms for selecting informative marker panels for population assignment. Journal of computational biology 9, 1183–1201 (2005)CrossRefGoogle Scholar
  21. 21.
    Su, Y., Murali, T.M., Pavlovic, V., Schaffer, M., Kasif, S.: RankGene: Identifcation of Diagnostic Genes Based on Expression Data. Bioinformatics 19, 1578–1579 (2003)CrossRefGoogle Scholar
  22. 22.
    The International HapMap Consortium: The international Hapmap Project. Nature 426, 789–796 (2003), www.hapmap.org/genotypes Google Scholar
  23. 23.
    Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS 99, 6567–6572 (2002)CrossRefGoogle Scholar
  24. 24.
    Trochim, W.M.: The Research Methods Knowledge Base, 2nd edn. Atomic Dog Publishing (2004), http://www.socialresearchmethods.net/kb/
  25. 25.
    Vapnik, V.: Statistical learning theory. Wiley, NewYork (1998)MATHGoogle Scholar
  26. 26.
    Wang, L.P.: Support Vector Machines: Theory and Applications. Springer, Heidelberg (2005)MATHGoogle Scholar
  27. 27.
    Wang, L.P., Chu, F., Xie, W.: Accurate cancer classification using expressions of very few genes. IEEE Transactions on Bioinformatics and Computational Biology 4, 40–53 (2007)CrossRefGoogle Scholar
  28. 28.
    Wang, L.P., Fu, X.J.: Data Mining with Computational Intelligence. Springer, Berlin (2005)MATHGoogle Scholar
  29. 29.
    Welch, B.L.: The generalizaition of student’s problem when several different population are involved. Biomethika 34, 28–35 (1947)MATHMathSciNetGoogle Scholar
  30. 30.
    Wright, S.: The interpretation of population structure by F-statistics with special regard to systems of mating. Evolution 19, 395–420 (1965)CrossRefGoogle Scholar
  31. 31.
    Wu, B., Abbott, T., Fishman, D., McMurray, W., Mor, G., Stone, K., Ward, D., Williams, K., Zhao, H.: Comparison of statistical methods for classifcation of ovarian cancer using mass spectrometry data. BioInformatics 19, 1636–1643 (2003)CrossRefGoogle Scholar
  32. 32.
    Zhen, L., Altman, R.B.: Finding Haplotype Tagging SNPs by Use of Principle Components Analysis. Am. J. Hum. Genet. 75, 850–861 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Nina Zhou
    • 1
  • Lipo Wang
    • 2
  1. 1.College of Information EngineeringXiangtan UniversityXiangtanChina
  2. 2.Nanyang Technological UniversitySingapore

Personalised recommendations