Soft Computing

, Volume 13, Issue 12, pp 1143–1151 | Cite as

A hybrid clustering and graph based algorithm for tagSNP selection

  • Mao-Zu Guo
  • Jun WangEmail author
  • Chun-yu Wang
  • Yang Liu


TagSNP selection, which aims to select a small subset of informative single nucleotide polymorphisms (SNPs) to represent the whole large SNP set, has played an important role in current genomic research. Not only can this cut down the cost of genotyping by filtering a large number of redundant SNPs, but also it can accelerate the study of genome-wide disease association. In this paper, we propose a new hybrid method called CMDStagger that combines the ideas of the clustering and the graph algorithm, to find the minimum set of tagSNPs. The proposed algorithm uses the information of the linkage disequilibrium association and the haplotype diversity to reduce the information loss in tagSNP selection, and has no limit of block partition. The approach is tested on eight benchmark datasets from Hapmap and chromosome 5q31. Experimental results show that the algorithm in this paper can reduce the selection time and obtain less tagSNPs with high prediction accuracy. It indicates that this method has better performance than previous ones.


TagSNP selection Clustering algorithm Maximum density subgraph (MDS) Linkage disequilibrium (LD) Haplotypes diversity 



The work was supported by the Natural Science Foundation of China under Grant No. 60871092, No. 60741001 and No. 60671011, the China National 863 High Tech Program under Grant No. 2007AA01Z171, the Science Fund for Distinguished Young Scholars of Heilongjiang Province in China under Grant No. JC200611, and the Natural Science Foundation of Heilongjiang Province in China under Grant No. ZJG0705.


  1. Ao SI, Kevin Y et al (2005) CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs. Bioinformatics. 21(8):1735–1736Google Scholar
  2. Bafna V, Halldorsson BV et al (2003) Haplotypes and informative SNP selection algorithms: don’t block out information. The Association for Computing Machinery, pp 19–27Google Scholar
  3. Carlson C, Eberle MA et al (2004) Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet 74(1):106–120CrossRefGoogle Scholar
  4. Daly MJ et al (2001) High-resolution haplotype structure in the human genome. Nat Genet 29(2):229–232CrossRefGoogle Scholar
  5. Das S (1971) Feature selection with a linear dependence measure. IEEE Trans Comp 20:1106–1109CrossRefGoogle Scholar
  6. Dawson E, Abecasis G et al (2002) A first-generation linkage disequilibrium map of human chromosome 22. Nature 418(6897):544–548CrossRefGoogle Scholar
  7. Gabriel SB, Schaffner SF et al (2002) The structure of haplotype blocks in the human genome. Science 296(5576):2225–2229CrossRefGoogle Scholar
  8. Halldorsson BV et al (2004) Optimal haplotype block-free selection of tagging SNPs for genome-wide association studies. Genome Res 14:1633–1640CrossRefGoogle Scholar
  9. Halperin E, Kimmel G, Shamir R (2005) Tag SNP selection in genotype data for maximizing SNP prediction accuracy. Bioinformatics 21(1):i195–i203CrossRefGoogle Scholar
  10. He W, Zelikovsky A (2006) MLR-tagging: informative SNP selection for unphased genotypes based on multiple linear regression. Bioinformatics 22(20):2558–2561CrossRefGoogle Scholar
  11. Johnson G, Esposito L et al (2001) Haplotype tagging for the identification of common disease genes. Nat Genet 29(2):233–237CrossRefGoogle Scholar
  12. Kimmel G, Shamir R (2005) GERBIL: genotype resolution and block identification using likelihood. Proc Natl Acad Sci USA 102:158–162CrossRefGoogle Scholar
  13. Lewontin RC (1964) The interaction of selection and linkage I. General considerations; heterotic models. Genetics 49:49–67Google Scholar
  14. Lin Z, Altman R (2004) Finding haplotype tagging SNP by use of principle component analysis. Am J Hum Genet 75(5):850–861CrossRefGoogle Scholar
  15. Pritchard J (2001) Linkage disequilibrium in humans: models and data. Am J Hum Genet 69(1):1–14CrossRefGoogle Scholar
  16. Sachidanandam R, Weissman D et al (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409(6822):928–933CrossRefGoogle Scholar
  17. Stephens M, Donnelly P (2003) A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 73(6):1162–1169CrossRefGoogle Scholar
  18. Zhang K, Qin Z et al (2004) Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies. Genome Res 14:908–916CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  1. 1.School of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations