A hybrid clustering and graph based algorithm for tagSNP selection


TagSNP selection, which aims to select a small subset of informative single nucleotide polymorphisms (SNPs) to represent the whole large SNP set, has played an important role in current genomic research. Not only can this cut down the cost of genotyping by filtering a large number of redundant SNPs, but also it can accelerate the study of genome-wide disease association. In this paper, we propose a new hybrid method called CMDStagger that combines the ideas of the clustering and the graph algorithm, to find the minimum set of tagSNPs. The proposed algorithm uses the information of the linkage disequilibrium association and the haplotype diversity to reduce the information loss in tagSNP selection, and has no limit of block partition. The approach is tested on eight benchmark datasets from Hapmap and chromosome 5q31. Experimental results show that the algorithm in this paper can reduce the selection time and obtain less tagSNPs with high prediction accuracy. It indicates that this method has better performance than previous ones.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4


  1. Ao SI, Kevin Y et al (2005) CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs. Bioinformatics. 21(8):1735–1736

    Google Scholar 

  2. Bafna V, Halldorsson BV et al (2003) Haplotypes and informative SNP selection algorithms: don’t block out information. The Association for Computing Machinery, pp 19–27

  3. Carlson C, Eberle MA et al (2004) Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet 74(1):106–120

    Article  Google Scholar 

  4. Daly MJ et al (2001) High-resolution haplotype structure in the human genome. Nat Genet 29(2):229–232

    Article  Google Scholar 

  5. Das S (1971) Feature selection with a linear dependence measure. IEEE Trans Comp 20:1106–1109

    Article  Google Scholar 

  6. Dawson E, Abecasis G et al (2002) A first-generation linkage disequilibrium map of human chromosome 22. Nature 418(6897):544–548

    Article  Google Scholar 

  7. Gabriel SB, Schaffner SF et al (2002) The structure of haplotype blocks in the human genome. Science 296(5576):2225–2229

    Article  Google Scholar 

  8. Halldorsson BV et al (2004) Optimal haplotype block-free selection of tagging SNPs for genome-wide association studies. Genome Res 14:1633–1640

    Article  Google Scholar 

  9. Halperin E, Kimmel G, Shamir R (2005) Tag SNP selection in genotype data for maximizing SNP prediction accuracy. Bioinformatics 21(1):i195–i203

    Article  Google Scholar 

  10. He W, Zelikovsky A (2006) MLR-tagging: informative SNP selection for unphased genotypes based on multiple linear regression. Bioinformatics 22(20):2558–2561

    Article  Google Scholar 

  11. Johnson G, Esposito L et al (2001) Haplotype tagging for the identification of common disease genes. Nat Genet 29(2):233–237

    Article  Google Scholar 

  12. Kimmel G, Shamir R (2005) GERBIL: genotype resolution and block identification using likelihood. Proc Natl Acad Sci USA 102:158–162

    Article  Google Scholar 

  13. Lewontin RC (1964) The interaction of selection and linkage I. General considerations; heterotic models. Genetics 49:49–67

    Google Scholar 

  14. Lin Z, Altman R (2004) Finding haplotype tagging SNP by use of principle component analysis. Am J Hum Genet 75(5):850–861

    Article  Google Scholar 

  15. Pritchard J (2001) Linkage disequilibrium in humans: models and data. Am J Hum Genet 69(1):1–14

    Article  Google Scholar 

  16. Sachidanandam R, Weissman D et al (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409(6822):928–933

    Article  Google Scholar 

  17. Stephens M, Donnelly P (2003) A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 73(6):1162–1169

    Article  Google Scholar 

  18. Zhang K, Qin Z et al (2004) Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies. Genome Res 14:908–916

    Article  Google Scholar 

Download references


The work was supported by the Natural Science Foundation of China under Grant No. 60871092, No. 60741001 and No. 60671011, the China National 863 High Tech Program under Grant No. 2007AA01Z171, the Science Fund for Distinguished Young Scholars of Heilongjiang Province in China under Grant No. JC200611, and the Natural Science Foundation of Heilongjiang Province in China under Grant No. ZJG0705.

Author information



Corresponding author

Correspondence to Jun Wang.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Guo, M., Wang, J., Wang, C. et al. A hybrid clustering and graph based algorithm for tagSNP selection. Soft Comput 13, 1143–1151 (2009). https://doi.org/10.1007/s00500-009-0419-z

Download citation


  • TagSNP selection
  • Clustering algorithm
  • Maximum density subgraph (MDS)
  • Linkage disequilibrium (LD)
  • Haplotypes diversity