Genotype Tagging with Limited Overfitting

  • Irina Astrovskaya
  • Alex Zelikovsky
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5676)


Due to the high genotyping cost and large data volume in genome-wide association studies data, it is desirable to find a small subset of SNPs, referred as tag SNPs, that covers the genetic variation of the entire data. To represent genetic variation of an untagged SNP, the existing tagging methods use either a single tag SNP ( e.g., Tagger, IdSelect), or several tag SNPs ( e.g., MLR, STAMPA). When multiple tags are used to explain variation of a single SNP then usually less tags are needed but overfitting is higher.

This paper explores the trade-off between the number of tags and overfitting and considers the problem of finding a minimum number of tags when at most two tags can represent variation of an untagged SNP. We show that this problem is hard to approximate and propose an efficient heuristic, referred as 2LR. Our experimental results show that 2LR tagging is between Tagger and MLR in the number of tags and in overfitting. Indeed, 2LR uses slightly more tags than MLR but the overfitting measured with 2-fold cross validations is practically the same as for Tagger. 2LR-tagging better tolerates missing data than Tagger.


genotype tagging linear programming minimum dominating set hypergraph 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Avi-Itzhak, H.I., Su, X., De La Vega, F.M.: Selection of minimum subsets of single nucleotide polymorphisms to capture haplotype block diversity. In: Pacific Symposium in Biocomputing, pp. 466–477 (2003)Google Scholar
  2. 2.
    de Bakker, P.I.W., Yelensky, R., Pe’er, I., Gabriel, S.B., Daly, M.J., Altshuler, D.: Efficiency and power in genetic association studies. Nature Genetics 37, 1217–1223 (2005)CrossRefPubMedGoogle Scholar
  3. 3.
    Calinescu, G.: Private communicationGoogle Scholar
  4. 4.
    Carlson, C.S., Eberle, M.A., Rieder, M.J., Yi, Q., Kruglyak, L., Nickerson, D.A.: Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. American Journal of Human Genetics 74(1), 106–120 (2004)CrossRefPubMedGoogle Scholar
  5. 5.
    Carr, R.D., Doddi, S., Konjevod, G., Marathe, M.V.: On the red-blue set cover problem. In: SODA 2000, pp. 345–353 (2000)Google Scholar
  6. 6.
    Dinur, I., Safra, S.: On the hardness of approximating label cover. ECCC Report 15 (1999)Google Scholar
  7. 7.
    Gabriel, S.B., Schaffner, S.F., Hguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J.: The structure of haplotype blocks in the human genome. Science 296, 2225–2229 (2002)CrossRefPubMedGoogle Scholar
  8. 8.
    Halldorsson, B.V., Bafna, V., Lippert, R., Schwartz, R., de la Vega, F.M., Clark, A.G., Istrail, S.: Optimal haplotype block-free selection of tagging SNPs for genome-wide association studies. Genome Research 14, 1633–1640 (2004)CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    Halperin, E., Kimmel, G., Shamir, R.: Tag SNP Selection in Genotype Data for Maximizing SNP Prediction Accuracy. Bioinformatics 21, 195–203 (2005)CrossRefGoogle Scholar
  10. 10.
    He, J., Zelikovsky, A.: Informative SNP Selection Based on SNP Prediction. IEEE Transactions on NanoBioscience 6(1), 60–67 (2007)CrossRefPubMedGoogle Scholar
  11. 11.
    He, J., Zelikovsky, A.: Linear Reduction Methods for Tag SNP Selection. In: Proceedings of International Conference of the IEEE Engineering in Medicine and Biology (EMBC 2004), pp. 2840–2843 (2004)Google Scholar
  12. 12.
    Hedrick, P.W., Kumar, S.: Mutation and linkage disequilibrium in human mtDNA. European Journal of Human Genetics 9, 969–972 (2001)CrossRefPubMedGoogle Scholar
  13. 13.
    Huang, Y.H., Zhang, K., Chen, T., Chao, K.-M.: Approximation algorithms for the selection of robust tag SNPs. In: Jonassen, I., Kim, J. (eds.) WABI 2004. LNCS (LNBI), vol. 3240, pp. 278–289. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  14. 14.
    Lee, P.H., Shatkay, H.: BNTagger: improved tagging SNP selection using Bayesian networks. Bioinformatics 22(14), 211–219 (2006)CrossRefGoogle Scholar
  15. 15.
    Vazirani, V.V.: Approximation Algorithms. Springer, Heidelberg (2001)Google Scholar
  16. 16.
    Zhang, K., Qin, Z., Chen, T., Liu, J.S., Waterman, M.S., Sun, F.: HapBlock: haplotype block partitioning and tag SNP selection software using a set of dynamic programming algorithms. Bioinformatics 21(1), 131–134 (2005)CrossRefPubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Irina Astrovskaya
    • 1
  • Alex Zelikovsky
    • 1
  1. 1.Department of Computer ScienceGeorgia State UniversityAtlanta

Personalised recommendations