Genotype Tagging with Limited Overfitting
Due to the high genotyping cost and large data volume in genome-wide association studies data, it is desirable to find a small subset of SNPs, referred as tag SNPs, that covers the genetic variation of the entire data. To represent genetic variation of an untagged SNP, the existing tagging methods use either a single tag SNP ( e.g., Tagger, IdSelect), or several tag SNPs ( e.g., MLR, STAMPA). When multiple tags are used to explain variation of a single SNP then usually less tags are needed but overfitting is higher.
This paper explores the trade-off between the number of tags and overfitting and considers the problem of finding a minimum number of tags when at most two tags can represent variation of an untagged SNP. We show that this problem is hard to approximate and propose an efficient heuristic, referred as 2LR. Our experimental results show that 2LR tagging is between Tagger and MLR in the number of tags and in overfitting. Indeed, 2LR uses slightly more tags than MLR but the overfitting measured with 2-fold cross validations is practically the same as for Tagger. 2LR-tagging better tolerates missing data than Tagger.
Keywordsgenotype tagging linear programming minimum dominating set hypergraph
Unable to display preview. Download preview PDF.
- 1.Avi-Itzhak, H.I., Su, X., De La Vega, F.M.: Selection of minimum subsets of single nucleotide polymorphisms to capture haplotype block diversity. In: Pacific Symposium in Biocomputing, pp. 466–477 (2003)Google Scholar
- 3.Calinescu, G.: Private communicationGoogle Scholar
- 5.Carr, R.D., Doddi, S., Konjevod, G., Marathe, M.V.: On the red-blue set cover problem. In: SODA 2000, pp. 345–353 (2000)Google Scholar
- 6.Dinur, I., Safra, S.: On the hardness of approximating label cover. ECCC Report 15 (1999)Google Scholar
- 11.He, J., Zelikovsky, A.: Linear Reduction Methods for Tag SNP Selection. In: Proceedings of International Conference of the IEEE Engineering in Medicine and Biology (EMBC 2004), pp. 2840–2843 (2004)Google Scholar
- 15.Vazirani, V.V.: Approximation Algorithms. Springer, Heidelberg (2001)Google Scholar