Abstract
Although several model-based methods are promising for the identification of influential single factors and multi-factor interactions, few are widely used in real applications for most of the model-selection procedures are complex and/or infeasible in computation for high-dimensional data. In particular, the ability of the methods to reveal more true factors and fewer false ones often relies heavily on the selection of appropriate values of tuning parameters, which is still a difficult task to practical analysts. This article provides a simple algorithm modified from stepwise forward regression for the identification of influential factors. Instead of keeping the identified factors in the next models for adjustment in stepwise regression, we propose to subtract the effects of identified factors in each run and always fit a single-term model to the effect-subtracted responses. The computation is lighter as the proposed method only involves calculations of a simple test statistic; and therefore it could be applied to screen ultrahigh-dimensional data for important single factors and multi-factor interactions. Most importantly, we have proposed a novel stopping rule of using a constant threshold for the simple test statistic, which is different from the conventional stepwise regression with AIC or BIC criterion. The performance of the new algorithm has been confirmed competitive by extensive simulation studies compared to several methods available in R packages, including the popular group lasso, surely independence screening, Bayesian quantitative trait locus mapping methods and others. Findings from two real data examples, including a genome-wide association study, demonstrate additional useful information of high-order interactions that can be gained from implementing the proposed algorithm.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Broman KW, Wu H, Sen S, Churchill GA (2003) R/qtl: QTL mapping in experimental crosses. Bioinformatics 19:889–890
Carlborg Ö, Andersson L, Kringhorn B (2000) The use of a genetic algorithm for simultaneous mapping of multiple interacting quantitative trait loci. Genetics 155:2003–2010
Chen J, Chen Z (2008) Extended Bayesian information criterion for model selection with model spaces. Biometrika 95:759–771
Chernoff H, Lo SH, Zheng T (2009) Discovering influential variables: A method of partitions. Ann Appl Stat 3:1335–1369
Chung CM, Wang RY et al. (2010) A genome-wide association study identifies new loci for ACE activity: potential implications for response to ACE inhibitor. Pharmacogenomics J 10(6):537–544
Fan J, Lv J (2008) Sure independence screening for ultra-high dimensional feature space (with discussion). J R Stat Soc B 70:849–911
Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20:101–148
David HA (1980) Order statistics. Wiley, New York
Ishimori N, Li R et al. (2004) Quantitative trait loci analysis for plasma HDL-cholesterol concentrations and atherosclerosis susceptibility between inbred mouse strains C57BL/6J and 129S1/SvImJ. Arterioscler Thromb Vasc Biol 24:161–166
Liang Y, Kelemen A (2008) Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases. Stat Surv 2:43–60
Lo SH, Chernoff H, Cong L, Ding Y, Zheng T (2008) Discovering interactions among BRCA1 and other candidate genes associated with sporadic breast cancer. Proc Natl Acad Sci USA 105(34):12387–12392
Loughin TM (2004) A systematic comparison of methods for combining p-values from independent tests. Comput Stat Data Anal 47:467–485
Meier L, van de Geer S, Bühlmann P (2008) The group Lasso for logistic regression. J R Stat Soc B 70:53–71
Sen S, Churchill GA (2001) A statistical framework for quantitative trait mapping. Genetics 159:371–387
Wang H (2009) Forward regression for ultra-high dimensional variable screening. J Am Stat Assoc 104:1512–1524
Wu TT, Chen YF, Hastie T, Sobel E, Lange K (2009) Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25:714–721
Yandell BS, Mehta T et al. (2007) R/qtlbim: QTL with Bayesian interval mapping in experimental crosses. Bioinformatics 23:641–643
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc B 68:49–67
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Hwang, JS., Hu, TH. Stepwise Paring down Variation for Identifying Influential Multi-factor Interactions Related to a Continuous Response Variable. Stat Biosci 4, 197–212 (2012). https://doi.org/10.1007/s12561-011-9045-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-011-9045-2