Selection of covariates in epidemiologic analysis is among the most controversial and difficult tasks in epidemiologic analysis. The answer to the question of whether to include or to exclude a covariate from the analysis depends on the research question posed, the design of the study, and ultimately also on the sample size [1]. The goals related to the selection of the best variables are mainly twofold. Firstly, variable selection is used for confounder control to obtain unbiased estimates in etiologic research. Secondly, prediction research depends on variable selection for unbiased estimation of probabilities [16].

Prior knowledge from the scientific literature is formally seen as the most important rationale for including or excluding covariates from a statistical analysis but it is not always available for all research questions asked [2, 4, 6]. Statistical science has therefore developed several decision rules and algorithms to achieve selection based on the relations of the data under study: change in the effect estimate, stepwise selection, modern techniques such as shrinkage and penalized regression, and other techniques.

The aim of this commentary is to assess how often different variable selection techniques were applied in contemporary epidemiologic analysis. It was of particular interest to see whether modern methods such as shrinkage or penalized regression were used in recent publications. We screened the methods sections of articles published in four major epidemiologic journals in 2008 (American Journal of Epidemiology, Epidemiology, European Journal of Epidemiology and the International Journal of Epidemiology) for a description of the technique used to select variables. We present the frequency of these methods and in addition cited some articles that give a good example of how these selection techniques can be described in the methods section. All articles were categorized by the first author of this commentary into one of the following six categories: prior knowledge, change-in-estimate, stepwise selection, modern methods, other methods, not described. The second author drew a random sample of 30 articles. Agreement between the two was 87% before the consensus discussion. One study was reclassified afterwards. We excluded commentaries, purely descriptive studies, genetic association studies, and meta-analyses. All other publications were included.

Table 1 shows the frequency of methods used by authors in their publications in these journals in 2008. 300 articles met our inclusion criteria. We could not observe significant differences between the journals (Fisher’s exact test, p = 0.09).

Table 1 Variable selection methods used in major epidemiologic journals in 2008

In 83 (28%) articles, the authors selected the covariates contained in multivariable models based on prior knowledge. Ideally the selection of covariates should be substantiated with references from the literature. This was only the case for 41 of the 83 publications that relied on this method. The remaining 42 studies described the rational for including the covariates without explicit references. Prior knowledge can be documented by referring to a study in the same population that resulted in the identification of risk factors for the outcome under study, as in a study on the impact of smoking on thyroid volume, [7] or by referring to one or more studies that identify each of the potential confounders. An example for this approach is a study examining injury risk in the Swedish population [8].

A total of 59 (20%) of all reviewed publications used stepwise selection procedures with or without univariate pre-screening of potential covariates. These procedures rely on statistical testing of the covariate-disease association to decide which variables to include or to exclude from the model. They have been criticized extensively in the literature because they require arbitrary definitions of thresholds that can lead to bias, overfitting, and exaggerated p values [14, 6, 9]. The majority of these studies (66%) explicitly stated the thresholds. Although p values cannot replace prior information to select the best set of covariates, if the exact methods and thresholds used for these procedures are reported as for example in a recent study that derived and validated a mortality index among frail older patients [10], the analysis is at least reproducible and therefore to a certain degree objective [1, 2].

Another approach that is often combined with stepwise selection procedures is using a pre-specified change-in-estimate criterion (n = 44, 15%). This approach has been judged more favorable than stepwise procedures particularly when using the change of the interval estimate instead of the point estimate of the effect under study [4, 11]. It takes into account the covariate—disease association but also the change in the estimate, i.e. the exposure—covariate association, upon removal of the covariate [4]. The decision on the adequacy of the threshold depends on the context of the study and requires prior knowledge. Reporting of the criterion used is essential for this procedure and it is up to the researcher and the audience to decide whether, e.g., a 10% change in the risk measure of the association between income and recurrent coronary events [12] is reasonable, whereas for the association between socioeconomic position and pre-term birth a 5% change was seen as more adequate [13]. Together, stepwise procedures and change-in-estimate, represent 34% of all the methods used and virtually all of the data-driven statistical methods. Several variants of these procedures are implemented in standard software packages used to analyze epidemiological data. The ease of using these methods and the dominance in the existing literature, albeit years of criticism by leading epidemiologists, have probably hindered the breakthrough of other less controversial methods such as shrinkage.

In 9 articles other, very diverse methods for variable selection were applied (4 studies used principal components, [1417] 1 study used propensity scores, [18] 1 study explicitly included all variables in the regression [19], 2 studies used causal diagrams, [20, 21] and 1 study used Deletion/Substitution/Addition algorithm [22]).

Not a single study used shrinkage procedures. Selection due to shrinkage in particular the Least Absolute Shrinkage and Selection Operator (LASSO) [23] was welcomed in the literature [3, 4]. In the case of the Cox proportional hazard model, this algorithm maximizes the partial likelihood of the regression coefficients subject to a constraint imposed on the sum of the absolute value of all regression coefficients in the model. The constraint itself can be estimated via cross validation [23]. The LASSO technique has been labelled “shrinkage with selection” [24]. It corrects the extremes in the distribution of all variables and thus shrinks very unstable estimates towards zero. This effectively excludes some variables without the need for formal statistical testing [2426]. Although LASSO and similar methods have been lauded in the epidemiologic literature because of these positive attributes, they were not applied in the selected articles in 2008. Admittedly it is a tedious procedure to implement LASSO for variable selection in the R program. Also, there is no consensus on the interpretation of estimates nor on how confidence intervals can be reliably estimated for penalized regression results such as those obtained by LASSO [3]. But as is the case with multiple imputation, which is now implemented as a routine in SAS and SPSS, the implementation of shrinkage and LASSO in commonly applied analysis software may help the dissemination of these modern methods for variable selection.

A total of 105 publications did not describe the method in sufficient detail. While it is remarkable to see that 35% of all selected articles in these epidemiologic journals scored in this category, this does not mean that the research is flawed. It is merely an indication of the quality of information in the methods section. One of the common reasons why we categorized the selection technique as “not described” were the use of vague formulations such as “based on prior knowledge” or “a priori”. When the selection of variables is based on prior knowledge, this knowledge needs to be made explicit with references or explanations; otherwise the selection cannot be judged or discussed.

We conclude that variable selection methods which have been formally criticized as flawed still prevail in the scientific literature. This may be due to the ease of implementation, slow knowledge transfer, or because of the fear that editors or reviewers do not appreciate new approaches. We call for more cooperation between the academic research into methodology and ask statisticians to cooperate with research groups to demonstrate the usability of new algorithms in real data instead of simulation studies. Journals may wish not only to publish criticism of these methods but also to actively encourage the use of less controversial selection routines. At least, more referencing could be required when prior knowledge is used as a selection criterion. In addition, we encourage researchers to not simply use stepwise regression because of its availability in standardized software packages but rather to explore the new methods for variable selection in their research.