Random Subspace Method for highdimensional regression with the R package regRSM
 1.8k Downloads
 2 Citations
Abstract
Model selection and variable importance assessment in highdimensional regression are among the most important tasks in contemporary applied statistics. In our procedure, implemented in the package regRSM, the Random Subspace Method (RSM) is used to construct a variable importance measure. The variables are ordered with respect to the measures computed in the first step using the RSM and then, from the hierarchical list of models given by the ordering, the final subset of variables is chosen using information criteria or validation set. Modifications of the original method such as the weighted Random Subspace Method and the version with initial screening of redundant variables are discussed. We developed parallel implementations which enable to reduce the computation time significantly. In this paper, we give a brief overview of the methodology, demonstrate the package’s functionality and present a comparative study of the proposed algorithm and the competitive methods like lasso or CAR scores. In the performance tests the computational times for parallel implementations are compared.
Keywords
Random Subspace Method Highdimensional regression Variable importance measure Generalized Information Criterion MPI R1 Introduction
1.1 Motivation
In recent years considerable attention has been devoted to model selection and variable importance assessment in highdimensional statistical learning. This is due to ubiquity of data with a large number of variables in a wide range of research fields. Moreover, nowadays there is a strong need to discover functional relationships in the data and to build predictive models. In many applications the number of variables significantly exceeds the number of observations (“small n large p problem”). However, very often functional relationships are sparse in the sense that among thousands of available variables only a few of them are informative and it is crucial to identify them correctly. Examples include microarray data containing genes activities, Quantitative Trait Loci (QTL) data, drug design data, highresolution images, highfrequency financial data and text data among others (see e.g. Donoho 2000 for an extensive list of references). In such situations the standard methods like ordinary least squares cannot be applied directly. In view of this a variety of dimension reduction techniques and regression methods tailored to the highdimensional framework have been developed recently.
1.2 Related work
There are two mainstream methodologies for the dimension reduction in regression: variable extraction methods and variable selection methods. The aim of variable extraction methods is to identify functions of variables that can replace the original ones. Examples include Principal Component Regression (see e.g. Jolliffe 1982) and Partial Least Squares Regression (see e.g. Martens 2001; Wold 2001). In contrast, variable selection methods identify an active set of original variables which affect the response. A significant number of such methods used for highdimensional data consists of two steps: in the first step the variables are ordered using some forms of regularization or variable importance measures (see below for examples). In the next step the final model is chosen from the hierarchical list of models given by the ordering. Usually in the second step crossvalidation, thresholding or information criteria are used to determine the active set.
An important and intensively studied line of research is devoted to regularization methods like lasso, SCAD and MCP (see e.g. Tibshirani 1996; Zou and Hastie 2005; Zhang and Zhang 2012). They perform estimation of parameters and variable selection simultaneously. The final model depends on the regularization parameter which is usually tuned using crossvalidation (Friedman et al. 2010). Many algorithms for computing the entire regularization path efficiently have been developed (see e.g. Friedman et al. 2010 for references).
An alternative direction employs ordering of variables based on their appropriately defined importance. Many variable importance measures have been proposed. The R package relaimpo (RELAtive IMPOrtance of regressors in linear models) implements different metrics for assessing relative importance of variables in the linear models (Grömping 2006). Unfortunately, some of them, e.g. pmvd (Proportional Marginal Variance Decomposition, Feldman 1999) and lmg (Latent Model Growth, Lindemann et al. 1980) require a substantial computational effort and usually cannot be applied in the highdimensional setting. Package relaimpo does not provide functions to select the final subset of variables based on their ordering. Another R package caret (Classification And REgression Training) presented in Kuhn (2008) is convenient for computing the variable importance metrics based on the popular methods like Random Forests (Breiman 2001) or Multivariate Adaptive Regression Splines (Friedman 1991). It also contains several functions that can be used to assess the performance of classification models and choose the final subset of variables. We also mention CAR scores (CorrelationAdjusted coRrelation) proposed in Zuber and Strimmer (2011) and implemented in R package care. Zuber and Strimmer (2011) proposed to use information criteria to select the final subset of variables.
1.3 Contribution
Recently a novel approach based on the Random Subspace Method (RSM) has been developed in Mielniczuk and Teisseyre (2014). Originally, the RSM was proposed by Ho (1998) for classification purposes and independently by Breiman (2001) for the case when a considered prediction method is either a classification or a regression tree. In our algorithm the ranking of variables is based on fitting linear models on small randomly chosen subsets of variables. The method does not impose any conditions on the number of candidate variables.
The aims of this paper are to present novel variants of the RSM (the weighted Random Subspace Method and the version with initial screening of the redundant variables), describe their implementations and discuss the results of extensive experiments on real and artificial data. We also present a new way of choosing the final model, which is based on Generalized Information Criterion (GIC) and does not require an additional validation set as originally proposed in Mielniczuk and Teisseyre (2014). We show that this step can be performed very fast using properties of QR decomposition of the design matrix.
For discussion of the theoretical properties of the original RSM we refer to Mielniczuk and Teisseyre (2014). The package regRSM, containing an implementation of the discussed procedure, is available from the Comprehensive R Archive Network at http://cran.rproject.org/web/packages/regRSM/index.html.
This paper is organized as follows. In Sect. 2 we describe briefly the methodology and present new variants of the original algorithm and in Sect. 3 we present the implementation and the package functionality. In Sect. 4 the efficacy experiments are discussed and Sect. 5 contains the analysis of the experiments on artificial and real datasets obtained using package regRSM.
2 Methodology
 1.
Input: observed data (Y, X), a number of subset draws B, a size of the subspace \(m<\min (p,n)\). Choice of weights \(w_{n}(i,m)\) is described below.
 2.Repeat the following procedure for \(k=1,\ldots , B\) starting with the counter \(C_{i,0}=0\) for any variable i.

Randomly draw a subset of variables \(m^{*}\) (without replacement) from the original variable space with the same probability for each variable.

Fit model to data \((Y,X_{m^*})\) and compute weight \(w(i,m^{*})\ge 0\) for each variable \(i\in m^{*}\). Set \(w(i,m^{*})=0\) if \(i\notin m^{*}\).

Update the counter \(C_{i,k}=C_{i,k1}+I\{i\in m^{*}\}.\)

 3.For each variable i compute the final score \(W_i^{*}\) defined as$$\begin{aligned}W_i^{*}=\frac{1}{C_{i,B}}\sum \limits _{m^{*}:i\in m^{*}}w(i,m^{*}).\end{aligned}$$
 4.
Sort the list of variables according to scores \(W_i^{*}\): \(W_{i_1}^{*}\ge W_{i_2}^{*}\cdots \ge W_{i_{p}}^{*}\).
 5.
Output: Ordered list of variables \(\{i_1,\ldots ,i_{p}\}\).
Observe that Algorithm 1 is generic in nature, i.e., other choices of weights \(w(i,m^{*})\) are also possible. Two parameters need to be set in the RSM: the number of selections B and the subspace size m. The smaller the size of a chosen subspace (i.e., a subset of variables chosen) the larger the chance of missing informative variables or missing dependencies between variables. On the other hand, for large m many spurious variables can be included adding noisy dimensions to the subspace. Note that if the choice of the weights \(w(i,m^{*})\) is based on least squares fit then the subspace size is limited by \(\min (n,p)\). In the following the value of parameter m is chosen empirically. We concluded from numerical experiments in Mielniczuk and Teisseyre (2014) that the reasonable choice is \(m=\min (n,p)/2\).
It follows from the description above that a parallel version of the algorithm is very easy to implement. Two parallel versions are provided in the package (see Sect. 3 for details). Figure 1 shows a block diagram of the algorithm.
In addition to the main algorithm, we consider a weighted version of the RSM (called WRSM). In the WRSM the additional initial step is performed in which we fit univariate linear models for each variable. Based on the univariate models we compute the individual relevance for each of the variables. Then in the main step the variables are drawn to the random subspaces with probabilities proportional to the relevances determined in the initial step. Thus variables whose individual influence on response is more significant, have larger probability of being chosen to any of the random subspaces. WRSM implements the heuristic premise that variables more correlated with the response variable should be chosen to the final model with larger probability. On the other hand, variables, which are weakly correlated with the response but useful when considered jointly with other variables, have still a chance to be drawn. So WRSM can be seen as a milder version of Sure Independence Screening method proposed in Fan and Lv (2008). Since in the WRSM the relevant variables are more likely to be selected, we can limit the number of repetitions B in the main loop and reduce the computational cost of the procedure. The pseudo code of the WRSM is shown below.
 1.
Input: observed data (Y, X), number of subset draws B, size of the subspace \(m<\min (p,n)\). Choice of initial weights \(w^{0}(i)\) is described below.
 2.
For each variable i fit the univariate regression model and compute a weight of ith variable \(w^{0}(i)\ge 0\).
 3.
For each variable i compute \(\pi _{i}=w^{0}(i)/\sum _{l=1}^{p}w^{0}(l)\).
 4.
Perform the RSM procedure in such a way that probability of choosing the ith variable to the random subspace is equal \(\pi _i\).
 5.
Output: Ordered list of variables \(\{i_1,\ldots ,i_{p}\}\).
Finally let us discuss the computational complexity of the whole RSM procedure. The cost of the first step is \(Bnm^2\) as we fit B linear models with m parameters each. The cost of the second step is \(nh^2\) as was discussed above. To reduce the computational cost of the procedure the execution of the first step is parallelized (see Sect. 3 for details).
3 Implementation
3.1 Package functionality

y (the response vector of length n),

x (the input matrix with n rows and p columns),

yval (the optional response vector from validation set),

xval (the optional input matrix from validation set),

m (the size of the random subspace, defaulted to \(\min (n,p)/2\)),

B (the number of repetitions in the RSM procedure),

parallel (the choice of parallelization method),

nslaves (the number of slaves),

store_data (to be set to TRUE when function validate.regRSM is used subsequently),

screening (percentage of variables to be discarded in the screening procedure),

initial_weights (whether or not WRSM should be used),

useGIC (indicates whether GIC should be used in the final model selection),

thrs [the cutoff level h, see (1)],

penalty (the penalty in GIC).

scores (RSM final scores),

model (the final model chosen from the list given by the ordering of variables according to the RSM scores),

time (computational time),

data_transfer (data transfer time),

coefficients (coefficients in the selected linear model),

input_data (input data x and y. These objects are stored only if store_data=TRUE),

control (list containing information about input parameters),

informationCriterion (values of GIC calculated for all models from the nested list given by the ordering),

predError (prediction errors on validation set calculated for all models from the nested list given by the ordering).

validate—This function selects the final model for another, user provided validation set based on the original RSM final scores. To use the function, the argument store_data in the ‘regRSM’ object must be TRUE. The function uses final scores from ‘regRSM’ object to create a ranking of variables. Then the final model which minimizes the prediction error on specified validation set is chosen. Object of class ‘regRSM’ is returned. The final scores in the original ‘regRSM’ object and in the new one coincide. However the final models can be different as they are based on different validation sets.

plot—This function produces a plot showing the values of the GIC (or prediction errors on validation set) against the number of variables included in the model.

ImpPlot—This function produces a dot plot showing final scores from the RSM procedure. Final scores describe importances of explanatory variables.

predict—This function makes a prediction for new observations. Prediction is based on a final model which is chosen using validation set or GIC.
 roc—This function produces ROCtype curve for ordering and computes the corresponding area under the curve (AUC) parameter. Let \(i_1,\ldots ,i_p\) be the ordering of variables given by the RSM procedure. Let t be the set of relevant variables (i.e., variables whose corresponding coefficients are nonzero), t its cardinality and \(t^c\) its complement. ROCtype curve for the ordering is defined as:where$$\begin{aligned} \text {ROC}(s):=(FPR(s),TPR(s)),\quad s\in \{1,\ldots ,p\}, \end{aligned}$$$$\begin{aligned} FPR(s):=\frac{\hat{t}(s)\setminus t}{t^{c}}, \end{aligned}$$and \(\hat{t}(s):=\{i_1,\ldots ,i_s\}\). This function is useful for the evaluation of the ranking produced by the RSM procedure when the set of significant variables t is known (e.g., in the simulation experiments on artificial datasets). When AUC is equal one it means that all significant variables, supplied by the user in argument truemodel, are placed ahead of the spurious ones on the ranking list. A similar idea of ranking evaluation is described in Cheng et al. (2014, Section 4).$$\begin{aligned} TPR(s):=\frac{\hat{t}(s)\cap t}{t} \end{aligned}$$

print, summary—These functions print out information about the selection method, screening, initial weights, version (sequential or parallel), size of the random subspace, number of simulations.
3.2 Package demonstration
3.3 Parallel implementation
 (1)
using MPI framework (based on package Rmpi), option parallel=MPI,
 (2)
using process forking (based on package doParallel), option parallel=POSIX.
In order to use parallel=MPI option, installation of MPI framework and Rmpi package (Hao 2002) are required. A guideline for installing MPI on multiple machines under Ubuntu is given in the “Appendix”. Rmpi is a wrapper of MPI written in R language. The main advantage of this wrapper is that writing R programs using MPI is much easier and possible even for nonprogrammers. The optimal value of nslaves under MPI is the number of computing cores of all machines configured in MPI framework. Note that after execution of the regRSM function, it is necessary to close MPI framework by calling mpi.close.Rslaves() function. Function regRSM does not close the MPI framework itself because of efficiency issues as creation of slaves is usually very time consuming. For example if one would like to execute regRSM multiple times in a row (e.g., for different datasets or with different settings), the command mpi.close.Rslaves() should be used after the last execution of regRSM. In this way the process of MPI initialization will be executed only once. Next parallel call will reuse existing slave processes. In order to change the number of slaves, MPI needs to be terminated (using command mpi.close.Rslaves()), and only then one can call function regRSM with new value of the parameter nslaves.
The other parallelization option (parallel=POSIX) does not require any preinstalled software except for the R package doParallel (Revolution Analytics and Weston 2013). The limitation of this parallel mode is that the execution can be performed on one single logical machine only. POSIX uses OpenMP like parallel implementation. The parallel execution is handled by doParallel library. The optimal value of nslaves is the number of processor cores in a machine. In contrast to MPI option, there is no need to shut down the workers (slaves) because the workers will cease to operate if the master R session is completed (or its process dies).
The internal implementations of both parallelization options differ in the way how the parallel processes of variable selection, partial model building and variable evaluation are created and synchronized. The POSIX path delegates the processing of parallelism to the operating system. The MPI path requires elaboration of the proper sequence of messages for starting new slaves, assigning tasks to them, taking over the results and for task reassignment.
In the case of parallel=MPI the Algorithm 1 is replaced by the Algorithm 3 below.
 1.
Input (for Master): observed data (Y, X), a number of subset draws B, a size of the subspace \(m<\min (p,n)\).
 2.
Master: send observed data (Y, X) and parameter m to each slave.
 3.
Master: Compute task_number=B/nslaves.
 4.
Master: Send task_number to each slave as their \(B^{local}\) except for the last one which gets remaining number of tasks Btask_number*(nslaves1) as its \(B^{local}\).
 5.Slave: Repeat the following procedure for \(k=1,\ldots , B^{local}\) starting with \(C^{local}_{i,0}=0\) for any variable i.

Randomly draw a subset of variables \(m^{*}\) (without replacement) from the original variable space with the same probability for each variable.

Fit model to data \((Y,X_{m^*})\) and compute weight \(w(i,m^{*})\ge 0\) for each variable \(i\in m^{*}\). Set \(w(i,m^{*})=0\) if \(i\notin m^{*}\).

Update the counter \( C^{local}_{i,k}= C^{local} _{i,k1}+I\{i\in m^{*}\}.\)

 6.Slave: For each variable i compute the partial sum$$\begin{aligned} S_{i}^{local}=\sum \limits _{m^{*}:i\in m^{*}}w(i,m^{*}). \end{aligned}$$
 7.
Slave: send vectors \((S_{1}^{local},\ldots ,S_{p}^{local})\) and \((C^{local}_{1,B^{local}},\ldots ,C^{local}_{p,B^{local}})\) to the Master.
 8.Master: Compute final scores:$$\begin{aligned} W_{i}^{*}=\frac{\sum _{\text {slaves}}S_{i}^{local}}{\sum _{\text {slaves}}C_{i,B^{local}}^{local}}. \end{aligned}$$
 9.
Master: Sort the list of variables according to scores \(W_i^{*}\): \(W_{i_1}^{*}\ge W_{i_2}^{*}\cdots \ge W_{i_{p}}^{*}\).
 10.
Output: Ordered list of variables \(\{i_1,\ldots ,i_{p}\}\).
When calling regRSM with parallel=MPI or parallel=POSIX options, the software will check for presence of the Rmpi or doParallel packages. If they are not installed, the regRSM will not be executed and an error message will be displayed.
4 Efficiency
 (1)
one physical computer with 16 cores (4 \(\times \) 4 core processor Intel(R) Xeon(R) CPU X7350 @ 2.93 GHz), 64 GB RAM, Open RTE 1.4.3 mpi, R 3.0.1, Ubuntu 12.04 LTS
 (2)
four physical computers with 4 cores each (4 core processor Intel(R) Core(TM) i72600 CPU @ 3.40 GHz), 16 GB RAM (12 GB for VM) each, Open RTE 1.7.3 mpi, R 3.0.2, Ubuntu 12.04 LTS on Oracle VM VirtualBox
In the first experiment, an artificial dataset containing \(n=400\) cases and \(p=1000\) explanatory variables was generated. We set \(m=\min (n,p)/2=200\). We investigated how the computational time depends on the number of simulations B. Figures 6 and 7 show the execution times and speedups against \(\log (B)\) for hardware setting (1). Observe that in this case the speedups for POSIX version (Fig. 7b) are larger than for MPI version (Fig. 6b) which makes POSIX version faster. Figure 8 shows the execution times and speedups against \(\log (B)\) for hardware setup (2). Observe that MPI version is faster on hardware setup (2) than on setup (1) and is faster than difference in processor frequencies (3.40 vs 2.93 GHz).
The observed effects of parallelization seem to be satisfactory. The speedup is not linear with respect to the number of slaves as for MPI overheads occur due to transfer of the complete data set and MPIstartups. Moreover, there are some other tasks which are executed in a sequential way. MPI version on four PC with one processor is faster tan on one server with four processors. This gives us a cheaper solution for speeding up our algorithm.
5 Application examples
5.1 Artificial data example
Models summary
Model  t  t  \(\beta _t\) 

1  1  \(\{10\}\)  (0.2) 
2  3  \(\{2,4,5\}\)  (1, 1, 1) 
3  10  \(\{2k+7:k=3,\ldots ,12\}\)  \((1,\ldots ,1)\) 
4  5  \(\{k^2:k=1,\ldots ,5\}\)  \((1,\ldots ,1)\) 
5  15  \(\{1,\ldots ,15\}\)  \((2.5,\ldots ,2.5,1.5,\ldots ,1.5,0.5,\ldots ,0.5)\) 
6  15  \(\{1,\ldots ,5,11,\ldots ,15,21,\ldots ,25\}\)  \((2.5,\ldots ,2.5,1.5,\ldots ,1.5,0.5,\ldots ,0.5)\) 
7  20  \(\{1,\ldots ,20\}\)  \((1.1, 1.2,\ldots ,3)\) 
8  8  \(\{1,\ldots ,8\}\)  (0.7, 0.9, 0.4, 0.3, 1, 0.2, 0.2, 0.1) 
9  50  \(\{1,\ldots ,25,51,\ldots 75\}\)  \((0.5,\ldots ,0.5)\) 
10  50  \(\{1,\ldots ,25,51,\ldots 75\}\)  \((1,\ldots ,1)\) 
Mean values of \(100\times \hbox {PE}\)/min(PE) based on 500 simulation trials
Model  t  lasso  RSM  WRSM  UNI  CAR  Min 

1  1  130.06  112.50  111.79  110.80  111.85  UNI 
(0.634)  (0.523)  (0.836)  (0.472)  (0.496)  
2  3  105.43  100.34  110.61  100.26  100.31  UNI 
(0.27)  (0.073)  (0.565)  (0.068)  (0.069)  
3  10  115.88  100.52  108.03  102.08  101.32  RSM 
(0.543)  (0.101)  (0.535)  (0.291)  (0.258)  
4  5  111.72  100.36  109.63  100.52  100.40  RSM 
(0.462)  (0.054)  (0.53)  (0.089)  (0.069)  
5  15  109.36  114.75  102.03  118.43  117.67  WRSM 
(0.513)  (0.812)  (0.311)  (0.941)  (0.895)  
6  15  112.68  125.33  101.04  129.83  127.94  WRSM 
(0.606)  (1.14)  (0.183)  (1.252)  (1.214)  
7  20  119.14  122.21  100.75  139.89  133.59  WRSM 
(0.629)  (1.908)  (0.228)  (3.258)  (2.637)  
8  8  106.96  100.96  114.95  100.62  100.61  CAR 
(0.362)  (0.154)  (0.525)  (0.101)  (0.099)  
9  50  121.67  115.98  102.85  144.62  132.96  WRSM 
(0.918)  (1.112)  (0.362)  (1.717)  (1.529)  
10  50  130.09  156.73  101.64  225.23  196.56  WRSM 
(1.279)  (3.277)  (0.397)  (4.744)  (4.326) 
Mean values of true positive rates (TPR) based on 500 simulation trials
Model  t  Lasso  RSM  WRSM  UNI  CAR  Max. TPR 

1  1  0.785  0.570  0.310  0.575  0.575  Lasso 
(0.03)  (0.035)  (0.033)  (0.035)  (0.035)  
2  3  1.000  1.000  1.000  1.000  1.000  All 
(0)  (0)  (0)  (0)  (0)  
3  10  1.000  1.000  1.000  0.998  0.998  Lasso, RSM, WRSM 
(0)  (0)  (0)  (0.001)  (0.001)  
4  5  1.000  1.000  1.000  1.000  1.000  All 
(0)  (0)  (0)  (0)  (0)  
5  15  0.993  0.811  0.986  0.786  0.791  Lasso 
(0.001)  (0.005)  (0.002)  (0.005)  (0.005)  
6  15  0.989  0.747  0.981  0.727  0.737  Lasso 
(0.002)  (0.005)  (0.002)  (0.005)  (0.005)  
7  20  1.000  0.979  1.000  0.962  0.968  Lasso, WRSM 
(0)  (0.002)  (0)  (0.003)  (0.003)  
8  8  0.858  0.838  0.702  0.831  0.834  Lasso 
(0.006)  (0.005)  (0.007)  (0.005)  (0.005)  
9  50  0.998  0.918  0.954  0.840  0.876  Lasso 
(0.001)  (0.003)  (0.003)  (0.005)  (0.005)  
10  50  1.000  0.951  0.992  0.881  0.910  Lasso 
(0)  (0.003)  (0.001)  (0.006)  (0.005) 
Mean values of false discovery rates (FDR) based on 500 simulation trials
Model  t  Lasso  RSM  WRSM  UNI  CAR  Min. FDR 

1  1  0.996  0.935  0.946  0.925  0.930  UNI 
(0)  (0.005)  (0.009)  (0.006)  (0.005)  
2  3  0.223  0.035  0.643  0.059  0.050  RSM 
(0.015)  (0.006)  (0.014)  (0.007)  (0.007)  
3  10  0.374  0.316  0.348  0.396  0.365  RSM 
(0.009)  (0.008)  (0.014)  (0.007)  (0.008)  
4  5  0.344  0.060  0.506  0.135  0.091  RSM 
(0.014)  (0.009)  (0.014)  (0.012)  (0.011)  
5  15  0.202  0.164  0.213  0.181  0.167  RSM 
(0.008)  (0.012)  (0.011)  (0.014)  (0.013)  
6  15  0.247  0.240  0.150  0.246  0.252  WRSM 
(0.008)  (0.015)  (0.011)  (0.015)  (0.014)  
7  20  0.203  0.273  0.026  0.342  0.315  WRSM 
(0.007)  (0.015)  (0.006)  (0.016)  (0.016)  
8  8  0.154  0.065  0.605  0.047  0.053  UNI 
(0.011)  (0.008)  (0.012)  (0.006)  (0.007)  
9  50  0.639  0.222  0.188  0.257  0.243  WRSM 
(0.01)  (0.007)  (0.007)  (0.008)  (0.008)  
10  50  0.463  0.308  0.193  0.333  0.319  WRSM 
(0.01)  (0.008)  (0.008)  (0.008)  (0.01) 
Mean values of model sizes based on 500 simulation trials
Model  t  Lasso  RSM  WRSM  UNI  CAR  Min. size 

1  1  221.295  11.540  11.705  9.415  10.375  UNI 
(0.457)  (0.361)  (0.989)  (0.293)  (0.297)  
2  3  4.205  3.215  11.025  3.285  3.255  RSM 
(0.11)  (0.031)  (0.429)  (0.032)  (0.032)  
3  10  16.600  15.135  16.900  17.255  16.190  RSM 
(0.258)  (0.196)  (0.389)  (0.22)  (0.207)  
4  5  8.335  5.420  12.585  6.065  5.685  RSM 
(0.208)  (0.077)  (0.402)  (0.122)  (0.098)  
5  15  19.035  15.540  19.870  15.430  15.190  CAR 
(0.207)  (0.362)  (0.348)  (0.418)  (0.39)  
6  15  20.300  16.145  18.225  16.030  16.400  UNI 
(0.228)  (0.48)  (0.308)  (0.473)  (0.459)  
7  20  25.590  30.290  20.660  33.470  31.935  WRSM 
(0.257)  (0.907)  (0.195)  (1.012)  (1.019)  
8  8  8.405  7.330  16.840  7.105  7.190  WRSM 
(0.164)  (0.116)  (0.541)  (0.083)  (0.091)  
9  50  155.550  60.315  59.480  58.655  59.605  UNI 
(2.869)  (0.73)  (0.632)  (1.013)  (0.932)  
10  50  101.120  70.975  62.825  68.955  69.665  WRSM 
(1.944)  (0.991)  (0.73)  (1.143)  (1.218) 
Ratio of the performance measure for the default subspace size \(m_{def}\) to the performance measure corresponding to the optimal subspace size
Model  

1  2  3  4  5  6  7  8  9  10  
\(\frac{PE(m_{def})}{\min _{m}PE(m)}\)  1.13  1.00  1.00  1.00  1.00  1.00  1.03  1.00  1.02  1.12 
\(\frac{TPR(m_{def})}{\max _{m}TPR(m)}\)  0.92  1.00  1.00  1.00  1.00  1.00  1.00  0.98  0.99  0.99 
\(\frac{FDR(m_{def})}{\min _{m}FDR(m)}\)  1.03  1.33  1.19  2.09  1.26  1.27  1.15  1.00  1.04  1.02 
\(\frac{LEN(m_{def})}{\min _{m}LEN(m)}\)  1.28  1.01  1.08  1.04  1.08  1.12  1.10  1.00  1.03  1.04 

true positive rate (TPR): \(\hat{t}\cap t/t\),

false discovery rate (FDR): \(\hat{t}\setminus t/\hat{t}\),

prediction error (PE) equal to root mean squared error computed on independent dataset having the same number of observations as the training set.
Table 2 shows values of \(100\times \text {PE}/\min (\text {PE})\) averaged over 500 simulations, where \(\min (\text {PE})\) is the minimal value of prediction error of 5 considered methods. The last column pertains to the method for which PE was minimal. It is seen that for 5 models the WRSM outperforms all other competitors with respect to PE. Table 3 shows values of TPR; the last column pertains to the method for which the maximal TPR is attained. Observe that in the majority of models, TPR for the lasso is close to one, which indicates that lasso selects most relevant variables. However, the differences between the lasso and the WRSM are negligible. Table 4 shows values of FDR; the last column pertains to the method for which the minimal FDR is obtained. Here, the WRSM outperforms other methods in the four out of ten cases. Table 5 contains information about sizes of chosen models: RSM usually selects much smaller models than lasso.
The clear advantage of the WRSM over other methods can be seen for models with large number of relevant variables (e.g., 7, 9, 10). For these models, the significant variables are usually placed on the top of the ranking list when the WRSM is used. It is not necessarily the case for other methods. For example in the case of model 7, the position of the last relevant variable in the ranking list (averaged over simulation trials) is: 20 (lasso), 64.76 (RSM), 20 (WRSM), 124.88 (UNI), 99.68 (CAR). This indicates that for the lasso and the WRSM, all relevant variables are placed ahead of spurious ones in all simulations. On the other hand in some situations (usually for models with small number of significant variables, e.g., 2, 4, 8) the WRSM have quite large FDRs compared to other methods. In these cases, the relevant variables are also placed on the top of the ranking list (TPRs are close to one) but the Bayesian Information Criterion used in the second step selects too many spurious variables to the final model. As this behaviour occurs for small values of t the number of false positives is also small in absolute terms, however.
Results for 3fold crossvalidation
Method  PE (BIC)  \(\hat{t}\) (BIC)  PE (VAL)  \(\hat{t}\) (VAL) 

RSM  6.45  10  6.85  44.33 
WRSM  6.59  28.33  6.92  22.33 
CAR  6.55  10.66  6.79  21.33 
UNI  6.69  14.66  7.08  27.33 
Lasso  7.69  189.33  7.69  189.33 
Naive  14.75  1  14.75  1 
5.2 Real data example
Top 50 variables for real dataset
RSM  RL  WRSM  RL  CAR  RL  UNI  RL  Lasso  

1  cg16867657  1  cg16867657  1  cg08097417  6  cg16867657  1  cg16867657 
2  cg08097417  6  cg08097417  6  cg14361627  4  cg06639320  3  cg10501210 
3  cg06639320  3  cg14361627  4  cg16867657  1  cg24724428  cg06639320  
4  cg14361627  4  cg10501210  2  cg22454769  5  cg22454769  5  cg14361627 
5  cg10501210  2  cg06639320  3  cg06639320  3  cg10501210  2  cg22454769 
6  cg22454769  5  cg22454769  5  cg07955995  cg24079702  cg08097417  
7  cg23606718  46  cg07955995  cg09499629  cg07553761  cg19283806  
8  cg07955995  cg24079702  cg24079702  cg21572722  cg14692377  
9  cg24079702  cg02650266  12  cg10501210  2  cg19283806  7  cg16054275  
10  cg24724428  cg10591771  cg04875128  cg06784991  cg08234504  
11  cg00748589  cg24955895  cg19283806  7  cg08234504  10  cg07082267  
12  cg09499629  cg22285878  cg22285878  cg04875128  cg02650266  
13  cg17110586  cg24724428  cg03607117  39  cg14692377  8  cg03399905  
14  cg02650266  12  cg26690592  cg22736354  20  cg01974375  17  cg20426994  
15  cg07850154  cg09499629  cg03032497  21  cg16054275  9  cg23091758  
16  cg04875128  cg23606718  46  cg06493994  cg22736354  20  cg04400972  
17  cg21572722  cg08719712  cg21572722  cg23744638  cg01974375  
18  cg17621438  cg21572722  cg20426994  14  cg07547549  cg08128734  
19  cg20426994  14  cg07850154  cg08719712  cg08160331  cg18618815  
20  cg06493994  cg10804656  cg24724428  cg02650266  12  cg22736354  
21  cg19283806  7  cg00748589  cg01528542  cg08128734  18  cg03032497  
22  cg14692377  8  cg17621438  cg00748589  cg23500537  cg04581938  
23  cg03473532  cg06493994  cg25478614  cg23078123  37  cg22796704  
24  cg22796704  23  cg03399905  13  cg25410668  cg03032497  21  cg21801378  
25  cg07547549  cg17110586  cg23091758  15  cg22796704  23  cg06240854  
26  cg22285878  cg00058879  cg18473521  cg25994988  cg25428494  
27  cg18933331  cg16312514  cg23606718  46  cg15804973  cg26290632  
28  cg03399905  13  cg20426994  14  cg23500537  cg08097417  6  cg22016779  
29  cg10804656  cg08540945  cg21186299  cg16932827  cg15707833  
30  cg07082267  11  cg17802949  cg16008966  40  cg14361627  4  cg16193278  
31  cg17802949  cg01528542  cg07547549  cg08262002  cg01820962  
32  cg07553761  cg00590036  cg14692377  8  cg16419235  cg04604946  
33  cg11693709  cg07553761  cg06419846  cg24466241  cg17183905  
34  cg18468088  cg20149168  cg17110586  cg03259243  cg13221458  
35  cg16008966  40  cg20275558  cg07927379  cg07082267  11  cg11067179  
36  cg01528542  cg18450254  cg17621438  cg01763090  cg22213242  
37  cg13327545  cg03607117  39  cg04581938  22  cg00292135  cg23078123  
38  cg03607117  39  cg10397932  cg04084157  cg03735592  cg18310639  
39  cg26290632  27  cg15707833  29  cg20052760  cg01820374  cg03607117  
40  cg02018902  cg18633600  cg08160331  cg01528542  cg16008966  
41  cg18450254  cg02452500  48  cg00481951  cg10149533  cg05207048  
42  cg02328239  cg26290632  27  cg26290632  27  cg22156456  cg06231995  
43  cg20822990  cg21186299  cg04400972  16  cg17110586  cg15894389  
44  cg08160331  cg17715419  cg02650266  12  cg07080372  cg02481950  
45  cg09310092  cg16008966  40  cg03399905  13  cg09809672  cg02924487  
46  cg09809672  cg06419846  cg01763090  cg24711336  cg23606718  
47  cg16312514  cg04940570  cg07553761  cg26161329  cg15936446  
48  cg08719712  cg05576959  cg11067179  35  cg03431918  cg02452500  
49  cg00094518  cg04875128  cg08128734  18  cg22285878  cg10106965  
50  cg04084157  cg19283806  7  cg10804656  cg05991454  cg01541867 
Table 7 shows the prediction errors and the sizes of selected models averaged over 3 crossvalidation splits. The value in bold pertains to the minimal value in each column. Observe that for all methods there is a significant improvement over the naive method. Note that for the lasso we obtain larger prediction errors and much larger models than for other considered models. When the final model is chosen using BIC, we get the smallest error for the RSM (see the first column). When the selection is based on validation set, CAR method is a winner (see the third column).
Figure 11a shows prediction errors (averaged over 3 folds) with respect to the number of variables included in the model. Observe that when the number of variables included in the final model is sufficiently large, the prediction errors for the WRSM are smaller than for competitive models. Figure 11b shows prediction errors with respect to the subspace size m for RSM. The vertical line corresponds to the default subspace size.
Table 8 shows rankings of top 50 variables obtained using considered methods. As in the original paper (Hannum et al. 2013), lasso was used to assess the relevance of the variables, we compare rankings obtained by RSM, WRSM, CAR and UNI with the one based on lasso. RL in Table 8 denotes a position of the given variable in the ranking based on lasso (empty space means that the given variable is not ranked in top 50 variables by lasso). Note that the rankings, corresponding to considered methods, are not fully concordant, which may be valuable in biological research as some new relevant variables can be potentially discovered. It is seen that 6 variables, recognized as the most significant ones by lasso are also ranked on the top 6 positions by RSM and WRSM. It is interesting that cg08097417 is the second most important variable according to RSM/WRSM (and the most important variable according to CAR), whereas it is placed on 6th position by lasso. Finally, observe that cg14361627 (4th position according to lasso and RSM) is not recognized as very relevant variable by UNI, which may suggest that this variable is relatively weakly correlated with the response but becomes useful when considered jointly with other variables.
6 Summary
In this paper we presented a novel variants of RSM as well as an implementation in R package regRSM. The method does not impose any conditions on the number of candidate variables. The underlying algorithms are discussed. The first step in our procedure is based on fitting linear models on small randomly chosen subsets of variables and thus it allows for parallelization. Two versions of parallel implementation are presented. Moreover other improvements of the original method are introduced, including an initial screening of variables and their weighting in the sampling process. The article presents the empirical evaluation of our implementation including: its efficiency in identifying the significant variables, its prediction power and acceleration of the processing due to parallel implementations. The method and its weighted variant compare well with other methods tailored to the highdimensional setup (like lasso or CAR scores) and is amenable to parallelization under various hardware settings (single and multiple physical machines) and parallel softwares (MPI, POSIX).
Notes
Acknowledgments
Research of Paweł Teisseyre and Robert A. Kłopotek was supported by the European Union from resources of the European Social Fund within project ‘Information technologies: research and their interdisciplinary applications’ POKL.04.01.0100051/1000.
References
 Breiman L (2001) Random forests. Mach Learn 45(1):5–32MathSciNetCrossRefzbMATHGoogle Scholar
 Chen J, Chen Z (2008) Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95:759–771MathSciNetCrossRefzbMATHGoogle Scholar
 Cheng J, Levina E, Wang P, Zhu J (2014) A sparse Ising model with covariates. Biometrics 70:943–953MathSciNetCrossRefzbMATHGoogle Scholar
 Donoho DL (2000) Highdimensional data analysis: the curses and blessings of dimensionality. Aidememoire of a lecture at AMS conference on math challenges of the 21st centuryGoogle Scholar
 Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space (with discussion). J R Stat Soc B 70:849–911MathSciNetCrossRefGoogle Scholar
 Fan Y, Tang CY (2013) Tuning parameter selection in high dimensional penalized likelihood. J R Stat Soc Ser B (Stat Methodol) 75(3):531–552MathSciNetCrossRefGoogle Scholar
 Feldman B (2005) Relative importance and value. http://ssrn.com/abstract=2255827
 Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19(1):1–67MathSciNetCrossRefzbMATHGoogle Scholar
 Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22Google Scholar
 Gentle JE (2007) Matrix algebra: theory, computations, and applications in statistics. Springer, New YorkCrossRefzbMATHGoogle Scholar
 Grömping U (2006) Relative importance for linear regression in R: the package relaimpo. J Stat Softw 17(1):1–27Google Scholar
 Hannum G, Guinney J, Zhao L, Zhang L, Hughes G, Sadda S, Klotzle B, Bibikova M, Fan JB, Gao Y, Deconde R, Chen M, Rajapakse I, Friend S, Ideker T, Zhang K (2013) Genomewide methylation profiles reveal quantitative views of human aging rates. Mol Cell 49(2):359–367CrossRefGoogle Scholar
 Hao Y (2002) Rmpi: parallel statistical computing in R. R News 2(2):10–14. http://cran.rproject.org/doc/Rnews/Rnews_20022.pdf
 Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference and prediction. Springer. http://wwwstat.stanford.edu/tibs/ElemStatLearn/
 Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844CrossRefGoogle Scholar
 Huang J, Ma S, Zhang CH (2008) Adaptive lasso for highdimensional regression models. Stat Sin 18:1603–1618MathSciNetzbMATHGoogle Scholar
 Jolliffe IT (1982) A note on the use of principal components in regression. J R Stat Soc Ser C (Appl Stat) 31(3):300–303Google Scholar
 Kuhn M (2008) Building predictive models in R using the caret package. J Stat Softw 28(5):1–26CrossRefGoogle Scholar
 Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
 Lindemann R, Merenda P, Gold R (1980) Introduction to bivariate and multivariate analysis. Scott Foresman, GlenviewzbMATHGoogle Scholar
 Martens H (2001) Reliable and relevant modelling of real world data: a personal account of the development of PLS regression. Chemom Intell Lab Syst 58(2):85–95MathSciNetCrossRefGoogle Scholar
 Mielniczuk J, Teisseyre P (2014) Using random subspace method for prediction and variable importance assessment in regression. Comput Stat Data Anal 71:725–742MathSciNetCrossRefGoogle Scholar
 Rencher AC, Schaalje GB (2008) Linear models in statistics. Wiley, HobokenzbMATHGoogle Scholar
 Revolution Analytics, Weston S (2013) doParallel: foreach parallel adaptor for the parallel package. http://CRAN.Rproject.org/package=doParallel. R package version 1.0.6
 Shao J, Deng X (2012) Estimation in highdimensional linear models with deterministic covariates. Ann Stat 40(2):812–831MathSciNetCrossRefzbMATHGoogle Scholar
 Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288MathSciNetzbMATHGoogle Scholar
 Wold S (2001) Personal memories of the early PLS development. Chemom Intell Lab Syst 58(2):83–84MathSciNetCrossRefGoogle Scholar
 Zhang CH, Zhang T (2012) A general theory of concave regularization for highdimensional sparse estimation problems. Stat Sci 27(4):576–593MathSciNetCrossRefzbMATHGoogle Scholar
 Zhang Y, Lia R, Tsaia CL (2012) Regularization parameter selections via generalized information criterion. J Am Stat Assoc 105:312–323MathSciNetCrossRefGoogle Scholar
 Zheng X, Loh WY (1997) A consistent variable selection criterion for linear models with highdimensional covariates. Stat Sin 7:311–325MathSciNetzbMATHGoogle Scholar
 Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B 67(2):301–320MathSciNetCrossRefzbMATHGoogle Scholar
 Zuber V, Strimmer K (2011) Highdimensional regression and variable selection using car scores. Stat Appl Genet Mol Biol 10(1):301–320MathSciNetzbMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.