Selective inference via marginal screening for high dimensional classification
 432 Downloads
 1 Citations
Abstract
Postselection inference is a statistical technique for determining salient variables after model or variable selection. Recently, selective inference, a kind of postselection inference framework, has garnered the attention in the statistics and machine learning communities. By conditioning on a specific variable selection procedure, selective inference can properly control for socalled selective type I error, which is a type I error conditional on a variable selection procedure, without imposing excessive additional computational costs. While selective inference can provide a valid hypothesis testing procedure, the main focus has hitherto been on Gaussian linear regression models. In this paper, we develop a selective inference framework for binary classification problem. We consider a logistic regression model after variable selection based on marginal screening, and derive the high dimensional statistical behavior of the postselection estimator. This enables us to asymptotically control for selective type I error for the purposes of hypothesis testing after variable selection. We conduct several simulation studies to confirm the statistical power of the test, and compare our proposed method with data splitting and other methods.
Keywords
High dimensional asymptotics Hypothesis testing Logistic regression Postselection inference Marginal screening1 Introduction
Discovering statistically significant variables in high dimensional data is an important problem for many applications such as bioinformatics, materials informatics, and econometrics, to name a few. To achieve this, for example, in a regression model, data analysts often attempt to reduce the dimensionality of the model by utilizing a particular model selection or variable selection method. For example, the Lasso (Tibshirani 1996) and marginal screening (Fan and Lv 2008) are frequently used in model selection contexts. In many applications, data analysts conduct statistical inference based on the selected model as if it is known a priori, but this practice has been referred to as “a quiet scandal in the statistical community” in Breiman (1992). If we select a model based on the available data, then we have to pay heed to the effect of model selection when we conduct a statistical inference. This is because the selected model is no longer deterministic, i.e., random, and statistical inference after model selection is affected by selection bias. In hypothesis testing of the selected variables, the validity of the inference is compromised when a test statistic is constructed without taking account of the model selection effect. This means that, as a consequence, we can no longer effectively control type I error or the falsepositive rate. This kind of problem falls under the banner of postselection inference in the statistical community and has recently attracted a lot of attention (see, e.g., Berk et al. 2013; Efron 2014; Barber and Candès 2016; Lee et al. 2016).
 Selection:

The analyst chooses a model or subset of variables and constructs hypothesis, based on the data.
 Inference:

The analyst tests the hypothesis by using the selected model.
Data splitting is the most common procedure for selection bias correction. In a high dimensional linear regression model, Wasserman and Roeder (2009) and Meinshausen et al. (2009) succeed in assigning a p value for each selected variable by splitting the data into two subsets. Specifically, they first reduce the dimensionality of the model using the first subset, and then make the final selection using the second subset of the data, by assigning a p value based on a classical least square estimation. While such a data splitting method is mathematically valid straightforward to implement, it leads to low power for extracting truly significant variables because only subsamples, whose size is obviously smaller than that of the full sample, can be used in each of the selection and inference steps.
As an alternative, simultaneous inference, which takes account all possible subsets of variables, has been developed for correcting selection bias. Berk et al. (2013) showed that the type I error can be successfully controlled even if the full sample is used in both the selection and inference steps by adjusting multiplicity of model selection. Since the number of all possible subsets of variables increases exponentially, computational costs associated with this method become excessive when the dimension of parameters is greater than 20.
On the other hand, selective inference, which only takes the selected model into account, is another approach for postselection inference, and provides a new framework for combining selection and hypothesis testing. Since hypothesis testing is conducted only for the selected model, it makes sense to condition on an event that “a certain model is selected”. This event is referred to as a selection event, and we conduct hypothesis testing conditional on the event. Thus, we can avoid having to compare coefficients across two different models. Recently, Lee et al. (2016) succeeded in using this method to conduct hypothesis testing through constructing confidence intervals for selected variables by the Lasso in a linear regression modeling context. When a specific confidence interval is constructed, the corresponding hypothesis testing can be successfully conducted. They also show that the type I error, which is also conditioned on the selection event and is called selective type I error, can be appropriately controlled. It is noteworthy that by conditioning on the selection event in a certain class, we can construct exact p values in the meaning of conditional inference based on a truncated normal distribution.
Almost all studies which have followed since the seminal work by Lee et al. (2016), however, focus on linear regression models. Particularly, normality of the noise is crucial to control selective type I error. To relax this assumption, Tian and Taylor (2017) developed an asymptotic theory for selective inference in a generalized linear modeling context. Although their results can be available for high dimensional and low sample size data, we can only test a global null hypothesis, that is, a hypothesis that all regression coefficients are zero, just like with covariance test (Lockhart et al. 2014). On the other hand, Taylor and Tibshirani (2018) proposed a procedure to test individual hypotheses in a logistic regression model with the Lasso. By debiasing the Lasso estimator for both the active and inactive variables, they require a joint asymptotic distribution of the debiased Lasso estimator and conduct hypothesis testing for regression coefficients individually. However, the method is justified only for low dimensional scenarios since they exploit standard fixed dimensional asymptotics.
Our main contribution is that, by utilizing marginal screening as a variable selection method, we can show that the selective type I error rate for logistic regression model is appropriately controlled even in a high dimensional asymptotic scenario. In addition, our method is applicable not only with respect to testing the global null hypothesis but also hypotheses pertaining to individual regression coefficients. Specifically, we first utilize marginal screening for the selection step in a similar way to Lee and Taylor (2014). Then, by considering a logistic regression model for the selected variables, we derive a high dimensional asymptotic property of a maximum likelihood estimator. Using the asymptotic results, we can conduct selective inference of a high dimensional logistic regression, i.e., valid hypothesis testing for the selected variables from high dimensional data.
The rest of the paper is organized as follows. Sect. 2 briefly describes the notion of selective inference and intruduces several related works. In Sect. 3, the model setting and assumptions are described. An asymptotic property of the maximum likelihood estimator of our model is discussed in Sect. 4. In Sect. 5, we conduct several simulation studies to explore the performance of the proposed method before application to real world empirical data sets in Sect. 6. Theorem proofs are relegated to Sect. 7. Finally, Sect. 8 offers concluding remarks and suggestions for future research in this domain.
1.1 Notation
Throughout the paper, row and column vectors of \(X\in \mathbb {R}^{n\times d}\) are denoted by \({\varvec{x}}_i~(i=1,\ldots , n)\) and \(\tilde{{\varvec{x}}}_j,~(j=1,\ldots , d)\), respectively. An \(n\times n\) identity matrix is denoted by \(I_n\). The \(\ell _2\)norm of a vector is denoted by \(\Vert \cdot \Vert\) provided there is no confusion. For any subset \(J\subseteq \{1,\ldots , d\}\), its complement is denoted by \(J^\bot =\{1,\ldots ,d\}\backslash S\). We also denote \({\varvec{v}}_J=(v_i)_{i\in J}\in \mathbb {R}^{J}\) and \(X_J=({\varvec{x}}_{J,1},\ldots ,{\varvec{x}}_{J,n})^\top \in \mathbb {R}^{n\times J}\) as a subvector of \({\varvec{v}}\) and a submatrix of X, respectively. For a differentiable function f, we denote \(f'\) and \(f''\) as the first and second derivatives and so on.
2 Selective inference and related works
In this section, we overview fundamental notion of selective inference through a simple linear regression model (Lee et al. 2016). We also review related existing works on selective inference.
2.1 Selective inference in linear regression model
If a subset of variables, i.e., the active set, \(\hat{S}\) is selected by the Lasso or marginal screening, the event \(\{\hat{S}=S\}\) can be written as an affine set with respect to \({\varvec{y}}\), that is, in the form of \(\{{\varvec{y}}; \, A{\varvec{y}}\le {\varvec{b}}\}\) for some nonrandom matrix A and vector \({\varvec{b}}\) (Lee et al. 2016; Lee and Taylor 2014), in which the event \(\{\hat{S}=S\}\) is called a selection event. Lee et al. (2016) showed that if \({\varvec{y}}\) follows a normal distribution and the selection event can be written as an affine set, the following lemma holds:
Lemma 1
2.2 Related works
In selective inference, we use the same data in variable selection and statistical inference. Therefore, the selected model is not deterministic and we can not apply classical hypothesis testing due to selection bias.
To navigate this problem, data splitting has been commonly utilized. In data splitting, the data are randomly divided into two disjoint sets, and one of them is used for variable selection and the other is used for hypothesis testing. This is a particularly versatile method and is widely applicable if we can divide the data randomly (see, e.g., Cox 1975; Wasserman and Roeder 2009; Meinshausen et al. 2009). Since the data are split randomly, i.e., independent of the data, we can conduct hypothesis testing in the inference step independent of the selection step. Thus, we do not need to concerned with selection bias. It is noteworthy that data splitting can be viewed as a method of selective inference because the inference is conducted only for the selected variables in the selection step. However, a drawback of data splitting is that only a part of the data are available for each split, precisely because the essence of this approach involves rendering some data available for the selection step and the remainder for the inference step. Because only a subset of the data can be used in variable selection, the risk of failing to select truly important variables increases. Similarly, the power of hypothesis testing would decrease since inference proceeds on the basis of a subset of the total data. In addition, since data splitting is executed at random, it is possible and plausible that the final results and conclusions will vary nontrivially depending on exactly how this split is manifested.
Following the seminal work of Lee et al. (2016), selective inference for variable selection has been intensively studied (e.g., Fithian et al. 2014; Lee and Taylor 2014; Taylor et al. 2016; Tian et al. 2018). All these methods, however, rely on the assumption of normality of the data.
2.3 Beyond normality
It is important to relax the assumption of the normality for applying selective inference to more general cases such as generalized linear models. To the best of our knowledge, there is death of research into selective inference in such a generalized setting. Here, we discuss the few studies which do exist in this respect.
Fithian et al. (2014) derived an exact postselection inference for a natural parameter of exponential family, and obtained the uniformly most powerful unbiased test in the framework of selective inference. However, as suggested in their paper, the difficulty in constructing exact inference in generalized linear models emanates from the discreteness of the response distribution.
Focusing on an asymptotic behavior in a generalized linear model context with the Lasso penalty, Tian and Taylor (2017) directly considered the asymptotic property of a pivotal quantity. Although their work can be applied in high dimensional scenarios, we can only test a global null, that is, \(\mathrm{H}_0:{\varvec{\beta }}^*={\varvec{0}}\), except for the linear regression model case. This is because that, when we conduct selective inference for individual coefficient, the selection event does not form a simple structure such as an affine set.
On the other hand, Taylor and Tibshirani (2018) proposed a procedure to test individual hypotheses fin logistic regression model context based on the Lasso. Their approach is fundamentally based on solving the Lasso by approximating the loglikelihood up to the second order, and on debiasing the Lasso estimator. Because the objective function now becomes quadratic as per the linear regression model, the selection event reduces to a relatively simple affine set. After debiasing the Lasso estimator, they derive an asymptotic joint distribution of active and inactive estimators. However, since they required d dimensional asymptotics, high dimensional scenarios can not be supported in their theory.
In this paper, we extend selective inference for logistic regression in Taylor and Tibshirani (2018) to high dimensional settings in the case where variable selection is conducted by marginal screening. We do not consider asymptotics for a d dimensional original parameter space, but for a K dimensional selected parameter space. Unfortunately, however, we cannot apply this asymptotic result directly to the polyhedral lemma (Lemma 1) in Lee et al. (2016). To tackle this problem, we consider a score function for constructing a test statistic for our selective inference framework. We first define a function \({\varvec{T}}_n({\varvec{\beta }}_{S}^*)\) based on a score function as a “source” for constructing a test statistic. To apply the polyhedral lemma to \({\varvec{T}}_n({\varvec{\beta }}_{S}^*)\), we need to asymptotically ensure that (i) the selection event is represented by affine constraints with respect to \({\varvec{T}}_n({\varvec{\beta }}_{S}^*)\), and (ii) the function in the form of \({\varvec{\eta }}^\top {\varvec{T}}_n({\varvec{\beta }}_{S}^*)\) is independent of the truncation points. Our main technical contribution herein is that, by carefully analyzing problem configuration and by introducing reasonable additional assumptions, we can show that those two requirements for the polyhedral lemma are satisfied asymptotically.
3 Setting and assumptions
As already noted, our objective herein is to develop a selective inference approach applicable to logistic regression models when the variables are selected by marginal screening. Let \((y_i,{\varvec{x}}_i)\) be the ith pair of the response and regressor. We assume that the \(y_i\)’s are mutually independent random variables which take values in \(\{0,1\}\), and the \({\varvec{x}}_i\)’s are a d dimensional vector of known constants. Further, let \(X=({\varvec{x}}_1,\ldots ,{\varvec{x}}_n)^\top \in \mathbb {R}^{n\times d}\) and \({\varvec{y}}=(y_1,\ldots ,y_n)^\top \in \{0,1\}^n\). Unlike Taylor and Tibshirani (2018), we do not require that the dimension d be fixed, that is, d may increase, as well as the sample size n.
3.1 Marginal screening and selection event
 (C0)

For the true active set \(S^*=\{j;\beta _j^*\ne 0\}\), the probability \(\mathrm{P}(\hat{S}\supset S^*)\) converges to 1 as n goes to infinity.
3.2 Selective test
Remark 1
 (C1)\(\max _{i}\Vert {\varvec{x}}_{S,i}\Vert =\mathrm{O}(\sqrt{K})\). In addition, for a \(K\times K\) dimensional matrixthe following holds:$$\begin{aligned} \varXi _{S,n}=\frac{1}{n}X_S^\top X_S=\frac{1}{n}\sum _{i=1}^{n}{\varvec{x}}_{S,i}{\varvec{x}}_{S,i}^\top \in \mathbb {R}^{K\times K}, \end{aligned}$$where \(C_1\) and \(C_2\) are constants that depend on neither n nor K.$$\begin{aligned} 0<C_1<\lambda _\mathrm{min}(\varXi _{S,n})\le \lambda _\mathrm{max}(\varXi _{S,n})<C_2<\infty , \end{aligned}$$
 (C2)There exists a constant \(\xi \;(<\infty )\) such that \(\max _i{\varvec{x}}_{S,i}^\top {\varvec{\beta }}_S^*<\xi\). In addition, parameter space \(\mathcal{B}\) isfor some constant \(\tilde{\xi }\;(\in (\xi ,\infty ))\).$$\begin{aligned} \mathcal{B}=\{{\varvec{\beta }}_S\in \mathbb {R}^{K};\max _i{\varvec{x}}_{S,i}^\top {\varvec{\beta }}_S<\tilde{\xi }\} \end{aligned}$$
 (C3)
\(K^3/n=\mathrm{o}(1)\).
 (C4)For any \(p\times q\) dimensional matrix A, we denote the spectral norm of A by \(\Vert A\Vert =\sup _{{\varvec{v}}\ne {\varvec{0}}}\Vert A{\varvec{v}}\Vert /\Vert {\varvec{v}}\Vert\). Then the following holds:$$\begin{aligned} \left\ \frac{1}{\sqrt{n}}X_{S^\bot }^\top X_S\right\ =\mathrm{O}(K). \end{aligned}$$
4 Proposed method
In this section, we present the proposed method for selective inference for high dimensional logistic regression with marginal screening. We first consider a subset of features \(\hat{S} = S (\supset S^*)\) as a fixed set, and derive an asymptotic distribution of \(\hat{{\varvec{\beta }}}_S\) under the assumptions (C1)–(C3). Then, we introduce the “source” of the test statistic \({\varvec{T}}_n({\varvec{\beta }}_S^*)\), which is defined by a score function, and apply it to the polyhedral lemma, where we will show that the truncation points are independent of the \({\varvec{\eta }}^\top {\varvec{T}}_n({\varvec{\beta }}_S^*)\) with the assumption (C4).
Theorem 1
Theorem 2
4.1 Computing truncation points
In practice, we need to compute truncation points in (9). When we utilize marginal screening for variable selection, it becomes difficult to compute \(L_n\) and \(U_n\) because \(\tilde{A}\) becomes a \(\{2K(dK)+K\}\times K\) dimensional matrix. For example, even when \(d=1000\) and \(K=20\), we need to handle a 39,220 dimensional vector. To reduce the computational burden, we derive a simple form of (9) in this section.
4.2 Controlling familywise error rate
Since selective test (1) consists of K hypotheses, we may be concerned about multiplicity when \(K>1\). In this case, instead of selective type I error, we control the familywise error rate (FWER) in the sense of selective inference and we term it selective FWER.
5 Simulation study
Through simulation studies, we explore the performance of the proposed method in Sect. 4, which we term ASICs (Asymptotic Selective Inference for Classification) here.
We first identify if the ASICs can control selective type I error. We also check the selective type I error when data splitting (DS) and nominal test (NT) methods are used. In DS, we first randomly divide the data into two disjoint sets. Then, after selecting \(\hat{S}=S\) with \(S=K\) by using one of these sets, we construct a test statistic \({\varvec{T}}_n(\hat{{\varvec{\beta }}}_S)\) based on the other sets and reject the jth selective test (1) when \(T_{n,j}/\sigma _n\ge z_{\alpha /2}\), where \(z_{\alpha /2}\) is an upper \(\alpha /2\)percentile of a standard normal distribution. In NT, we cannot control type I errors since selection bias is ignored: it selects K variables by marginal screening first, then rejects the jth selective test (1) when \(T_{n,j}/\sigma _n\ge z_{\alpha /2}\), where the entire data set is used for both selection and inference steps. Finally, we explore whether the ASICs can effectively control selective FWER, and at the same time, confirm its statistical power by comparing it with that of DS.
The simulation settings are as follows. As d dimensional regressor \({\varvec{x}}_i\) (\(i=1,\ldots ,n\)), we used vectors obtained from \(\mathrm{N}({\varvec{0}},\varSigma )\), where \(\varSigma\) is a \(d\times d\) dimensional covariance matrix whose (j, k)th element is set to \(\rho ^{jk}\). We set \(\rho =0\) or 0.5 in Case 1 and Case 2, respectively. Note that each element of \({\varvec{x}}_i\) is independent in Case 1 but correlated in Case 2. Then, for each \({\varvec{x}}_i\), we generate \(y_i\) from \(\mathrm{Bi}(\psi '({\varvec{x}}_i^\top {\varvec{\beta }}^*))\), where \({\varvec{\beta }}^*\) is a d dimensional true coefficient vector and \(\mathrm{Bi}(p)\) is a Bernoulli distribution with parameter p. In the following, we conduct simulations using 1000 MonteCarlo runs. We use the glm package in R for parameter estimation.
5.1 Controlling selective type I error
Method comparison using simulated data based on 1000 MonteCarlo runs
d  Method  Sample size  

50  100  200  500  1000  1500  
Case 1  
200  ASICs  0.029 (0.168)  0.049 (0.216)  0.038 (0.191)  0.031 (0.173)  0.028 (0.165)  0.033 (0.179) 
DS  0.012 (0.109)  0.015 (0.122)  0.004 (0.063)  0.004 (0.063)  0.011 (0.104)  0.011 (0.104)  
NT  0.184 (0.388)  0.226 (0.418)  0.219 (0.414)  0.261 (0.439)  0.255 (0.436)  0.256 (0.437)  
500  ASICs  0.028 (0.165)  0.043 (0.203)  0.039 (0.194)  0.039 (0.194)  0.032 (0.176)  0.036 (0.186) 
DS  0.012 (0.109)  0.006 (0.077)  0.008 (0.089)  0.009 (0.094)  0.005 (0.071)  0.008 (0.089)  
NT  0.267 (0.044)  0.273 (0.446)  0.304 (0.460)  0.301 (0.459)  0.326 (0.469)  0.325 (0.469)  
1000  ASICs  0.041 (0.198)  0.044 (0.205)  0.023 (0.150)  0.032 (0.176)  0.038 (0.191)  0.044 (0.205) 
DS  0.006 (0.077)  0.011 (0.104)  0.010 (0.100)  0.009 (0.094)  0.013 (0.113)  0.010 (0.100)  
NT  0.294 (0.456)  0.345 (0.476)  0.390 (0.488)  0.402 (0.491)  0.411 (0.492)  0.405 (0.491)  
Case 2  
200  ASICs  0.038 (0.191)  0.038 (0.191)  0.040 (0.196)  0.032 (0.176)  0.028 (0.165)  0.031 (0.173) 
DS  0.012 (0.109)  0.007 (0.083)  0.012 (0.109)  0.010 (0.100)  0.012 (0.109)  0.004 (0.063)  
NT  0.177 (0.382)  0.207 (0.405)  0.234 (0.424)  0.211 (0.408)  0.219 (0.414)  0.210 (0.408)  
500  ASICs  0.049 (0.216)  0.038 (0.191)  0.030 (0.171)  0.030 (0.171)  0.039 (0.194)  0.034 (0.181) 
DS  0.007 (0.083)  0.006 (0.077)  0.010 (0.100)  0.009 (0.094)  0.007 (0.083)  0.007 (0.083)  
NT  0.247 (0.431)  0.269 (0.443)  0.291 (0.454)  0.295 (0.456)  0.309 (0.462)  0.318 (0.466)  
1000  ASICs  0.049 (0.216)  0.047 (0.212)  0.031 (0.173)  0.034 (0.181)  0.024 (0.153)  0.046 (0.210) 
DS  0.009 (0.094)  0.008 (0.089)  0.013 (0.113)  0.006 (0.077)  0.006 (0.077)  0.010 (0.100)  
NT  0.290 (0.454)  0.350 (0.477)  0.375 (0.484)  0.396 (0.489)  0.407 (0.492)  0.414 (0.493) 
5.2 FWER and power
Here, we explore selective FWER and statistical power with respect to ASICs and DS for K selective tests (1), where we set \(K=5, 10, 15\), and 20. Note that, as discussed in the above section, NT is disregarded here because it does no adequately control selective type I error. We adjust multiplicity by utilizing Bonferroni’s method as noted in Sect. 4.2.
6 Empirical applications
7 Theoretical analysis
In this section, we provide proofs of the theoretical results derived herein. We use the notation \(p\lesssim q\), which means that, if for any \(p,q\in \mathbb {R}\), there exists a constant \(r>0\) such that \(p\le rq\), and \(p\gtrsim q\) is defined similarly. All proofs are based on fixed \(S~(\supset S^*)\); thus we simply denote \(\hat{{\varvec{\beta }}}_S\) and \(X_S\) by \(\hat{{\varvec{\beta }}}\) and X, respectively. This is because we need to verify several asymptotic condition before selections in the same way as in Tian and Taylor (2017) and Taylor and Tibshirani (2018).
7.1 Proof of (6)
Remark 2
7.2 Proof of Theorem 1
7.3 Proof of Theorem 2
8 Concluding remarks and future research
Recently, methods for data driven science such as selective inference and adaptive data analysis have become increasingly important as described by Barber and Candès (2016). Although there are several approaches for carrying out postselection inference, we have developed a selective inference method for high dimensional classification problems, based on the work in Lee et al. (2016). In the same way as that seminal work, the polyhedral lemma (Lemma 1) plays an important role in our study. By considering high dimensional asymptotics concerning sample size and the number of selected variables, we have shown that a similar result to the polyhedral lemma holds even for high dimensional logistic regression problems. As a result, we could construct a pivotal quantity whose sampling distribution is represented as a truncated normal distribution which converges to a standard uniform distribution. In addition, through simulation experiments, it has been shown that the performance of our proposed method is, in almost all cases, superior to other methods such as data splitting.
As suggested by the results from the simulation experiments, conditions might be relaxed to accommodate more general settings. In terms of future research in this domain, while we considered the logistic model in this paper, it is important to extend the results to other models, for example, generalized linear models. Further, higher order interaction models are also crucial in practice. In this situation, the size of the matrix in the selection event becomes very large, and thus, it is cumbersome to compute truncation points in the polyhedral lemma. Suzumura et al. (2017) have shown that selective inference can be constructed in such a model by utilizing a pruning algorithm. In this respect, it is desirable to extend their result not only to linear regression modeling contexts but also to other models.
Footnotes
Notes
Acknowledgements
Funding was provided by Japan Society for the Promotion of Science (Grant nos. 18K18010, 16H06538, 17H00758, JPMJCR1502), RIKEN Center for Advanced Intelligence Project and JST support program for starting up innovationhub on materials research by information integration initiative.
References
 Barber, R.F., & Candès, E.J. (2016). A knockoff filter for highdimensional selective inference, arXiv preprint arXiv:1602.03574.
 Berk, R., Brown, L., Buja, A., Zhang, K., & Zhao, L. (2013). Valid postselection inference. The Annals of Statistics, 41, 802–837.MathSciNetCrossRefGoogle Scholar
 Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37, 1705–1732.MathSciNetCrossRefGoogle Scholar
 Breiman, L. (1992). The little bootstrap and other methods for dimensionality selection in regression: Xfixed prediction error. Journal of the American Statistical Association, 87, 738–754.MathSciNetCrossRefGoogle Scholar
 Cox, D. (1975). A note on datasplitting for the evaluation of significance levels. Biometrika, 62, 441–444.MathSciNetCrossRefGoogle Scholar
 Dasgupta, S., Khare, K., & Ghosh, M. (2014). Asymptotic expansion of the posterior density in high dimensional generalized linear models. Journal of Multivariate Analysis, 131, 126–148.MathSciNetCrossRefGoogle Scholar
 Dickhaus, T. (2014). Simultaneous statistical inference. With applications in the life sciences. Heidelberg: Springer.CrossRefGoogle Scholar
 Efron, B. (2014). Estimation and accuracy after model selection. Journal of the American Statistical Association, 109, 991–1007.MathSciNetCrossRefGoogle Scholar
 Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B, 70, 849–911.MathSciNetCrossRefGoogle Scholar
 Fan, J., & Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32, 928–961.MathSciNetCrossRefGoogle Scholar
 Fan, J., & Song, R. (2010). Sure independence screening in generalized linear models with NPdimensionality. The Annals of Statistics, 38, 3567–3604.MathSciNetCrossRefGoogle Scholar
 Fithian, W., Sun, D., & Taylor, J. (2014). Optimal inference after model selection, arXiv preprint arXiv:1410.2597.
 Huang, J., Horowitz, J. L., & Ma, S. (2008). Asymptotic properties of bridge estimators in sparse highdimensional regression models. The Annals of Statistics, 36, 587–613.MathSciNetCrossRefGoogle Scholar
 Huber, P. J. (1973). Robust regression: Asymptotics, conjectures and Monte Carlo. The Annals of Statistics, 1, 799–821.MathSciNetCrossRefGoogle Scholar
 Lee, J. D., Sun, D. L., Sun, Y., & Taylor, J. E. (2016). Exact postselection inference, with application to the lasso. The Annals of Statistics, 44, 907–927.MathSciNetCrossRefGoogle Scholar
 Lee, J. D., & Taylor, J. E. (2014). Exact post model selection inference for marginal screening. in Advances in neural information processing systems, p. 136–144.Google Scholar
 Lockhart, R., Taylor, J., Tibshirani, R. J., & Tibshirani, R. (2014). A significance test for the lasso. Annals of Statistics, 42, 413.MathSciNetCrossRefGoogle Scholar
 Meinshausen, N., Meier, L., & Bühlmann, P. (2009). \(p\)values for highdimensional regression. Journal of the American Statistical Association, 104, 1671–1681.MathSciNetCrossRefGoogle Scholar
 Suzumura, S., Nakagawa, K., Umezu, Y., Tsuda, K., & Takeuchi, I. (2017). Selective inference for sparse highorder interaction models. In Proceedings of the 34th International Conference on Machine LearningVolume 70, JMLR. org, pp. 3338–3347.Google Scholar
 Taylor, J. E., Loftus, J. R., Tibshirani, R. J., et al. (2016). Inference in adaptive regression via the KacRice formula. The Annals of Statistics, 44, 743–770.MathSciNetCrossRefGoogle Scholar
 Taylor, J., & Tibshirani, R. (2018). Postselection inference for \(\ell _1\)penalized likelihood models. The Canadian Journal of Statistics, 46, 41–61.MathSciNetCrossRefGoogle Scholar
 Tian, X., Loftus, J. R., & Taylor, J. E. (2018). Selective inference with unknown variance via the squareroot lasso. Biometrika, 105, 755–768.MathSciNetzbMATHGoogle Scholar
 Tian, X., & Taylor, J. (2017). Asymptotics of selective inference. Scandinavian Journal of Statistics, 44, 480–499.MathSciNetzbMATHGoogle Scholar
 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58, 267–288.MathSciNetzbMATHGoogle Scholar
 Wasserman, L., & Roeder, K. (2009). High dimensional variable selection. The Annals of statistics, 37, 2178–2201.MathSciNetCrossRefGoogle Scholar