Advertisement

Japanese Journal of Statistics and Data Science

, Volume 2, Issue 2, pp 559–589 | Cite as

Selective inference via marginal screening for high dimensional classification

  • Yuta Umezu
  • Ichiro TakeuchiEmail author
Original Paper Information Theory and Statistics

Abstract

Post-selection inference is a statistical technique for determining salient variables after model or variable selection. Recently, selective inference, a kind of post-selection inference framework, has garnered the attention in the statistics and machine learning communities. By conditioning on a specific variable selection procedure, selective inference can properly control for so-called selective type I error, which is a type I error conditional on a variable selection procedure, without imposing excessive additional computational costs. While selective inference can provide a valid hypothesis testing procedure, the main focus has hitherto been on Gaussian linear regression models. In this paper, we develop a selective inference framework for binary classification problem. We consider a logistic regression model after variable selection based on marginal screening, and derive the high dimensional statistical behavior of the post-selection estimator. This enables us to asymptotically control for selective type I error for the purposes of hypothesis testing after variable selection. We conduct several simulation studies to confirm the statistical power of the test, and compare our proposed method with data splitting and other methods.

Keywords

High dimensional asymptotics Hypothesis testing Logistic regression Post-selection inference Marginal screening 

1 Introduction

Discovering statistically significant variables in high dimensional data is an important problem for many applications such as bioinformatics, materials informatics, and econometrics, to name a few. To achieve this, for example, in a regression model, data analysts often attempt to reduce the dimensionality of the model by utilizing a particular model selection or variable selection method. For example, the Lasso (Tibshirani 1996) and marginal screening (Fan and Lv 2008) are frequently used in model selection contexts. In many applications, data analysts conduct statistical inference based on the selected model as if it is known a priori, but this practice has been referred to as “a quiet scandal in the statistical community” in Breiman (1992). If we select a model based on the available data, then we have to pay heed to the effect of model selection when we conduct a statistical inference. This is because the selected model is no longer deterministic, i.e., random, and statistical inference after model selection is affected by selection bias. In hypothesis testing of the selected variables, the validity of the inference is compromised when a test statistic is constructed without taking account of the model selection effect. This means that, as a consequence, we can no longer effectively control type I error or the false-positive rate. This kind of problem falls under the banner of post-selection inference in the statistical community and has recently attracted a lot of attention (see, e.g., Berk et al. 2013; Efron 2014; Barber and Candès 2016; Lee et al. 2016).

Post-selection inference consists of the following two steps:
Selection:

The analyst chooses a model or subset of variables and constructs hypothesis, based on the data.

Inference:

The analyst tests the hypothesis by using the selected model.

Broadly speaking, the selection step determines what issue to address, i.e., a hypothesis selected from the data, and the inference step conducts hypothesis testing to enable a conclusion to be drawn about the issue under consideration. To navigate the issue of selection bias, there are several approaches for conducting the inference step.

Data splitting is the most common procedure for selection bias correction. In a high dimensional linear regression model, Wasserman and Roeder (2009) and Meinshausen et al. (2009) succeed in assigning a p value for each selected variable by splitting the data into two subsets. Specifically, they first reduce the dimensionality of the model using the first subset, and then make the final selection using the second subset of the data, by assigning a p value based on a classical least square estimation. While such a data splitting method is mathematically valid straightforward to implement, it leads to low power for extracting truly significant variables because only sub-samples, whose size is obviously smaller than that of the full sample, can be used in each of the selection and inference steps.

As an alternative, simultaneous inference, which takes account all possible subsets of variables, has been developed for correcting selection bias. Berk et al. (2013) showed that the type I error can be successfully controlled even if the full sample is used in both the selection and inference steps by adjusting multiplicity of model selection. Since the number of all possible subsets of variables increases exponentially, computational costs associated with this method become excessive when the dimension of parameters is greater than 20.

On the other hand, selective inference, which only takes the selected model into account, is another approach for post-selection inference, and provides a new framework for combining selection and hypothesis testing. Since hypothesis testing is conducted only for the selected model, it makes sense to condition on an event that “a certain model is selected”. This event is referred to as a selection event, and we conduct hypothesis testing conditional on the event. Thus, we can avoid having to compare coefficients across two different models. Recently, Lee et al. (2016) succeeded in using this method to conduct hypothesis testing through constructing confidence intervals for selected variables by the Lasso in a linear regression modeling context. When a specific confidence interval is constructed, the corresponding hypothesis testing can be successfully conducted. They also show that the type I error, which is also conditioned on the selection event and is called selective type I error, can be appropriately controlled. It is noteworthy that by conditioning on the selection event in a certain class, we can construct exact p values in the meaning of conditional inference based on a truncated normal distribution.

Almost all studies which have followed since the seminal work by Lee et al. (2016), however, focus on linear regression models. Particularly, normality of the noise is crucial to control selective type I error. To relax this assumption, Tian and Taylor (2017) developed an asymptotic theory for selective inference in a generalized linear modeling context. Although their results can be available for high dimensional and low sample size data, we can only test a global null hypothesis, that is, a hypothesis that all regression coefficients are zero, just like with covariance test (Lockhart et al. 2014). On the other hand, Taylor and Tibshirani (2018) proposed a procedure to test individual hypotheses in a logistic regression model with the Lasso. By debiasing the Lasso estimator for both the active and inactive variables, they require a joint asymptotic distribution of the debiased Lasso estimator and conduct hypothesis testing for regression coefficients individually. However, the method is justified only for low dimensional scenarios since they exploit standard fixed dimensional asymptotics.

Our main contribution is that, by utilizing marginal screening as a variable selection method, we can show that the selective type I error rate for logistic regression model is appropriately controlled even in a high dimensional asymptotic scenario. In addition, our method is applicable not only with respect to testing the global null hypothesis but also hypotheses pertaining to individual regression coefficients. Specifically, we first utilize marginal screening for the selection step in a similar way to Lee and Taylor (2014). Then, by considering a logistic regression model for the selected variables, we derive a high dimensional asymptotic property of a maximum likelihood estimator. Using the asymptotic results, we can conduct selective inference of a high dimensional logistic regression, i.e., valid hypothesis testing for the selected variables from high dimensional data.

The rest of the paper is organized as follows. Sect. 2 briefly describes the notion of selective inference and intruduces several related works. In Sect. 3, the model setting and assumptions are described. An asymptotic property of the maximum likelihood estimator of our model is discussed in Sect. 4. In Sect. 5, we conduct several simulation studies to explore the performance of the proposed method before application to real world empirical data sets in Sect. 6. Theorem proofs are relegated to Sect. 7. Finally, Sect. 8 offers concluding remarks and suggestions for future research in this domain.

1.1 Notation

Throughout the paper, row and column vectors of \(X\in \mathbb {R}^{n\times d}\) are denoted by \({\varvec{x}}_i~(i=1,\ldots , n)\) and \(\tilde{{\varvec{x}}}_j,~(j=1,\ldots , d)\), respectively. An \(n\times n\) identity matrix is denoted by \(I_n\). The \(\ell _2\)-norm of a vector is denoted by \(\Vert \cdot \Vert\) provided there is no confusion. For any subset \(J\subseteq \{1,\ldots , d\}\), its complement is denoted by \(J^\bot =\{1,\ldots ,d\}\backslash S\). We also denote \({\varvec{v}}_J=(v_i)_{i\in J}\in \mathbb {R}^{|J|}\) and \(X_J=({\varvec{x}}_{J,1},\ldots ,{\varvec{x}}_{J,n})^\top \in \mathbb {R}^{n\times |J|}\) as a sub-vector of \({\varvec{v}}\) and a sub-matrix of X, respectively. For a differentiable function f, we denote \(f'\) and \(f''\) as the first and second derivatives and so on.

2 Selective inference and related works

In this section, we overview fundamental notion of selective inference through a simple linear regression model (Lee et al. 2016). We also review related existing works on selective inference.

2.1 Selective inference in linear regression model

Let \({\varvec{y}}\in \mathbb {R}^n\) and \(X\in \mathbb {R}^{n\times d}\) be a response and non-random regressor, respectively, and let us consider a linear regression model
$$\begin{aligned} {\varvec{y}}=X{\varvec{\beta }}^*+{\varvec{\varepsilon }}, \end{aligned}$$
where \({\varvec{\beta }}^*\) is the true regression coefficient vector and \({\varvec{\varepsilon }}\) is distributed according to \(\mathrm{N}({\varvec{0}},\sigma ^2 I_n)\) with known variance \(\sigma ^2\). Suppose that a subset of variables S is selected in the selection step (e.g., Lasso or marginal screening as in Lee et al. (2016); Lee and Taylor (2014)) and let us consider hypothesis testing for \(j\in \{1,\ldots , |S|\}\):
$$\begin{aligned} \text {H}_{0,j}:\beta _{S, j}^*=0 \qquad \text {vs.} \qquad \text {H}_{1,j}:\beta _{S, j}^*\ne 0. \end{aligned}$$
(1)
If S is non-random, a maximum likelihood estimator \(\hat{{\varvec{\beta }}}_S=(X_S^\top X_S)^{-1}X_S^\top {\varvec{y}}\) is distributed according to \(\mathrm{N}({\varvec{\beta }}_S^*,\sigma ^2(X_S^\top X_S)^{-1})\), as is well-known. However, we cannot use this sampling distribution when S is selected based on the data, and the selected variable S is also random.

If a subset of variables, i.e., the active set, \(\hat{S}\) is selected by the Lasso or marginal screening, the event \(\{\hat{S}=S\}\) can be written as an affine set with respect to \({\varvec{y}}\), that is, in the form of \(\{{\varvec{y}}; \, A{\varvec{y}}\le {\varvec{b}}\}\) for some non-random matrix A and vector \({\varvec{b}}\) (Lee et al. 2016; Lee and Taylor 2014), in which the event \(\{\hat{S}=S\}\) is called a selection event. Lee et al. (2016) showed that if \({\varvec{y}}\) follows a normal distribution and the selection event can be written as an affine set, the following lemma holds:

Lemma 1

(Polyhedral Lemma; Lee et al. 2016) Suppose\({\varvec{y}}\sim \mathrm{N}({\varvec{\mu }},\varSigma )\). Let\({\varvec{c}}=\varSigma {\varvec{\eta }}({\varvec{\eta }}^\top \varSigma {\varvec{\eta }})^{-1}\)for any\({\varvec{\eta }}\in \mathbb {R}^n\), and let\({\varvec{z}}=(I_n-{\varvec{c}}{\varvec{\eta }}^\top ){\varvec{y}}\). Then we have
$$\begin{aligned} \{{\varvec{y}}; \, A{\varvec{y}}\le {\varvec{b}}\}=\{{\varvec{y}}; \, L({\varvec{z}})\le {\varvec{\eta }}^\top {\varvec{y}}\le U({\varvec{z}}),\;N({\varvec{z}})\ge 0\}, \end{aligned}$$
where
$$\begin{aligned} L({\varvec{z}}) =\max _{j:(A{\varvec{c}})_j<0}\frac{b_j-(A{\varvec{z}})_j}{(A{\varvec{c}})_j},~~~~~ U({\varvec{z}}) =\min _{j:(A{\varvec{c}})_j>0}\frac{b_j-(A{\varvec{z}})_j}{(A{\varvec{c}})_j} \end{aligned}$$
and\(N({\varvec{z}})=\max _{j:(A{\varvec{c}})_j=0}b_j-(A{\varvec{z}})_j\). In addition,\((L({\varvec{z}}),U({\varvec{z}}),N({\varvec{z}}))\)is independent of\({\varvec{\eta }}^\top {\varvec{y}}\).
By the lemma, we can find that the distribution of the pivotal quantity for \({\varvec{\eta }}^\top {\varvec{\mu }}\) is given by a truncated normal distribution. Specifically, let \(F^{[L,U]}_{\mu ,\sigma ^2}\) be a cumulative distribution function of a truncated normal distribution \(\mathrm{TN}(\mu , \sigma ^2, L, U)\), that is,
$$\begin{aligned} F^{[L,U]}_{\mu ,\sigma ^2}(x) =\frac{\Phi ((x-\mu )/\sigma )-\Phi ((L-\mu )/\sigma )}{\Phi ((U-\mu )/\sigma ) -\Phi ((L-\mu )/\sigma )}, \end{aligned}$$
where \(\Phi\) is a cumulative distribution function of a standard normal distribution. Then, for any value of \({\varvec{z}}\), we have
$$\begin{aligned} \left[ F^{[L({\varvec{z}}),U({\varvec{z}})]}_{{\varvec{\eta }}^\top {\varvec{\mu }},{\varvec{\eta }}^\top \varSigma {\varvec{\eta }}}({\varvec{\eta }}^\top {\varvec{y}}) \mid A{\varvec{y}}\le {\varvec{b}} \right] \sim \mathrm{Unif}(0,1), \end{aligned}$$
where \(L({\varvec{z}})\) and \(U({\varvec{z}})\) are defined in the above lemma. This pivotal quantity allows us to construct a so-called selective p value. Precisely, by choosing \({\varvec{\eta }}=X_S(X_S^\top X_S)^{-1}{\varvec{e}}_j\), we can construct a right-side selective p value as
$$\begin{aligned} P_j =1-F^{[L({\varvec{z}}_0), U({\varvec{z}}_0)]}_{0,{\varvec{\eta }}^\top \varSigma {\varvec{\eta }}}({\varvec{\eta }}^\top {\varvec{y}}), \end{aligned}$$
where \({\varvec{e}}_j\in \mathbb {R}^{|S|}\) is a unit vector whose j-th element is 1 and 0 otherwise, and \({\varvec{z}}_0\) is a realization of \({\varvec{z}}\). Note that the value of \(P_j\) represents a right-side p value conditional on the selection event under the null hypothesis \(\text {H}_{0,j}:\beta _{S,j}^*={\varvec{\eta }}^\top {\varvec{\mu }}=0\) in (1). In addition, for the j-th test in (1), a two-sided selective p value can be defined as
$$\begin{aligned} \tilde{P}_j=2\min \{P_j, 1-P_j \}, \end{aligned}$$
which also follows from standard uniform distribution under the null hypothesis. Therefore, we reject the j-th null hypothesis at level \(\alpha\) when \(\tilde{P}_j\le \alpha\), and the probability
$$\begin{aligned} \text {P}(\text {H}_{0,j}~\text {is falsely rejected}\mid \hat{S}=S) =\text {P}(\tilde{P}_j\le \alpha \mid \hat{S}=S) \end{aligned}$$
(2)
is referred to as a selective type I error.

2.2 Related works

In selective inference, we use the same data in variable selection and statistical inference. Therefore, the selected model is not deterministic and we can not apply classical hypothesis testing due to selection bias.

To navigate this problem, data splitting has been commonly utilized. In data splitting, the data are randomly divided into two disjoint sets, and one of them is used for variable selection and the other is used for hypothesis testing. This is a particularly versatile method and is widely applicable if we can divide the data randomly (see, e.g., Cox 1975; Wasserman and Roeder 2009; Meinshausen et al. 2009). Since the data are split randomly, i.e., independent of the data, we can conduct hypothesis testing in the inference step independent of the selection step. Thus, we do not need to concerned with selection bias. It is noteworthy that data splitting can be viewed as a method of selective inference because the inference is conducted only for the selected variables in the selection step. However, a drawback of data splitting is that only a part of the data are available for each split, precisely because the essence of this approach involves rendering some data available for the selection step and the remainder for the inference step. Because only a subset of the data can be used in variable selection, the risk of failing to select truly important variables increases. Similarly, the power of hypothesis testing would decrease since inference proceeds on the basis of a subset of the total data. In addition, since data splitting is executed at random, it is possible and plausible that the final results and conclusions will vary non-trivially depending on exactly how this split is manifested.

On the other hand, in the traditional statistical community, simultaneous inference has been developed for correcting selection bias (see, e.g., Berk et al. 2013; Dickhaus 2014). In simultaneous inference, type I error is controlled at level \(\alpha\) by considering all possible subsets of variables. Specifically, let \(\hat{S}\subseteq \{1,\ldots , d\}\) be the set of variables selected by a certain variable selection method and \(P_j(\hat{S})\) be a p value for the j-th selected variable in \(\hat{S}\). Then, in simultaneous inference, the following type I error should be adequately controlled:
$$\begin{aligned} \mathrm{P}(P_j(\hat{S})\le \alpha ~\text {for any}~\hat{S}\subseteq \{1,\ldots ,d\})\le \alpha . \end{aligned}$$
(3)
To examine the relationship between selective inference and simultaneous inference, note that the left-hand side in (3) can be rewritten as
$$\begin{aligned}&\mathrm{P}(P_j(\hat{S})\le \alpha ~\text {for any}~\hat{S}\subseteq \{1,\ldots ,d\}) \\&\quad =\sum _{S\subseteq \{1,\ldots ,d\}}\mathrm{P}(P_j(S)\le \alpha \mid \hat{S}=S)\mathrm{P}(\hat{S}=S). \end{aligned}$$
The right-hand side in the above equality is simply a weighted sum of selective type I errors over all possible subsets of variables. Therefore, if we control selected type I errors for all possible subsets of variables, we can also control type I errors in the sense of simultaneous inference. However, because the number of all possible subsets of variables is \(2^d\), it becomes overly cumbersome to compute the left-hand side in (3) even for \(d=20\). In contrast to simultaneous inference, selective inference only considers the selected variables, and thus, the computational cost is low compared to simultaneous inference.

Following the seminal work of Lee et al. (2016), selective inference for variable selection has been intensively studied (e.g., Fithian et al. 2014; Lee and Taylor 2014; Taylor et al. 2016; Tian et al. 2018). All these methods, however, rely on the assumption of normality of the data.

2.3 Beyond normality

It is important to relax the assumption of the normality for applying selective inference to more general cases such as generalized linear models. To the best of our knowledge, there is death of research into selective inference in such a generalized setting. Here, we discuss the few studies which do exist in this respect.

Fithian et al. (2014) derived an exact post-selection inference for a natural parameter of exponential family, and obtained the uniformly most powerful unbiased test in the framework of selective inference. However, as suggested in their paper, the difficulty in constructing exact inference in generalized linear models emanates from the discreteness of the response distribution.

Focusing on an asymptotic behavior in a generalized linear model context with the Lasso penalty, Tian and Taylor (2017) directly considered the asymptotic property of a pivotal quantity. Although their work can be applied in high dimensional scenarios, we can only test a global null, that is, \(\mathrm{H}_0:{\varvec{\beta }}^*={\varvec{0}}\), except for the linear regression model case. This is because that, when we conduct selective inference for individual coefficient, the selection event does not form a simple structure such as an affine set.

On the other hand, Taylor and Tibshirani (2018) proposed a procedure to test individual hypotheses fin logistic regression model context based on the Lasso. Their approach is fundamentally based on solving the Lasso by approximating the log-likelihood up to the second order, and on debiasing the Lasso estimator. Because the objective function now becomes quadratic as per the linear regression model, the selection event reduces to a relatively simple affine set. After debiasing the Lasso estimator, they derive an asymptotic joint distribution of active and inactive estimators. However, since they required d dimensional asymptotics, high dimensional scenarios can not be supported in their theory.

In this paper, we extend selective inference for logistic regression in Taylor and Tibshirani (2018) to high dimensional settings in the case where variable selection is conducted by marginal screening. We do not consider asymptotics for a d dimensional original parameter space, but for a K dimensional selected parameter space. Unfortunately, however, we cannot apply this asymptotic result directly to the polyhedral lemma (Lemma 1) in Lee et al. (2016). To tackle this problem, we consider a score function for constructing a test statistic for our selective inference framework. We first define a function \({\varvec{T}}_n({\varvec{\beta }}_{S}^*)\) based on a score function as a “source” for constructing a test statistic. To apply the polyhedral lemma to \({\varvec{T}}_n({\varvec{\beta }}_{S}^*)\), we need to asymptotically ensure that (i) the selection event is represented by affine constraints with respect to \({\varvec{T}}_n({\varvec{\beta }}_{S}^*)\), and (ii) the function in the form of \({\varvec{\eta }}^\top {\varvec{T}}_n({\varvec{\beta }}_{S}^*)\) is independent of the truncation points. Our main technical contribution herein is that, by carefully analyzing problem configuration and by introducing reasonable additional assumptions, we can show that those two requirements for the polyhedral lemma are satisfied asymptotically.

Figure 1 shows the asymptotic distribution of selective p values in our setting and in Taylor and Tibshirani (2018) based on 1000 Monte-Carlo simulation. While the theory in Taylor and Tibshirani (2018) does not support high dimensionality, their selective p value (red solid line) appears to be effective in high dimensional scenarios, although it is slightly mode conservative compared to the approach developed in this paper (black solid line). Our high dimensional framework means that the number of selected variables grows with the sample size in an appropriate order, and a proposed method allows us to test (1) individually even in high dimensional contexts.
Fig. 1

Comparison between empirical distributions of selective p values in (10) (black solid line) and Taylor and Tibshirani (2018) (red solid line). The dashed line shows the cumulative distribution function of the standard uniform distribution. Data were simulated for \(n=50\) and \(d=3000\) under the global null and \(x_{ij}\) was independently generated from a normal distribution \(\text {N}(0, 1)\). Our proposed method appears to offer superior approximation accuracy compared to the extant alternative

3 Setting and assumptions

As already noted, our objective herein is to develop a selective inference approach applicable to logistic regression models when the variables are selected by marginal screening. Let \((y_i,{\varvec{x}}_i)\) be the i-th pair of the response and regressor. We assume that the \(y_i\)’s are mutually independent random variables which take values in \(\{0,1\}\), and the \({\varvec{x}}_i\)’s are a d dimensional vector of known constants. Further, let \(X=({\varvec{x}}_1,\ldots ,{\varvec{x}}_n)^\top \in \mathbb {R}^{n\times d}\) and \({\varvec{y}}=(y_1,\ldots ,y_n)^\top \in \{0,1\}^n\). Unlike Taylor and Tibshirani (2018), we do not require that the dimension d be fixed, that is, d may increase, as well as the sample size n.

3.1 Marginal screening and selection event

In this study, we simply select variables based on a score between the regressor and response \({\varvec{z}}=X^\top {\varvec{y}}\) as per a linear regression problem. Specifically, we select the top K coordinates of absolute values in \({\varvec{z}}\), that is,
$$\begin{aligned} \hat{S}=\{j;|z_j|~ \text {is among the first}~ K~ \text {largest of all}\}. \end{aligned}$$
To avoid computational issues, we consider the event \(\{(\hat{S}, {\varvec{u}}_{\hat{S}})=(S, {\varvec{u}}_S)\}\) as a selection event (see, e.g., Lee and Taylor 2014; Tian and Taylor 2017; Lee et al. 2016). Here, \({\varvec{u}}_S\) is a vector of sign \(z_j~(j\in S)\). Then, the selection event \(\{(\hat{S}, {\varvec{u}}_{\hat{S}})=(S, {\varvec{u}}_S)\}\) can be rewritten as
$$\begin{aligned} |z_j|\ge |z_k|, \quad \forall (j,k)\in S\times S^\bot , \end{aligned}$$
which is equivalent to
$$\begin{aligned} -u_jz_j\le z_k\le u_jz_j, \quad u_j z_j\ge 0, \qquad \forall (j,k)\in S\times S^\bot . \end{aligned}$$
Therefore, \(\{(\hat{S}, {\varvec{u}}_{\hat{S}})=(S, {\varvec{u}}_S)\}\) is reduced to an affine set \(\{{\varvec{z}};\;A{\varvec{z}}\le {\varvec{0}}\}\) for an appropriate \(\{2K(d-K)+K\}\times d\) dimensional matrix A.
In the following, we assume that a sure screening property holds. This is desirable property for variable selection (see e.g., Fan and Lv 2008; Fan and Song 2010) and the statement is as follows:
(C0)

For the true active set \(S^*=\{j;\beta _j^*\ne 0\}\), the probability \(\mathrm{P}(\hat{S}\supset S^*)\) converges to 1 as n goes to infinity.

In the above assumption, we denote \({\varvec{\beta }}^*\in \mathbb {R}^d\) as a true value of the coefficient vector. This assumption requires that the set of selected variables contain the set of true active variables with probability tending to 1. In the linear regression model, (C0) holds under some regularity conditions in high dimensional settings (see, e.g., Fan and Lv 2008). The sufficient condition concerning about high dimensionality for (C0) is \(\log d=O(n^\xi )\) for some \(\xi \in (0,1/2)\), and thus we allow d to be exponentially large. Because (C0) is not directly related in selective inference, we do not further discuss it.

3.2 Selective test

For a subset of variables \(\hat{S}~(=S)\) selected by marginal screening, we consider K selective tests (1) for each variable \(\beta _j^*,~j\in S\). Let us define the loss function of logistic regression with the selected variables as follows:
$$\begin{aligned} \ell _n({\varvec{\beta }}_S) =\sum _{i=1}^n\{y_i{\varvec{x}}_{S,i}^\top {\varvec{\beta }}_S-\psi ({\varvec{x}}_{S,i}^\top {\varvec{\beta }}_S)\}, \end{aligned}$$
(4)
where \(\psi ({\varvec{x}}_{S,i}^\top {\varvec{\beta }}_S)=\log (1+\exp ({\varvec{x}}_{S,i}^\top {\varvec{\beta }}_S))\) is a cumulant generating function. Observe that \(\ell _n({\varvec{\beta }}_S)\) is concave with respect to \({\varvec{\beta }}_S\). Thus, we can define the maximum likelihood estimator of \({\varvec{\beta }}_S\) as the optimal solution that attains the maximum of the following optimization problem:
$$\begin{aligned} \hat{{\varvec{\beta }}}_S =\mathop {\mathrm{arg~max}}\limits _{{\varvec{\beta }}_S\in \mathcal{B}}\ell _n({\varvec{\beta }}_S), \end{aligned}$$
(5)
where \(\mathcal{B}\subseteq \mathbb {R}^K\) is a parameter space.

Remark 1

Suppose that \(S~(\supset S^*)\) is fixed. Then, it holds that
$$\begin{aligned} \psi '({\varvec{x}}_{S,i}^\top {\varvec{\beta }}_S^*)=\psi '({\varvec{x}}_{S^*,i}^\top {\varvec{\beta }}_{S^*}^*),~~~ \psi ''({\varvec{x}}_{S,i}^\top {\varvec{\beta }}_S^*)=\psi ''({\varvec{x}}_{S^*,i}^\top {\varvec{\beta }}_{S^*}^*), \end{aligned}$$
and thus, we have
$$\begin{aligned} \mathrm{P}(y_i=1)=\mathrm{E}[y_i]=\psi '({\varvec{x}}_{S^*,i}^\top {\varvec{\beta }}_{S^*}^*),~~~ \mathrm{V}[y_i]=\psi ''({\varvec{x}}_{S^*,i}^\top {\varvec{\beta }}_{S^*}^*). \end{aligned}$$
We construct test statistics for (1) by deriving an asymptotic distribution of \(\hat{{\varvec{\beta }}}_S\). To develop our asymptotic theory, we further assume the following conditions in addition to (C0) for a fixed S with \(|S|=K\):
  1. (C1)
    \(\max _{i}\Vert {\varvec{x}}_{S,i}\Vert =\mathrm{O}(\sqrt{K})\). In addition, for a \(K\times K\) dimensional matrix
    $$\begin{aligned} \varXi _{S,n}=\frac{1}{n}X_S^\top X_S=\frac{1}{n}\sum _{i=1}^{n}{\varvec{x}}_{S,i}{\varvec{x}}_{S,i}^\top \in \mathbb {R}^{K\times K}, \end{aligned}$$
    the following holds:
    $$\begin{aligned} 0<C_1<\lambda _\mathrm{min}(\varXi _{S,n})\le \lambda _\mathrm{max}(\varXi _{S,n})<C_2<\infty , \end{aligned}$$
    where \(C_1\) and \(C_2\) are constants that depend on neither n nor K.
     
  2. (C2)
    There exists a constant \(\xi \;(<\infty )\) such that \(\max _i|{\varvec{x}}_{S,i}^\top {\varvec{\beta }}_S^*|<\xi\). In addition, parameter space \(\mathcal{B}\) is
    $$\begin{aligned} \mathcal{B}=\{{\varvec{\beta }}_S\in \mathbb {R}^{K};\max _i|{\varvec{x}}_{S,i}^\top {\varvec{\beta }}_S|<\tilde{\xi }\} \end{aligned}$$
    for some constant \(\tilde{\xi }\;(\in (\xi ,\infty ))\).
     
  3. (C3)

    \(K^3/n=\mathrm{o}(1)\).

     
  4. (C4)
    For any \(p\times q\) dimensional matrix A, we denote the spectral norm of A by \(\Vert A\Vert =\sup _{{\varvec{v}}\ne {\varvec{0}}}\Vert A{\varvec{v}}\Vert /\Vert {\varvec{v}}\Vert\). Then the following holds:
    $$\begin{aligned} \left\| \frac{1}{\sqrt{n}}X_{S^\bot }^\top X_S\right\| =\mathrm{O}(K). \end{aligned}$$
     
The condition (C1) pertains to the design matrix. Note that we only consider a high dimensional and small sample size setting for the original data set, and not for selected variables. This assumption is reasonable for high dimensional and large sample scenarios. (C2) requires that \(\mathrm{P}(y_i=1)\) not converge to 0 or 1 for any \(i=1,\ldots ,n\). Observe that the parameter space \(\mathcal{B}\) is an open and convex set with respect to \({\varvec{\beta }}_S\). This assumption naturally holds when the space of regressors is compact and \({\varvec{\beta }}_S\) does not diverge. In addition, if the maximum likelihood estimator \(\hat{{\varvec{\beta }}}_S\) is \(\sqrt{n/K}\)-consistent, then \(\hat{{\varvec{\beta }}}_S\) lies in \(\mathcal{B}\) with probability converging to 1. The condition (C3) represents the relationship between the sample size and the number of selected variables for high dimensional asymptotics in our model. As related conditions, Fan and Peng (2004) employs \(K^5/n\rightarrow 0\), and Dasgupta et al. (2014) employs \(K^{6+\delta }/n\rightarrow 0\) for some \(\delta >0\) to derive an asymptotic expansion of a posterior distribution in a Bayesian setting. Furthermore, Huber (1973) employs the same condition as in (C3) in the scenario for M-estimation. Finally, (C4) requires that regressors of selected variables and those of unselected variables be only weakly correlated. A similar assumption is required in Huang et al. (2008) for deriving an asymptotic distribution for a bridge estimator. This type of assumption, e.g., a restricted eigenvalue condition (Bickel et al. 2009), is essential for handling high dimensional behavior of the estimator.

4 Proposed method

In this section, we present the proposed method for selective inference for high dimensional logistic regression with marginal screening. We first consider a subset of features \(\hat{S} = S (\supset S^*)\) as a fixed set, and derive an asymptotic distribution of \(\hat{{\varvec{\beta }}}_S\) under the assumptions (C1)–(C3). Then, we introduce the “source” of the test statistic \({\varvec{T}}_n({\varvec{\beta }}_S^*)\), which is defined by a score function, and apply it to the polyhedral lemma, where we will show that the truncation points are independent of the \({\varvec{\eta }}^\top {\varvec{T}}_n({\varvec{\beta }}_S^*)\) with the assumption (C4).

To extend the selective inference framework to logistic regression, we first consider a subset of variables \(\hat{S}=S~(\supset S^*)\) as a fixed set. From (4), let us define a score function and observed information matrix by
$$\begin{aligned} {\varvec{s}}_n({\varvec{\beta }}_S)&=\frac{1}{\sqrt{n}}\ell '_n({\varvec{\beta }}_S) =\frac{1}{\sqrt{n}}\sum _{i=1}^{n}{\varvec{x}}_{S,i}(y_i-\psi '({\varvec{x}}_{S,i}^\top {\varvec{\beta }}_S)) \end{aligned}$$
and
$$\begin{aligned} \varSigma _n({\varvec{\beta }}_S)&=-\frac{1}{n}\ell ''_n({\varvec{\beta }}_S) =\frac{1}{n}\sum _{i=1}^{n}\psi ''({\varvec{x}}_{S,i}^\top {\varvec{\beta }}_S){\varvec{x}}_{S,i}{\varvec{x}}_{S,i}^\top , \end{aligned}$$
respectively. To simplify the notation, we denote \({\varvec{s}}_n({\varvec{\beta }}_S^*)\) and \(\varSigma _n({\varvec{\beta }}_S^*)\) by \({\varvec{s}}_n\) and \(\varSigma _n\), respectively, for the true value of \({\varvec{\beta }}_S^*\). Because \(\psi ''({\varvec{x}}_{S,i}^\top {\varvec{\beta }}_S^*)\) is uniformly bounded on \(\mathcal{B}\) from (C2), \(\varSigma _n\) is a symmetric and positive definite matrix when (C1) holds. Then, by the same argument as in Fan and Peng (2004), if \(K^2/n\rightarrow 0\), we have
$$\begin{aligned} \Vert \hat{{\varvec{\beta }}}_S-{\varvec{\beta }}_S^*\Vert =\mathrm{O}_{\mathrm{p}}(\sqrt{K/n}). \end{aligned}$$
(6)
By Taylor’s theorem, we have
$$\begin{aligned} {\varvec{0}} =\ell '_n(\hat{{\varvec{\beta }}}_S) \approx \sqrt{n}{\varvec{s}}_n-n\varSigma _n(\hat{{\varvec{\beta }}}_S-{\varvec{\beta }}_S^*), \end{aligned}$$
and thus
$$\begin{aligned} \sqrt{n}(\hat{{\varvec{\beta }}}_S-{\varvec{\beta }}_S^*) \approx \varSigma _n^{-1}{\varvec{s}}_n. \end{aligned}$$
As per Remark 1, \(S\supset S^*\) implies
$$\begin{aligned} \mathrm{E}[{\varvec{s}}_n] =\frac{1}{\sqrt{n}}\sum _{i=1}^{n}{\varvec{x}}_{S,i}(\mathrm{E}[y_i]-\psi '({\varvec{x}}_{S,i}^\top {\varvec{\beta }}_S^*)) ={\varvec{0}}. \end{aligned}$$
In addition, because the \(y_i\)’s are independent of each other, we observe that
$$\begin{aligned} \mathrm{V}[{\varvec{s}}_n] =\frac{1}{n}\sum _{i=1}^{n}\mathrm{V}[y_i]{\varvec{x}}_{S,i}{\varvec{x}}_{S,i}^\top =\varSigma _n. \end{aligned}$$
Therefore, by recalling asymptotic normality of the score function, we expect that a distribution of \(\varSigma _n^{-1}{\varvec{s}}_n\) can be approximated by a normal distribution with mean \({\varvec{0}}\) and covariance matrix \(\varSigma _n^{-1}\). Observe that S can depend on sample size n. However, if S is fixed sequence with respect to n satisfying \(S\supset S^*\), this approximation is true under the conditions (C1)–(C3):

Theorem 1

Suppose that the conditions (C1)–(C3) hold. Then, for any fixed sequence\(S~(\supset S^*)\)and\({\varvec{\eta }}\in \mathbb {R}^K\)with\(\Vert {\varvec{\eta }}\Vert <\infty\), we have
$$\begin{aligned} \sqrt{n}\sigma _n^{-1}{\varvec{\eta }}^\top (\hat{{\varvec{\beta }}}_S-{\varvec{\beta }}^*_S) =\sigma _n^{-1}{\varvec{\eta }}^\top \varSigma _n^{-1}{\varvec{s}}_n+\mathrm{o}_\mathrm{p}(1) {\mathop {\rightarrow }\limits ^{\mathrm{d}}} \mathrm{N}(0,1), \end{aligned}$$
(7)
where\(\sigma _n^2={\varvec{\eta }}^\top \varSigma _n^{-1}{\varvec{\eta }}\)and\(\mathrm{o}_{\mathrm{p}}(1)\)is a term that converges to 0 in probability uniformly with respect to\({\varvec{\eta }}\)andS.
Note that, under the conditions (C1), (C2) and \(d^3/n\rightarrow 0\), Theorem 1 also holds when we do not enforce variable selection (see e.g., Fan and Peng (2004)). To formulate a selective test, let us consider
$$\begin{aligned} {\varvec{T}}_n({\varvec{\beta }}_S^*) =\varSigma _n^{-1}{\varvec{s}}_n =\varSigma _n^{-1}\left\{ \frac{1}{\sqrt{n}}X_S^\top ({\varvec{y}}-{\varvec{\psi }}'({\varvec{\beta }}_S^*)) \right\} \end{aligned}$$
(8)
as a “source” of a test statistic, where \({\varvec{\psi }}'({\varvec{\beta }}_S^*)=(\psi '({\varvec{x}}_{S,i}^\top {\varvec{\beta }}_S^*))_{i=1,\ldots , n}\). The term “source” means that we cannot use it as a test statistic directly because \({\varvec{T}}_n({\varvec{\beta }}_S^*)\) depends on \({\varvec{\beta }}_S^*\). In the following, for notational simplicity, we denote \({\varvec{T}}_n({\varvec{\beta }}_S^*)\) and \({\varvec{\psi }}'({\varvec{\beta }}_S^*)\) by \({\varvec{T}}_n\) and \({\varvec{\psi }}'\), respectively.
As noted in Sect. 3.1, using an appropriate non-random matrix \(A\in \mathbb {R}^{K(2d-2K+1)\times d}\), the marginal screening selection event can be expressed as an affine constraint with respect to \({\varvec{z}}=X^\top {\varvec{y}}\), that is, \(\{{\varvec{z}}; \, A{\varvec{z}}\le {\varvec{0}}\}\). Then, by appropriately dividing A and X based on the selected S, we can rewrite it as follows:
$$\begin{aligned} A{\varvec{z}}\le {\varvec{0}} \quad \Leftrightarrow \quad A_SX_S^\top {\varvec{y}}+A_{S^\bot }X_{S^\bot }^\top {\varvec{y}}\le {\varvec{0}} \quad \Leftrightarrow \quad \tilde{A}{\varvec{T}}_n\le \tilde{{\varvec{b}}}. \end{aligned}$$
The last inequality is an affine constraint with respect to \({\varvec{T}}_n\), where
$$\begin{aligned} \tilde{A}=A_S\varSigma _n \qquad \text {and} \qquad \tilde{{\varvec{b}}}=-\frac{1}{\sqrt{n}}(A_SX_S^\top {\varvec{\psi }}'+A_{S^\bot }X_{S^\bot }^\top {\varvec{y}}). \end{aligned}$$
Unlike the polyhedral lemma in Sect. 2.1, \(\tilde{{\varvec{b}}}\) depends on \({\varvec{y}}\) and so is a random vector. By using (C4), we can prove that \(\tilde{{\varvec{b}}}\) is asymptotically independent of \({\varvec{\eta }}^\top {\varvec{T}}_n\), which implies the polyhedral lemma holds asymptotically.

Theorem 2

Suppose that (C1)(C4) all hold. Let\({\varvec{c}}=\varSigma _n^{-1}{\varvec{\eta }}/\sigma _n^2\)for any\({\varvec{\eta }}\in \mathbb {R}^K\)with\(\Vert {\varvec{\eta }}\Vert <\infty\), and\({\varvec{w}}=(I_K-{\varvec{c}}{\varvec{\eta }}^\top ){\varvec{T}}_n\), where\(\sigma _n^2={\varvec{\eta }}^\top \varSigma _n^{-1}{\varvec{\eta }}\). Then, for any fixed\(S~(\supset S^*)\), the selection event can be expressed as
$$\begin{aligned} \{{\varvec{T}}; \, \tilde{A}{\varvec{T}}\le \tilde{{\varvec{b}}}\} =\{{\varvec{T}}; \, L_n\le {\varvec{\eta }}^\top {\varvec{T}}\le U_n,N_n=0\}, \end{aligned}$$
where
$$\begin{aligned} L_n =\max _{l:(\tilde{A}{\varvec{c}})_l<0}\frac{\tilde{b}_l-(\tilde{A}{\varvec{w}})_l}{(\tilde{A}{\varvec{c}})_l}, \qquad U_n =\min _{l:(\tilde{A}{\varvec{c}})_l>0}\frac{\tilde{b}_l-(\tilde{A}{\varvec{w}})_l}{(\tilde{A}{\varvec{c}})_l}, \end{aligned}$$
(9)
and\(N_n=\max _{l:(\tilde{A}{\varvec{c}})_l=0}\tilde{b}_l-(\tilde{A}{\varvec{w}})_l\). In addition, \((L_n,U_n,N_n)\)is asymptotically independent of\({\varvec{\eta }}^\top {\varvec{T}}_n\).
As a result of Theorems 1, 2 and (C0), we can asymptotically identify a pivotal quantity as a truncated normal distribution, that is, by letting \({\varvec{\eta }}={\varvec{e}}_j\in \mathbb {R}^K\),
$$\begin{aligned} \left[ F^{[L_n,U_n]}_{0,\sigma _n^2}({\varvec{\eta }}^\top {\varvec{T}}_n) \mid \tilde{A}{\varvec{T}}_n\le \tilde{{\varvec{b}}}\right] {\mathop {\rightarrow }\limits ^{\mathrm{d}}}\mathrm{Unif}(0,1) \end{aligned}$$
for any \({\varvec{w}}\), under \(\mathrm{H}_{0,j}\). Therefore, we can define an asymptotic selective p value for selective test (1) under \(\mathrm{H}_{0,j}\) as follows:
$$\begin{aligned} P_{n,j}=2\min \left\{ F^{[L_n,U_n]}_{0,\sigma _n^2}({\varvec{\eta }}^\top {\varvec{T}}_n), 1-F^{[L_n,U_n]}_{0,\sigma _n^2}({\varvec{\eta }}^\top {\varvec{T}}_n) \right\} , \end{aligned}$$
(10)
where \(L_n\) and \(U_n\) are evaluated at the realization of \({\varvec{w}}={\varvec{w}}_0\). Unfortunately, because \({\varvec{T}}_n\), \(\varSigma _n\), \(L_n\) and \(U_n\) are still dependent on the true value of \({\varvec{\beta }}_S^*\), we construct a test statistic by introducing a maximum likelihood estimator (5), which is a consistent estimator of \({\varvec{\beta }}_S^*\).

4.1 Computing truncation points

In practice, we need to compute truncation points in (9). When we utilize marginal screening for variable selection, it becomes difficult to compute \(L_n\) and \(U_n\) because \(\tilde{A}\) becomes a \(\{2K(d-K)+K\}\times K\) dimensional matrix. For example, even when \(d=1000\) and \(K=20\), we need to handle a 39,220 dimensional vector. To reduce the computational burden, we derive a simple form of (9) in this section.

We first derive \(A_S\). As noted in Sect. 3.1, selection event \(\{(\hat{S}, {\varvec{u}}_{\hat{S}})=(S, {\varvec{u}}_S)\}\) can be rewritten as
$$\begin{aligned} -u_jz_j\le z_k\le u_jz_j,~u_j z_j\ge 0, \quad \forall (j,k)\in S\times S^\bot , \end{aligned}$$
where \(s_j=\mathrm{sgn}(z_j)\) is the sign of the j-th element of \({\varvec{z}}=X^\top {\varvec{y}}\). Let \(S=\{j_1,\ldots , j_K\}\) and \(q=2(d-K)+1\). Then, by a simple calculation, we have
$$\begin{aligned} A_S =\left( \begin{array}{lll} -u_{j_1}{\varvec{1}}_{q}&{}&{}O \\ &{}\ddots &{} \\ O&{}&{} -u_{j_K}{\varvec{1}}_{q} \end{array} \right) =-J\otimes {\varvec{1}}_{q}, \end{aligned}$$
where J is a \(K\times K\) dimensional diagonal matrix whose j-th diagonal element is \(s_j\) and \(\otimes\) denotes a Kronecker product. Since \(\tilde{A}=A_S\varSigma _n\) and \({\varvec{c}}=\varSigma _n^{-1}{\varvec{\eta }}/\sigma _n^2\), the denominator in (9) reduces to \(\tilde{A}{\varvec{c}}=A_S{\varvec{\eta }}//\sigma _n^2\). For \({\varvec{\eta }}={\varvec{e}}_j\), we can further evaluate \(A_S{\varvec{\eta }}\) as
$$\begin{aligned} A_S{\varvec{\eta }} =-u_j({\varvec{0}}_{(j-1)q}^\top ,{\varvec{1}}_{q}^\top ,{\varvec{0}}_{(K-j)q}^\top )^\top \in \mathbb {R}^{Kq}. \end{aligned}$$
Further, by the definition of \(\tilde{A},~\tilde{{\varvec{b}}}\), and \({\varvec{w}}\), we have
$$\begin{aligned} \tilde{{\varvec{b}}}-\tilde{A}{\varvec{w}} =\tilde{{\varvec{b}}}-\tilde{A}{\varvec{T}}_n+({\varvec{\eta }}^\top {\varvec{T}}_n)\tilde{A}{\varvec{c}} =-\frac{1}{\sqrt{n}}A{\varvec{z}}+T_{n,j}\tilde{A}{\varvec{c}}. \end{aligned}$$
Because \(\sigma _n^2\), the j-th diagonal element of \(\varSigma _n^{-1}\), is positive, it is straightforward to observe that
$$\begin{aligned} \{l:(\tilde{A}{\varvec{c}})_l<0\} = {\left\{ \begin{array}{ll} \{(j-1)q+1,\ldots ,jq\}, &{} \text {if}~s_j=1 \\ \emptyset , &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
for \(j=1,\ldots ,K\). Note that, for each \(j=1,\ldots ,K\), \((A{\varvec{z}})_{l=(j-1)q+1,\ldots ,jq}\) consists of q elements of \(z_j\) and \(z_j\pm z_k\) for any \(k\in S^\bot\). Therefore, for each \(j=1,\ldots ,K\), we have
$$\begin{aligned} \max _{l=(j-1)q+1,\ldots ,jq}(A{\varvec{z}})_l =\max _{k\in S^\bot }\{z_j,z_j\pm z_k\} =z_j+\max _{k\in S^\bot }|z_k|. \end{aligned}$$
As a consequence, we obtain
$$\begin{aligned} L_n&=\max _{l:(\tilde{A}{\varvec{c}})_l<0}\frac{\tilde{b}_l-(\tilde{A}{\varvec{w}})_l}{(\tilde{A}{\varvec{c}})_l} \nonumber \\&=\max _{l:(\tilde{A}{\varvec{\eta }})_l<0}\frac{-(A{\varvec{z}})_l/\sqrt{n}}{(\tilde{A}{\varvec{\eta }})_l/\sigma _n^2}+T_{n,j} \nonumber \\&=\frac{\sigma _n^2}{\sqrt{n}} \max _{l=(j-1)q+1,\ldots ,jq}(A{\varvec{z}})_l+T_{n,j} \nonumber \\&=\frac{\sigma _n^2}{\sqrt{n}}(|z_j|+\max _{k\in S^\bot }|z_k|)+T_{n,j}, \end{aligned}$$
(11)
if \(u_j=1\), and \(L_n=-\infty\), otherwise. Similarly, we obtain
$$\begin{aligned} U_n&=\min _{l:(\tilde{A}{\varvec{c}})_l>0}\frac{\tilde{b}_l-(\tilde{A}{\varvec{w}})_l}{(\tilde{A}{\varvec{c}})_l} \nonumber \\&=\min _{l:(\tilde{A}{\varvec{\eta }})_l>0}\frac{-(A{\varvec{z}})_l/\sqrt{n}}{(\tilde{A}{\varvec{\eta }})_l/\sigma _n^2}+T_{n,j} \nonumber \\&=-\frac{\sigma _n^2}{\sqrt{n}} \max _{l=(j-1)q+1,\ldots ,jq}(A{\varvec{z}})_l+T_{n,j} \nonumber \\&=\frac{\sigma _n^2}{\sqrt{n}}(|z_j|-\max _{k\in S^\bot }|z_k|)+T_{n,j}, \end{aligned}$$
(12)
if \(u_j=-1\), and \(U_n=\infty\), otherwise. Because of this simple form, we can calculate truncation points efficiently. We summarize the algorithm to compute selective p values of the K selective test in Algorithm 1.

4.2 Controlling family-wise error rate

Since selective test (1) consists of K hypotheses, we may be concerned about multiplicity when \(K>1\). In this case, instead of selective type I error, we control the family-wise error rate (FWER) in the sense of selective inference and we term it selective FWER.

For the selected variable \(\hat{S}=S\), let us denote a family of true null by \(\mathcal{H}=\{\mathrm{H}_{0,j}:\mathrm{H}_{0,j}(j\in S)\) is true null\(\}\). Then, let us define the selective FWER by
$$\begin{aligned} \mathrm{sFWER} =\mathrm{P}(\text {at least one}~\mathrm{H}_{0,j}\in \mathcal{H}~\text {is rejected}\mid \hat{S}=S) \end{aligned}$$
(13)
in the same way as the classic FWER. Next, we asymptotically control the selective FWER at level \(\alpha\) by utilizing Bonferroni correction for K selective tests. Specifically, we adjust selective p values (10) as follows. Let us define \(\tilde{\alpha }=\alpha /K\). Since selective p value \(P_{n,j}\) is asymptotically distributed according to \(\mathrm{Unif}(0,1)\), we have that a limit superior of (13) can be bounded as follows:
$$\begin{aligned} \limsup _{n\rightarrow \infty }\mathrm{P}\left( \bigcup _{j:\mathrm{H}_{0,j}\in \mathcal{H}}\{P_{n,j}\le \tilde{\alpha } \} \mid \hat{S}=S\right)&\le \limsup _{n\rightarrow \infty }\sum _{j:\mathrm{H}_{0,j}\in \mathcal{H}}\mathrm{P}(P_{n,j}\le \tilde{\alpha } \mid \hat{S}=S) \\&\le \sum _{j:\mathrm{H}_{0,j}\in \mathcal{H}}\limsup _{n\rightarrow \infty }\mathrm{P}(P_{n,j}\le \tilde{\alpha } \mid \hat{S}=S) \\&\le |\mathcal{H}|\tilde{\alpha } \le \alpha . \end{aligned}$$
In the last inequality, we simply use \(|\mathcal{H}|\le K\). Accordingly, letting \(p_{n,j}\) be a realization of (10), we reject a null hypothesis when \(\{p_{n,j}\le \tilde{\alpha } \}\). In the following, we refer to \(\tilde{p}_{n,j}=\min \{1,Kp_{n,j}\}\) as an adjusted selective p value. Note that we can utilize not only Bonferroni’s method but also other methods for correcting multiplicity such as Scheff\(\acute{\text {e}}\)’s method, Holm’s method, and so on. We use Bonferroni’s method for expository purposes.

5 Simulation study

Through simulation studies, we explore the performance of the proposed method in Sect. 4, which we term ASICs (Asymptotic Selective Inference for Classification) here.

We first identify if the ASICs can control selective type I error. We also check the selective type I error when data splitting (DS) and nominal test (NT) methods are used. In DS, we first randomly divide the data into two disjoint sets. Then, after selecting \(\hat{S}=S\) with \(|S|=K\) by using one of these sets, we construct a test statistic \({\varvec{T}}_n(\hat{{\varvec{\beta }}}_S)\) based on the other sets and reject the j-th selective test (1) when \(|T_{n,j}/\sigma _n|\ge z_{\alpha /2}\), where \(z_{\alpha /2}\) is an upper \(\alpha /2\)-percentile of a standard normal distribution. In NT, we cannot control type I errors since selection bias is ignored: it selects K variables by marginal screening first, then rejects the j-th selective test (1) when \(|T_{n,j}/\sigma _n|\ge z_{\alpha /2}\), where the entire data set is used for both selection and inference steps. Finally, we explore whether the ASICs can effectively control selective FWER, and at the same time, confirm its statistical power by comparing it with that of DS.

The simulation settings are as follows. As d dimensional regressor \({\varvec{x}}_i\) (\(i=1,\ldots ,n\)), we used vectors obtained from \(\mathrm{N}({\varvec{0}},\varSigma )\), where \(\varSigma\) is a \(d\times d\) dimensional covariance matrix whose (jk)-th element is set to \(\rho ^{|j-k|}\). We set \(\rho =0\) or 0.5 in Case 1 and Case 2, respectively. Note that each element of \({\varvec{x}}_i\) is independent in Case 1 but correlated in Case 2. Then, for each \({\varvec{x}}_i\), we generate \(y_i\) from \(\mathrm{Bi}(\psi '({\varvec{x}}_i^\top {\varvec{\beta }}^*))\), where \({\varvec{\beta }}^*\) is a d dimensional true coefficient vector and \(\mathrm{Bi}(p)\) is a Bernoulli distribution with parameter p. In the following, we conduct simulations using 1000 Monte-Carlo runs. We use the glm package in R for parameter estimation.

5.1 Controlling selective type I error

To check if ASICs can control selective type I error, we consider a selective test (1). Specifically, we first select \(K=1\) variable by marginal screening and then conduct a selective test at the 5% level. By setting \({\varvec{\beta }}^*={\varvec{0}}\in \mathbb {R}^d\), we can confirm selective type I error because the selective null is always true. Therefore, we assess the following index as an estimator of the selective type I error: letting \(\beta\) be the selected variable in each simulation, we evaluate an average and standard deviation of
$$\begin{aligned} I\{\mathrm{H}_0~\text {is rejected}\}, \end{aligned}$$
(14)
where I is an indicator function and \(\mathrm{H}_0:\beta ^*=0\) is a selective null. We construct a selective test at the 5% level in all simulations. In the same manner as classical type I error, it is desirable when the above index is less than or equal to 0.05, with particularly small values indicating that the selective test is overly conservative.
Table 1 presents averages and standard deviations of (14) based on 1000 runs. It is clear that NT cannot control selective type I error; it becomes larger as the dimension d increases. In addition, NT does not improve even if the sample size becomes large, because there exist selection bias in the selection step. On the other hand, both ASICs and DS adequately control selective type I error, although the latter appears slightly more conservative than the former. Moreover, unlike NT, these two methods can adequately control selective type I error, even when the covariance structure of \({\varvec{x}}_i\) and the number of dimensions change.
Table 1

Method comparison using simulated data based on 1000 Monte-Carlo runs

d

Method

Sample size

50

100

200

500

1000

1500

Case 1

200

ASICs

0.029 (0.168)

0.049 (0.216)

0.038 (0.191)

0.031 (0.173)

0.028 (0.165)

0.033 (0.179)

DS

0.012 (0.109)

0.015 (0.122)

0.004 (0.063)

0.004 (0.063)

0.011 (0.104)

0.011 (0.104)

NT

0.184 (0.388)

0.226 (0.418)

0.219 (0.414)

0.261 (0.439)

0.255 (0.436)

0.256 (0.437)

500

ASICs

0.028 (0.165)

0.043 (0.203)

0.039 (0.194)

0.039 (0.194)

0.032 (0.176)

0.036 (0.186)

DS

0.012 (0.109)

0.006 (0.077)

0.008 (0.089)

0.009 (0.094)

0.005 (0.071)

0.008 (0.089)

NT

0.267 (0.044)

0.273 (0.446)

0.304 (0.460)

0.301 (0.459)

0.326 (0.469)

0.325 (0.469)

1000

ASICs

0.041 (0.198)

0.044 (0.205)

0.023 (0.150)

0.032 (0.176)

0.038 (0.191)

0.044 (0.205)

DS

0.006 (0.077)

0.011 (0.104)

0.010 (0.100)

0.009 (0.094)

0.013 (0.113)

0.010 (0.100)

NT

0.294 (0.456)

0.345 (0.476)

0.390 (0.488)

0.402 (0.491)

0.411 (0.492)

0.405 (0.491)

Case 2

200

ASICs

0.038 (0.191)

0.038 (0.191)

0.040 (0.196)

0.032 (0.176)

0.028 (0.165)

0.031 (0.173)

DS

0.012 (0.109)

0.007 (0.083)

0.012 (0.109)

0.010 (0.100)

0.012 (0.109)

0.004 (0.063)

NT

0.177 (0.382)

0.207 (0.405)

0.234 (0.424)

0.211 (0.408)

0.219 (0.414)

0.210 (0.408)

500

ASICs

0.049 (0.216)

0.038 (0.191)

0.030 (0.171)

0.030 (0.171)

0.039 (0.194)

0.034 (0.181)

DS

0.007 (0.083)

0.006 (0.077)

0.010 (0.100)

0.009 (0.094)

0.007 (0.083)

0.007 (0.083)

NT

0.247 (0.431)

0.269 (0.443)

0.291 (0.454)

0.295 (0.456)

0.309 (0.462)

0.318 (0.466)

1000

ASICs

0.049 (0.216)

0.047 (0.212)

0.031 (0.173)

0.034 (0.181)

0.024 (0.153)

0.046 (0.210)

DS

0.009 (0.094)

0.008 (0.089)

0.013 (0.113)

0.006 (0.077)

0.006 (0.077)

0.010 (0.100)

NT

0.290 (0.454)

0.350 (0.477)

0.375 (0.484)

0.396 (0.489)

0.407 (0.492)

0.414 (0.493)

Each cell denotes an average with standard deviations of (14) in parentheses

5.2 FWER and power

Here, we explore selective FWER and statistical power with respect to ASICs and DS for K selective tests (1), where we set \(K=5, 10, 15\), and 20. Note that, as discussed in the above section, NT is disregarded here because it does no adequately control selective type I error. We adjust multiplicity by utilizing Bonferroni’s method as noted in Sect. 4.2.

The true coefficient vector is set to be \({\varvec{\beta }}^*=(2\times {\varvec{1}}_5^\top ,{\varvec{0}}_{d-5}^\top )^\top\) and \({\varvec{\beta }}^*=(2\times {\varvec{1}}_5^\top ,-2\times {\varvec{1}}_5^\top ,{\varvec{0}}_{d-10}^\top )^\top\) in Model 1 and Model 2, respectively. In the following, we assess the indices as an estimator of selective FWER and power. Letting \(\hat{S}=S\) be the subset of selected variables for each simulation, we evaluate an average of
$$\begin{aligned} I\{\text {at least one}~\mathrm{H}_{0,j}\in \mathcal{H}~\text {is rejected}\} \end{aligned}$$
(15)
and
$$\begin{aligned} \frac{1}{|S^*|}\sum _{j\in S}I\{\mathrm{H}_{0,j}\not \in \mathcal{H}~\text {is rejected}\}, \end{aligned}$$
(16)
where, for each \(j\in S\), \(\mathrm{H}_{0,j}:\beta _j^*=0\) is the selective null and \(\mathcal{H}\) is a family of true nulls. Note that, by using Bonferroni’s method, we use \(\tilde{\alpha }=\alpha /K\) as an adjusted significance level for \(\alpha =0.05\). Similar to the selective type I error, it is desirable when (15) is less than or equal to \(\alpha\). In addition, higher values of (16) are desiable in the same manner as per classical power. We evaluate (16) as the proportion of rejected hypotheses for false nulls to that of true active variables. We employ this performance index because it is important to identify how many truly active variables are extracted in practice.
Figure 2 shows the average (15) for each method. ASICs and DS are both evaluated with respect to four values of K; thus, eight lines are plotted in each graph. Because of the randomness of simulation, some of the ASICs results are larger than 0.05 especially in small sample size and large variable dimension cases. For both methods, it is clear that selective FWER tends to be controlled at the desired significance level, although DS is more conservative than ASICs. To accord with our asymptotic theory, the number of selected variables must be \(K=\mathrm{o}(n^{1/3})\), which means that the normal approximation is not ensured in the case of \(K=15\) and 20. However, we observe that selective FWER is correctly controlled even in these cases, which suggests that assumptions (C3) and (C4) can be relaxed.
Fig. 2

Method comparison using simulated data based on 1000 Monte-Carlo runs. The vertical and horizontal axes represent an average of (15) and sample size, respectively. The dotted line shows the significance level (\(\alpha =0.05\))

Figures 3 and 4 show the average of (16) for each method and settings in Model 1 and Model 2, respectively. In Case 1 of Fig. 3, ASICs and DS have almost the same power for each K and d. In addition, ASICs is clearly superior to DS in Case 2. This is reasonable since DS uses only the half of the data for inference. On the other hand, in all cases, the power of ASICs becomes higher as the number of selected variables K decreases. This can be explained by the condition (C3), that is, we need a much larger sample size when K becomes large for assuring the asymptotic result in Theorem 2. In Fig. 4, it is clear that the power of ASICs is superior in almost all settings. However, neither AISCs nor DS appears to perform well when \(K=5\). In this case, the power of ASICs and DS cannot be improved by \(50\%\) or more. This is because we can only select at most 5 true nonzero variables, while there are 10 true nonzero variables.
Fig. 3

Method comparison using simulated data based on 1000 Monte-Carlo runs. The vertical and horizontal axes represent an average of (16) and sample size, respectively

Fig. 4

Method comparison using simulated data based on 1000 Monte-Carlo runs. The vertical and horizontal axes represent an average of (16) and sample size, respectively

6 Empirical applications

We further explore the performance of the proposed method by applying it to several empirical data sets, all of which are available at LIBSVM.1 In all experiments, we standardize the design matrix X to make the scale of each variable the same. We report adjusted selective p values for selected variables. To explore the selection bias, we also report naive adjusted p values. That is, we first compute p values for selected variables based on NT, then we adjust these p values by multiplying the number of selected variables. The results are plotted in Figs. 5, 6 and 7. The result shows that almost all adjusted nominal p values are smaller than those of selective inference, and the difference between these p values is interpreted as the effect of selection bias.
Fig. 5

Comparison between adjusted selective p values and nominal p values. The vertical and horizontal axes represent adjusted p values and indices of selected variables, respectively, and the black dotted line shows the significance level (\(\alpha =0.05\)). In each figure, black circles and red triangles, respectively, indicate adjusted nominal p values and selective p values

Fig. 6

Comparison between adjusted selective p values and nominal p values. The vertical and horizontal axes represent adjusted p values and indices of selected variables, respectively, and the black dotted line shows the significance level (\(\alpha =0.05\)). In each figure, black circles and red triangles, respectively, indicate adjusted nominal p values and selective p values

Fig. 7

Comparison between adjusted selective p values and nominal p values. The vertical and horizontal axes represent adjusted p values and indices of selected variables, respectively, and the black dotted line shows the significance level (\(\alpha =0.05\)). In each figure, black circles and red triangles, respectively, indicate adjusted nominal p values and selective p values

7 Theoretical analysis

In this section, we provide proofs of the theoretical results derived herein. We use the notation \(p\lesssim q\), which means that, if for any \(p,q\in \mathbb {R}\), there exists a constant \(r>0\) such that \(p\le rq\), and \(p\gtrsim q\) is defined similarly. All proofs are based on fixed \(S~(\supset S^*)\); thus we simply denote \(\hat{{\varvec{\beta }}}_S\) and \(X_S\) by \(\hat{{\varvec{\beta }}}\) and X, respectively. This is because we need to verify several asymptotic condition before selections in the same way as in Tian and Taylor (2017) and Taylor and Tibshirani (2018).

7.1 Proof of (6)

Let \(\alpha _n=\sqrt{K/n}\) and define a K dimensional vector \({\varvec{u}}\) satisfying \(\Vert {\varvec{u}}\Vert =C\) for a sufficiently large \(C>0\). The concavity of \(\ell _n\) implies
$$\begin{aligned} \mathrm{P} (\Vert \hat{{\varvec{\beta }}}-{\varvec{\beta }}^*\Vert \le \alpha _nC) \ge \mathrm{P}\left( \sup _{\Vert {\varvec{u}}\Vert =C}\ell _n({\varvec{\beta }}^*+\alpha _n{\varvec{u}})<\ell _n({\varvec{\beta }}^*)\right) , \end{aligned}$$
and thus, we need to show that for any \(\varepsilon >0\), there exists a sufficiently large \(C>0\) such that
$$\begin{aligned} \mathrm{P}\left( \sup _{\Vert {\varvec{u}}\Vert =C}\ell _n({\varvec{\beta }}^*+\alpha _n{\varvec{u}})<\ell _n({\varvec{\beta }}^*)\right) \ge 1-\varepsilon . \end{aligned}$$
(17)
In fact, the above inequality implies that \(\hat{{\varvec{\beta }}}\in \{{\varvec{\beta }}+\alpha _n{\varvec{u}};\;\Vert {\varvec{u}}\Vert \le C\}\), that is, \(\Vert \hat{{\varvec{\beta }}}-{\varvec{\beta }}^*\Vert =\mathrm{O}_\mathrm{p}(\alpha _n)\).
Observe that \(|\psi '({\varvec{x}}_i^\top {\varvec{\beta }})|,|\psi ''({\varvec{x}}_i^\top {\varvec{\beta }})|\) and \(|\psi '''({\varvec{x}}_i^\top {\varvec{\beta }})|\) are bounded uniformly with respect to \({\varvec{\beta }}\in \mathcal{B}\) and i. Using Taylor’s theorem, we have
$$\begin{aligned}&\ell _n({\varvec{\beta }}^*+\alpha _n{\varvec{u}})-\ell _n({\varvec{\beta }}^*) \\&\quad =\sum _{i=1}^n\left[ \alpha _ny_i{\varvec{x}}_i^\top {\varvec{u}}-\left\{ \psi ({\varvec{x}}_i^\top ({\varvec{\beta }}^*+\alpha _n{\varvec{u}}))-\psi ({\varvec{x}}_i^\top {\varvec{\beta }}^*)\right\} \right] \\&\quad =\alpha _n\sum _{i=1}^{n}(y_i-\psi '({\varvec{x}}_i^\top {\varvec{\beta }}^*)){\varvec{x}}_i^\top {\varvec{u}} -\frac{\alpha _n^2}{2}\sum _{i=1}^{n}\psi ''({\varvec{x}}_i^\top {\varvec{\beta }}^*)({\varvec{x}}_i^\top {\varvec{u}})^2 -\frac{\alpha _n^3}{6}\sum _{i=1}^{n}\psi '''(\theta _i)({\varvec{x}}_i^\top {\varvec{u}})^3 \\&\quad \equiv I_1+I_2+I_3, \end{aligned}$$
where for \(i=1,2,\ldots ,n\), \(\theta _i\) is in the line segment between \({\varvec{x}}_i^\top {\varvec{\beta }}^*\) and \({\varvec{x}}_i^\top ({\varvec{\beta }}^*+\alpha _n{\varvec{u}})\). From (C1) and (C2), we observe that
$$\begin{aligned} \mathrm{E}\left[ \left\{ \sum _{i=1}^{n}(y_i-\psi '({\varvec{x}}_i^\top {\varvec{\beta }}^*)){\varvec{x}}_i^\top {\varvec{u}}\right\} ^2\right]&=\sum _{i=1}^{n}\mathrm{E}\left[ (y_i-\psi '({\varvec{x}}_i^\top {\varvec{\beta }}^*))^2({\varvec{x}}_i^\top {\varvec{u}})^2\right] \\&=\sum _{i=1}^{n}\psi ''({\varvec{x}}_i^\top {\varvec{\beta }}^*)({\varvec{x}}_{i}^\top {\varvec{u}})^2 \lesssim n{\varvec{u}}^\top \varXi _n{\varvec{u}} \lesssim n\Vert {\varvec{u}}\Vert ^2, \end{aligned}$$
and thus we have \(|I_1|=\mathrm{O}_\mathrm{p}(\alpha _n\sqrt{n}\Vert {\varvec{u}}\Vert )=\mathrm{O}_\mathrm{p}(\sqrt{K}\Vert {\varvec{u}}\Vert )\). Next, using (C1) again, \(I_2\) can be bounded as
$$\begin{aligned} I_2 \lesssim -\alpha _n^2\sum _{i=1}^{n}({\varvec{x}}_i^\top {\varvec{u}})^2 \lesssim -K\Vert {\varvec{u}}\Vert ^2 <0. \end{aligned}$$
Finally, for \(I_3\), we have
$$\begin{aligned} |I_3|&=\left| \frac{\alpha _n^3}{6}\sum _{i=1}^{n}\psi '''(\theta _i)({\varvec{x}}_i^\top {\varvec{u}})^3\right| \lesssim \alpha _n^3\sum _{i=1}^{n}|{\varvec{x}}_i^\top {\varvec{u}}|^3 \le n\alpha _n^3{\varvec{u}}^\top \varXi _n{\varvec{u}}\max _{1\le i\le n}|{\varvec{x}}_i^\top {\varvec{u}}| \\&\lesssim n\alpha _n^3\sqrt{K}\Vert {\varvec{u}}\Vert ^3 =\mathrm{O}\left( \frac{K^2}{\sqrt{n}}\Vert {\varvec{u}}\Vert ^3\right) . \end{aligned}$$
Combining all the above, if \(K^2/n\rightarrow 0\) is satisfied, we observe that for sufficiently large C, \(I_1\) and \(I_2\) are dominated by \(I_2~(<0)\). As a result, we obtain (17).

Remark 2

From (6) and (2), we have
$$\begin{aligned} |{\varvec{x}}_i^\top \hat{{\varvec{\beta }}}| \le |{\varvec{x}}_i^\top (\hat{{\varvec{\beta }}}-{\varvec{\beta }}^*)|+|{\varvec{x}}_i^\top {\varvec{\beta }}^*| =\mathrm{O}_{\mathrm{p}}(K/\sqrt{n})+\xi , \end{aligned}$$
and thus, with probability tending to 1, \(\hat{{\varvec{\beta }}}\in \mathcal{B}\) holds.

7.2 Proof of Theorem 1

First, we prove that \(\sqrt{n}(\hat{{\varvec{\beta }}}-{\varvec{\beta }}^*)\) is asymptotically equivalent to \(\varSigma _n^{-1}{\varvec{s}}_n\). Using Taylor’s theorem, we have
$$\begin{aligned} {\varvec{0}} =\ell _n'(\hat{{\varvec{\beta }}}) =\ell _n'({\varvec{\beta }}^*)+\ell ''_n({\varvec{\beta }}^*)(\hat{{\varvec{\beta }}}-{\varvec{\beta }}^*) +\frac{1}{2}\sum _{i=1}^n\psi '''(\tilde{\theta }_i){\varvec{x}}_i\{{\varvec{x}}_i^\top (\hat{{\varvec{\beta }}}-{\varvec{\beta }}^*)\}^2, \end{aligned}$$
(18)
where for \(i=1,2,\ldots ,n\), \(\tilde{\theta }_i\) is in the line segment between \({\varvec{x}}_i^\top {\varvec{\beta }}^*\) and \({\varvec{x}}_i^\top \hat{{\varvec{\beta }}}\). In addition, (18) can be rewritten as
$$\begin{aligned} \sqrt{n}(\hat{{\varvec{\beta }}}-{\varvec{\beta }}^*) =\varSigma _n^{-1}{\varvec{s}}_n+R_n, \end{aligned}$$
where
$$\begin{aligned} R_n =-\frac{1}{2\sqrt{n}}\varSigma _n^{-1}\sum _{i=1}^n\psi '''(\tilde{\theta }_i) {\varvec{x}}_i\{{\varvec{x}}_i^\top (\hat{{\varvec{\beta }}}-{\varvec{\beta }}^*)\}^2. \end{aligned}$$
Noting that, from (C1),
$$\begin{aligned} \lambda _{\mathrm{min}}(\varSigma _n) \gtrsim \lambda _{\mathrm{min}}(\varXi _n)>C_1 >0, \end{aligned}$$
(C1), (C3) and (6) imply
$$\begin{aligned} \Vert R_n\Vert&\lesssim \frac{1}{\sqrt{n}}\max _{1\le i\le n}|{\varvec{x}}_i^\top (\hat{{\varvec{\beta }}} -{\varvec{\beta }}^*)| \times \left\| \sum _{i=1}^{n}\varSigma _n^{-1}\psi '''(\tilde{\theta }_i){\varvec{x}}_i {\varvec{x}}_i^\top (\hat{{\varvec{\beta }}}-{\varvec{\beta }}^*)\right\| \\&\lesssim \frac{1}{\sqrt{n}}\max _{1\le i\le n}\Vert {\varvec{x}}_i\Vert \Vert \hat{{\varvec{\beta }}} -{\varvec{\beta }}^*\Vert \times n\lambda _{\mathrm{max}}(\varXi _n)\Vert \hat{{\varvec{\beta }}}-{\varvec{\beta }}^*\Vert \\&=\mathrm{O}_{\mathrm{p}}\left( \frac{K\sqrt{K}}{\sqrt{n}} \right) =\mathrm{o}_{\mathrm{p}}(1). \end{aligned}$$
Now we can prove the asymptotic normality of \(\sigma _n^{-1}\varSigma _n^{-1}{\varvec{s}}_n\). For any K dimensional vector \({\varvec{\eta }}\) with \(\Vert {\varvec{\eta }}\Vert <\infty\), define \(\sigma _n^2={\varvec{\eta }}^\top \varSigma _n^{-1}{\varvec{\eta }}\) and \(\omega _n\) such that
$$\begin{aligned} {\varvec{\eta }}^\top \varSigma _n^{-1}{\varvec{s}}_n =\frac{1}{\sqrt{n}}\sum _{i=1}^{n}{\varvec{\eta }}^\top \varSigma _n^{-1}{\varvec{x}}_i (y_i-\psi '({\varvec{x}}_i^\top {\varvec{\beta }}^*)) =\sum _{i=1}^{n}\omega _{ni}. \end{aligned}$$
Then, since \(S\supset S^*\), we observe that
$$\begin{aligned} \sum _{i=1}^{n}\mathrm{E}[\omega _{ni}] =\sum _{i=1}^{n}\frac{1}{\sqrt{n}}{\varvec{\eta }}^\top \varSigma _n^{-1}{\varvec{x}}_i\mathrm{E}[y_i-\psi '_i] =0, \end{aligned}$$
and
$$\begin{aligned} \sum _{i=1}^{n}\mathrm{V}[\omega _{ni}] =\frac{1}{n}\sum _{i=1}^n{\varvec{\eta }}^\top \varSigma _n^{-1}{\varvec{x}}_i\mathrm{V}[y_i] {\varvec{x}}_i^\top \varSigma _n^{-1}{\varvec{\eta }} =\sigma _n^2. \end{aligned}$$
To state the asymptotic normality of \(\sigma _n^{-1}\varSigma _n^{-1}{\varvec{s}}_n\), we check the Lindeberg condition for \(\omega _n\): for any \(\varepsilon >0\),
$$\begin{aligned} \frac{1}{\sigma _n^2}\sum _{i=1}^{n}\mathrm{E}[\omega _{ni}^2I(|\omega _{ni}|>\sigma _n\varepsilon )] =\mathrm{o}(1). \end{aligned}$$
(19)
For any \(\varepsilon >0\), we have
$$\begin{aligned}&\frac{1}{\sigma _n^2}\sum _{i=1}^{n}\mathrm{E}[\omega _{ni}^2I(|\omega _{ni}|>\sigma _n\varepsilon )] \\&\quad =\frac{1}{\sigma _n^2}\cdot \frac{1}{n}\sum _{i=1}^{n}({\varvec{\eta }}^\top \varSigma _n^{-1}{\varvec{x}}_i)^2\mathrm{E}[(y_i-\psi '_i)^2I(|\omega _{ni}|>\sigma _n\varepsilon )] \\&\quad \le \frac{1}{\sigma _n^2}\max _{1\le i\le n}\mathrm{E}[(y_i-\psi '_i)^2I(|\omega _{ni}|>\sigma _n\varepsilon )] \times \frac{1}{n}\sum _{i=1}^{n}({\varvec{\eta }}^\top \varSigma _n^{-1}{\varvec{x}}_i)^2. \end{aligned}$$
Using the Cauchy–Schwarz inequality and (C1),
$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}({\varvec{\eta }}^\top \varSigma _n^{-1}{\varvec{x}}_i)^2 \le \frac{1}{n}\sum _{i=1}^{n}({\varvec{\eta }}^\top \varSigma _n^{-1}{\varvec{\eta }})({\varvec{x}}_i^\top \varSigma _n^{-1}{\varvec{x}}_i) \lesssim \frac{1}{n}\sum _{i=1}^{n}\Vert {\varvec{x}}_i\Vert ^2 =\mathrm{O}(K). \end{aligned}$$
Noting that each \(y_i\) is distributed according to a Bernoulli distribution with parameter \(\psi '\), \(\mathrm{E}[(y_i-\psi '_i)^4]\) is uniformly bounded on \(\mathcal{B}\) for any \(i=1,\ldots ,n\) by a simple calculation. Thus, using the Cauchy–Schwarz inequality and Chebyshev’s inequality, we have
$$\begin{aligned} \max _{1\le i\le n}\mathrm{E}[(y_i-\psi '_i)^2I(|\omega _{ni}|>\sigma _n\varepsilon )]&\le \max _{1\le i\le n}\mathrm{E}[(y_i-\psi _i)^4]^{1/2}\mathrm{P}(|\omega _{ni}|>\sigma _n\varepsilon )^{1/2} \\&\lesssim \frac{1}{\sigma _n}\max _{1\le i\le n}\mathrm{E}[\omega _{ni}^2]^{1/2} \\&=\frac{1}{\sigma _n\sqrt{n}}\max _{1\le i\le n}|{\varvec{\eta }}^\top \varSigma _n^{-1}{\varvec{x}}_i|\sqrt{\psi '_i(1-\psi '_i)} \\&\lesssim \frac{1}{\sqrt{n}}\max _{1\le i\le n}\Vert {\varvec{x}}_i\Vert =\mathrm{O}\left( \frac{\sqrt{K}}{\sqrt{n}}\right) . \end{aligned}$$
Finally, since
$$\begin{aligned} \sigma _n^2 ={\varvec{\eta }}^\top \varSigma _n^{-1}{\varvec{\eta }} \le \lambda _{\mathrm{max}}(\varSigma _n^{-1})\Vert {\varvec{\eta }}\Vert ^2 =\frac{\Vert {\varvec{\eta }}\Vert ^2}{\lambda _{\mathrm{min}}(\varSigma _n)} =\mathrm{O}(1), \end{aligned}$$
we have
$$\begin{aligned} \frac{1}{\sigma _n^2}\sum _{i=1}^{n}\mathrm{E}[\omega _{ni}^2I(|\omega _{ni}|>\sigma _n\varepsilon )] =\mathrm{O}\left( \frac{\sqrt{K}}{\sqrt{n}}\cdot K \right) . \end{aligned}$$
From (C3), this implies the Lindeberg condition (19).

7.3 Proof of Theorem 2

First, we prove that, for any K dimensional vector \({\varvec{\eta }}\), the selection event can be expressed as an inequality with respect to \({\varvec{\eta }}^\top {\varvec{T}}_n\). Let us define \({\varvec{w}}=(I_K-{\varvec{c}}{\varvec{\eta }}^\top ){\varvec{T}}_n\), where \({\varvec{c}}=\varSigma _n^{-1}{\varvec{\eta }}/\sigma _n^2\). Then, since \({\varvec{T}}_n=({\varvec{\eta }}^\top {\varvec{T}}_n){\varvec{c}}+{\varvec{w}}\), we have
$$\begin{aligned} \tilde{A}{\varvec{T}}_n\le \tilde{{\varvec{b}}}&\Leftrightarrow ({\varvec{\eta }}^\top {\varvec{T}}_n)\tilde{A}{\varvec{c}}\le \tilde{{\varvec{b}}}-\tilde{A}{\varvec{w}} \\&\Leftrightarrow ({\varvec{\eta }}^\top {\varvec{T}}_n)(\tilde{A}{\varvec{c}})_j\le (\tilde{{\varvec{b}}}-\tilde{A}{\varvec{w}})_j,~\forall j \\&\Leftrightarrow {\left\{ \begin{array}{ll} {\varvec{\eta }}^\top {\varvec{T}}_n\le (\tilde{{\varvec{b}}}-\tilde{A}{\varvec{w}})_j/(\tilde{A}{\varvec{c}})_j,&{} j:(\tilde{A}{\varvec{c}})_j>0 \\ {\varvec{\eta }}^\top {\varvec{T}}_n\ge (\tilde{{\varvec{b}}}-\tilde{A}{\varvec{w}})_j/(\tilde{A}{\varvec{c}})_j,&{} j:(\tilde{A}{\varvec{c}})_j<0 \\ 0=(\tilde{{\varvec{b}}}-\tilde{A}{\varvec{w}})_j,&{} j:(\tilde{A}{\varvec{c}})_j=0 \\ \end{array}\right. } \end{aligned}$$
and this implies the former result in Theorem 2.
To prove the theorem, we need to verify asymptotic independency between \((L_n, U_n, N_n)\) and \({\varvec{\eta }}^\top {\varvec{T}}_n\). To see this, it is enough to show the independency between (a) \({\varvec{\eta }}_1^\top {\varvec{w}}\) and \({\varvec{\eta }}^\top {\varvec{T}}_n\) for any \({\varvec{\eta }}_1\) with \(\Vert {\varvec{\eta }}_1\Vert <\infty\) and (b) \({\varvec{\eta }}_2^\top \tilde{{\varvec{b}}}\) and \({\varvec{\eta }}^\top {\varvec{T}}_n\) for any \({\varvec{\eta }}_2\) with \(\Vert {\varvec{\eta }}_2\Vert <\infty\), respectively. By the definition of \({\varvec{w}}\) and Theorem 1, we see that
$$\begin{aligned} {\varvec{\eta }}_1^\top {\varvec{w}} ={\varvec{\eta }}_1^\top (I_K-{\varvec{c}}{\varvec{\eta }}^\top ) {\varvec{T}}_n \end{aligned}$$
is asymptotically distributed according to a Gaussian distribution. Thus, \({\varvec{\eta }}_1^\top {\varvec{w}}\) and \({\varvec{\eta }}^\top {\varvec{T}}_n\) are asymptotically independent since
$$\begin{aligned} \mathrm{Cov}[{\varvec{\eta }}_1^\top {\varvec{w}},{\varvec{\eta }}^\top {\varvec{T}}_n] ={\varvec{\eta }}_1^\top (I_K-{\varvec{c}}{\varvec{\eta }}^\top )\mathrm{E}[{\varvec{T}}_n{\varvec{T}}_n^\top ]{\varvec{\eta }} ={\varvec{\eta }}_1^\top (I_K-{\varvec{c}}{\varvec{\eta }}^\top )\varSigma _n^{-1}{\varvec{\eta }} ={\varvec{0}}. \end{aligned}$$
As for latter, letting \({\varvec{\psi }}'={\varvec{\psi }}'({\varvec{\beta }}^*)\), the definition of \({\varvec{T}}_n\) and \(\varSigma _n\) imply
$$\begin{aligned} X_S^\top \left\{ ({\varvec{y}}-{\varvec{\psi }}')-\frac{1}{\sqrt{n}}\varPsi X_S{\varvec{T}}_n\right\} ={\varvec{0}}, \end{aligned}$$
and thus
$$\begin{aligned} {\varvec{y}}={\varvec{\psi }}'+\frac{1}{\sqrt{n}}\varPsi X_S{\varvec{T}}_n, \end{aligned}$$
where \(\varPsi \in \mathbb {R}^{n\times n}\) is a diagonal matrix whose i-th diagonal element is \(\psi ''({\varvec{x}}_{S, i}^\top {\varvec{\beta }}_S)=\psi ({\varvec{x}}_{S, i}^\top {\varvec{\beta }}_S)(1-\psi ({\varvec{x}}_{S, i}^\top {\varvec{\beta }}_S))\). Then, we observe that
$$\begin{aligned} \tilde{{\varvec{b}}}&=-\frac{1}{\sqrt{n}}A_SX_S^\top {\varvec{\psi }}'-\frac{1}{\sqrt{n}}A_{S^\bot }X_{S^\bot }^\top {\varvec{y}} \\&=-\frac{1}{\sqrt{n}}A_SX_S^\top {\varvec{\psi }}'-\frac{1}{\sqrt{n}}A_{S^\bot }X_{S^\bot }^\top \left( {\varvec{\psi }}'+\frac{1}{\sqrt{n}}\varPsi X_S{\varvec{T}}_n\right) \\&=-\frac{1}{\sqrt{n}}AX^\top {\varvec{\psi }}'-\frac{1}{n}A_{S^\bot }X_{S^\bot }^\top \varPsi X_S{\varvec{T}}_n. \end{aligned}$$
Since \(\tilde{{\varvec{b}}}\) can be expressed as a linear combination of \({\varvec{T}}_n\) as well as \({\varvec{w}}\), the theorem holds when the covariance between \({\varvec{\eta }}_2^\top \tilde{{\varvec{b}}}\) and \({\varvec{\eta }}^\top {\varvec{T}}_n\) converges to 0 as n goes to infinity. By noting that \(\varSigma _n=X_S^\top \varPsi X_S/n\), we have
$$\begin{aligned} \mathrm{Cov}[{\varvec{\eta }}_2^\top \tilde{{\varvec{b}}},{\varvec{\eta }}^\top {\varvec{T}}_n]&=-\frac{1}{n}{\varvec{\eta }}_2^\top A_{S^\bot }X_{S^\bot }^\top \varPsi X_S\mathrm{E}[{\varvec{T}}_n{\varvec{T}}_n^\top ]{\varvec{\eta }} \\&=-{\varvec{\eta }}_2^\top A_{S^\bot }(X_{S^\bot }^\top \varPsi X_S)(X_S^\top \varPsi X_S)^{-1}{\varvec{\eta }}. \end{aligned}$$
In addition, letting \({\varvec{a}}=(1,-1)^\top\), it is straightforward that
$$\begin{aligned} A_{S^\bot } ={\varvec{1}}_{K}\otimes \left( \begin{array}{ccc} 0&{}\cdots &{}0 \\ {\varvec{a}}&{}&{}O \\ &{}\ddots &{} \\ O&{}&{}{\varvec{a}} \end{array} \right) ={\varvec{1}}_{K}\otimes \tilde{J} \end{aligned}$$
by the definition of the selection event, where \(\tilde{J}=({\varvec{0}}_{d-K}, I_{d-K}\otimes {\varvec{a}}^\top )^\top\). This implies \(A_{S^\bot }^\top A_{S^\bot }=2KI_{d-K}\). Finally, (C1), (C3), and (C4) imply
$$\begin{aligned} \Vert \mathrm{Cov}[{\varvec{\eta }}_2^\top \tilde{{\varvec{b}}},{\varvec{\eta }}^\top {\varvec{T}}_n]\Vert ^2&=2K\{{\varvec{\eta }}_2^\top (X_{S^\bot }^\top \varPsi X_S)(X_S^\top \varPsi X_S)^{-1}{\varvec{\eta }}\}^2 \\&\lesssim K\left\| \frac{1}{n}X_{S^\bot }^\top X_S\right\| ^2 =\mathrm{O}(K^3/n), \end{aligned}$$
and this proves (b).

8 Concluding remarks and future research

Recently, methods for data driven science such as selective inference and adaptive data analysis have become increasingly important as described by Barber and Candès (2016). Although there are several approaches for carrying out post-selection inference, we have developed a selective inference method for high dimensional classification problems, based on the work in Lee et al. (2016). In the same way as that seminal work, the polyhedral lemma (Lemma 1) plays an important role in our study. By considering high dimensional asymptotics concerning sample size and the number of selected variables, we have shown that a similar result to the polyhedral lemma holds even for high dimensional logistic regression problems. As a result, we could construct a pivotal quantity whose sampling distribution is represented as a truncated normal distribution which converges to a standard uniform distribution. In addition, through simulation experiments, it has been shown that the performance of our proposed method is, in almost all cases, superior to other methods such as data splitting.

As suggested by the results from the simulation experiments, conditions might be relaxed to accommodate more general settings. In terms of future research in this domain, while we considered the logistic model in this paper, it is important to extend the results to other models, for example, generalized linear models. Further, higher order interaction models are also crucial in practice. In this situation, the size of the matrix in the selection event becomes very large, and thus, it is cumbersome to compute truncation points in the polyhedral lemma. Suzumura et al. (2017) have shown that selective inference can be constructed in such a model by utilizing a pruning algorithm. In this respect, it is desirable to extend their result not only to linear regression modeling contexts but also to other models.

Footnotes

Notes

Acknowledgements

Funding was provided by Japan Society for the Promotion of Science (Grant nos. 18K18010, 16H06538, 17H00758, JPMJCR1502), RIKEN Center for Advanced Intelligence Project and JST support program for starting up innovation-hub on materials research by information integration initiative.

References

  1. Barber, R.F., & Candès, E.J. (2016). A knockoff filter for high-dimensional selective inference, arXiv preprint arXiv:1602.03574.
  2. Berk, R., Brown, L., Buja, A., Zhang, K., & Zhao, L. (2013). Valid post-selection inference. The Annals of Statistics, 41, 802–837.MathSciNetCrossRefGoogle Scholar
  3. Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37, 1705–1732.MathSciNetCrossRefGoogle Scholar
  4. Breiman, L. (1992). The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. Journal of the American Statistical Association, 87, 738–754.MathSciNetCrossRefGoogle Scholar
  5. Cox, D. (1975). A note on data-splitting for the evaluation of significance levels. Biometrika, 62, 441–444.MathSciNetCrossRefGoogle Scholar
  6. Dasgupta, S., Khare, K., & Ghosh, M. (2014). Asymptotic expansion of the posterior density in high dimensional generalized linear models. Journal of Multivariate Analysis, 131, 126–148.MathSciNetCrossRefGoogle Scholar
  7. Dickhaus, T. (2014). Simultaneous statistical inference. With applications in the life sciences. Heidelberg: Springer.CrossRefGoogle Scholar
  8. Efron, B. (2014). Estimation and accuracy after model selection. Journal of the American Statistical Association, 109, 991–1007.MathSciNetCrossRefGoogle Scholar
  9. Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B, 70, 849–911.MathSciNetCrossRefGoogle Scholar
  10. Fan, J., & Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32, 928–961.MathSciNetCrossRefGoogle Scholar
  11. Fan, J., & Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics, 38, 3567–3604.MathSciNetCrossRefGoogle Scholar
  12. Fithian, W., Sun, D., & Taylor, J. (2014). Optimal inference after model selection, arXiv preprint arXiv:1410.2597.
  13. Huang, J., Horowitz, J. L., & Ma, S. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics, 36, 587–613.MathSciNetCrossRefGoogle Scholar
  14. Huber, P. J. (1973). Robust regression: Asymptotics, conjectures and Monte Carlo. The Annals of Statistics, 1, 799–821.MathSciNetCrossRefGoogle Scholar
  15. Lee, J. D., Sun, D. L., Sun, Y., & Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44, 907–927.MathSciNetCrossRefGoogle Scholar
  16. Lee, J. D., & Taylor, J. E. (2014). Exact post model selection inference for marginal screening. in Advances in neural information processing systems, p. 136–144.Google Scholar
  17. Lockhart, R., Taylor, J., Tibshirani, R. J., & Tibshirani, R. (2014). A significance test for the lasso. Annals of Statistics, 42, 413.MathSciNetCrossRefGoogle Scholar
  18. Meinshausen, N., Meier, L., & Bühlmann, P. (2009). \(p\)-values for high-dimensional regression. Journal of the American Statistical Association, 104, 1671–1681.MathSciNetCrossRefGoogle Scholar
  19. Suzumura, S., Nakagawa, K., Umezu, Y., Tsuda, K., & Takeuchi, I. (2017). Selective inference for sparse high-order interaction models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, pp. 3338–3347.Google Scholar
  20. Taylor, J. E., Loftus, J. R., Tibshirani, R. J., et al. (2016). Inference in adaptive regression via the Kac-Rice formula. The Annals of Statistics, 44, 743–770.MathSciNetCrossRefGoogle Scholar
  21. Taylor, J., & Tibshirani, R. (2018). Post-selection inference for \(\ell _1\)-penalized likelihood models. The Canadian Journal of Statistics, 46, 41–61.MathSciNetCrossRefGoogle Scholar
  22. Tian, X., Loftus, J. R., & Taylor, J. E. (2018). Selective inference with unknown variance via the square-root lasso. Biometrika, 105, 755–768.MathSciNetzbMATHGoogle Scholar
  23. Tian, X., & Taylor, J. (2017). Asymptotics of selective inference. Scandinavian Journal of Statistics, 44, 480–499.MathSciNetzbMATHGoogle Scholar
  24. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58, 267–288.MathSciNetzbMATHGoogle Scholar
  25. Wasserman, L., & Roeder, K. (2009). High dimensional variable selection. The Annals of statistics, 37, 2178–2201.MathSciNetCrossRefGoogle Scholar

Copyright information

© Japanese Federation of Statistical Science Associations 2019

Authors and Affiliations

  1. 1.Nagoya Institute of TechnologyNagoyaJapan
  2. 2.RIKEN Center for Advanced Intelligence ProjectTokyoJapan
  3. 3.Center for Materials Research by Information IntegrationNational Institute for Materials ScienceTsukubaJapan

Personalised recommendations