Abstract
Using conditional characteristic function as a screening index, a new model-free screening procedure is proposed to deal with variable screening problems in large-scale high-dimensional imbalanced data analysis. For binary response, our results show that the screening index under full data is proportional to the screening index under case–control sampling, an important sampling property for imbalanced data. This conclusion implies that we can apply this screening method to imbalanced data. Surely, the most appealing feature of the screening index is that it can be expressed as a simple linear combination of two first-order moments, so it is computationally simple. In addition, we successfully extend this method to multiple response. The theoretical properties are established under regularity conditions. To compare the performance of our method with its competitors, extensive simulations are conducted, which shows that the proposed procedure performs well in both the linear and nonlinear models. Finally, a real data analysis is investigated to further illustrate the effectiveness of the new method.
Similar content being viewed by others
References
Battey H, Fan J, Liu H, Lu J, Zhu Z (2018) Distributed testing and estimation under sparse highdimensional models. Ann Stat 46:1352–1382
Cai T, Wei H (2019) Transfer learning for nonparametric classification: minimax rate and adaptive classifier. https://arxiv.org/pdf/1906.02903.pdf
Chang J, Tang C, Wu Y (2013) Marginal empirical likelihood and sure independence feature screening. Ann Stat 41:2123–2148
Chen K (2001) Parametric models for response-biased sampling. J R Stat Soc Ser B 63:775–789
Chen X, Xie M (2014) A split-and-conquer approach for analysis of extraordinarily large data. Stat Sin 24:1655–1684
Chen K, Lin Y, Yao Y, Zhou C (2017) Regression analysis with response-biased sampling. Stat Sin 27:1699–1714
Cui H, Li R, Zhong W (2015) Model-free feature screening for ultrahigh dimensional discriminant analysis. J Am Stat Assoc 110:630–641
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J Roy Stat Soc B 70:849–911
Fan J, Song R (2010) Sure independence screening in generalized linear models with np-dimensionality. Ann Stat 38:3567–3604
Fan J, Feng Y, Song R (2011) Nonparametric independence screening in sparse ultrahigh dimensional additive models. J Am Stat Assoc 106:544–557
Fithian W, Hastie T (2014) Local case-control sampling: efficient subsampling in imblanced data sets. Ann Stat 42:1693–1724
He X, Wang L, Hong H (2013) Quantile adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Stat 41:342–369
Kang J, Hong H, Li Y (2017) Partition-based ultrahigh dimensional variable screening. Biometrika 104:785–800
Li G, Peng H, Zhang J, Zhu L (2012) Robust rank correlation based screening. Ann Stat 40:1846–1877
Li R, Zhong W, Zhu L (2012) Feature screening via distance correlation learning. J Am Stat Assoc 107:1129–1139
Li X, Li R, Xia Z, Xu C (2020) Distributed feature screening via componentwise debiasing. J Mach Learn Res 21:1–32
Lin N, Xi R (2011) Aggregated estimating equation estimation. Stat Interface 4:73–83
Lu J, Lin L (2018) Feature screening for multi-response varying coefficient models with ultrahigh dimensional predictors. Comput Stat Data Anal 128:242–254
Lu J, Lin L (2018) Model-free sure independence screening in the context of ultrahigh dimensional covariate together with labeled response. Manuscript
Luo S, Chen Z (2020) Feature selection by canonical correlation search in high-dimensional multi-response models with complex group structures. J Am Stat Assoc 115:1227–1235
Ma P, Mahoney M, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–911
Mai Q, Zou H (2012) The kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika 100:229–234
Mai Q, Zou H (2015) The fused kolmogorov filter: a nonparametric model-free screening method. Ann Stat 43:1471–1497
Manski C (1993) The selection problem in econometrics and statistics. Handb Stat 11:73–84
Pan R, Wang H, Li R (2016) Ultrahigh dimensional multi-class linear discriminant analysis by pairwise sure independence screening. J Am Stat Assoc 111:169–179
Schifano E, Wu J, Wang C, Yan J, Chen M (2016) Online updating of statistical inference in the big data setting. Technometrics 58:393–403
Serfling R (2009) Approximation theorems of mathematical statistics. Wiley, New York
Song R, Lu W, Ma S, Jeng X (2014) Censored rank independence screening for high-dimensional survival data. Biometrika 101:799–814
Székely G, Rizzo M, Bakirov N (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35:2769–2794
Vapnik V (1998) Statistical learning theory. Wiley, New York
Wang X, Leng C (2016) High dimensional ordinary least squares projection for screening variables. J R Stat Soc Ser B 78:589–611
Wang H, Zhu R, Ma P (2018) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113:829–844
Xie J, Lin Y, Yan X, Tang N (2019) Category-adaptive variable screening for ultrahigh dimensional heterogeneous categorical data. J Am Stat Assoc 115:747–760
Xie J, Hao M, Liu W, Lin Y (2020) Fused variable screening for massive imbalanced data. Comput Stat Data Anal 141:94–108
Zhou T, Zhu L (2017) Model-free feature screening for ultrahigh dimensional censored regression. Stat Comput 27:947–961
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The research was supported by NNSF project of China (11971265) and National Key R &D Program of China (2018YFA0703900).
Appendix: Proof of main results
Appendix: Proof of main results
We first prove Lemma 2, which is the basis of our ideas.
Firstly, the proof of formula (a) of Lemma 2.
Similarly,
Secondly, the proof of the formula (b) of Lemma 2.
\(\square \)
Proof of Theorem 1
where \(a=\frac{P_0^*P_1^*}{P_0P_1}\), \(P_0^*\) and \(P_1^*\) are the probabilities of response variable \(Y^*\) respectively valued at \(Y^*=0\) and \(Y^*=1\) under case–control sampling setting. The second equality above holds due to the assumption of case–control sampling design. When the population and sampling are fixed, a is a constant. \(\square \)
Before proving Theorem 2, we introduce two important inequalities will be used repeatedly in the proof of Theorem 2.
Inequality 1. Let \(\mu =E(Y)\). If \(P(a \le Y\le b)=1\), then
Inequality 2. Suppose that \(h(Y_1,\cdots ,Y_m)\) is a kernel of the U statistics \(U_n\), and let \(\theta =E\{h(Y_1,\cdots ,Y_m)\}\). If \(a\le h(Y_1,\cdots ,Y_m) \le b\), then, for any \(t>0\) and \(n\ge m\),
where [n/m] denotes the integer part of n/m. Due to the symmetry of U statistics, we further have that
Proof of Theorem 2
The events satisfy \(\{\left| {\tilde{\tau }}_{k}^*-\tau _{k}^*\right| >\varepsilon \}\) \(\subseteq \) \( \{\left| {\hat{\tau }}_{k}^*-\tau _{k}^*\right| >\varepsilon \} \). To prove sure screening property of \({\tilde{\tau }}_{k}^*\), we just need to prove that \({\hat{\tau }}_{k}^*\) also has the same properties. The ranking statistics \({\widehat{\tau }}_k^*\) can be written as
where
Correspondingly, the population form \(\tau _k^*\) can also be written as \(\tau _{k}^*=I_{k, 1}+\) \(I_{k, 2}\), where
We aim to show the uniform consistency of the \({\widehat{\tau }}_{k}^*\) under regularity conditions.
Firstly, we deal with the term \({\hat{I}}_{k, 1}\). Define
where \(h\left( X_{i k},X_{j k}\right) =\frac{1}{2}\left| X_{i k}-X_{j k}\right| +\frac{1}{2}\left| X_{j k}-X_{i k}\right| \). Then it has that \({\hat{I}}_{k 1}^{\star }(n-1) / n={\hat{I}}_{k 1}\). \({\hat{I}}_{k 1}^{\star }\) is a usual U statistics with kernel h, thus we shall establish the uniform consistency of \({\hat{I}}_{k 1}^{\star }\) by invoking the theory of U statistics (Serfling, 1980, chap. 5). Condition (C1) states that \(I_{k 1}\) is uniformly bounded in p, that is, \(\sup _{p} \max _{1 \le k \le p} I_{k 1}<\infty .\) For any give \(\varepsilon >0\), take n large enough such that \(I_{k 1} / n<\varepsilon \), then it can be showed that
The U statistics \({\hat{I}}_{k 1}^{\star }\) can be written as
where M will be specified later.
Accordingly, we decompose \(I_{k 1}\) into two parts:
Obviously, \({\hat{I}}_{k 11}^{\star }\) and \({\hat{I}}_{k 12}^{\star }\) are unbiased estimators of \(I_{k 11}\) and \(I_{k 12}\). We deal with the consistency of \({\hat{I}}_{k 11}^{\star }\) first. With the Markov’s inequality,for any \(t>0\), we obtain that
Serfling (2009) showed that any U statistics can be represented as an average of of i.i.d random variables, i.e., \({\hat{I}}_{k 11}^{\star }=(n !)^{-1} \sum _{n !} \Omega _{1}\left( X_{1 k},\cdots ,X_{n k}\right) \) where \(\sum _{n !}\) denotes the summation over all possible permutations of \((1, \cdots , n)\), and each \(\Omega _{1}\left( X_{1 k},X_{n k} \right) \) is an average of \(m=[n / 2]\) i.i.d. random variables (i.e., \(\Omega _{1}=m^{-1} \sum _{l} h^{(l)} {\mathbf {1}}\left\{ h^{(l)} \le M\right\} .\) Since the exponential function is convex, it follows from the Jensen’s inequality that, for \(0<t \le 2 s_{0}\),
which together with Inequality 2, entails immediately that
By choosing \(t=4 \varepsilon m / M^{2}\), we have \(\mathrm {P}\left( {\hat{I}}_{k 11}^{\star }-I_{k 11} \ge \varepsilon \right) \le \exp \left( -2 \varepsilon ^{2} m / M^{2}\right) \). Therefore, by the symmetry of U statistics, we can obtain easily that
Next, we show the consistency of \({\hat{I}}_{k 12}^{\star }.\) With Cauchy-Schwartz and Markov’s inequalities, for any \(s^{\prime }>0\),
We have
together with the Cauchy-Schwartz inequality, yields that
If we choose \(M=c n^{\gamma }\) for \(0<\gamma <1 / 2-\kappa \), then by condition \((\mathrm {C} 1), I_{k 12} \le \varepsilon / 2\) when n is sufficiently large. Consequently,
It remains to bound the probability \(\mathrm {P}\left( \left| {\hat{I}}_{k 12}^{\star }\right| >\varepsilon / 2\right) \). Observe that
By invoking condition (C1) and Markov’s inequality for \(s>0\), there must be some constant C such that
Consequently,
Combing the results (a.1), (a.2), (a.3) and (a.5) and \(M=c n^{\gamma }\), it is easily obtained that
Thus, we have
It remains to prove the uniform consistency of \({\hat{I}}_{k, 2} .\) We still first deal with the term \(J_{y, k 2}=E\left| X_{k}-X_{k}^{\prime }\right| {\mathbf {1}}(Y=y) {\mathbf {1}}\left( Y^{\prime }=y\right) / p_{y} .\) Let \({\hat{J}}_{y, k 2}=\hat{{\bar{J}}}_{y, k 2} / {\hat{p}}_{y}\), where
Accordingly, \(J_{y, k 2}\) can be decomposed into \(J_{y, k 2}={\bar{J}}_{y, k 2} / p_{y}\) where \({\bar{J}}_{y, k 2}=\) \(E\left| X_{k}-X_{k}^{\prime }\right| {\mathbf {1}}(Y=y) {\mathbf {1}}\left( Y^{\prime }=y\right) \). Following the argument for proving (a.7), we get that
As a result, it has that
We first deal with the first term in (a.9). Under condition (C2), we have
The last inequality holds because of (a.8) and Inequality 1. We next deal with the second term in (a.9),
This, together with (a.10), we have that
Combining (a.7), (a.11) and the Bonferroni’s inequality, we can easily get that
for some positive constants \(C_{1}\) and \(C_{2}.\) The convergence rate of \({\widehat{\tau }}_{k}^*\) is now achieved by
Let \(\varepsilon =c n^{-\varsigma }\), where \(\varsigma \) satisfies that \(0 \le \varsigma <1 / 2\). We thus have
Now, we deal with the second part of Theorem 2. If \({\mathcal {A}} \subsetneq \hat{{\mathcal {A}}}\), then there must be some \(k \in {\mathcal {A}}\) such that \({\widehat{\tau }}_{k}^*<c n^{-\varsigma }\). It follows from Condition (C3) that \(\left| {\widehat{\tau }}_{k}^*-\tau _{k}\right| >c n^{-\varsigma }\), for some \(k \in {\mathcal {A}}\), indicating that the events satisfy \(\{{\mathcal {A}} \subsetneq \hat{{\mathcal {A}}}\} \subseteq \left\{ \left| {\widehat{\tau }}_{k}^*-\tau _{k}\right| >c n^{-\varsigma }\right. \), for some \(\left. k \in {\mathcal {A}}\right\} .\) Hence \(D_{n}=\left\{ \max _{k \in {\mathcal {A}}} \mid {\widehat{\tau }}_{k}^*-\right. \) \(\left. \tau _{k} \mid \le c n^{-\varsigma }\right\} \subset \{{\mathcal {A}} \subset \hat{{\mathcal {A}}}\}.\) Consequently,
where \(s_{n}\) is the cardinality of \({\mathcal {A}}\). This proves the second result in Theorem 2. \(\square \)
Proof of Theorem 3
Define
Then, we have
The last term goes to 0 under condition (C2) as \(n \rightarrow \infty \). \(\square \)
Rights and permissions
About this article
Cite this article
Wang, P., Lin, L. Conditional characteristic feature screening for massive imbalanced data. Stat Papers 64, 807–834 (2023). https://doi.org/10.1007/s00362-022-01342-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-022-01342-8