Skip to main content
Log in

Conditional characteristic feature screening for massive imbalanced data

  • Regular Article
  • Published:
Statistical Papers Aims and scope Submit manuscript

Abstract

Using conditional characteristic function as a screening index, a new model-free screening procedure is proposed to deal with variable screening problems in large-scale high-dimensional imbalanced data analysis. For binary response, our results show that the screening index under full data is proportional to the screening index under case–control sampling, an important sampling property for imbalanced data. This conclusion implies that we can apply this screening method to imbalanced data. Surely, the most appealing feature of the screening index is that it can be expressed as a simple linear combination of two first-order moments, so it is computationally simple. In addition, we successfully extend this method to multiple response. The theoretical properties are established under regularity conditions. To compare the performance of our method with its competitors, extensive simulations are conducted, which shows that the proposed procedure performs well in both the linear and nonlinear models. Finally, a real data analysis is investigated to further illustrate the effectiveness of the new method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Battey H, Fan J, Liu H, Lu J, Zhu Z (2018) Distributed testing and estimation under sparse highdimensional models. Ann Stat 46:1352–1382

    Article  MATH  Google Scholar 

  • Cai T, Wei H (2019) Transfer learning for nonparametric classification: minimax rate and adaptive classifier. https://arxiv.org/pdf/1906.02903.pdf

  • Chang J, Tang C, Wu Y (2013) Marginal empirical likelihood and sure independence feature screening. Ann Stat 41:2123–2148

    Article  MathSciNet  MATH  Google Scholar 

  • Chen K (2001) Parametric models for response-biased sampling. J R Stat Soc Ser B 63:775–789

    Article  MathSciNet  MATH  Google Scholar 

  • Chen X, Xie M (2014) A split-and-conquer approach for analysis of extraordinarily large data. Stat Sin 24:1655–1684

    MathSciNet  MATH  Google Scholar 

  • Chen K, Lin Y, Yao Y, Zhou C (2017) Regression analysis with response-biased sampling. Stat Sin 27:1699–1714

    MATH  Google Scholar 

  • Cui H, Li R, Zhong W (2015) Model-free feature screening for ultrahigh dimensional discriminant analysis. J Am Stat Assoc 110:630–641

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J Roy Stat Soc B 70:849–911

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Song R (2010) Sure independence screening in generalized linear models with np-dimensionality. Ann Stat 38:3567–3604

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Feng Y, Song R (2011) Nonparametric independence screening in sparse ultrahigh dimensional additive models. J Am Stat Assoc 106:544–557

    Article  MATH  Google Scholar 

  • Fithian W, Hastie T (2014) Local case-control sampling: efficient subsampling in imblanced data sets. Ann Stat 42:1693–1724

    Article  MATH  Google Scholar 

  • He X, Wang L, Hong H (2013) Quantile adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Stat 41:342–369

    MathSciNet  MATH  Google Scholar 

  • Kang J, Hong H, Li Y (2017) Partition-based ultrahigh dimensional variable screening. Biometrika 104:785–800

    Article  MathSciNet  MATH  Google Scholar 

  • Li G, Peng H, Zhang J, Zhu L (2012) Robust rank correlation based screening. Ann Stat 40:1846–1877

    Article  MathSciNet  MATH  Google Scholar 

  • Li R, Zhong W, Zhu L (2012) Feature screening via distance correlation learning. J Am Stat Assoc 107:1129–1139

    Article  MathSciNet  MATH  Google Scholar 

  • Li X, Li R, Xia Z, Xu C (2020) Distributed feature screening via componentwise debiasing. J Mach Learn Res 21:1–32

    MathSciNet  MATH  Google Scholar 

  • Lin N, Xi R (2011) Aggregated estimating equation estimation. Stat Interface 4:73–83

    Article  MathSciNet  MATH  Google Scholar 

  • Lu J, Lin L (2018) Feature screening for multi-response varying coefficient models with ultrahigh dimensional predictors. Comput Stat Data Anal 128:242–254

    Article  MathSciNet  MATH  Google Scholar 

  • Lu J, Lin L (2018) Model-free sure independence screening in the context of ultrahigh dimensional covariate together with labeled response. Manuscript

  • Luo S, Chen Z (2020) Feature selection by canonical correlation search in high-dimensional multi-response models with complex group structures. J Am Stat Assoc 115:1227–1235

    Article  MATH  Google Scholar 

  • Ma P, Mahoney M, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–911

    MathSciNet  MATH  Google Scholar 

  • Mai Q, Zou H (2012) The kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika 100:229–234

    Article  MathSciNet  MATH  Google Scholar 

  • Mai Q, Zou H (2015) The fused kolmogorov filter: a nonparametric model-free screening method. Ann Stat 43:1471–1497

    Article  MathSciNet  MATH  Google Scholar 

  • Manski C (1993) The selection problem in econometrics and statistics. Handb Stat 11:73–84

    Article  MathSciNet  Google Scholar 

  • Pan R, Wang H, Li R (2016) Ultrahigh dimensional multi-class linear discriminant analysis by pairwise sure independence screening. J Am Stat Assoc 111:169–179

    Article  Google Scholar 

  • Schifano E, Wu J, Wang C, Yan J, Chen M (2016) Online updating of statistical inference in the big data setting. Technometrics 58:393–403

    Article  MathSciNet  Google Scholar 

  • Serfling R (2009) Approximation theorems of mathematical statistics. Wiley, New York

    MATH  Google Scholar 

  • Song R, Lu W, Ma S, Jeng X (2014) Censored rank independence screening for high-dimensional survival data. Biometrika 101:799–814

    Article  MathSciNet  MATH  Google Scholar 

  • Székely G, Rizzo M, Bakirov N (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35:2769–2794

    Article  MathSciNet  MATH  Google Scholar 

  • Vapnik V (1998) Statistical learning theory. Wiley, New York

    MATH  Google Scholar 

  • Wang X, Leng C (2016) High dimensional ordinary least squares projection for screening variables. J R Stat Soc Ser B 78:589–611

    Article  MathSciNet  MATH  Google Scholar 

  • Wang H, Zhu R, Ma P (2018) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113:829–844

    Article  MathSciNet  MATH  Google Scholar 

  • Xie J, Lin Y, Yan X, Tang N (2019) Category-adaptive variable screening for ultrahigh dimensional heterogeneous categorical data. J Am Stat Assoc 115:747–760

    Article  MATH  Google Scholar 

  • Xie J, Hao M, Liu W, Lin Y (2020) Fused variable screening for massive imbalanced data. Comput Stat Data Anal 141:94–108

    Article  MathSciNet  MATH  Google Scholar 

  • Zhou T, Zhu L (2017) Model-free feature screening for ultrahigh dimensional censored regression. Stat Comput 27:947–961

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lu Lin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The research was supported by NNSF project of China (11971265) and National Key R &D Program of China (2018YFA0703900).

Appendix: Proof of main results

Appendix: Proof of main results

We first prove Lemma 2, which is the basis of our ideas.

Firstly, the proof of formula (a) of Lemma 2.

$$\begin{aligned}&\phi _{{X_k}|Y=1}(t)-\phi _{X_k}(t) \\&\quad =E(e^{itX_k}| Y=1)-E(E(e^{itX_k}|Y))\\&\quad =E(e^{itX_k}| Y=1)-(P_1E(e^{itX_k}|Y=1)+P_0E(e^{itX_k}| Y=0))\\&\quad =P_0(\phi _{{X_k}|Y=1}(t)-\phi _{{X_k}|Y=0}(t)). \end{aligned}$$

Similarly,

$$\begin{aligned} \phi _{{X_k}|Y=0}(t)-\phi _{X_k}(t)=P_1(\phi _{{X_k}|Y=0}(t)-\phi _{{X_k}|Y=1}(t)). \end{aligned}$$

Secondly, the proof of the formula (b) of Lemma 2.

$$\begin{aligned} \tau _k= & {} \sum _{y=0,1} p_y\int _{\mathbb {R}}\Vert \phi _{{X_k}|Y=y}(t)-\phi _{X_k}(t)\Vert ^2w(t) dt\\= & {} P_1P_0^2\int _{\mathbb {R}}\Vert \phi _{{X_k}|Y=1}(t)-\phi _{{X_k}|Y=0}(t)\Vert ^2w(t)dt\\&\quad +P_0P_1^2\int _{\mathbb {R}}\Vert \phi _{{X_k}|Y=0}(t)-\phi _{{X_k}|Y=0}(t)\Vert ^2w(t)dt\\= & {} P_0P_1\int _{\mathbb {R}}\Vert \phi _{{X_k}|Y=1}(t)-\phi _{{X_k}|Y=0}(t)\Vert ^2w(t)dt. \end{aligned}$$

\(\square \)

Proof of Theorem 1

$$\begin{aligned} \tau _k^*&=P_0^*P_1^*\int _{\mathbb {R}}\Vert \phi _{{X_k}^*|Y^*=1}(t)-\phi _{{X_k}^*|Y^*=0}(t))\Vert ^2w(t)dt \nonumber \\&=P_0^*P_1^*\int _{\mathbb {R}}\Vert \phi _{{X_k}|Y=1}(t)-\phi _{{X_k}|Y=0}(t))\Vert ^2w(t)dt \nonumber \\&=\frac{P_0^*P_1^*}{P_0P_1}P_0P_1\int _{\mathbb {R}}\Vert \phi _{{X_k}|Y=1}(t)-\phi _{{X_k}|Y=0}(t))\Vert ^2w(t)dt \nonumber \\&=a\tau _k ,\nonumber \end{aligned}$$

where \(a=\frac{P_0^*P_1^*}{P_0P_1}\), \(P_0^*\) and \(P_1^*\) are the probabilities of response variable \(Y^*\) respectively valued at \(Y^*=0\) and \(Y^*=1\) under case–control sampling setting. The second equality above holds due to the assumption of case–control sampling design. When the population and sampling are fixed, a is a constant. \(\square \)

Before proving Theorem 2, we introduce two important inequalities will be used repeatedly in the proof of Theorem 2.

Inequality 1. Let \(\mu =E(Y)\). If \(P(a \le Y\le b)=1\), then

$$\begin{aligned} E[exp\{s(Y-\mu )\}] \le exp\{s^2(b-a)^2/8\} \quad \text {for any} \quad s>0. \end{aligned}$$

Inequality 2. Suppose that \(h(Y_1,\cdots ,Y_m)\) is a kernel of the U statistics \(U_n\), and let \(\theta =E\{h(Y_1,\cdots ,Y_m)\}\). If \(a\le h(Y_1,\cdots ,Y_m) \le b\), then, for any \(t>0\) and \(n\ge m\),

$$\begin{aligned} P(U_n-\theta \ge t) \le exp\{-2[n/m] t^2/(b-a)^2\}, \end{aligned}$$

where [n/m] denotes the integer part of n/m. Due to the symmetry of U statistics, we further have that

$$\begin{aligned} P(\left| U_n-\theta \ge t\right| ) \le exp\{-2[n/m] t^2/(b-a)^2\}, \end{aligned}$$

Proof of Theorem 2

The events satisfy \(\{\left| {\tilde{\tau }}_{k}^*-\tau _{k}^*\right| >\varepsilon \}\) \(\subseteq \) \( \{\left| {\hat{\tau }}_{k}^*-\tau _{k}^*\right| >\varepsilon \} \). To prove sure screening property of \({\tilde{\tau }}_{k}^*\), we just need to prove that \({\hat{\tau }}_{k}^*\) also has the same properties. The ranking statistics \({\widehat{\tau }}_k^*\) can be written as

$$\begin{aligned} {\widehat{\tau }}_k^*={\hat{I}}_{k, 1}+{\hat{I}}_{k, 2}, \end{aligned}$$

where

$$\begin{aligned} \begin{array}{l} {\hat{I}}_{k, 1}=n^{-2} \sum _{i, j=1}^{n}\left| X_{ik}-X_{jk}\right| , \\ {\hat{I}}_{k, 2}=\sum _{y=1}^{R} {\hat{J}}_{y, k 2} ~ \text{ for } {\hat{J}}_{y, k 2}=\frac{n^{-2} \sum _{i, j=1}^{n}\left| X_{ik}-X_{jk}\right| {\mathbf {1}}\left( Y_{i}=y\right) {\mathbf {1}}\left( Y_{j}=y\right) }{n^{-1} \sum _{l=1}^{n} {\mathbf {1}}\left( Y_{l}=y\right) }. \end{array} \end{aligned}$$

Correspondingly, the population form \(\tau _k^*\) can also be written as \(\tau _{k}^*=I_{k, 1}+\) \(I_{k, 2}\), where

$$\begin{aligned} \begin{array}{l} I_{k, 1}=E\left| X_{k}-X_{k}^{\prime }\right| , \\ I_{k, 2}=\sum _{\nu =1}^{R} J_{y, k 2} ~ \text{ for } J_{y, k 2}=\frac{E\left| X_{k}-X_{k}^{\prime }\right| {\mathbf {1}}(Y=y) {\mathbf {1}}\left( Y^{\prime }=y\right) }{E {\mathbf {1}}(Y=y)}. \end{array} \end{aligned}$$

We aim to show the uniform consistency of the \({\widehat{\tau }}_{k}^*\) under regularity conditions.

Firstly, we deal with the term \({\hat{I}}_{k, 1}\). Define

$$\begin{aligned} {\hat{I}}_{k 1}^{\star }=\frac{2}{n(n-1)} \sum _{i<j} h\left( X_{i k}, X_{j k}\right) , \end{aligned}$$

where \(h\left( X_{i k},X_{j k}\right) =\frac{1}{2}\left| X_{i k}-X_{j k}\right| +\frac{1}{2}\left| X_{j k}-X_{i k}\right| \). Then it has that \({\hat{I}}_{k 1}^{\star }(n-1) / n={\hat{I}}_{k 1}\). \({\hat{I}}_{k 1}^{\star }\) is a usual U statistics with kernel h, thus we shall establish the uniform consistency of \({\hat{I}}_{k 1}^{\star }\) by invoking the theory of U statistics (Serfling, 1980, chap. 5). Condition (C1) states that \(I_{k 1}\) is uniformly bounded in p, that is, \(\sup _{p} \max _{1 \le k \le p} I_{k 1}<\infty .\) For any give \(\varepsilon >0\), take n large enough such that \(I_{k 1} / n<\varepsilon \), then it can be showed that

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( \left| {\hat{I}}_{k 1}-I_{k 1}\right| \ge 2 \varepsilon \right) \\&\quad \le ~ \mathrm {P}\left\{ \left| {\hat{I}}_{k 1}^{\star }-I_{k 1}\right| (n-1) / n \ge 2 \varepsilon -I_{k 1} / n\right\} \\&\quad \le ~ \mathrm {P}\left( \left| {\hat{I}}_{k 1}^{\star }-I_{k 1}\right| \ge \varepsilon \right) . \end{aligned} \end{aligned}$$
(a.1)

The U statistics \({\hat{I}}_{k 1}^{\star }\) can be written as

$$\begin{aligned} \begin{aligned} {\hat{I}}_{k 1}^{\star }=&~ \frac{1}{2} \frac{2}{n(n-1)} \sum _{i<j} h\left( X_{i k}, X_{j k}\right) {\mathbf {1}}\left\{ h\left( X_{i k},X_{j k}\right) \le M\right\} \\&+\frac{1}{2} \frac{2}{n(n-1)} \sum _{i<j} h\left( X_{i k},X_{j k}\right) {\mathbf {1}}\left\{ h\left( X_{i k},X_{j k}\right) >M\right\} \\ =&~ {\hat{I}}_{k 11}^{\star }+{\hat{I}}_{k 12}^{\star }, \end{aligned} \end{aligned}$$

where M will be specified later.

Accordingly, we decompose \(I_{k 1}\) into two parts:

$$\begin{aligned} \begin{aligned} I_{k 1}=&~ E\left[ h\left( X_{i k},X_{j k}\right) {\mathbf {1}}\left\{ h\left( X_{i k},X_{j k}\right) \le M\right\} \right] \\&+E\left[ h\left( X_{i k},X_{j k}\right) {\mathbf {1}}\left\{ h\left( X_{i k},X_{j k}\right) >M\right\} \right] \\ =&~ I_{k 11}+I_{k 12}. \end{aligned} \end{aligned}$$

Obviously, \({\hat{I}}_{k 11}^{\star }\) and \({\hat{I}}_{k 12}^{\star }\) are unbiased estimators of \(I_{k 11}\) and \(I_{k 12}\). We deal with the consistency of \({\hat{I}}_{k 11}^{\star }\) first. With the Markov’s inequality,for any \(t>0\), we obtain that

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( {\hat{I}}_{k 11}^{\star }-I_{k 11} \ge \varepsilon \right) \\&\quad = ~ \mathrm {P}\left( exp(t{\hat{I}}_{k 11}^{\star }) \ge exp(tI_{k 11}) exp(t\varepsilon )\right) \\&\quad \le ~ \exp (-t \varepsilon ) \exp \left( -t I_{k 11}\right) E\left[ \exp \left( t {\hat{I}}_{k 11}^{\star }\right) \right] . \end{aligned} \end{aligned}$$

Serfling (2009) showed that any U statistics can be represented as an average of of i.i.d random variables, i.e., \({\hat{I}}_{k 11}^{\star }=(n !)^{-1} \sum _{n !} \Omega _{1}\left( X_{1 k},\cdots ,X_{n k}\right) \) where \(\sum _{n !}\) denotes the summation over all possible permutations of \((1, \cdots , n)\), and each \(\Omega _{1}\left( X_{1 k},X_{n k} \right) \) is an average of \(m=[n / 2]\) i.i.d. random variables (i.e., \(\Omega _{1}=m^{-1} \sum _{l} h^{(l)} {\mathbf {1}}\left\{ h^{(l)} \le M\right\} .\) Since the exponential function is convex, it follows from the Jensen’s inequality that, for \(0<t \le 2 s_{0}\),

$$\begin{aligned} \begin{aligned} E\left\{ \exp \left( t {\hat{I}}_{k 11}^{\star }\right) \right\}&=E\left[ \exp \left\{ t(n !)^{-1} \sum _{n !} \Omega _{1}\left( X_{1 k},\cdots ,X_{n k}\right) \right\} \right] \\&\le (n !)^{-1} \sum _{n !} E\left[ \exp \left\{ t \Omega _{1}\left( X_{1 k},\cdots ,X_{n k}\right) \right\} \right] \\&=E^{m}\left\{ \exp \left( m^{-1} t h^{(l)} 1\left\{ h^{(l)} \le M\right\} \right) \right\} , \end{aligned} \end{aligned}$$

which together with Inequality 2, entails immediately that

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( {\hat{I}}_{k 11}^{\star }-I_{k 11} \ge \varepsilon \right) \\&\quad \le \exp (-t \varepsilon ) E^{m}\left\{ \exp \left( m^{-1} t\left[ h^{(l)} {\mathbf {1}}\left\{ h^{(l)} \le M\right\} -I_{k 11}\right] \right) \right\} \\&\quad \le \exp \left\{ -t \varepsilon +M^{2} t^{2} /(8 m)\right\} . \end{aligned} \end{aligned}$$

By choosing \(t=4 \varepsilon m / M^{2}\), we have \(\mathrm {P}\left( {\hat{I}}_{k 11}^{\star }-I_{k 11} \ge \varepsilon \right) \le \exp \left( -2 \varepsilon ^{2} m / M^{2}\right) \). Therefore, by the symmetry of U statistics, we can obtain easily that

$$\begin{aligned} \mathrm {P}\left( \left| {\hat{I}}_{k 11}^{\star }-I_{k 11}\right| \ge \varepsilon \right) \le 2 \exp \left( -2 \varepsilon ^{2} m / M^{2}\right) . \end{aligned}$$
(a.2)

Next, we show the consistency of \({\hat{I}}_{k 12}^{\star }.\) With Cauchy-Schwartz and Markov’s inequalities, for any \(s^{\prime }>0\),

$$\begin{aligned} \begin{aligned} I_{k 12}^{2}&\le E\left\{ h^{2}\left( X_{i k},X_{j k}\right) \right\} \mathrm {P}\left\{ h\left( X_{i k},X_{j k}\right) >M\right\} \\&\le E\left\{ h^{2}\left( X_{i k},X_{j k}\right) \right\} E\left\{ \exp \left[ s^{\prime } h\left( X_{i k},X_{j k}\right) \right] \right\} \exp \left( -s^{\prime } M\right) . \end{aligned} \end{aligned}$$

We have

$$\begin{aligned} \begin{aligned} h\left( X_{i k},X_{j k}\right) \le \left| X_{i k}\right| +\left| X_{j k}\right| , \end{aligned} \end{aligned}$$

together with the Cauchy-Schwartz inequality, yields that

$$\begin{aligned} E\left\{ \exp \left[ sh\left( X_{i k},X_{j k}\right) \right] \right\} \le E\left\{ \exp \left[ s\left( \left| X_{i k}\right| +\left| X_{j k}\right| \right) \right] \right\} . \end{aligned}$$

If we choose \(M=c n^{\gamma }\) for \(0<\gamma <1 / 2-\kappa \), then by condition \((\mathrm {C} 1), I_{k 12} \le \varepsilon / 2\) when n is sufficiently large. Consequently,

$$\begin{aligned} \mathrm {P}\left( \left| {\hat{I}}_{k 12}^{\star }-I_{k 12}\right| \ge \varepsilon \right) \le \mathrm {P}\left( \left| {\hat{I}}_{k 12}^{\star }\right| >\varepsilon / 2\right) . \end{aligned}$$
(a.3)

It remains to bound the probability \(\mathrm {P}\left( \left| {\hat{I}}_{k 12}^{\star }\right| >\varepsilon / 2\right) \). Observe that

$$\begin{aligned} \left\{ \left| {\hat{I}}_{k 12}^{\star }\right|>\varepsilon / 2\right\} \subseteq \left\{ \left| X_{i k}\right| >M / 2, \text{ for } \text{ some } 1 \le i \le n\right\} . \end{aligned}$$
(a.4)

By invoking condition (C1) and Markov’s inequality for \(s>0\), there must be some constant C such that

$$\begin{aligned} \mathrm {P}\left( \left| X_{i k}\right| >M / 2\right) \le 2 C \exp (-s M / 2). \end{aligned}$$

Consequently,

$$\begin{aligned} \max _{1 \le k \le p} \mathrm {P}\left( \left| {\hat{I}}_{k 12}^{\star }\right|>\varepsilon / 2\right) \le n \max _{1 \le k \le p} \mathrm {P}\left( \left| X_{i k}\right| >M / 2\right) \le 2 n C \exp (-s M / 2). \end{aligned}$$
(a.5)

Combing the results (a.1), (a.2), (a.3) and (a.5) and \(M=c n^{\gamma }\), it is easily obtained that

$$\begin{aligned} \max _{1 \le k \le p} \mathrm {P}\left( \left| {\hat{I}}_{k 1}^{\star }-I_{k 1}\right| >4 \varepsilon \right) \le 2 \exp \left( -\varepsilon ^{2} n^{1-2 \gamma }\right) +2 n C \exp \left( -s n^{\gamma } / 2\right) . \end{aligned}$$
(a.6)

Thus, we have

$$\begin{aligned} \mathrm {P}\left( \left| {\hat{I}}_{k 1}-I_{k 1}\right| >4 \varepsilon \right) \le 2 \exp \left( -\varepsilon ^{2} n^{1-2 \gamma }\right) +2 n C \exp \left( -s n^{\gamma } / 2\right) . \end{aligned}$$
(a.7)

It remains to prove the uniform consistency of \({\hat{I}}_{k, 2} .\) We still first deal with the term \(J_{y, k 2}=E\left| X_{k}-X_{k}^{\prime }\right| {\mathbf {1}}(Y=y) {\mathbf {1}}\left( Y^{\prime }=y\right) / p_{y} .\) Let \({\hat{J}}_{y, k 2}=\hat{{\bar{J}}}_{y, k 2} / {\hat{p}}_{y}\), where

$$\begin{aligned} \hat{{\bar{J}}}_{y, k 2}=n^{-2} \sum _{i \ne j}\left| X_{i k}-X_{j k}\right| {\mathbf {1}}\left( Y_{i}=y\right) {\mathbf {1}}\left( Y_{j}=y\right) \text{ and } {\hat{p}}_{y}=n^{-1} \sum _{i=1}^{n} {\mathbf {1}}\left( Y_{i}=y\right) . \end{aligned}$$

Accordingly, \(J_{y, k 2}\) can be decomposed into \(J_{y, k 2}={\bar{J}}_{y, k 2} / p_{y}\) where \({\bar{J}}_{y, k 2}=\) \(E\left| X_{k}-X_{k}^{\prime }\right| {\mathbf {1}}(Y=y) {\mathbf {1}}\left( Y^{\prime }=y\right) \). Following the argument for proving (a.7), we get that

$$\begin{aligned} \mathrm {P}\left( \left| \hat{{\bar{J}}}_{y, k 2}-{\bar{J}}_{y, k 2}\right| >4 \varepsilon \right) \le 2 \exp \left( -\varepsilon ^{2} n^{1-2 \gamma }\right) +2 n C \exp \left( -s n^{\gamma } / 2\right) . \end{aligned}$$
(a.8)

As a result, it has that

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( \left| {\hat{J}}_{y, k 2}-J_{y, k 2}\right| \ge 2 \varepsilon \right) \\&\quad \le ~ \mathrm {P}\left( \left| \frac{\hat{{\bar{J}}}_{y, k 2}-{\bar{J}}_{y, k 2}}{{\hat{p}}_{y}}-\frac{{\bar{J}}_{y, k 2}\left( p_{y}-{\hat{p}}_{y}\right) }{{\hat{p}}_{y} p_{y}}\right| \ge 2 \varepsilon \right) \\&\quad \le ~ \mathrm {P}\left( \left| \frac{\hat{{\bar{J}}}_{y, k 2}-{\bar{J}}_{y, k 2}}{{\hat{p}}_{y}}\right| \ge \varepsilon \right) +\mathrm {P}\left( \left| \frac{{\bar{J}}_{y, k 2}\left( p_{y}-{\hat{p}}_{y}\right) }{{\hat{p}}_{y} p_{y}}\right| \ge \varepsilon \right) . \end{aligned} \end{aligned}$$
(a.9)

We first deal with the first term in (a.9). Under condition (C2), we have

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( \left| \frac{\hat{{\bar{J}}}_{y, k 2}-{\bar{J}}_{y, k 2}}{{\hat{p}}_{y}}\right| \ge \varepsilon \right) \\&\quad \le ~ \mathrm {P}\left( \left| \frac{\hat{{\bar{J}}}_{y, k 2}-{\bar{J}}_{y, k 2}}{{\hat{p}}_{y}}\right| \ge \varepsilon , {\hat{p}}_{y} \ge \frac{c_{1}}{2 R}\right) +\mathrm {P}\left( {\hat{p}}_{y}<\frac{c_{1}}{2 R}\right) \\&\quad \le ~ \mathrm {P}\left( \left| \hat{{\bar{J}}}_{y, k 2}-{\bar{J}}_{y, k 2}\right| \ge \frac{c_{1} \varepsilon }{2 R}\right) +\mathrm {P}\left( \left| {\hat{p}}_{y}-p_{y}\right| \ge \frac{c_{1} \varepsilon }{2 R}\right) \\&\quad \le ~ 2 \exp \left( -\frac{c_{1}^{2} \varepsilon ^{2}}{32 R^{2}} n^{1-2 \gamma }\right) +2 n C \exp \left( -\frac{s n^{\gamma }}{2}\right) +2 \exp \left( -2 n \frac{c_{1}^{2} \varepsilon ^{2}}{4 R^{2}}\right) . \end{aligned} \end{aligned}$$
(a.10)

The last inequality holds because of (a.8) and Inequality 1. We next deal with the second term in (a.9),

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( \left| \frac{{\bar{J}}_{y, k 2}\left( p_{y}-{\hat{p}}_{y}\right) }{{\hat{p}}_{y} p_{y}}\right| \ge \varepsilon \right) \\&\quad \le ~\mathrm {P}\left( \left| \frac{{\bar{J}}_{y, k 2}\left( p_{y}-{\hat{p}}_{y}\right) }{{\hat{p}}_{y} p_{y}}\right| \ge \varepsilon , {\hat{p}}_{y} \ge \frac{c_{1}}{2 R}\right) +\mathrm {P}\left( {\hat{p}}_{y}<\frac{c_{1}}{2 R}\right) \\&\quad \le ~ \mathrm {P}\left( \left| {\bar{J}}_{y, k 2}\left( p_{y}-{\hat{p}}_{y}\right) \right| \ge \frac{c_{1}^{2} \varepsilon }{2 R^{2}}\right) +\mathrm {P}\left( \left| {\hat{p}}_{y}-p_{y}\right| \ge \frac{c_{1} \varepsilon }{2 R}\right) \\&\quad \le ~ \mathrm {P}\left( 2 E\left| X_{k}\right| \cdot \left| p_{y}-{\hat{p}}_{y}\right| \ge \frac{c_{1}^{2} \varepsilon }{2 R^{2}}\right) +\mathrm {P}\left( \left| {\hat{p}}_{y}-p_{y}\right| \ge \frac{c_{1} \varepsilon }{2 R}\right) \\&\quad \le ~ \mathrm {P}\left( \left| p_{y}-{\hat{p}}_{y}\right| \ge \frac{c_{1}^{2} \varepsilon }{4 C_{0} R^{2}}\right) +2 \exp \left( -2 n \frac{c_{1}^{2} \varepsilon ^{2}}{4 R^{2}}\right) \\&\quad \le ~ 2 \exp \left( -2 n \frac{c_{1}^{4} \varepsilon ^{2}}{16 C_{0}^{2} R^{4}}\right) +2 \exp \left( -2 n \frac{c_{1}^{2} \varepsilon ^{2}}{4 R^{2}}\right) \end{aligned}. \end{aligned}$$

This, together with (a.10), we have that

$$\begin{aligned} \begin{aligned} \mathrm {P}\left( \left| {\hat{I}}_{k 2}-I_{k 2}\right| >4 \varepsilon \right) \le&~R\left\{ \exp \left( -\frac{c_{1}^{2} \varepsilon ^{2}}{22 R^{2}} n^{1-2 \gamma }\right) +2 n C \exp \left( -\frac{s n^{\gamma }}{2}\right) \right. \\&\left. +4 \exp \left( -2 n \frac{c_{1}^{2} \varepsilon ^{2}}{4 R^{2}}\right) +2 \exp \left( -2 n \frac{c_{1}^{4} \varepsilon ^{2}}{16 C_{0}^{2} R^{4}}\right) \right\} . \end{aligned} \end{aligned}$$
(a.11)

Combining (a.7), (a.11) and the Bonferroni’s inequality, we can easily get that

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( |({\hat{I}}_{k, 1}+{\hat{I}}_{k, 2})-I_{k, 1}-I_{k, 2}| \ge \varepsilon \right) \\&\quad \le ~ \mathrm {P}\left( \left| {\hat{I}}_{k, 1}-I_{k, 1}\right| \ge \varepsilon / 2\right) +\mathrm {P}\left( \left| {\hat{I}}_{k, 2}-I_{k, 2}\right| \ge \varepsilon / 2\right) \\&\quad = ~ O\left\{ n^{\kappa } \exp \left( -C_{1} \varepsilon ^{2} n^{1-2 \gamma -\kappa }\right) +n^{1+\kappa } \exp \left( -C_{2} n^{\gamma }\right) \right\} , \end{aligned} \end{aligned}$$

for some positive constants \(C_{1}\) and \(C_{2}.\) The convergence rate of \({\widehat{\tau }}_{k}^*\) is now achieved by

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left\{ \max _{1 \le k \le p}\left| {\widehat{\tau }}_{k}^*-\tau _{k}\right| \ge \varepsilon \right\} \\&\quad \le ~ p \mathrm {P}\left\{ \left| {\widehat{\tau }}_{k}^*-\tau _{k}\right| \ge \varepsilon \right\} \\&\quad = ~ p O\left\{ n^{\kappa } \exp \left( -c_{1} \varepsilon ^{2} n^{1-2 \gamma -2 \kappa } / 2\right) +n^{1+\kappa } \exp \left( -C_{2} n^{\gamma }\right) \right\} . \end{aligned} \end{aligned}$$

Let \(\varepsilon =c n^{-\varsigma }\), where \(\varsigma \) satisfies that \(0 \le \varsigma <1 / 2\). We thus have

$$\begin{aligned} \mathrm {P}\left\{ \max _{1 \le k \le p}\left| {\widehat{\tau }}_{k}^*-\tau _{k}\right| \ge c n^{-\varsigma }\right\} =p O\left\{ n^{\kappa } \exp \left( -C_{1} n^{1-2 \gamma -2 \kappa -2 \varsigma }\right) +n^{1+\kappa } \exp \left( -C_{2} n^{\gamma }\right) \right\} . \end{aligned}$$

Now, we deal with the second part of Theorem 2. If \({\mathcal {A}} \subsetneq \hat{{\mathcal {A}}}\), then there must be some \(k \in {\mathcal {A}}\) such that \({\widehat{\tau }}_{k}^*<c n^{-\varsigma }\). It follows from Condition (C3) that \(\left| {\widehat{\tau }}_{k}^*-\tau _{k}\right| >c n^{-\varsigma }\), for some \(k \in {\mathcal {A}}\), indicating that the events satisfy \(\{{\mathcal {A}} \subsetneq \hat{{\mathcal {A}}}\} \subseteq \left\{ \left| {\widehat{\tau }}_{k}^*-\tau _{k}\right| >c n^{-\varsigma }\right. \), for some \(\left. k \in {\mathcal {A}}\right\} .\) Hence \(D_{n}=\left\{ \max _{k \in {\mathcal {A}}} \mid {\widehat{\tau }}_{k}^*-\right. \) \(\left. \tau _{k} \mid \le c n^{-\varsigma }\right\} \subset \{{\mathcal {A}} \subset \hat{{\mathcal {A}}}\}.\) Consequently,

$$\begin{aligned} \begin{aligned} \mathrm {P}\{{\mathcal {A}} \subset \hat{{\mathcal {A}}}\}&=\mathrm {P}\left\{ D_{n}\right\} =1-\mathrm {P}\left\{ D_{n}^{c}\right\} =1-\mathrm {P}\left\{ \min _{k \in {\mathcal {A}}}\left| {\hat{\tau }}_{k}^*-\tau _{k}\right| \ge c n^{\varsigma }\right\} \\&=1-s_{n} \mathrm {P}\left\{ \left| {\widehat{\tau }}_{k}^*-\tau _{k}\right| \ge c n^{-\varsigma }\right\} \\&\ge 1-s_{n} O\left\{ n^{\kappa } \exp \left( -C_{1} n^{1-2 \gamma -2 \kappa -2 \varsigma }\right) +n^{1+\kappa } \exp \left( -C_{2} n^{\gamma }\right) \right\} . \end{aligned} \end{aligned}$$

where \(s_{n}\) is the cardinality of \({\mathcal {A}}\). This proves the second result in Theorem 2. \(\square \)

Proof of Theorem 3

Define

$$\begin{aligned} \delta =\min _{j \in {\mathcal {A}}} \tau _{k}^*-\max _{j \in {\mathcal {I}}} \tau _{k}^*. \end{aligned}$$

Then, we have

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( \min _{k \in {\mathcal {A}}} {\tilde{\tau }}_{k}^* \le \max _{k \in {\mathcal {I}}} {\tilde{\tau }}_{k}^*\right) \\&\quad = ~\mathrm {P}\left( \min _{k \in {\mathcal {A}}} {\tilde{\tau }}_{k}^*-\min _{k \in {\mathcal {A}}} \tau _{k}^*+\delta \le \max _{k \in {\mathcal {I}}} {\tilde{\tau }}_{k}^*-\max _{j \in {\mathcal {I}}} \tau _{k}^*\right) \\&\quad \le ~ \mathrm {P}\left( \max _{j \in {\mathcal {I}}}\left| {\tilde{\tau }}_{k}^*-\tau _{k}^*\right| +\max _{k \in {\mathcal {A}}}\left| {\tilde{\tau }}_{k}^*-\tau _{k}^*\right| \ge \delta \right) \\&\quad \le ~ \mathrm {P}\left( \max _{1 \le k \le p}\left| {\tilde{\tau }}_{k}^*-\tau _{k}^*\right| \ge \delta / 2\right) \le \sum _{k=1}^{p} \mathrm {P}\left( \left| {\tilde{\tau }}_{k}^*-\tau _{k}^*\right| \ge \delta / 2\right) \\&\quad \le ~ p O\left\{ n^{\kappa } \exp \left( -c_{1} \delta ^{2} n^{1-2 \gamma -2 \kappa } / 8\right) +n^{1+\kappa } \exp \left( -C_{2} n^{\gamma }\right) \right\} . \end{aligned} \end{aligned}$$

The last term goes to 0 under condition (C2) as \(n \rightarrow \infty \). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, P., Lin, L. Conditional characteristic feature screening for massive imbalanced data. Stat Papers 64, 807–834 (2023). https://doi.org/10.1007/s00362-022-01342-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-022-01342-8

Keywords

Navigation