Conditional characteristic feature screening for massive imbalanced data

Wang, Ping; Lin, Lu

doi:10.1007/s00362-022-01342-8

Conditional characteristic feature screening for massive imbalanced data

Regular Article
Published: 25 July 2022

Volume 64, pages 807–834, (2023)
Cite this article

Statistical Papers Aims and scope Submit manuscript

314 Accesses
Explore all metrics

Abstract

Using conditional characteristic function as a screening index, a new model-free screening procedure is proposed to deal with variable screening problems in large-scale high-dimensional imbalanced data analysis. For binary response, our results show that the screening index under full data is proportional to the screening index under case–control sampling, an important sampling property for imbalanced data. This conclusion implies that we can apply this screening method to imbalanced data. Surely, the most appealing feature of the screening index is that it can be expressed as a simple linear combination of two first-order moments, so it is computationally simple. In addition, we successfully extend this method to multiple response. The theoretical properties are established under regularity conditions. To compare the performance of our method with its competitors, extensive simulations are conducted, which shows that the proposed procedure performs well in both the linear and nonlinear models. Finally, a real data analysis is investigated to further illustrate the effectiveness of the new method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

Article Open access 19 December 2014

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

References

Battey H, Fan J, Liu H, Lu J, Zhu Z (2018) Distributed testing and estimation under sparse highdimensional models. Ann Stat 46:1352–1382
Article MATH Google Scholar
Cai T, Wei H (2019) Transfer learning for nonparametric classification: minimax rate and adaptive classifier. https://arxiv.org/pdf/1906.02903.pdf
Chang J, Tang C, Wu Y (2013) Marginal empirical likelihood and sure independence feature screening. Ann Stat 41:2123–2148
Article MathSciNet MATH Google Scholar
Chen K (2001) Parametric models for response-biased sampling. J R Stat Soc Ser B 63:775–789
Article MathSciNet MATH Google Scholar
Chen X, Xie M (2014) A split-and-conquer approach for analysis of extraordinarily large data. Stat Sin 24:1655–1684
MathSciNet MATH Google Scholar
Chen K, Lin Y, Yao Y, Zhou C (2017) Regression analysis with response-biased sampling. Stat Sin 27:1699–1714
MATH Google Scholar
Cui H, Li R, Zhong W (2015) Model-free feature screening for ultrahigh dimensional discriminant analysis. J Am Stat Assoc 110:630–641
Article MathSciNet MATH Google Scholar
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J Roy Stat Soc B 70:849–911
Article MathSciNet MATH Google Scholar
Fan J, Song R (2010) Sure independence screening in generalized linear models with np-dimensionality. Ann Stat 38:3567–3604
Article MathSciNet MATH Google Scholar
Fan J, Feng Y, Song R (2011) Nonparametric independence screening in sparse ultrahigh dimensional additive models. J Am Stat Assoc 106:544–557
Article MATH Google Scholar
Fithian W, Hastie T (2014) Local case-control sampling: efficient subsampling in imblanced data sets. Ann Stat 42:1693–1724
Article MATH Google Scholar
He X, Wang L, Hong H (2013) Quantile adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Stat 41:342–369
MathSciNet MATH Google Scholar
Kang J, Hong H, Li Y (2017) Partition-based ultrahigh dimensional variable screening. Biometrika 104:785–800
Article MathSciNet MATH Google Scholar
Li G, Peng H, Zhang J, Zhu L (2012) Robust rank correlation based screening. Ann Stat 40:1846–1877
Article MathSciNet MATH Google Scholar
Li R, Zhong W, Zhu L (2012) Feature screening via distance correlation learning. J Am Stat Assoc 107:1129–1139
Article MathSciNet MATH Google Scholar
Li X, Li R, Xia Z, Xu C (2020) Distributed feature screening via componentwise debiasing. J Mach Learn Res 21:1–32
MathSciNet MATH Google Scholar
Lin N, Xi R (2011) Aggregated estimating equation estimation. Stat Interface 4:73–83
Article MathSciNet MATH Google Scholar
Lu J, Lin L (2018) Feature screening for multi-response varying coefficient models with ultrahigh dimensional predictors. Comput Stat Data Anal 128:242–254
Article MathSciNet MATH Google Scholar
Lu J, Lin L (2018) Model-free sure independence screening in the context of ultrahigh dimensional covariate together with labeled response. Manuscript
Luo S, Chen Z (2020) Feature selection by canonical correlation search in high-dimensional multi-response models with complex group structures. J Am Stat Assoc 115:1227–1235
Article MATH Google Scholar
Ma P, Mahoney M, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–911
MathSciNet MATH Google Scholar
Mai Q, Zou H (2012) The kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika 100:229–234
Article MathSciNet MATH Google Scholar
Mai Q, Zou H (2015) The fused kolmogorov filter: a nonparametric model-free screening method. Ann Stat 43:1471–1497
Article MathSciNet MATH Google Scholar
Manski C (1993) The selection problem in econometrics and statistics. Handb Stat 11:73–84
Article MathSciNet Google Scholar
Pan R, Wang H, Li R (2016) Ultrahigh dimensional multi-class linear discriminant analysis by pairwise sure independence screening. J Am Stat Assoc 111:169–179
Article Google Scholar
Schifano E, Wu J, Wang C, Yan J, Chen M (2016) Online updating of statistical inference in the big data setting. Technometrics 58:393–403
Article MathSciNet Google Scholar
Serfling R (2009) Approximation theorems of mathematical statistics. Wiley, New York
MATH Google Scholar
Song R, Lu W, Ma S, Jeng X (2014) Censored rank independence screening for high-dimensional survival data. Biometrika 101:799–814
Article MathSciNet MATH Google Scholar
Székely G, Rizzo M, Bakirov N (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35:2769–2794
Article MathSciNet MATH Google Scholar
Vapnik V (1998) Statistical learning theory. Wiley, New York
MATH Google Scholar
Wang X, Leng C (2016) High dimensional ordinary least squares projection for screening variables. J R Stat Soc Ser B 78:589–611
Article MathSciNet MATH Google Scholar
Wang H, Zhu R, Ma P (2018) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113:829–844
Article MathSciNet MATH Google Scholar
Xie J, Lin Y, Yan X, Tang N (2019) Category-adaptive variable screening for ultrahigh dimensional heterogeneous categorical data. J Am Stat Assoc 115:747–760
Article MATH Google Scholar
Xie J, Hao M, Liu W, Lin Y (2020) Fused variable screening for massive imbalanced data. Comput Stat Data Anal 141:94–108
Article MathSciNet MATH Google Scholar
Zhou T, Zhu L (2017) Model-free feature screening for ultrahigh dimensional censored regression. Stat Comput 27:947–961
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan, China
Ping Wang & Lu Lin

Authors

Ping Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lu Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lu Lin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The research was supported by NNSF project of China (11971265) and National Key R &D Program of China (2018YFA0703900).

Appendix: Proof of main results

We first prove Lemma 2, which is the basis of our ideas.

Firstly, the proof of formula (a) of Lemma 2.

$$\begin{aligned}&\phi _{{X_k}|Y=1}(t)-\phi _{X_k}(t) \\&\quad =E(e^{itX_k}| Y=1)-E(E(e^{itX_k}|Y))\\&\quad =E(e^{itX_k}| Y=1)-(P_1E(e^{itX_k}|Y=1)+P_0E(e^{itX_k}| Y=0))\\&\quad =P_0(\phi _{{X_k}|Y=1}(t)-\phi _{{X_k}|Y=0}(t)). \end{aligned}$$

Similarly,

$$\begin{aligned} \phi _{{X_k}|Y=0}(t)-\phi _{X_k}(t)=P_1(\phi _{{X_k}|Y=0}(t)-\phi _{{X_k}|Y=1}(t)). \end{aligned}$$

Secondly, the proof of the formula (b) of Lemma 2.

$$\begin{aligned} \tau _k= & {} \sum _{y=0,1} p_y\int _{\mathbb {R}}\Vert \phi _{{X_k}|Y=y}(t)-\phi _{X_k}(t)\Vert ^2w(t) dt\\= & {} P_1P_0^2\int _{\mathbb {R}}\Vert \phi _{{X_k}|Y=1}(t)-\phi _{{X_k}|Y=0}(t)\Vert ^2w(t)dt\\&\quad +P_0P_1^2\int _{\mathbb {R}}\Vert \phi _{{X_k}|Y=0}(t)-\phi _{{X_k}|Y=0}(t)\Vert ^2w(t)dt\\= & {} P_0P_1\int _{\mathbb {R}}\Vert \phi _{{X_k}|Y=1}(t)-\phi _{{X_k}|Y=0}(t)\Vert ^2w(t)dt. \end{aligned}$$

$\square $

Proof of Theorem 1

$$\begin{aligned} \tau _k^*&=P_0^*P_1^*\int _{\mathbb {R}}\Vert \phi _{{X_k}^*|Y^*=1}(t)-\phi _{{X_k}^*|Y^*=0}(t))\Vert ^2w(t)dt \nonumber \\&=P_0^*P_1^*\int _{\mathbb {R}}\Vert \phi _{{X_k}|Y=1}(t)-\phi _{{X_k}|Y=0}(t))\Vert ^2w(t)dt \nonumber \\&=\frac{P_0^*P_1^*}{P_0P_1}P_0P_1\int _{\mathbb {R}}\Vert \phi _{{X_k}|Y=1}(t)-\phi _{{X_k}|Y=0}(t))\Vert ^2w(t)dt \nonumber \\&=a\tau _k ,\nonumber \end{aligned}$$

where $a=\frac{P_0^*P_1^*}{P_0P_1}$, $P_0^*$ and $P_1^*$ are the probabilities of response variable $Y^*$ respectively valued at $Y^*=0$ and $Y^*=1$ under case–control sampling setting. The second equality above holds due to the assumption of case–control sampling design. When the population and sampling are fixed, a is a constant. $\square $

Before proving Theorem 2, we introduce two important inequalities will be used repeatedly in the proof of Theorem 2.

Inequality 1. Let $\mu =E(Y)$. If $P(a \le Y\le b)=1$, then

$$\begin{aligned} E[exp\{s(Y-\mu )\}] \le exp\{s^2(b-a)^2/8\} \quad \text {for any} \quad s>0. \end{aligned}$$

Inequality 2. Suppose that $h(Y_1,\cdots ,Y_m)$ is a kernel of the U statistics $U_n$, and let $\theta =E\{h(Y_1,\cdots ,Y_m)\}$. If $a\le h(Y_1,\cdots ,Y_m) \le b$, then, for any $t>0$ and $n\ge m$,

$$\begin{aligned} P(U_n-\theta \ge t) \le exp\{-2[n/m] t^2/(b-a)^2\}, \end{aligned}$$

where [n/m] denotes the integer part of n/m. Due to the symmetry of U statistics, we further have that

$$\begin{aligned} P(\left| U_n-\theta \ge t\right| ) \le exp\{-2[n/m] t^2/(b-a)^2\}, \end{aligned}$$

Proof of Theorem 2

The events satisfy $\{\left| {\tilde{\tau }}_{k}^*-\tau _{k}^*\right| >\varepsilon \}$ $\subseteq $ $ \{\left| {\hat{\tau }}_{k}^*-\tau _{k}^*\right| >\varepsilon \} $. To prove sure screening property of ${\tilde{\tau }}_{k}^*$, we just need to prove that ${\hat{\tau }}_{k}^*$ also has the same properties. The ranking statistics ${\widehat{\tau }}_k^*$ can be written as

$$\begin{aligned} {\widehat{\tau }}_k^*={\hat{I}}_{k, 1}+{\hat{I}}_{k, 2}, \end{aligned}$$

where

$$\begin{aligned} \begin{array}{l} {\hat{I}}_{k, 1}=n^{-2} \sum _{i, j=1}^{n}\left| X_{ik}-X_{jk}\right| , \\ {\hat{I}}_{k, 2}=\sum _{y=1}^{R} {\hat{J}}_{y, k 2} ~ \text{ for } {\hat{J}}_{y, k 2}=\frac{n^{-2} \sum _{i, j=1}^{n}\left| X_{ik}-X_{jk}\right| {\mathbf {1}}\left( Y_{i}=y\right) {\mathbf {1}}\left( Y_{j}=y\right) }{n^{-1} \sum _{l=1}^{n} {\mathbf {1}}\left( Y_{l}=y\right) }. \end{array} \end{aligned}$$

Correspondingly, the population form $\tau _k^*$ can also be written as $\tau _{k}^*=I_{k, 1}+$ $I_{k, 2}$, where

$$\begin{aligned} \begin{array}{l} I_{k, 1}=E\left| X_{k}-X_{k}^{\prime }\right| , \\ I_{k, 2}=\sum _{\nu =1}^{R} J_{y, k 2} ~ \text{ for } J_{y, k 2}=\frac{E\left| X_{k}-X_{k}^{\prime }\right| {\mathbf {1}}(Y=y) {\mathbf {1}}\left( Y^{\prime }=y\right) }{E {\mathbf {1}}(Y=y)}. \end{array} \end{aligned}$$

We aim to show the uniform consistency of the ${\widehat{\tau }}_{k}^*$ under regularity conditions.

Firstly, we deal with the term ${\hat{I}}_{k, 1}$. Define

$$\begin{aligned} {\hat{I}}_{k 1}^{\star }=\frac{2}{n(n-1)} \sum _{i<j} h\left( X_{i k}, X_{j k}\right) , \end{aligned}$$

where $h\left( X_{i k},X_{j k}\right) =\frac{1}{2}\left| X_{i k}-X_{j k}\right| +\frac{1}{2}\left| X_{j k}-X_{i k}\right| $. Then it has that ${\hat{I}}_{k 1}^{\star }(n-1) / n={\hat{I}}_{k 1}$. ${\hat{I}}_{k 1}^{\star }$ is a usual U statistics with kernel h, thus we shall establish the uniform consistency of ${\hat{I}}_{k 1}^{\star }$ by invoking the theory of U statistics (Serfling, 1980, chap. 5). Condition (C1) states that $I_{k 1}$ is uniformly bounded in p, that is, $\sup _{p} \max _{1 \le k \le p} I_{k 1}<\infty .$ For any give $\varepsilon >0$, take n large enough such that $I_{k 1} / n<\varepsilon $, then it can be showed that

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( \left| {\hat{I}}_{k 1}-I_{k 1}\right| \ge 2 \varepsilon \right) \\&\quad \le ~ \mathrm {P}\left\{ \left| {\hat{I}}_{k 1}^{\star }-I_{k 1}\right| (n-1) / n \ge 2 \varepsilon -I_{k 1} / n\right\} \\&\quad \le ~ \mathrm {P}\left( \left| {\hat{I}}_{k 1}^{\star }-I_{k 1}\right| \ge \varepsilon \right) . \end{aligned} \end{aligned}$$

(a.1)

The U statistics ${\hat{I}}_{k 1}^{\star }$ can be written as

$$\begin{aligned} \begin{aligned} {\hat{I}}_{k 1}^{\star }=&~ \frac{1}{2} \frac{2}{n(n-1)} \sum _{i<j} h\left( X_{i k}, X_{j k}\right) {\mathbf {1}}\left\{ h\left( X_{i k},X_{j k}\right) \le M\right\} \\&+\frac{1}{2} \frac{2}{n(n-1)} \sum _{i<j} h\left( X_{i k},X_{j k}\right) {\mathbf {1}}\left\{ h\left( X_{i k},X_{j k}\right) >M\right\} \\ =&~ {\hat{I}}_{k 11}^{\star }+{\hat{I}}_{k 12}^{\star }, \end{aligned} \end{aligned}$$

where M will be specified later.

Accordingly, we decompose $I_{k 1}$ into two parts:

$$\begin{aligned} \begin{aligned} I_{k 1}=&~ E\left[ h\left( X_{i k},X_{j k}\right) {\mathbf {1}}\left\{ h\left( X_{i k},X_{j k}\right) \le M\right\} \right] \\&+E\left[ h\left( X_{i k},X_{j k}\right) {\mathbf {1}}\left\{ h\left( X_{i k},X_{j k}\right) >M\right\} \right] \\ =&~ I_{k 11}+I_{k 12}. \end{aligned} \end{aligned}$$

Obviously, ${\hat{I}}_{k 11}^{\star }$ and ${\hat{I}}_{k 12}^{\star }$ are unbiased estimators of $I_{k 11}$ and $I_{k 12}$. We deal with the consistency of ${\hat{I}}_{k 11}^{\star }$ first. With the Markov’s inequality,for any $t>0$, we obtain that

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( {\hat{I}}_{k 11}^{\star }-I_{k 11} \ge \varepsilon \right) \\&\quad = ~ \mathrm {P}\left( exp(t{\hat{I}}_{k 11}^{\star }) \ge exp(tI_{k 11}) exp(t\varepsilon )\right) \\&\quad \le ~ \exp (-t \varepsilon ) \exp \left( -t I_{k 11}\right) E\left[ \exp \left( t {\hat{I}}_{k 11}^{\star }\right) \right] . \end{aligned} \end{aligned}$$

Serfling (2009) showed that any U statistics can be represented as an average of of i.i.d random variables, i.e., ${\hat{I}}_{k 11}^{\star }=(n !)^{-1} \sum _{n !} \Omega _{1}\left( X_{1 k},\cdots ,X_{n k}\right) $ where $\sum _{n !}$ denotes the summation over all possible permutations of $(1, \cdots , n)$, and each $\Omega _{1}\left( X_{1 k},X_{n k} \right) $ is an average of $m=[n / 2]$ i.i.d. random variables (i.e., $\Omega _{1}=m^{-1} \sum _{l} h^{(l)} {\mathbf {1}}\left\{ h^{(l)} \le M\right\} .$ Since the exponential function is convex, it follows from the Jensen’s inequality that, for $0<t \le 2 s_{0}$,

$$\begin{aligned} \begin{aligned} E\left\{ \exp \left( t {\hat{I}}_{k 11}^{\star }\right) \right\}&=E\left[ \exp \left\{ t(n !)^{-1} \sum _{n !} \Omega _{1}\left( X_{1 k},\cdots ,X_{n k}\right) \right\} \right] \\&\le (n !)^{-1} \sum _{n !} E\left[ \exp \left\{ t \Omega _{1}\left( X_{1 k},\cdots ,X_{n k}\right) \right\} \right] \\&=E^{m}\left\{ \exp \left( m^{-1} t h^{(l)} 1\left\{ h^{(l)} \le M\right\} \right) \right\} , \end{aligned} \end{aligned}$$

which together with Inequality 2, entails immediately that

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( {\hat{I}}_{k 11}^{\star }-I_{k 11} \ge \varepsilon \right) \\&\quad \le \exp (-t \varepsilon ) E^{m}\left\{ \exp \left( m^{-1} t\left[ h^{(l)} {\mathbf {1}}\left\{ h^{(l)} \le M\right\} -I_{k 11}\right] \right) \right\} \\&\quad \le \exp \left\{ -t \varepsilon +M^{2} t^{2} /(8 m)\right\} . \end{aligned} \end{aligned}$$

By choosing $t=4 \varepsilon m / M^{2}$, we have $\mathrm {P}\left( {\hat{I}}_{k 11}^{\star }-I_{k 11} \ge \varepsilon \right) \le \exp \left( -2 \varepsilon ^{2} m / M^{2}\right) $. Therefore, by the symmetry of U statistics, we can obtain easily that

$$\begin{aligned} \mathrm {P}\left( \left| {\hat{I}}_{k 11}^{\star }-I_{k 11}\right| \ge \varepsilon \right) \le 2 \exp \left( -2 \varepsilon ^{2} m / M^{2}\right) . \end{aligned}$$

(a.2)

Next, we show the consistency of ${\hat{I}}_{k 12}^{\star }.$ With Cauchy-Schwartz and Markov’s inequalities, for any $s^{\prime }>0$,

$$\begin{aligned} \begin{aligned} I_{k 12}^{2}&\le E\left\{ h^{2}\left( X_{i k},X_{j k}\right) \right\} \mathrm {P}\left\{ h\left( X_{i k},X_{j k}\right) >M\right\} \\&\le E\left\{ h^{2}\left( X_{i k},X_{j k}\right) \right\} E\left\{ \exp \left[ s^{\prime } h\left( X_{i k},X_{j k}\right) \right] \right\} \exp \left( -s^{\prime } M\right) . \end{aligned} \end{aligned}$$

We have

$$\begin{aligned} \begin{aligned} h\left( X_{i k},X_{j k}\right) \le \left| X_{i k}\right| +\left| X_{j k}\right| , \end{aligned} \end{aligned}$$

together with the Cauchy-Schwartz inequality, yields that

$$\begin{aligned} E\left\{ \exp \left[ sh\left( X_{i k},X_{j k}\right) \right] \right\} \le E\left\{ \exp \left[ s\left( \left| X_{i k}\right| +\left| X_{j k}\right| \right) \right] \right\} . \end{aligned}$$

If we choose $M=c n^{\gamma }$ for $0<\gamma <1 / 2-\kappa $, then by condition $(\mathrm {C} 1), I_{k 12} \le \varepsilon / 2$ when n is sufficiently large. Consequently,

$$\begin{aligned} \mathrm {P}\left( \left| {\hat{I}}_{k 12}^{\star }-I_{k 12}\right| \ge \varepsilon \right) \le \mathrm {P}\left( \left| {\hat{I}}_{k 12}^{\star }\right| >\varepsilon / 2\right) . \end{aligned}$$

(a.3)

It remains to bound the probability $\mathrm {P}\left( \left| {\hat{I}}_{k 12}^{\star }\right| >\varepsilon / 2\right) $. Observe that

$$\begin{aligned} \left\{ \left| {\hat{I}}_{k 12}^{\star }\right|>\varepsilon / 2\right\} \subseteq \left\{ \left| X_{i k}\right| >M / 2, \text{ for } \text{ some } 1 \le i \le n\right\} . \end{aligned}$$

(a.4)

By invoking condition (C1) and Markov’s inequality for $s>0$, there must be some constant C such that

$$\begin{aligned} \mathrm {P}\left( \left| X_{i k}\right| >M / 2\right) \le 2 C \exp (-s M / 2). \end{aligned}$$

Consequently,

$$\begin{aligned} \max _{1 \le k \le p} \mathrm {P}\left( \left| {\hat{I}}_{k 12}^{\star }\right|>\varepsilon / 2\right) \le n \max _{1 \le k \le p} \mathrm {P}\left( \left| X_{i k}\right| >M / 2\right) \le 2 n C \exp (-s M / 2). \end{aligned}$$

(a.5)

Combing the results (a.1), (a.2), (a.3) and (a.5) and $M=c n^{\gamma }$, it is easily obtained that

$$\begin{aligned} \max _{1 \le k \le p} \mathrm {P}\left( \left| {\hat{I}}_{k 1}^{\star }-I_{k 1}\right| >4 \varepsilon \right) \le 2 \exp \left( -\varepsilon ^{2} n^{1-2 \gamma }\right) +2 n C \exp \left( -s n^{\gamma } / 2\right) . \end{aligned}$$

(a.6)

Thus, we have

$$\begin{aligned} \mathrm {P}\left( \left| {\hat{I}}_{k 1}-I_{k 1}\right| >4 \varepsilon \right) \le 2 \exp \left( -\varepsilon ^{2} n^{1-2 \gamma }\right) +2 n C \exp \left( -s n^{\gamma } / 2\right) . \end{aligned}$$

(a.7)

It remains to prove the uniform consistency of ${\hat{I}}_{k, 2} .$ We still first deal with the term $J_{y, k 2}=E\left| X_{k}-X_{k}^{\prime }\right| {\mathbf {1}}(Y=y) {\mathbf {1}}\left( Y^{\prime }=y\right) / p_{y} .$ Let ${\hat{J}}_{y, k 2}=\hat{{\bar{J}}}_{y, k 2} / {\hat{p}}_{y}$, where

$$\begin{aligned} \hat{{\bar{J}}}_{y, k 2}=n^{-2} \sum _{i \ne j}\left| X_{i k}-X_{j k}\right| {\mathbf {1}}\left( Y_{i}=y\right) {\mathbf {1}}\left( Y_{j}=y\right) \text{ and } {\hat{p}}_{y}=n^{-1} \sum _{i=1}^{n} {\mathbf {1}}\left( Y_{i}=y\right) . \end{aligned}$$

Accordingly, $J_{y, k 2}$ can be decomposed into $J_{y, k 2}={\bar{J}}_{y, k 2} / p_{y}$ where ${\bar{J}}_{y, k 2}=$ $E\left| X_{k}-X_{k}^{\prime }\right| {\mathbf {1}}(Y=y) {\mathbf {1}}\left( Y^{\prime }=y\right) $. Following the argument for proving (a.7), we get that

$$\begin{aligned} \mathrm {P}\left( \left| \hat{{\bar{J}}}_{y, k 2}-{\bar{J}}_{y, k 2}\right| >4 \varepsilon \right) \le 2 \exp \left( -\varepsilon ^{2} n^{1-2 \gamma }\right) +2 n C \exp \left( -s n^{\gamma } / 2\right) . \end{aligned}$$

(a.8)

As a result, it has that

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( \left| {\hat{J}}_{y, k 2}-J_{y, k 2}\right| \ge 2 \varepsilon \right) \\&\quad \le ~ \mathrm {P}\left( \left| \frac{\hat{{\bar{J}}}_{y, k 2}-{\bar{J}}_{y, k 2}}{{\hat{p}}_{y}}-\frac{{\bar{J}}_{y, k 2}\left( p_{y}-{\hat{p}}_{y}\right) }{{\hat{p}}_{y} p_{y}}\right| \ge 2 \varepsilon \right) \\&\quad \le ~ \mathrm {P}\left( \left| \frac{\hat{{\bar{J}}}_{y, k 2}-{\bar{J}}_{y, k 2}}{{\hat{p}}_{y}}\right| \ge \varepsilon \right) +\mathrm {P}\left( \left| \frac{{\bar{J}}_{y, k 2}\left( p_{y}-{\hat{p}}_{y}\right) }{{\hat{p}}_{y} p_{y}}\right| \ge \varepsilon \right) . \end{aligned} \end{aligned}$$

(a.9)

We first deal with the first term in (a.9). Under condition (C2), we have

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( \left| \frac{\hat{{\bar{J}}}_{y, k 2}-{\bar{J}}_{y, k 2}}{{\hat{p}}_{y}}\right| \ge \varepsilon \right) \\&\quad \le ~ \mathrm {P}\left( \left| \frac{\hat{{\bar{J}}}_{y, k 2}-{\bar{J}}_{y, k 2}}{{\hat{p}}_{y}}\right| \ge \varepsilon , {\hat{p}}_{y} \ge \frac{c_{1}}{2 R}\right) +\mathrm {P}\left( {\hat{p}}_{y}<\frac{c_{1}}{2 R}\right) \\&\quad \le ~ \mathrm {P}\left( \left| \hat{{\bar{J}}}_{y, k 2}-{\bar{J}}_{y, k 2}\right| \ge \frac{c_{1} \varepsilon }{2 R}\right) +\mathrm {P}\left( \left| {\hat{p}}_{y}-p_{y}\right| \ge \frac{c_{1} \varepsilon }{2 R}\right) \\&\quad \le ~ 2 \exp \left( -\frac{c_{1}^{2} \varepsilon ^{2}}{32 R^{2}} n^{1-2 \gamma }\right) +2 n C \exp \left( -\frac{s n^{\gamma }}{2}\right) +2 \exp \left( -2 n \frac{c_{1}^{2} \varepsilon ^{2}}{4 R^{2}}\right) . \end{aligned} \end{aligned}$$

(a.10)

The last inequality holds because of (a.8) and Inequality 1. We next deal with the second term in (a.9),

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( \left| \frac{{\bar{J}}_{y, k 2}\left( p_{y}-{\hat{p}}_{y}\right) }{{\hat{p}}_{y} p_{y}}\right| \ge \varepsilon \right) \\&\quad \le ~\mathrm {P}\left( \left| \frac{{\bar{J}}_{y, k 2}\left( p_{y}-{\hat{p}}_{y}\right) }{{\hat{p}}_{y} p_{y}}\right| \ge \varepsilon , {\hat{p}}_{y} \ge \frac{c_{1}}{2 R}\right) +\mathrm {P}\left( {\hat{p}}_{y}<\frac{c_{1}}{2 R}\right) \\&\quad \le ~ \mathrm {P}\left( \left| {\bar{J}}_{y, k 2}\left( p_{y}-{\hat{p}}_{y}\right) \right| \ge \frac{c_{1}^{2} \varepsilon }{2 R^{2}}\right) +\mathrm {P}\left( \left| {\hat{p}}_{y}-p_{y}\right| \ge \frac{c_{1} \varepsilon }{2 R}\right) \\&\quad \le ~ \mathrm {P}\left( 2 E\left| X_{k}\right| \cdot \left| p_{y}-{\hat{p}}_{y}\right| \ge \frac{c_{1}^{2} \varepsilon }{2 R^{2}}\right) +\mathrm {P}\left( \left| {\hat{p}}_{y}-p_{y}\right| \ge \frac{c_{1} \varepsilon }{2 R}\right) \\&\quad \le ~ \mathrm {P}\left( \left| p_{y}-{\hat{p}}_{y}\right| \ge \frac{c_{1}^{2} \varepsilon }{4 C_{0} R^{2}}\right) +2 \exp \left( -2 n \frac{c_{1}^{2} \varepsilon ^{2}}{4 R^{2}}\right) \\&\quad \le ~ 2 \exp \left( -2 n \frac{c_{1}^{4} \varepsilon ^{2}}{16 C_{0}^{2} R^{4}}\right) +2 \exp \left( -2 n \frac{c_{1}^{2} \varepsilon ^{2}}{4 R^{2}}\right) \end{aligned}. \end{aligned}$$

This, together with (a.10), we have that

$$\begin{aligned} \begin{aligned} \mathrm {P}\left( \left| {\hat{I}}_{k 2}-I_{k 2}\right| >4 \varepsilon \right) \le&~R\left\{ \exp \left( -\frac{c_{1}^{2} \varepsilon ^{2}}{22 R^{2}} n^{1-2 \gamma }\right) +2 n C \exp \left( -\frac{s n^{\gamma }}{2}\right) \right. \\&\left. +4 \exp \left( -2 n \frac{c_{1}^{2} \varepsilon ^{2}}{4 R^{2}}\right) +2 \exp \left( -2 n \frac{c_{1}^{4} \varepsilon ^{2}}{16 C_{0}^{2} R^{4}}\right) \right\} . \end{aligned} \end{aligned}$$

(a.11)

Combining (a.7), (a.11) and the Bonferroni’s inequality, we can easily get that

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( |({\hat{I}}_{k, 1}+{\hat{I}}_{k, 2})-I_{k, 1}-I_{k, 2}| \ge \varepsilon \right) \\&\quad \le ~ \mathrm {P}\left( \left| {\hat{I}}_{k, 1}-I_{k, 1}\right| \ge \varepsilon / 2\right) +\mathrm {P}\left( \left| {\hat{I}}_{k, 2}-I_{k, 2}\right| \ge \varepsilon / 2\right) \\&\quad = ~ O\left\{ n^{\kappa } \exp \left( -C_{1} \varepsilon ^{2} n^{1-2 \gamma -\kappa }\right) +n^{1+\kappa } \exp \left( -C_{2} n^{\gamma }\right) \right\} , \end{aligned} \end{aligned}$$

for some positive constants $C_{1}$ and $C_{2}.$ The convergence rate of ${\widehat{\tau }}_{k}^*$ is now achieved by

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left\{ \max _{1 \le k \le p}\left| {\widehat{\tau }}_{k}^*-\tau _{k}\right| \ge \varepsilon \right\} \\&\quad \le ~ p \mathrm {P}\left\{ \left| {\widehat{\tau }}_{k}^*-\tau _{k}\right| \ge \varepsilon \right\} \\&\quad = ~ p O\left\{ n^{\kappa } \exp \left( -c_{1} \varepsilon ^{2} n^{1-2 \gamma -2 \kappa } / 2\right) +n^{1+\kappa } \exp \left( -C_{2} n^{\gamma }\right) \right\} . \end{aligned} \end{aligned}$$

Let $\varepsilon =c n^{-\varsigma }$, where $\varsigma $ satisfies that $0 \le \varsigma <1 / 2$. We thus have

$$\begin{aligned} \mathrm {P}\left\{ \max _{1 \le k \le p}\left| {\widehat{\tau }}_{k}^*-\tau _{k}\right| \ge c n^{-\varsigma }\right\} =p O\left\{ n^{\kappa } \exp \left( -C_{1} n^{1-2 \gamma -2 \kappa -2 \varsigma }\right) +n^{1+\kappa } \exp \left( -C_{2} n^{\gamma }\right) \right\} . \end{aligned}$$

Now, we deal with the second part of Theorem 2. If ${\mathcal {A}} \subsetneq \hat{{\mathcal {A}}}$, then there must be some $k \in {\mathcal {A}}$ such that ${\widehat{\tau }}_{k}^*<c n^{-\varsigma }$. It follows from Condition (C3) that $\left| {\widehat{\tau }}_{k}^*-\tau _{k}\right| >c n^{-\varsigma }$, for some $k \in {\mathcal {A}}$, indicating that the events satisfy $\{{\mathcal {A}} \subsetneq \hat{{\mathcal {A}}}\} \subseteq \left\{ \left| {\widehat{\tau }}_{k}^*-\tau _{k}\right| >c n^{-\varsigma }\right. $, for some $\left. k \in {\mathcal {A}}\right\} .$ Hence $D_{n}=\left\{ \max _{k \in {\mathcal {A}}} \mid {\widehat{\tau }}_{k}^*-\right. $ $\left. \tau _{k} \mid \le c n^{-\varsigma }\right\} \subset \{{\mathcal {A}} \subset \hat{{\mathcal {A}}}\}.$ Consequently,

$$\begin{aligned} \begin{aligned} \mathrm {P}\{{\mathcal {A}} \subset \hat{{\mathcal {A}}}\}&=\mathrm {P}\left\{ D_{n}\right\} =1-\mathrm {P}\left\{ D_{n}^{c}\right\} =1-\mathrm {P}\left\{ \min _{k \in {\mathcal {A}}}\left| {\hat{\tau }}_{k}^*-\tau _{k}\right| \ge c n^{\varsigma }\right\} \\&=1-s_{n} \mathrm {P}\left\{ \left| {\widehat{\tau }}_{k}^*-\tau _{k}\right| \ge c n^{-\varsigma }\right\} \\&\ge 1-s_{n} O\left\{ n^{\kappa } \exp \left( -C_{1} n^{1-2 \gamma -2 \kappa -2 \varsigma }\right) +n^{1+\kappa } \exp \left( -C_{2} n^{\gamma }\right) \right\} . \end{aligned} \end{aligned}$$

where $s_{n}$ is the cardinality of ${\mathcal {A}}$. This proves the second result in Theorem 2. $\square $

Proof of Theorem 3

Define

$$\begin{aligned} \delta =\min _{j \in {\mathcal {A}}} \tau _{k}^*-\max _{j \in {\mathcal {I}}} \tau _{k}^*. \end{aligned}$$

Then, we have

$$\begin{aligned} \begin{aligned}&\mathrm {P}\left( \min _{k \in {\mathcal {A}}} {\tilde{\tau }}_{k}^* \le \max _{k \in {\mathcal {I}}} {\tilde{\tau }}_{k}^*\right) \\&\quad = ~\mathrm {P}\left( \min _{k \in {\mathcal {A}}} {\tilde{\tau }}_{k}^*-\min _{k \in {\mathcal {A}}} \tau _{k}^*+\delta \le \max _{k \in {\mathcal {I}}} {\tilde{\tau }}_{k}^*-\max _{j \in {\mathcal {I}}} \tau _{k}^*\right) \\&\quad \le ~ \mathrm {P}\left( \max _{j \in {\mathcal {I}}}\left| {\tilde{\tau }}_{k}^*-\tau _{k}^*\right| +\max _{k \in {\mathcal {A}}}\left| {\tilde{\tau }}_{k}^*-\tau _{k}^*\right| \ge \delta \right) \\&\quad \le ~ \mathrm {P}\left( \max _{1 \le k \le p}\left| {\tilde{\tau }}_{k}^*-\tau _{k}^*\right| \ge \delta / 2\right) \le \sum _{k=1}^{p} \mathrm {P}\left( \left| {\tilde{\tau }}_{k}^*-\tau _{k}^*\right| \ge \delta / 2\right) \\&\quad \le ~ p O\left\{ n^{\kappa } \exp \left( -c_{1} \delta ^{2} n^{1-2 \gamma -2 \kappa } / 8\right) +n^{1+\kappa } \exp \left( -C_{2} n^{\gamma }\right) \right\} . \end{aligned} \end{aligned}$$

The last term goes to 0 under condition (C2) as $n \rightarrow \infty $. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, P., Lin, L. Conditional characteristic feature screening for massive imbalanced data. Stat Papers 64, 807–834 (2023). https://doi.org/10.1007/s00362-022-01342-8

Download citation

Received: 27 September 2021
Revised: 30 April 2022
Accepted: 08 July 2022
Published: 25 July 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s00362-022-01342-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Conditional characteristic feature screening for massive imbalanced data

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proof of main results

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Conditional characteristic feature screening for massive imbalanced data

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proof of main results

Appendix: Proof of main results

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation