Abstract
Huang et al. (J Bus Econ Stat 32:237–244, 2014) first proposed a Pearson Chi-Square based feature screening procedure tailored to multi-classification problem with ultrahigh dimensional categorical covariates, which is a common problem in practice but has seldom been discussed in the literature. However, their work establishes the sure screening property only in a limited setting. Moreover, the p value based adjustments when the number of categories involved by each covariate is different do not work well in several practical situations. In this paper, we propose an adjusted Pearson Chi-Square feature screening procedure and a modified method for tuning parameter selection. Theoretically, we establish the sure screening property of the proposed method in general settings. Empirically, the proposed method is more successful than Pearson Chi-Square feature screening in handling non-equal numbers of covariate categories in finite samples. Results of three simulation studies and one real data analysis are presented. Our work together with Huang et al. (J Bus Econ Stat 32:237–244, 2014) establishes a solid theoretical foundation and empirical evidence for the family of Pearson Chi-Square based feature screening methods.
Similar content being viewed by others
References
Cui HJ, Li RZ, Zhong W (2015) Model-free feature screening for ultrahigh dimensional discriminant analysis. J Am Stat Assoc 110:630–641
Fan JQ, Fan YY (2008) High dimensional classification using features annealed independence rules. Ann Stat 36:2605–2637
Fan JQ, Lv JC (2008) Sure independence screening for ultra-high dimensional feature space (with discussion). J R Stat Soc Ser B 70:849–911
Fan JQ, Song R (2010) Sure independent screening in generalized linear models with NP-dimensionality. Ann Stat 38:3567–3604
Fan JQ, Ma YB, Dai W (2014) Nonparametric independence screening in sparse ultra-high dimensional varying coefficient models. J Am Stat Assoc 109:1270–1284
He XM, Wang L, Hong HG (2013) Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Stat 41:342–369
Huang DY, Li RZ, Wang HS (2014) Feature screening for ultrahigh dimensional categorical data with applications. J Bus Econ Stat 32:237–244
Li RZ, Zhong W, Zhu LP (2012) Feature screening via distance correlation learning. J Am Stat Assoc 107:1129–1139
Mai Q, Zou H (2013) The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika 100:229–234
Mai Q, Zou H (2015) The fused Kolmogorov filter: a nonparametric model-free screening method. Ann Stat 43:1471–1497
Ni L, Fang F (2016) Entropy-based model-free feature screening for ultrahigh-dimensional multiclass classification. J Nonparametr Stat 28:515–530
Pan R, Wang HS, Li RZ (2016) Ultrahigh dimensional multi-class linear discriminant analysis by pairwise sure independence screening. J Am Stat Assoc 111:169–179
Quinlan JR (1992) C4.5: programs for machine learning, 1st edn. Morgan Kaufmann, Burlington
Wang HS (2009) Forward regression for ultra-high dimensional variable screening. J Am Stat Assoc 104:1512–1524
Zhu LP, Li LX, Li RZ, Zhu LX (2011) Model-free feature screening for ultrahigh dimensional data. J Am Stat Assoc 106:1464–1475
Acknowledgements
The authors would like to thank an anonymous reviewer and the Associate Editor for their helpful comments and suggestions. Lyu Ni’s research was partially supported by ECNU Fund for Short Term Overseas Academic Visit (40600-511232-16204/010/001). Fang Fang’s research was partially supported by Shanghai Nature Science Foundation (15ZR1410300), Shanghai Rising Star Program (16QA1401700), National Scientific Foundation of China (11601156), and the 111 Project (B14019).
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Lemma 1
For categorical \(X_k\), under condition (C1) and (C3), we have
for any \(0<\varepsilon <1\), where \(c_{5}\) is a positive constant.
Proof
Note that
Since \(\log J_k \ge \log 2>0.5\), we have
Further,
where inequality (5) holds because \(\left| {\hat{p}}_r {\hat{w}}_j^{(k)}\right| \), \(\left| {\hat{\pi }}_{r,j}^{(k)}\right| \), \(\left| p_r w_j^{(k)}\right| \) and \(\left| \pi _{r,j}^{(k)}\right| \) all have upper bounds, inequality (6) holds because \(p_r\ge c_1/R\) and \(w_j^{(k)}\ge c_1/J_k\), and inequality (7) holds because \( \left| {\hat{p}}_r {\hat{w}}_j^{(k)}-p_r w_j^{(k)}\right| \le \left| {\hat{w}}_j^{(k)}- w_j^{(k)}\right| +\left| {\hat{p}}_r -p_r\right| \).
Similarly, since \(\left| \left( {\hat{p}}_r{\hat{w}}_j^{(k)}-{\hat{\pi }}_{r,j}^{(k)}\right) ^2\right| \le 2\left( {\hat{p}}_r^2 \left( {\hat{w}}_j^{(k)}\right) ^2+\left( {\hat{\pi }}_{r,j}^{(k)}\right) ^2\right) \le 4 \),
Therefore,
We deal with \(I_{1.3}\) first.
where inequality (13) holds due to Bernstein’s inequality. Similarly,
and
where inequalities (14), (16), (18) and (19) hold due to Bernstein’s inequality.
Finally, by the inequalities (4), (8), (9), (10), (11), (13), (14), (16), (18) and (19), we have
for all \(k=1,\ldots , p\). \(\square \)
Proof of Theorem 1
By Lemma A1 and Conditions (C2) and (C3), we have
where b is a positive constant. \(\square \)
Lemma 2
For any continuous covariate \(X_k\) satisfying conditions (C4) and (C5), let \(F_k(y,x)\) be the cumulative distribution function of \((Y,X_k)\) and \({\hat{F}}_{k}(y,x)\) be the empirical cumulative distribution function. We have
for any \(\varepsilon >0\), \(1\le r\le R\) and \(1\le j\le J_k\), where \({\hat{q}}_{k,(j)}\) and \(q_{k,(j)}\) are the sample and population \(j/J_k\)-percentile of \(X_k\), and \(c_6\) and \(c_7\) are two positive constants.
Proof
Details can be found in Ni and Fang (2016). \(\square \)
Lemma 3
For any continuous \(X_k\), under condition (C1), (C4) and (C5), we have
for any \(0<\varepsilon <1\), where \(c_{9}\) is a positive constant.
Proof
Since
we have
So
Further,
where inequality (20) holds because of Lemma A2.
Similarly, \(P\left( I_{3.2}>\varepsilon /4\right) \), \(P\left( I_{3.3}>\varepsilon /4\right) \) and \(P\left( I_{3.4}>\varepsilon /4\right) \) are all not larger than \(c_6R\cdot \exp \left\{ -c_7n^{1-2\rho }\varepsilon ^2/{(16R^2)}\right\} \). Then
Similarly,
By inequalities (12), (15), (17), combined with inequalities (21) and (22), we have
and meanwhile the arguments (3), (4), (8), (9), (10), (11), (14) and (19) still hold. Therefore,
\(\square \)
Proof of Theorem 2
With Lemma A3 and Condition (C2), the proof of Theorem 2 is the same as that of Theorem 1 and hence is omitted. \(\square \)
Rights and permissions
About this article
Cite this article
Ni, L., Fang, F. & Wan, F. Adjusted Pearson Chi-Square feature screening for multi-classification with ultrahigh dimensional data. Metrika 80, 805–828 (2017). https://doi.org/10.1007/s00184-017-0629-9
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-017-0629-9