Abstract
Feature screening is commonly used to handle ultrahigh-dimensional data prior to conducting a formal data analysis. While various feature screening methods have been developed in the literature, research gaps still exist. The existing methods usually make an implicit assumption that data are accurately measured. This requirement, however, is frequently violated in applications. In this chapter, we consider error-prone ultrahigh-dimensional survival data and propose a robust feature screening method. We develop an iteration algorithm to improve the performance of retaining all informative covariates. Theoretical results are established for the proposed method. Simulation studies are reported to assess the performance of the proposed method, together with an application of the proposed method to handle a mantle cell lymphoma microarray dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Carroll, R. J., Ruppert, D., Stefanski, L. A., & Crainiceanu, C. M. (2006). Measurement Error in Nonlinear Model. New York: CRC Press.
Chen, L.-P. (2019). Iterated feature screening based on distance correlation for ultrahigh-dimensional censored data with covariates measurement error. arXiv:1901.01610.
Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95, 759–771.
Chen, L.-P., & Yi, G. Y. (2020). Model selection and model averaging for analysis of truncated and censored data with measurement error. Electronic Journal of Statistics, 14, 4054–4109.
Chen, L.-P., & Yi, G. Y. (2021a). Analysis of noisy survival data with graphical proportional hazards measurement error models. Biometrics. https://doi.org/10.1111/biom.13331
Chen, L.-P., & Yi, G. Y. (2021b). Semiparametric methods for left-truncated and right-censored survival data with covariate measurement error. Annals of the Institute of Statistical Mathematics, 73, 481–517. https://doi.org/10.1007/s10463-020-00755-2
Chen, X., Chen, X., & Wang, H. (2018). Robust feature screening for ultra-high dimensional right censored data via distance correlation. Computational Statistics and Data Analysis, 119, 118–138.
Cui, H., Li, R., & Zhong, W. (2015). Model-free feature screening for ultrahigh dimensional discriminant analysis. Journal of the American Statistical Association, 110, 630–641.
Dreiera, I., & Kotzb, S. (2002). A note on the characteristic function of the t-distribution. Statistics and Probability Letters, 57, 221–224.
Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society, Series B, 70, 849–911.
Fan, J., & Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics, 38, 3567–3604.
Fan, J., Samworth, R., & Wu, Y. (2009). Ultrahigh dimensional feature selection: beyond the linear model. Journal of Machine Learning Research, 10, 1829–1853.
Fan, J., Feng, Y., & Wu, Y. (2010). Ultrahigh dimensional variable selection for Cox’s proportional hazards model. IMS Collect, 6, 70–86.
Földes, A., & Rejtö, L. (1981). A LIL type result for the product limit estimator. Z. Wahrscheinlichkeitstheorie verw. Gebiete, 56, 75–86.
Hall, P., & Miller, H. (2009). Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics, 18, 533–550.
Hao, M., Lin, Y., Liu, X., & Tang, W. (2019). Robust feature screening for high-dimensional survival data. Journal of Applied Statistics, 46, 979–994.
Isaev, M., & McKay, B. D. (2016). On a bound of Hoeffding in the complex case. Electronic Communications in Probability, 21, 1–7.
Li, R., Zhong, W., & Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107, 1129–1139.
Marsden, J. E., & Hoffman, M. J. (1999). Basic complex analysis. New York: W. H. Freeman.
Rosenwald, A., Wright, G., Chan, W. C., Connors, J. M., Campo, E., Fisher, R. I., Gascoyne, R. D., Muller-Hermelink, H. K., Smeland, E. B., & Staudt, L. M. (2003). The proliferation gene expression signature is a quantitative integrator of oncogenic events that predicts survival in mantle cell lymphoma. Cancer Cell, 3, 185–197.
Song, R., Lu, W., Ma, S., & Jeng, X. (2014). Censored rank independence screening for high-dimensional survival data. Biometrika, 101, 799–814.
Székely, G. J., Rizzo, M. L., & Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35, 2769–2794.
Wand, M.P. & Jones, M.C. (1995). Kernel Smoothing. Chapman & Hall, London.
Xue, J., & Liang, F. (2017). A robust model free feature screening method for ultrahigh dimensional data. Journal of Computational and Graphical Statistics, 26, 803–813.
Yan, X., Tang, N., & Zhao, X. (2017). The Spearman rank correlation screening for ultrahigh dimensional censored data. arXiv:1702.02708v1.
Yi, G. Y. (2017). Statistical Analysis with Measurement Error and Misclassication: Strategy, Method and Application. Springer.
Yi, G.Y., He, W., & Caroll, R.J. (2021). Feature screening with large-scale and high-dimensional survival data. Biometrics. http://doi.org/10.1111/biom.13479
Yi, G. Y., Ma, Y., Spiegelman, D., & Carroll, R. J. (2015). Functional and structural methods with mixed measurement error and misclassification in covariates. Journal of the American Statistical Association, 110, 681–696.
Zhang, J., Liu, Y., & Cui, H. (2020). Model-free feature screening via distance correlation for ultrahigh dimensional survival data. Statistical Papers. https://doi.org/10.1007/s00362-020-01210-3
Zhong, W., & Zhu, L. (2015). An iterative approach to distance correlation-based sure independence screening. Journal of Statistical Computation and Simulation, 85, 2331–2345.
Zhu, L., Li, L., Li, R., & Zhu, L. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106, 1464–1475.
Acknowledgements
The authors thank the co-editors and a referee for their helpful comments on the initial version. This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). Yi is Canada Research Chair in Data Science (Tier 1). Her research was undertaken, in part, thanks to funding from the Canada Research Chairs program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
1.1 A. Technical Lemmas
In this appendix, we provide some lemmas that are useful to derive the main theorems. The first lemma is the probabilistic bound of the estimated survivor function.
Lemma 1
Let H(t) = P(Y i > t) denote the cumulative distribution function of Y i , where \(Y_i = \min \{T_i, C_i\}\) . Suppose that there is a finite time point τ, such that H(τ) > η for a positive constant η. Then for ξ > 27 n −1 η −2 , there exist positive constants κ 1 and κ 2 such that
This lemma is Theorem 2 of Földes & Rejtö (1981). The second lemma is about the probabilistic bound of the estimator (16).
Lemma 2
Under regularity conditions (C1) and (C2), for any ξ ∗ > 0, we have
for some positive constants G and κ 3.
Proof
We first write
because f adj,j(x) is just a different symbol of the inverse Fourier transformation of \(F_{X_{(j)}}(x)\), i.e., \(f_{adj,j}(x) - f_{X_{(j)}}(x) = 0\). Therefore, the remaining task is to examine \(\widehat {f}_{adj,j}(x) - f_{adj,j}(x)\). By (11) and (15), we have
Note that \(\phi _{X_{(j)}^\ast }(u) = E\left \{ \exp \left ( \mathbf {i} u X_{i(j)}^\ast \right ) \right \}\) and \(\widehat {\phi }_{X_{(j)}^\ast }(u)\) is given by (14); then
By Conditions (C1) and (C2) and the finiteness of \(\int _{-\infty }^\infty u^r K(u) du\) for all \(r \in \mathbb {N}\), applying the Taylor series expansion of the exponential function gives that \(\int _{-\infty }^\infty \exp \left ( \mathbf {i} u h z \right ) K(u) du = 1 + o( n^{-\frac {1}{5}} )\). Combining with (A.5) gives
Let \(Z_i = \exp \left ( \mathbf {i} u X_{i(j)}^\ast \right )\), which is a complex random variable. By Theorem 1.2 of Isaev and McKay (2016), we have
where G is some constant with \(G > \text{diam}Z \triangleq \inf \left \{c \in \mathbb {R}^+ : P\left ( |Z_1 - Z_2| > c \right )\right .\) . Note that \(\widehat {\phi }_{X_{(j)}^\ast }(u) = \frac {1}{n} \sum \limits _{i=1}^n Z_i\); then by (A.6), for any ξ 2 > 0 and ν > 0,
where the third step is due to Markov’s inequality, the fourth step is by the independence of the \(X_i^\ast \), and the last step comes from (A.7) with Z i replaced by νZ i, so that with constant νG satisfying \(\nu G > \inf \{\nu c : P\left ( \nu |Z_1 - Z_2| > \nu c \right ) = 0 \}\), we have \( E \left \{ \exp \left (\nu | Z_i - E(Z_i) | \right ) \right \} \leq \exp \left ( \frac {\nu ^2 G^2}{8} \right )\).
To get the best upper bound, we take the right-hand side of (A.8) as the function of ν and then minimize it. Specifically, let \(\varphi (\nu ) = \frac {n \nu ^2 G^2}{8} -\nu n \xi _2 + o(n^{-\frac {1}{5}})\). Since φ(ν) is a quadratic function, it is easy to check that \(\nu ^\ast \triangleq \mathop {\operatorname {argminn}} \limits _\nu \varphi (\nu ) = \frac {4\xi _2}{G^2}\). Then replacing ν by ν ∗ in (A.8) yields
Moreover, by (A.4) and (A.9), we observe that with a probability greater than \(1-\exp \left \{- \frac {2n \xi _2^2}{G^2} + o(n^{-\frac {1}{5}}) \right \}\),
where the first equality is due to (A.3), the last step comes from (A.9), and the improper integral \(\int _{-\infty }^\infty \frac {\exp \left ( -\mathbf {i}u x \right )}{\phi _{\epsilon _{(j)}}(u)} du\) is shown to converge to a finite value (e.g., Marsden & Hoffman 1999, Proposition 4.3.9).
In other words,
Specifying \(\xi ^\ast = \left ( \sup \limits _x \frac {1}{2\pi } \int _{-\infty }^\infty \frac {\exp \left ( -\mathbf {i}u x \right )}{\phi _{\epsilon _{(j)}}(u)} du \right ) \xi _2\) gives that
where \(\kappa _3 \triangleq \exp \left \{ \frac {2n \xi ^{\ast 2}}{G^2} - \frac {2n \xi ^2}{G^2} \right \}\), which is positive. Thus, by the definition of the cumulative distribution function and (16), we conclude the desired result (A.2). â–¡
1.2 B. Proofs of Main Theorems
1.2.1 B.1 Proof of Theorem 1
Part 1
We prove (19).
Since ω j and \(\widehat {\omega }_j\) are formulated in terms of dov(⋅, ⋅) and the associated estimates, to show the desired result, it suffices to examine dov(⋅, ⋅) and its estimates.
Let \(\omega _j^\ast \triangleq \widehat {\text{dcov}}(F_{X_{(j)}}(X_{(j)}),F(Y)) = \widetilde {M}_{j,1} + \widetilde {M}_{j,2} - 2 \widetilde {M}_{j,3}\), where \(\widetilde {M}_{j,k}\) with k = 1, 2, 3 has the same form as \(\widehat {M}_{j,k}\) in (4) with \(\widehat {F}_{X_{(j)}}(X_{(j)})\) and \(\widehat {F}(Y)\) replaced by \(F_{X_{(j)}}(X_{(j)})\) and F(Y ), respectively. Therefore, the difference between \(\widehat {\omega }_j\) and ω j can be expressed as
Similar to the derivation of Li et al. (2012), we can show that
for some positive constants \(\widetilde {c}_1\) and ξ.
On the other hand, we examine \( \widehat {\omega }_j - \omega _j^\ast \) by writing
Since the derivations of \( \widehat {M}_{j,2} - \widetilde {M}_{j,2}\) and \( \widehat {M}_{j,3} - \widetilde {M}_{j,3}\) are similar to those of \( \widehat {M}_{j,1} - \widetilde {M}_{j,1}\), we only present the argument for the latter case.
By adding and subtracting \(\frac {1}{n^2} \sum \limits _{i=1}^n \sum \limits _{k=1}^n \Big \{\left | \widehat {F}_{adj,j}(X_{i(j)}) - \widehat {F}_{adj,j}(X_{k(j)}) \right | \left |F(Y_i) - \right .\) \(\left . F(Y_k) \right | \Big \}\), we obtain that
First, we examine S 1. Since \(\widehat {F}_{adj,j}(x)\) is the estimated cumulative distribution function with \(0 \leq \widehat {F}_{adj,j}(x) \leq 1\), then for any i and k, we have that
By the triangle inequality, we have that
Then by (B.6), we have that
where the last step is due to Lemma 1 with ξ in the right-hand side of (A.1) replaced by \(\frac {\xi }{2}\). Therefore, combining (B.5) and (B.7) gives
Next, we examine S 2 in a similar manner. Similar to (B.5), we have that
Similar to the arguments for (B.7), we obtain that
Therefore, combining (B.9) and (B.10) yields
Finally, combining (B.4), (B.8), and (B.11), the probabilistic bound of \(\widehat {M}_{j,1} - \widetilde {M}_{j,1}\) is given by
Furthermore, similar derivations show that
and
Noting that the upper bounds in (B.12)–(B.14) are dominated by \( \exp \left (- c^\ast n \xi ^2 \right )\) for certain constant c ∗, we apply (B.12)–(B.14) to (B.3) and obtain that
for some \(\widetilde {c}_2 >0\). Thus, combining (B.2) and (B.15) with (B.1) and specifying ξ = cn −ζ for the constants c and ζ described in Condition (C4) yield the desired result.
Part 2
We prove (20).
Let \(J = \min \limits _{j \in \mathcal {I}} \big |\omega _j\big | - \max \limits _{j \in \mathcal {I}^c} \big |\omega _j \big | \). The left-hand side of (20) can be expressed as
where the last step comes from the result in Part 1 and Condition (C5). \(\hfill \square \)
1.2.2 B.2 Proof of Theorem 2
Similar to the derivations of Li et al. (2012), one can obtain that
It gives
where the last step comes from Theorem 1. â–¡
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Chen, LP., Yi, G.Y. (2022). Robust Feature Screening for Ultrahigh-Dimensional Censored Data Subject to Measurement Error. In: He, W., Wang, L., Chen, J., Lin, C.D. (eds) Advances and Innovations in Statistics and Data Science. ICSA Book Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-031-08329-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-08329-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08328-0
Online ISBN: 978-3-031-08329-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)