Abstract
This paper is concerned with selection of variables in two-group discriminant analysis with the same covariance matrix. We propose a test-based method (TM) drawing on the significance of each variable. Sufficient conditions for the test-based method to be consistent are provided when the dimension and the sample size are large. For the case that the dimension is larger than the sample size, a ridge-type method is proposed. Our results and tendencies therein are explored numerically through a Monte Carlo simulation. It is pointed that our selection method can be applied for high-dimensional data.
Similar content being viewed by others
References
Akaike, H. (1973). Information theory and an extension of themaximum likelihood principle. In B. N. Petrov & F. Csáki (Eds.), 2nd International Symposium on Information Theory (pp. 267–281). Budapest: Akadémiai Kiadó.
Clemmensen, L., Hastie, T., Witten, D. M., & Ersbell, B. (2011). Sparse discriminant analysis. Technometrics, 53, 406–413.
Fujikoshi, Y. (1985). Selection of variables in two-group discriminant analysis by error rate and Akaike’s information criteria. Journal of Multivariate Analysis, 17, 27–37.
Fujikoshi, Y. (2000). Error bounds for asymptotic approximations of the linear discriminant function when the sample size and dimensionality are large. Journal of Multivariate Analysis, 73, 1–17.
Fujikoshi, Y., & Sakurai, T. (2016). High-dimensional consistency of rank estimation criteria in multivariate linear model. Journal of Multivariate Analysis, 149, 199–212.
Fujikoshi, Y., Ulyanov, V. V., & Shimizu, R. (2010). Multivariate statistics: high-dimensional and large-sample approximations. Hobeken, NJ: Wiley.
Fujikoshi, Y., Sakurai, T., & Yanagihara, H. (2014). Consistency of high-dimensional AIC-type and \(\text{ C }_p\)-type criteria in multivariate linear regression. Journal of Multivariate Analysis, 144, 184–200.
Hao, N., Dong, B. & Fan, J. (2015). Sparsifying the Fisher linear discriminant by rotation. Journal of the Royal Statistical Society: Series B, 77, 827–851.
Hyodo, M., & Kubokawa, T. (2014). A variable selection criterion for linear discriminant rule and its optimality in high dimensional and large sample data. Journal of Multivariate Analysis, 123, 364–379.
Ito, T. & Kubokawa, T. (2015). Linear ridge estimator of high-dimensional precision matrix using random matrix theory. Discussion Paper Series, CIRJE-F-995.
Kubokawa, T., & Srivastava, M. S. (2012). Selection of variables in multivariate regression models for large dimensions. Communication in Statistics-Theory and Methods, 41, 2465–2489.
McLachlan, G. J. (1976). A criterion for selecting variables for the linear discriminant function. Biometrics, 32, 529–534.
Nishii, R., Bai, Z. D., & Krishnaia, P. R. (1988). Strong consistency of the information criterion for model selection in multivariate analysis. Hiroshima Mathematical Journal, 18, 451–462.
Rao, C. R. (1973). Linear statistical inference and its applications (2nd ed.). New York: Wiley.
Sakurai, T., Nakada, T., & Fujikoshi, Y. (2013). High-dimensional AICs for selection of variables in discriminant analysis. Sankhya, Series A, 75, 1–25.
Schwarz, G. (1978). Estimating the dimension od a model. Annals of Statistics, 6, 461–464.
Tiku, M. (1985). Noncentral chi-square distribution. In S. Kotz & N. L. Johnson (Eds.), Encyclopedia of Statistical Sciences, vol. 6 (pp. 276–280). New York: Wiely.
Van Wieringen, W. N., & Peeters, C. F. (2016). Ridge estimation of inverse covariance matrices from high-dimensional data. Computational Statistics & Data Analysis, 103, 284–303.
Witten, D. W., & Tibshirani, R. (2011). Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B, 73, 753–772.
Yamada, T., Sakurai, T. & Fujikoshi, Y. (2017). High-dimensional asymptotic results for EPMCs of W- and Z- rules. Hiroshima Statistical Research Group, 17–12.
Yanagihara, H., Wakaki, H., & Fujikoshi, Y. (2015). A consistency property of the AIC for multivariate linear models when the dimension and the sample size are large. Electronic Journal of Statistics, 9, 869–897.
Zhao, L. C., Krishnaiah, P. R., & Bai, Z. D. (1986). On determination of the number of signals in presence of white noise. Journal of Multivariate Analysis, 20, 1–25.
Acknowledgements
We thank two referees for careful reading of our manuscript and many helpful comments which improved the presentation of this paper. The first author’s research is partially supported by the Ministry of Education, Science, Sports, and Culture, a Grant-in-Aid for Scientific Research (C), 16K00047, 2016–2018.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Proofs of Theorems 1, 2 and 3
Appendix: Proofs of Theorems 1, 2 and 3
1.1 Preliminary lemmas
First, we study distributional results related to the test statistics \(\mathrm{T}_{d,i}\) in (5). For a notational simplicity, consider a decomposition of \({\varvec{y}}=({\varvec{y}}_1', {\varvec{y}}_2')', \ {\varvec{y}}_1; \ p_1\times 1, \ {\varvec{y}}_2; \ p_2 \times 1\). Similarly, decompose \(\varvec{\beta }=(\varvec{\beta }_1', \varvec{\beta }_2')'\), and
Let \(\lambda\) be the likelihood ratio criterion for testing a hypothesis \(\varvec{\beta }_2={\varvec{0}}\), then
where \(g=\left\{ (n_1n_2)/n\right\} ^{1/2}\). The following lemma (see, e.g., Fujikoshi et al. 2010) is used.
Lemma 1
Let \(D_1\) and D be the sample Mahalanobis distances based on \({\varvec{y}}_1\) and \({\varvec{y}}\), respectively. Let \(D_{2\cdot 1}^2=D^2-D_1^2\). Similarly, the corresponding population quantities are expressed as \(\varDelta _1\), \(\varDelta\) and \(\varDelta _{2\cdot 1}^2\). Then, it holds that
Here, \(\chi _{p_1}^2(\cdot )\), \(\chi _{n-p_1-1}^2\), \(\chi _{p_2}^2(\cdot )\), and \(\chi _{n-p-1}^2\) are independent Chi-square variates.
Related to the conditional distribution of the right-hand side of (3) with \(p_2=1\) and \(m=n-p-1\) in Lemma 1, consider the random variable defined by
where \(\chi _1^2(\lambda ^2)\) and \(\chi _m^2\) are independent. We can express V as
in terms of the centralized variables \(U_1\) and \(U_2\) defined by
It is well known (see, e.g., Tiku 1985) that
Furthermore
These give the first four moments of V. In particular, we use the following results.
Lemma 2
Let V be the random variable defined by (14). Suppose that \(\lambda ^2=\mathrm{O}(m)\). Then
1.2 Proof of Theorem 1
First, we show “\(\mathrm{[F1]} \rightarrow 0\)”. Let \(i \in j_*\). Then, \((-i) \notin {{{\mathcal {F}}}}_+\), and hence
where \(R_i = \chi _{p-1}^2 (g^2 \varDelta _{(-i)}^2)\left\{ \chi _{n-p}^2\right\} ^{-1}\). Here, since \(j_*\) is finite, by showing
we obtain \(P (\mathrm{T}_{d, i} \le 0) \rightarrow 0\), and hence, “\(\mathrm{[F1]} \rightarrow 0\)”. It is easily seen that
where \(g=\{(n_1n_2)/n\}^{1/2}\) and “ \(\sim\)” means asymptotically equivalent, and hence
Therefore, we obtain
which implies our assertion.
Next, consider to show “\(\mathrm{[F2]} \rightarrow 0\)”. For any \(i \notin j_*\), \(\varDelta ^2=\varDelta _{(-i)}^2\). Therefore, using Lemma 1(3), we have
whose distribution does not depend on i. Here, \(\chi _1^2\) and \(\chi _{n-p-1}^2\) are independent Chi-square variates with 1 and \(n-p-1\) degrees of freedom. This implies that
Noting that \(\mathrm{E}[ \chi _1^2/ \chi _{n-p-1}^2 ] = (n-p-3)^{-1}\), let
Then, since \(e^{d/n} - 1 - \frac{1}{n-p-3}>h\)
Furthermore, using Markov inequality, we have
Furthermore, it is easily seen that
using, e.g., Theorem 16.2.2 in Fujikoshi et al. (2010). When \(h = O(n^{-a})\)
Choosing \(\ell\) such that \(\ell > (1-a)^{-1}\), we have “\(\mathrm{[F2]} \rightarrow 0\)”.
1.3 Proof of Theorem 2
First, note that in the proof of “\(\mathrm{[F2]} \rightarrow 0\)” in Theorem 1, Assumption A3 is not used. This implies the assertion “\(\mathrm{[F2]} \rightarrow 0\)” in Theorem 2.
Now, we consider to show “\(\mathrm{[F1]} \rightarrow 0\)” when \(p_*=\mathrm{O}(p)\) and \(\varDelta ^2=\mathrm{O}(p)\). In this case, \(p_*\) tends to \(\infty\). Based on the proof in Theorem 1, we can express \(\mathrm{T}_{d, i}\) for \(i \in j_*\) as
where \({\widehat{\lambda }}_i^2=g^2\varDelta _{\{i\} \cdot (-i)}^2 (1+R_i)^{-1}\) and \(R_i = \chi _{p-1}^2 (g^2 \varDelta _{(-i)}^2)\left\{ \chi _{n-p}^2\right\} ^{-1}\). Note that \(\chi _1^2\) and \(\chi _{n-p-1}^2\) are independent of \(R_i\), and hence of \({\widehat{\lambda }}_i^2\). Then, we have
where
Considering the conditional distribution of the right-hand side in (17), we have
where
Here
Using Assumption A6, it can be seen that
Now, we consider the probability \(P({\widetilde{V}} \le {\widetilde{h}})\) when \(\lambda _i^2=\lambda _{i0}^2\). From assumption \(r < b\), for large n, \({\widetilde{h}} < 0\). Therefor, for large n, we have
From Lemma 2, \(\mathrm{E}({\widetilde{V}}^4)=\mathrm{O}(n^{-2})\). Noting that \({\widetilde{h}}=\mathrm{O}(n^{-(1-b)})\), we have
whose order is \(\mathrm{O}(n^{-(1+3\delta )})\) if we choose b as \(b > (3/4)(1+\delta )\). Therefore, we have \(P(\mathrm{T}_{d,i} \le 0)=\mathrm{O}(n^{-(1+3\delta )})\), which implies “\(\mathrm{[F1]} \rightarrow 0\)”.
1.4 Proof of Theorem 3
The assertion “\(\mathrm{[F1]} \rightarrow 0\)” follows from the proof of “\(\mathrm{[F1]} \rightarrow 0\)” in Theorem 1. For a proof of “\(\mathrm{[F2]} \rightarrow 0\)”, it is enough to show that
since p has been fixed. From (16), the limiting distribution of \(\mathrm{T}_{d,i}\) is “\(\chi _1^2-d\)”. This implies “\(\mathrm{[F2]} \rightarrow 0\)”.
Rights and permissions
About this article
Cite this article
Fujikoshi, Y., Sakurai, T. Consistency of test-based method for selection of variables in high-dimensional two-group discriminant analysis. Jpn J Stat Data Sci 2, 155–171 (2019). https://doi.org/10.1007/s42081-019-00032-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42081-019-00032-4