Consistency of test-based method for selection of variables in high-dimensional two-group discriminant analysis


This paper is concerned with selection of variables in two-group discriminant analysis with the same covariance matrix. We propose a test-based method (TM) drawing on the significance of each variable. Sufficient conditions for the test-based method to be consistent are provided when the dimension and the sample size are large. For the case that the dimension is larger than the sample size, a ridge-type method is proposed. Our results and tendencies therein are explored numerically through a Monte Carlo simulation. It is pointed that our selection method can be applied for high-dimensional data.

This is a preview of subscription content, log in to check access.


  1. Akaike, H. (1973). Information theory and an extension of themaximum likelihood principle. In B. N. Petrov & F. Csáki (Eds.), 2nd International Symposium on Information Theory (pp. 267–281). Budapest: Akadémiai Kiadó.

    Google Scholar 

  2. Clemmensen, L., Hastie, T., Witten, D. M., & Ersbell, B. (2011). Sparse discriminant analysis. Technometrics, 53, 406–413.

    MathSciNet  Article  Google Scholar 

  3. Fujikoshi, Y. (1985). Selection of variables in two-group discriminant analysis by error rate and Akaike’s information criteria. Journal of Multivariate Analysis, 17, 27–37.

    MathSciNet  Article  Google Scholar 

  4. Fujikoshi, Y. (2000). Error bounds for asymptotic approximations of the linear discriminant function when the sample size and dimensionality are large. Journal of Multivariate Analysis, 73, 1–17.

    MathSciNet  Article  Google Scholar 

  5. Fujikoshi, Y., & Sakurai, T. (2016). High-dimensional consistency of rank estimation criteria in multivariate linear model. Journal of Multivariate Analysis, 149, 199–212.

    MathSciNet  Article  Google Scholar 

  6. Fujikoshi, Y., Ulyanov, V. V., & Shimizu, R. (2010). Multivariate statistics: high-dimensional and large-sample approximations. Hobeken, NJ: Wiley.

    Google Scholar 

  7. Fujikoshi, Y., Sakurai, T., & Yanagihara, H. (2014). Consistency of high-dimensional AIC-type and \(\text{ C }_p\)-type criteria in multivariate linear regression. Journal of Multivariate Analysis, 144, 184–200.

    Article  Google Scholar 

  8. Hao, N., Dong, B. & Fan, J. (2015). Sparsifying the Fisher linear discriminant by rotation. Journal of the Royal Statistical Society: Series B, 77, 827–851.

    MathSciNet  Article  Google Scholar 

  9. Hyodo, M., & Kubokawa, T. (2014). A variable selection criterion for linear discriminant rule and its optimality in high dimensional and large sample data. Journal of Multivariate Analysis, 123, 364–379.

    MathSciNet  Article  Google Scholar 

  10. Ito, T. & Kubokawa, T. (2015). Linear ridge estimator of high-dimensional precision matrix using random matrix theory. Discussion Paper Series, CIRJE-F-995.

  11. Kubokawa, T., & Srivastava, M. S. (2012). Selection of variables in multivariate regression models for large dimensions. Communication in Statistics-Theory and Methods, 41, 2465–2489.

    MathSciNet  Article  Google Scholar 

  12. McLachlan, G. J. (1976). A criterion for selecting variables for the linear discriminant function. Biometrics, 32, 529–534.

    MathSciNet  Article  Google Scholar 

  13. Nishii, R., Bai, Z. D., & Krishnaia, P. R. (1988). Strong consistency of the information criterion for model selection in multivariate analysis. Hiroshima Mathematical Journal, 18, 451–462.

    MathSciNet  Article  Google Scholar 

  14. Rao, C. R. (1973). Linear statistical inference and its applications (2nd ed.). New York: Wiley.

    Google Scholar 

  15. Sakurai, T., Nakada, T., & Fujikoshi, Y. (2013). High-dimensional AICs for selection of variables in discriminant analysis. Sankhya, Series A, 75, 1–25.

    MathSciNet  Article  Google Scholar 

  16. Schwarz, G. (1978). Estimating the dimension od a model. Annals of Statistics, 6, 461–464.

    MathSciNet  Article  Google Scholar 

  17. Tiku, M. (1985). Noncentral chi-square distribution. In S. Kotz & N. L. Johnson (Eds.), Encyclopedia of Statistical Sciences, vol. 6 (pp. 276–280). New York: Wiely.

    Google Scholar 

  18. Van Wieringen, W. N., & Peeters, C. F. (2016). Ridge estimation of inverse covariance matrices from high-dimensional data. Computational Statistics & Data Analysis, 103, 284–303.

    MathSciNet  Article  Google Scholar 

  19. Witten, D. W., & Tibshirani, R. (2011). Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B, 73, 753–772.

    MathSciNet  Article  Google Scholar 

  20. Yamada, T., Sakurai, T. & Fujikoshi, Y. (2017). High-dimensional asymptotic results for EPMCs of W- and Z- rules. Hiroshima Statistical Research Group, 17–12.

  21. Yanagihara, H., Wakaki, H., & Fujikoshi, Y. (2015). A consistency property of the AIC for multivariate linear models when the dimension and the sample size are large. Electronic Journal of Statistics, 9, 869–897.

    MathSciNet  Article  Google Scholar 

  22. Zhao, L. C., Krishnaiah, P. R., & Bai, Z. D. (1986). On determination of the number of signals in presence of white noise. Journal of Multivariate Analysis, 20, 1–25.

    MathSciNet  Article  Google Scholar 

Download references


We thank two referees for careful reading of our manuscript and many helpful comments which improved the presentation of this paper. The first author’s research is partially supported by the Ministry of Education, Science, Sports, and Culture, a Grant-in-Aid for Scientific Research (C), 16K00047, 2016–2018.

Author information



Corresponding author

Correspondence to Yasunori Fujikoshi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proofs of Theorems 1, 2 and 3

Appendix: Proofs of Theorems 12 and 3

Preliminary lemmas

First, we study distributional results related to the test statistics \(\mathrm{T}_{d,i}\) in (5). For a notational simplicity, consider a decomposition of \({\varvec{y}}=({\varvec{y}}_1', {\varvec{y}}_2')', \ {\varvec{y}}_1; \ p_1\times 1, \ {\varvec{y}}_2; \ p_2 \times 1\). Similarly, decompose \(\varvec{\beta }=(\varvec{\beta }_1', \varvec{\beta }_2')'\), and

$$\begin{aligned} {\mathsf {S}}= \left( \begin{array}{cc} {\mathsf {S}}_{11} &{} {\mathsf {S}}_{12} \\ {\mathsf {S}}_{21} &{} {\mathsf {S}}_{22} \end{array} \right) , \quad {\mathsf {S}}_{12}; \ p_1 \times p_2. \end{aligned}$$

Let \(\lambda\) be the likelihood ratio criterion for testing a hypothesis \(\varvec{\beta }_2={\varvec{0}}\), then

$$\begin{aligned} -2 \log \lambda = n \log \left\{ 1 + \frac{g^2 (D^2 - D_1^2)}{n-2 + g^2 D_1^2}, \right\} \end{aligned}$$

where \(g=\left\{ (n_1n_2)/n\right\} ^{1/2}\). The following lemma (see, e.g., Fujikoshi et al. 2010) is used.

Lemma 1

Let\(D_1\)andDbe the sample Mahalanobis distances based on\({\varvec{y}}_1\)and\({\varvec{y}}\), respectively. Let\(D_{2\cdot 1}^2=D^2-D_1^2\). Similarly, the corresponding population quantities are expressed as\(\varDelta _1\), \(\varDelta\)and\(\varDelta _{2\cdot 1}^2\). Then, it holds that

$$\begin{aligned}&\mathrm{(1)} \ D_1^2=(n-2)g^{-2}R, \quad R=\chi _{p_1}^2(g^2\varDelta _1^2)\left\{ \chi _{n-p_1-1}^2\right\} ^{-1}. \\&\mathrm{(2)} \ D_{2\cdot 1}^2 = (n-2) g^{-2} \chi _{p_2}^2 \left( g^2 \varDelta _{2\cdot 1}^2 \cdot \frac{1}{1+R} \right) \left\{ \chi _{n-p-1}^2\right\} ^{-1} (1+R).\\&\mathrm{(3)} \ \frac{g^2 (D^2 - D_1^2)}{n-2 + g^2 D_1^2} = \chi _{p_2}^2 ( g^2 \varDelta _{2\cdot 1}^2 (1+R)^{-1})\{\chi _{n-p-1}^2\}^{-1} \end{aligned}$$

Here, \(\chi _{p_1}^2(\cdot )\), \(\chi _{n-p_1-1}^2\), \(\chi _{p_2}^2(\cdot )\), and\(\chi _{n-p-1}^2\)are independent Chi-square variates.

Related to the conditional distribution of the right-hand side of (3) with \(p_2=1\) and \(m=n-p-1\) in Lemma 1, consider the random variable defined by

$$\begin{aligned} V=\frac{\chi _1^2(\lambda ^2)}{\chi _m^2}-\frac{1+\lambda ^2}{m-2}, \end{aligned}$$

where \(\chi _1^2(\lambda ^2)\) and \(\chi _m^2\) are independent. We can express V as

$$\begin{aligned} V=U_1U_2+(m-2)^{-1}U_1+(1+\lambda ^2)U_2, \end{aligned}$$

in terms of the centralized variables \(U_1\) and \(U_2\) defined by

$$\begin{aligned} U_1=\chi _1^2(1+\lambda ^2) -(1+\lambda ^2), \quad U_2=\frac{1}{\chi _m^2}-\frac{1}{m-2}. \end{aligned}$$

It is well known (see, e.g., Tiku 1985) that

$$\begin{aligned}&\mathrm{E}(U_1)=0,\\&\mathrm{E}(U_1^2)=2(1+2\lambda ^2), \\&\mathrm{E}(U_1^3)=8(1+3\lambda ^2), \\&\mathrm{E}(U_1^4)=48(1+4\lambda ^2)+4(1+3\lambda ^2)^2. \end{aligned}$$


$$\begin{aligned} \mathrm{E}\left( U_2^k\right) =&\sum _{i=0}^k {}_k C_i \mathrm{E}\left\{ \left( \frac{1}{\chi _m^2}\right) ^i\right\} \left( -\frac{1}{m-2}\right) ^{k-i}\\ =&\sum _{i=1}^k {}_kC_i\frac{1}{(m-2) \cdots (m-2i)}\left( -\frac{1}{m-2}\right) ^{k-i} +\left( -\frac{1}{m-2}\right) ^k. \end{aligned}$$

These give the first four moments of V. In particular, we use the following results.

Lemma 2

LetVbe the random variable defined by (14). Suppose that\(\lambda ^2=\mathrm{O}(m)\). Then

$$\begin{aligned}&\mathrm{E}(V)=0,\quad \mathrm{E}(V^2)=\frac{2(m-3-2\lambda ^2+\lambda ^4)}{(m-2)^2(m-4)}=\mathrm{O}(m^{-1}), \\&\mathrm{E}(V^3)=\mathrm{O}(m^{-2}), \quad \mathrm{E}(V^4)=\mathrm{O}(m^{-2}). \end{aligned}$$

Proof of Theorem 1

First, we show “\(\mathrm{[F1]} \rightarrow 0\)”. Let \(i \in j_*\). Then, \((-i) \notin {{{\mathcal {F}}}}_+\), and hence

$$\begin{aligned} \varDelta _{(-i)}^2 < \varDelta ^2, \quad \varDelta _{ \{i\} \cdot (-i)}^2 > 0. \end{aligned}$$

Using (12) and Lemma 1(3)

$$\begin{aligned} \mathrm{T}_{d, i}= n \log \left\{ 1 + \frac{\chi _1^2 ( g^2 \varDelta _{\{i\} \cdot (-i)}^2 (1+R_i)^{-1})}{\chi _{n-p-1}^2} \right\} - d, \end{aligned}$$

where \(R_i = \chi _{p-1}^2 (g^2 \varDelta _{(-i)}^2)\left\{ \chi _{n-p}^2\right\} ^{-1}\). Here, since \(j_*\) is finite, by showing

$$\begin{aligned} \mathrm{T}_{d, i} \overset{p}{\rightarrow } t_i > 0 \quad \text {or} \quad \mathrm{T}_{d, i} \overset{p}{\rightarrow } \infty , \end{aligned}$$

we obtain \(P (\mathrm{T}_{d, i} \le 0) \rightarrow 0\), and hence, “\(\mathrm{[F1]} \rightarrow 0\)”. It is easily seen that

$$\begin{aligned} R_i \sim \frac{p+g^2 \varDelta _{(-i)}^2}{n-p}, \end{aligned}$$

where \(g=\{(n_1n_2)/n\}^{1/2}\) and “ \(\sim\)” means asymptotically equivalent, and hence

$$\begin{aligned} (1+R_i)^{-1} \sim \frac{n-p}{n+g^2\varDelta _{(-i)}^2}. \end{aligned}$$

Therefore, we obtain

$$\begin{aligned} \frac{1}{n} \mathrm{T}_{d, i} \rightarrow \lim \log \left( 1 + \frac{g^2 \varDelta _{\{i\} \cdot (-i)}^2}{n + g^2 \varDelta _{(-i)}^2} \right) > 0, \end{aligned}$$

which implies our assertion.

Next, consider to show “\(\mathrm{[F2]} \rightarrow 0\)”. For any \(i \notin j_*\), \(\varDelta ^2=\varDelta _{(-i)}^2\). Therefore, using Lemma 1(3), we have

$$\begin{aligned} \mathrm{T}_{d, i}= n \log \left( 1 + \frac{\chi _1^2}{\chi _{n-p-1}^2} \right) - d, \end{aligned}$$

whose distribution does not depend on i. Here, \(\chi _1^2\) and \(\chi _{n-p-1}^2\) are independent Chi-square variates with 1 and \(n-p-1\) degrees of freedom. This implies that

$$\begin{aligned} \mathrm{T}_{d,i}> 0 \Leftrightarrow \frac{\chi _1^2}{\chi _{n-p-1}^2} > e^{d/n} - 1. \end{aligned}$$

Noting that \(\mathrm{E}[ \chi _1^2/ \chi _{n-p-1}^2 ] = (n-p-3)^{-1}\), let

$$\begin{aligned} U = \frac{\chi _1^2}{\chi _{n-p-1}^2} - \frac{1}{n-p-3}. \end{aligned}$$

Then, since \(e^{d/n} - 1 - \frac{1}{n-p-3}>h\)

$$\begin{aligned} P ( \mathrm{T}_{d,i}> 0 )&= P \left( U> e^{d/n} - 1 - \frac{1}{n-p-3} \right) \\&\le P \left( U > h \right) . \end{aligned}$$

Furthermore, using Markov inequality, we have

$$\begin{aligned} P( \mathrm{T}_{d,i}> 0 )&\le P(|U| > h)\\&\le h^{-2\ell } \mathrm{E}(U^{2\ell }), \quad \ell = 1, 2, \ldots \end{aligned}$$

Furthermore, it is easily seen that

$$\begin{aligned} \mathrm{E}( U^{2\ell } ) = \mathrm{O}(n^{-2\ell }), \end{aligned}$$

using, e.g., Theorem 16.2.2 in Fujikoshi et al. (2010). When \(h = O(n^{-a})\)

$$\begin{aligned} h^{-2\ell } \mathrm{E}( U^{2\ell } ) = \mathrm{O}(n^{-2(1-a)\ell }). \end{aligned}$$

Choosing \(\ell\) such that \(\ell > (1-a)^{-1}\), we have “\(\mathrm{[F2]} \rightarrow 0\)”.

Proof of Theorem 2

First, note that in the proof of “\(\mathrm{[F2]} \rightarrow 0\)” in Theorem 1, Assumption A3 is not used. This implies the assertion “\(\mathrm{[F2]} \rightarrow 0\)” in Theorem 2.

Now, we consider to show “\(\mathrm{[F1]} \rightarrow 0\)” when \(p_*=\mathrm{O}(p)\) and \(\varDelta ^2=\mathrm{O}(p)\). In this case, \(p_*\) tends to \(\infty\). Based on the proof in Theorem 1, we can express \(\mathrm{T}_{d, i}\) for \(i \in j_*\) as

$$\begin{aligned} \mathrm{T}_{d, i}= n \log \left\{ 1 + \frac{\chi _1^2 ({\widehat{\lambda }}_i^2)}{\chi _{n-p-1}^2} \right\} - d, \end{aligned}$$

where \({\widehat{\lambda }}_i^2=g^2\varDelta _{\{i\} \cdot (-i)}^2 (1+R_i)^{-1}\) and \(R_i = \chi _{p-1}^2 (g^2 \varDelta _{(-i)}^2)\left\{ \chi _{n-p}^2\right\} ^{-1}\). Note that \(\chi _1^2\) and \(\chi _{n-p-1}^2\) are independent of \(R_i\), and hence of \({\widehat{\lambda }}_i^2\). Then, we have

$$\begin{aligned} P(T_{d, i} \le 0)=P({\widehat{V}} \le {\widehat{h}}), \end{aligned}$$


$$\begin{aligned} {\widehat{V}}&= \frac{\chi _1^2 ({\widehat{\lambda }}_i^2)}{\chi _{n-p-1}^2}- \frac{1+{\widehat{\lambda }}_i^2}{n-p-3}, \\ {\widehat{h}}&=e^{d/n}-1-(1+{\widehat{\lambda }}_i^2)/(n-p-3). \end{aligned}$$

Considering the conditional distribution of the right-hand side in (17), we have

$$\begin{aligned} P({\widehat{V}} \le {\widehat{h}})= \mathrm{E}_{{\widehat{\lambda }}^2_i}\left\{ Q({\widehat{\lambda }}^2_i)\right\} , \end{aligned}$$


$$\begin{aligned} Q(\lambda ^2_i)&=P({\widehat{V}} \le {\widehat{h}} \ | \ {\widehat{\lambda }}_i^2=\lambda _i^2) \\&=P({\widetilde{V}} \le {\widetilde{h}}). \end{aligned}$$


$$\begin{aligned} {\widetilde{V}}&= \frac{\chi _1^2 (\lambda _i^2)}{\chi _{n-p-1}^2}- \frac{1+\lambda _i^2}{n-p-3}, \\ {\widetilde{h}}&=e^{d/n}-1-(1+\lambda _i^2)/(n-p-3). \end{aligned}$$

Using Assumption A6, it can be seen that

$$\begin{aligned} {\widehat{\lambda }}_i^2 \sim (1-c)c^{-1}\theta _i^2p \equiv \lambda _{i0}^2,\quad \mathrm{and} \quad {\widehat{\lambda }}_i^2 = \mathrm{O}(p^b). \end{aligned}$$

Now, we consider the probability \(P({\widetilde{V}} \le {\widetilde{h}})\) when \(\lambda _i^2=\lambda _{i0}^2\). From assumption \(r < b\), for large n, \({\widetilde{h}} < 0\). Therefor, for large n, we have

$$\begin{aligned} P({\widetilde{V}} \le {\widetilde{h}})&\le P(|{\widetilde{V}}| \ge |{\widetilde{h}}|) \\&\le |{\widetilde{h}}|^{-4} \mathrm{E}({\widetilde{V}}^4). \end{aligned}$$

From Lemma 2, \(\mathrm{E}({\widetilde{V}}^4)=\mathrm{O}(n^{-2})\). Noting that \({\widetilde{h}}=\mathrm{O}(n^{-(1-b)})\), we have

$$\begin{aligned} |{\widetilde{h}}|^{-4} \mathrm{E}({\widetilde{V}}^4)=\mathrm{O}(n^{4(1-b)-2}), \end{aligned}$$

whose order is \(\mathrm{O}(n^{-(1+3\delta )})\) if we choose b as \(b > (3/4)(1+\delta )\). Therefore, we have \(P(\mathrm{T}_{d,i} \le 0)=\mathrm{O}(n^{-(1+3\delta )})\), which implies “\(\mathrm{[F1]} \rightarrow 0\)”.

Proof of Theorem 3

The assertion “\(\mathrm{[F1]} \rightarrow 0\)” follows from the proof of “\(\mathrm{[F1]} \rightarrow 0\)” in Theorem 1. For a proof of “\(\mathrm{[F2]} \rightarrow 0\)”, it is enough to show that

$$\begin{aligned} \mathrm{for} \ i \notin j_*, \quad \mathrm{T}_{d,i} \rightarrow \ -\infty . \end{aligned}$$

since p has been fixed. From (16), the limiting distribution of \(\mathrm{T}_{d,i}\) is “\(\chi _1^2-d\)”. This implies “\(\mathrm{[F2]} \rightarrow 0\)”.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Fujikoshi, Y., Sakurai, T. Consistency of test-based method for selection of variables in high-dimensional two-group discriminant analysis. Jpn J Stat Data Sci 2, 155–171 (2019).

Download citation


  • Consistency
  • Discriminant analysis
  • High-dimensional framework
  • Selection of variables
  • Test-based method