Consistency of testbased method for selection of variables in highdimensional twogroup discriminant analysis
 32 Downloads
Abstract
This paper is concerned with selection of variables in twogroup discriminant analysis with the same covariance matrix. We propose a testbased method (TM) drawing on the significance of each variable. Sufficient conditions for the testbased method to be consistent are provided when the dimension and the sample size are large. For the case that the dimension is larger than the sample size, a ridgetype method is proposed. Our results and tendencies therein are explored numerically through a Monte Carlo simulation. It is pointed that our selection method can be applied for highdimensional data.
Keywords
Consistency Discriminant analysis Highdimensional framework Selection of variables Testbased method1 Introduction
This paper is concerned with the variable selection problem in twogroup discriminant analysis with the same covariance matrix. In a variable selection problem under such discriminant model, one of the goals is to find a subset of variables, whose coefficients of the linear discriminant function are not zero. Several methods including model selection criteria \(\mathrm{AIC}\) (Akaike information criterion, Akaike 1973) and \(\mathrm{BIC}\) (Bayesian information criterion, Schwarz 1978) have been developed. It is known (see, e.g., Fujikoshi 1985; Nishii et al. 1988) that in a largesample framework, \(\mathrm{AIC}\) is not consistent, but \(\mathrm{BIC}\) is consistent. On the other hand, in multivariate regression model, it has been shown (Fujikoshi et al. 2014; Yanagihara et al. 2015) that in a highdimensional framework, \(\mathrm{AIC}\) has consistency properties under some conditions, but \(\mathrm{BIC}\) is not necessarily consistent. In our discriminant model, there are methods based on misclassification errors by McLachlan (1976), Fujikoshi (1985), Hyodo and Kubokawa (2014), Yamada et al. (2017) for a highdimensional case as well as a largesample case. It is known (see, e.g., Fujikoshi 1985) that two methods by misclassification error rate and AIC are asymptotically equivalent under a largesample framework. In our discriminant model, Sakurai et al. (2013) derived an asymptotically unbiased estimator of the risk function for a highdimensional case. On the other hand, the selection methods are based on the minimization of the criteria, and become computationally onerous when the dimension is large. Though some stepwise methods have been proposed, their optimality is not known. For highdimensional data, Lasso and other regularization methods have been extended. For such study, see, e.g., Clemmensen et al. (2011), Witten and Tibshirani (2011), Hao et al. (2015), etc.
In this paper, we propose a testbased method based on significance test of each variable, which is useful for highdimensional data as well as largesample data. Such an idea is essentially the same as in Zhao et al. (1986). Our criterion involves a constant term which should be determined by point of some optimality. We propose a class of constants satisfying a consistency when the dimension and the sample size are large. For the case, when the dimension is larger than the sample size, a regularized method is numerically examined. Our results and tendencies therein are explored numerically through a Monte Carlo simulation.
The remainder of the present paper is organized as follows. In Sect. 2, we present the relevant notation and the testbased method. In Sect. 3, we derive sufficient conditions for the testbased criterion to be consistent under a highdimensional case. In Sect. 4, we study the testbased criterion through a Monte Carlo simulation. In Sect. 5, we propose the ridgetype criteria, whose consistency properties are numerically examined. In Sect. 6, conclusions are offered. All proofs of our results are provided in the Appendix.
2 Testbased method
In general, if d is large, a small number of variables is selected. On the other hand, if d is small, a large number of variables is selected. Ideally, we want to select only the true variables whose discriminant coefficients are not zero. For a testbased method, there is an important problem how to decide the constant term d. Nishii et al. (1988) have used a special case with \(d=\log n\). They noted that under a largesample framework, \(\mathrm{TM}_{\log n}\) is consistent. However, we note that \(\mathrm{TM}_{\log n}\) will be not consistent for a highdimensional case, through a simulation experiment. We propose a class of d satisfying a highdimensional consistency including \(d= \sqrt{n}\) in Sect. 3.
3 Consistency of \(\mathrm{TM}_d\) under highdimensional framework
 A1
(The true model): \(M_{j_*} \in {{\mathcal {F}}}\).
 A2
(The highdimensional asymptotic framework): \(p \rightarrow \infty , \ n \rightarrow \infty , \ p/n \rightarrow c \in (0, 1)\), \(n_i/n \rightarrow k_i>0, \ (i=1, 2).\)
 A3
: \(p_{*}\) is finite, and \(\varDelta ^2 = \mathrm{O}(1)\).
 B1
: \(d/n \rightarrow 0\).
 B2
: \(h \equiv d/n1/(np3) > 0\), and \(h=\mathrm{O}(n^{a}), \ \mathrm{where} \ 0<a<1\).
 A4: For \(i \in j_*\),$$\begin{aligned} \lim (n_1n_2/n^{2})\varDelta _{\{i\} \cdot (i)}^2\left\{ 1+(n_1n_2/n^{2})\varDelta _{(i)}^2\right\} ^{1} > 0. \end{aligned}$$
Theorem 1
Suppose that assumptions A1, A2, A3 and A4 are satisfied. Then, the testbased method\(\mathrm{TM}_{d}\)is consistent if B1 and B2 are satisfied.
 A5
: \(p_{*}=\mathrm{O}(p)\), and \(\varDelta ^2 = \mathrm{O}(p)\).
 A6: For \(i \in j_*\),$$\begin{aligned} \theta _i^2=\varDelta _{\{i\} \cdot (i)}^2\left\{ \varDelta _{(i)}^2\right\} ^{1}=\mathrm{O}(p^{b1}), \quad 0< b < 1. \end{aligned}$$
Theorem 2
Suppose that assumptions A1, A2, A5 and A6 are satisfied. Then, the testbased criterion\(\mathrm{TM}_{d}\)isconsistent if “\(d=n^r, 0< r < 1\)” and “\(r< b, (3/4)(1+\delta ) < b\)” are satisfied, where\(\delta\)is any small positive number.
From our proof, it is conjectured that the condition “\(3/4 < b\)” can be replaced by “\(1/2 < b\)”. From a practical point, it shall be natural that b is small, and then, the sufficiency condition requires \(r \le 1/2\).
Theorem 3
From Theorem 3, we can see that \(\mathrm{TM}_{\log n}\) and \(\mathrm{TM}_{\sqrt{n}}\) are consistent under a largesample framework. However, \(\mathrm{TM}_{2}\) does not satisfy the sufficient condition.
4 Numerical study
In this section, we numerically explore the validity of our claims through three testbased criteria, \(\mathrm{TM}_{2}\), \(\mathrm{TM}_{\log n}\), and \(\mathrm{TM}_{\sqrt{n}}\). Note that \(\mathrm{TM}_{\sqrt{n}}\) satisfies sufficient conditions B1 and B2 for its consistency, but \(\mathrm{TM}_{2}\) and \(\mathrm{TM}_{\log n}\) do not satisfy them.
The selection rates associated with these criteria are given in Tables 1, 2, and 3. “Under”, “True”, and “Over” denote the underspecified models, the true model, and the overspecified models, respectively. We focused on selection rates for \(10^3\) replications in Tables 1, 2, and for \(10^2\) replications in Table 3.

The selection probabilities of the true model by \(\mathrm{TM}_{2}\) are relatively large when the dimension is small as in the case \(p=5\). However, the values do not approach to 1 as n increases, and it seems that \(\mathrm{TM}_{2}\) has no consistency under a largesample case.

The selection probabilities of the true model by \(\mathrm{TM}_{\log n}\) are near to 1, and has a consistency under a largesample case. However, the probabilities are decreasing as p increases, and so will not have a consistency in a highdimensional case.

The selection probabilities of the true model by \(\mathrm{TM}_{\sqrt{n}}\) approach 1 as n is large, even if p is small. Furthermore, if p is large, but under a highdimensional framework such that n is also large, then it has a consistency. However, the probabilities decrease as the ratio p / n approaches 1.

As the quantity \(\alpha\) presenting a distance between two groups becomes large, the selection probabilities of the true model by \(\mathrm{TM}_{2}\) and \(\mathrm{TM}_{\log n}\) increase in a largesample size as in the case \(p=5\). However, the effect becomes small when p is large. On the other hand, the selection probabilities of the true model by \(\mathrm{TM}_{\sqrt{n}}\) increase to a certain extend both in largesample and highdimensional cases.

As is the case with Table 1, \(\mathrm{TM}_{2}\) and \(\mathrm{TM}_{\log n}\) are not consistent as p increases, but \(\mathrm{TM}_{\sqrt{n}}\) is consistent. In general, the probability of selecting the true model decreases as the dimension of the true model is large.

When \(p_*=p/4\), the consistency of \(\mathrm{TM}_{2}\) and \(\mathrm{TM}_{\log n}\) can not be seen. The consistency of \(\mathrm{TM}_{\sqrt{n}}\) can be seen when n is large.
Selection rates of \(\mathrm{TM}_{2}\), \(\mathrm{TM}_{\log n}\) and \(\mathrm{TM}_{\sqrt{n}}\) for \(p_*=3\)
\(n_1\)  \(n_2\)  p  \(\mathrm{TM}_{2}\)  \(\mathrm{TM}_{\log n}\)  \(\mathrm{TM}_{\sqrt{n}}\)  

Under  True  Over  Under  True  Over  Under  True  Over  
\(p_* = 3\), \(\alpha = 1\)  
50  50  5  0.00  0.66  0.34  0.00  0.93  0.07  0.04  0.96  0.00 
100  100  5  0.00  0.70  0.30  0.00  0.95  0.05  0.00  1.00  0.00 
200  200  5  0.00  0.71  0.29  0.00  0.95  0.05  0.00  1.00  0.00 
50  50  25  0.00  0.01  0.98  0.01  0.28  0.71  0.07  0.80  0.13 
100  100  50  0.00  0.00  1.00  0.00  0.13  0.87  0.00  0.94  0.06 
200  200  100  0.00  0.00  1.00  0.00  0.06  0.94  0.00  0.99  0.01 
50  50  50  0.01  0.00  0.99  0.04  0.01  0.95  0.15  0.34  0.52 
100  100  100  0.00  0.00  1.00  0.00  0.00  1.00  0.01  0.53  0.46 
200  200  200  0.00  0.00  1.00  0.00  0.00  1.00  0.00  0.74  0.26 
\(p_* = 3\), \(\alpha = 2\)  
50  50  5  0.00  0.27  0.73  0.00  0.85  0.15  0.00  1.00  0.00 
100  100  5  0.00  0.67  0.33  0.00  0.95  0.05  0.00  1.00  0.00 
200  200  5  0.00  0.72  0.28  0.00  0.97  0.04  0.00  1.00  0.00 
50  50  25  0.00  0.00  1.00  0.00  0.27  0.72  0.01  0.86  0.13 
100  100  50  0.00  0.00  1.00  0.00  0.16  0.85  0.00  0.95  0.05 
200  200  100  0.00  0.00  1.00  0.00  0.05  0.96  0.00  0.99  0.01 
50  50  50  0.00  0.00  1.00  0.00  0.01  0.98  0.03  0.37  0.59 
100  100  100  0.00  0.00  1.00  0.00  0.00  1.00  0.00  0.53  0.47 
200  200  200  0.00  0.00  1.00  0.00  0.00  1.00  0.00  0.75  0.25 
\(p_* = 3\), \(\alpha = 3\)  
50  50  5  0.00  0.70  0.31  0.00  0.92  0.08  0.00  0.99  0.01 
100  100  5  0.00  0.71  0.29  0.00  0.95  0.05  0.00  1.00  0.00 
200  200  5  0.00  0.70  0.30  0.00  0.97  0.03  0.00  1.00  0.00 
50  50  25  0.00  0.01  0.99  0.00  0.26  0.74  0.01  0.87  0.12 
100  100  50  0.00  0.00  1.00  0.00  0.13  0.87  0.00  0.94  0.06 
200  200  100  0.00  0.00  1.00  0.00  0.04  0.96  0.00  0.99  0.01 
50  50  50  0.00  0.00  1.00  0.00  0.01  0.99  0.03  0.35  0.62 
100  100  100  0.00  0.00  1.00  0.00  0.00  1.00  0.00  0.51  0.49 
200  200  200  0.00  0.00  1.00  0.00  0.00  1.00  0.00  0.73  0.27 
Selection rates of \(\mathrm{TM}_{2}\), \(\mathrm{TM}_{\log n}\) and \(\mathrm{TM}_{\sqrt{n}}\) for \(p_*=6\)
\(n_1\)  \(n_2\)  p  \(\mathrm{TM}_{2}\)  \(\mathrm{TM}_{\log n}\)  \(\mathrm{TM}_{\sqrt{n}}\)  

Under  True  Over  Under  True  Over  Under  True  Over  
\(p_* = 6\), \(\alpha = 1\)  
50  50  25  0.08  0.01  0.92  0.30  0.24  0.46  0.86  0.12  0.02 
100  100  50  0.00  0.00  1.00  0.01  0.16  0.83  0.30  0.66  0.04 
200  200  100  0.00  0.00  1.00  0.00  0.06  0.94  0.01  0.98  0.01 
50  50  50  0.19  0.00  0.81  0.49  0.00  0.51  0.90  0.03  0.07 
100  100  100  0.00  0.00  1.00  0.05  0.00  0.95  0.48  0.28  0.25 
200  200  200  0.00  0.00  1.00  0.00  0.00  1.00  0.03  0.73  0.24 
\(p_* = 6\), \(\alpha = 2\)  
50  50  25  0.06  0.01  0.93  0.23  0.24  0.53  0.75  0.21  0.04 
100  100  50  0.00  0.00  1.00  0.01  0.16  0.83  0.14  0.81  0.05 
200  200  100  0.00  0.00  1.00  0.00  0.06  0.94  0.00  0.99  0.01 
50  50  50  0.12  0.00  0.88  0.36  0.01  0.63  0.82  0.07  0.11 
100  100  100  0.00  0.00  1.00  0.02  0.00  0.98  0.33  0.32  0.35 
200  200  200  0.00  0.00  1.00  0.00  0.00  1.00  0.02  0.74  0.24 
\(p_* = 6\), \(\alpha = 3\)  
50  50  25  0.03  0.01  0.96  0.17  0.24  0.59  0.71  0.25  0.05 
100  100  50  0.00  0.00  1.00  0.00  0.17  0.83  0.13  0.83  0.04 
200  200  100  0.00  0.00  1.00  0.00  0.07  0.93  0.00  0.99  0.01 
50  50  50  0.13  0.00  0.87  0.35  0.01  0.65  0.82  0.05  0.13 
100  100  100  0.00  0.00  1.00  0.03  0.00  0.97  0.30  0.37  0.33 
200  200  200  0.00  0.00  1.00  0.00  0.00  1.00  0.01  0.75  0.24 
Selection rates of \(\mathrm{TM}_{2}\), \(\mathrm{TM}_{\log n}\) and \(\mathrm{TM}_{\sqrt{n}}\) for \(p_*=p/4\)
\(n_1\)  \(n_2\)  \(\alpha\)  \(\mathrm{TM}_{2}\)  \(\mathrm{TM}_{\log n}\)  \(\mathrm{TM}_{\sqrt{n}}\)  

Under  True  Over  Under  True  Over  Under  True  Over  
\(p=100\), \(p_*=25\)  
100  100  1  0.99  0.00  0.01  1.00  0.00  0.00  1.00  0.00  0.00 
200  200  1  0.28  0.00  0.72  0.94  0.01  0.05  1.00  0.00  0.00 
500  500  1  0.00  0.00  1.00  0.01  0.36  0.63  1.00  0.00  0.00 
1000  1000  1  0.00  0.00  1.00  0.00  0.58  0.42  0.43  0.57  0.00 
2000  2000  1  0.00  0.00  1.00  0.00  0.73  0.27  0.00  1.00  0.00 
\(p=200\), \(p_*=50\)  
200  200  1  1.00  0.00  0.00  1.00  0.00  0.00  1.00  0.00  0.00 
500  500  1  0.19  0.00  0.81  0.96  0.01  0.03  1.00  0.00  0.00 
1000  1000  1  0.00  0.00  1.00  0.02  0.17  0.81  1.00  0.00  0.00 
2000  2000  1  0.00  0.00  1.00  0.00  0.45  0.55  1.00  0.00  0.00 
5000  5000  1  0.00  0.00  1.00  0.00  0.70  0.30  0.00  1.00  0.00 
10,000  10000  1  0.00  0.00  1.00  0.00  0.75  0.25  0.00  1.00  0.00 
5 Ridgetype methods

\(\mathrm{TM}_{2}\) does not have consistency. On the other hand, it seems that \(\mathrm{TM}_{\log n}\) and \(\mathrm{TM}_{\sqrt{n}}\) have consistency when the dimension p and the total sample size n are separated.
Selection rates of \(\mathrm{TM}_{2}\), \(\mathrm{TM}_{\log n}\) and \(\mathrm{TM}_{\sqrt{n}}\) for \(p_*=p/4\)
\(\alpha = 1\)  \(\mathrm{TM}_{2}\)  \(\mathrm{TM}_{\log n}\)  \(\mathrm{TM}_{\sqrt{n}}\)  

\(n_1\)  \(n_2\)  p  Under  True  Over  Under  True  Over  Under  True  Over 
15  15  40  0.31  0.00  0.69  0.48  0.00  0.52  0.73  0.04  0.23 
25  25  60  0.15  0.00  0.85  0.30  0.00  0.70  0.52  0.00  0.48 
50  50  110  0.07  0.00  0.93  0.14  0.00  0.87  0.35  0.00  0.65 
15  15  90  0.14  0.29  0.57  0.48  0.48  0.04  0.91  0.09  0.00 
25  25  150  0.01  0.13  0.86  0.10  0.83  0.07  0.60  0.40  0.00 
50  50  300  0.00  0.02  0.98  0.00  0.95  0.05  0.08  0.93  0.00 
6 Concluding remarks
In this paper, we propose a testbased method (TM) for the variable selection problem, based on drawing on the significance of each variable. The method involves a constant term d, and is denoted by \(\mathrm{TM}_d\). When \(d=2\) and \(d=\log n\), the corresponding \(\mathrm{TM}\)’s are related to \(\mathrm{AIC}\) and \(\mathrm{BIC}\), respectively. However, the usual model selection criteria such as \(\mathrm{AIC}\) and \(\mathrm{BIC}\) need to examine all the subsets. However, \(\mathrm{TM}_{d}\) need not to examine all the subsets, but need to examine only the p subsets \((i), \ i=1, \ldots , p\). This circumvents computational complexities associated with \(\mathrm{AIC}\) and \(\mathrm{BIC}\) has been resolved. Furthermore, it was identified that \(\mathrm{TM}_{d}\) has a highdimensional consistency property for some d including \(d=\sqrt{n}\), when (i) \(p_*\) is finite and \(\varDelta ^2=\mathrm{O}(1)\), and (ii) \(p_*\) is infinite and \(\varDelta ^2=\mathrm{O}(p)\), When the dimension is larger than the sample size, we propose the ridgetype methods. However, a study of their theoretical property is left as a future work.
Notes
Acknowledgements
We thank two referees for careful reading of our manuscript and many helpful comments which improved the presentation of this paper. The first author’s research is partially supported by the Ministry of Education, Science, Sports, and Culture, a GrantinAid for Scientific Research (C), 16K00047, 2016–2018.
References
 Akaike, H. (1973). Information theory and an extension of themaximum likelihood principle. In B. N. Petrov & F. Csáki (Eds.), 2nd International Symposium on Information Theory (pp. 267–281). Budapest: Akadémiai Kiadó.Google Scholar
 Clemmensen, L., Hastie, T., Witten, D. M., & Ersbell, B. (2011). Sparse discriminant analysis. Technometrics, 53, 406–413.MathSciNetCrossRefGoogle Scholar
 Fujikoshi, Y. (1985). Selection of variables in twogroup discriminant analysis by error rate and Akaike’s information criteria. Journal of Multivariate Analysis, 17, 27–37.MathSciNetCrossRefzbMATHGoogle Scholar
 Fujikoshi, Y. (2000). Error bounds for asymptotic approximations of the linear discriminant function when the sample size and dimensionality are large. Journal of Multivariate Analysis, 73, 1–17.MathSciNetCrossRefzbMATHGoogle Scholar
 Fujikoshi, Y., & Sakurai, T. (2016). Highdimensional consistency of rank estimation criteria in multivariate linear model. Journal of Multivariate Analysis, 149, 199–212.MathSciNetCrossRefzbMATHGoogle Scholar
 Fujikoshi, Y., Ulyanov, V. V., & Shimizu, R. (2010). Multivariate statistics: highdimensional and largesample approximations. Hobeken, NJ: Wiley.CrossRefzbMATHGoogle Scholar
 Fujikoshi, Y., Sakurai, T., & Yanagihara, H. (2014). Consistency of highdimensional AICtype and \(\text{ C }_p\)type criteria in multivariate linear regression. Journal of Multivariate Analysis, 144, 184–200.CrossRefzbMATHGoogle Scholar
 Hao, N., Dong, B. & Fan, J. (2015). Sparsifying the Fisher linear discriminant by rotation. Journal of the Royal Statistical Society: Series B, 77, 827–851.MathSciNetCrossRefGoogle Scholar
 Hyodo, M., & Kubokawa, T. (2014). A variable selection criterion for linear discriminant rule and its optimality in high dimensional and large sample data. Journal of Multivariate Analysis, 123, 364–379.MathSciNetCrossRefzbMATHGoogle Scholar
 Ito, T. & Kubokawa, T. (2015). Linear ridge estimator of highdimensional precision matrix using random matrix theory. Discussion Paper Series, CIRJEF995.Google Scholar
 Kubokawa, T., & Srivastava, M. S. (2012). Selection of variables in multivariate regression models for large dimensions. Communication in StatisticsTheory and Methods, 41, 2465–2489.MathSciNetCrossRefzbMATHGoogle Scholar
 McLachlan, G. J. (1976). A criterion for selecting variables for the linear discriminant function. Biometrics, 32, 529–534.MathSciNetCrossRefzbMATHGoogle Scholar
 Nishii, R., Bai, Z. D., & Krishnaia, P. R. (1988). Strong consistency of the information criterion for model selection in multivariate analysis. Hiroshima Mathematical Journal, 18, 451–462.MathSciNetCrossRefzbMATHGoogle Scholar
 Rao, C. R. (1973). Linear statistical inference and its applications (2nd ed.). New York: Wiley.CrossRefzbMATHGoogle Scholar
 Sakurai, T., Nakada, T., & Fujikoshi, Y. (2013). Highdimensional AICs for selection of variables in discriminant analysis. Sankhya, Series A, 75, 1–25.MathSciNetCrossRefzbMATHGoogle Scholar
 Schwarz, G. (1978). Estimating the dimension od a model. Annals of Statistics, 6, 461–464.MathSciNetCrossRefzbMATHGoogle Scholar
 Tiku, M. (1985). Noncentral chisquare distribution. In S. Kotz & N. L. Johnson (Eds.), Encyclopedia of Statistical Sciences, vol. 6 (pp. 276–280). New York: Wiely.Google Scholar
 Van Wieringen, W. N., & Peeters, C. F. (2016). Ridge estimation of inverse covariance matrices from highdimensional data. Computational Statistics & Data Analysis, 103, 284–303.MathSciNetCrossRefzbMATHGoogle Scholar
 Witten, D. W., & Tibshirani, R. (2011). Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B, 73, 753–772.MathSciNetCrossRefzbMATHGoogle Scholar
 Yamada, T., Sakurai, T. & Fujikoshi, Y. (2017). Highdimensional asymptotic results for EPMCs of W and Z rules. Hiroshima Statistical Research Group, 17–12.Google Scholar
 Yanagihara, H., Wakaki, H., & Fujikoshi, Y. (2015). A consistency property of the AIC for multivariate linear models when the dimension and the sample size are large. Electronic Journal of Statistics, 9, 869–897.MathSciNetCrossRefzbMATHGoogle Scholar
 Zhao, L. C., Krishnaiah, P. R., & Bai, Z. D. (1986). On determination of the number of signals in presence of white noise. Journal of Multivariate Analysis, 20, 1–25.MathSciNetCrossRefzbMATHGoogle Scholar