Advertisement

Consistency of test-based method for selection of variables in high-dimensional two-group discriminant analysis

  • Yasunori FujikoshiEmail author
  • Tetsuro Sakurai
Original Paper
  • 32 Downloads

Abstract

This paper is concerned with selection of variables in two-group discriminant analysis with the same covariance matrix. We propose a test-based method (TM) drawing on the significance of each variable. Sufficient conditions for the test-based method to be consistent are provided when the dimension and the sample size are large. For the case that the dimension is larger than the sample size, a ridge-type method is proposed. Our results and tendencies therein are explored numerically through a Monte Carlo simulation. It is pointed that our selection method can be applied for high-dimensional data.

Keywords

Consistency Discriminant analysis High-dimensional framework Selection of variables Test-based method 

1 Introduction

This paper is concerned with the variable selection problem in two-group discriminant analysis with the same covariance matrix. In a variable selection problem under such discriminant model, one of the goals is to find a subset of variables, whose coefficients of the linear discriminant function are not zero. Several methods including model selection criteria \(\mathrm{AIC}\) (Akaike information criterion, Akaike 1973) and \(\mathrm{BIC}\) (Bayesian information criterion, Schwarz 1978) have been developed. It is known (see, e.g., Fujikoshi 1985; Nishii et al. 1988) that in a large-sample framework, \(\mathrm{AIC}\) is not consistent, but \(\mathrm{BIC}\) is consistent. On the other hand, in multivariate regression model, it has been shown (Fujikoshi et al. 2014; Yanagihara et al. 2015) that in a high-dimensional framework, \(\mathrm{AIC}\) has consistency properties under some conditions, but \(\mathrm{BIC}\) is not necessarily consistent. In our discriminant model, there are methods based on misclassification errors by McLachlan (1976), Fujikoshi (1985), Hyodo and Kubokawa (2014), Yamada et al. (2017) for a high-dimensional case as well as a large-sample case. It is known (see, e.g., Fujikoshi 1985) that two methods by misclassification error rate and AIC are asymptotically equivalent under a large-sample framework. In our discriminant model, Sakurai et al. (2013) derived an asymptotically unbiased estimator of the risk function for a high-dimensional case. On the other hand, the selection methods are based on the minimization of the criteria, and become computationally onerous when the dimension is large. Though some stepwise methods have been proposed, their optimality is not known. For high-dimensional data, Lasso and other regularization methods have been extended. For such study, see, e.g., Clemmensen et al. (2011), Witten and Tibshirani (2011), Hao et al. (2015), etc.

In this paper, we propose a test-based method based on significance test of each variable, which is useful for high-dimensional data as well as large-sample data. Such an idea is essentially the same as in Zhao et al. (1986). Our criterion involves a constant term which should be determined by point of some optimality. We propose a class of constants satisfying a consistency when the dimension and the sample size are large. For the case, when the dimension is larger than the sample size, a regularized method is numerically examined. Our results and tendencies therein are explored numerically through a Monte Carlo simulation.

The remainder of the present paper is organized as follows. In Sect. 2, we present the relevant notation and the test-based method. In Sect. 3, we derive sufficient conditions for the test-based criterion to be consistent under a high-dimensional case. In Sect. 4, we study the test-based criterion through a Monte Carlo simulation. In Sect. 5, we propose the ridge-type criteria, whose consistency properties are numerically examined. In Sect. 6, conclusions are offered. All proofs of our results are provided in the Appendix.

2 Test-based method

In two-group discriminant analysis, suppose that we have independent samples \({\varvec{y}}^{(i)}_1, \ldots , {\varvec{y}}^{(i)}_{n_i}\) of \({\varvec{y}}= (y_1, \ldots , y_p)'\) from p-dimensional normal distributions \(\varPi _i:\)\(\mathrm{N}_p(\varvec{\mu }^{(i)}, {\varvec{\Sigma }})\), \(i=1, 2\). Let \({\mathsf {Y}}\) be the total sample matrix defined by
$$\begin{aligned} {\mathsf {Y}}=({\varvec{y}}^{(1)}_1, \ldots , {\varvec{y}}^{(1)}_{n_1}, {\varvec{y}}^{(2)}_1, \ldots , {\varvec{y}}^{(2)}_{n_2})'. \end{aligned}$$
The coefficients of the population discriminant function are given by
$$\begin{aligned} \varvec{\beta }={\varvec{\Sigma }}^{-1}(\varvec{\mu }^{(1)}-\varvec{\mu }^{(2)})=(\beta _1, \ldots , \beta _p)'. \end{aligned}$$
(1)
Let \(\varDelta\) and D be the population and the sample Mahalanobis distances defined by \(\varDelta =\left\{ (\varvec{\mu }^{(1)}-\varvec{\mu }^{(2)})'{\varvec{\Sigma }}^{-1}(\varvec{\mu }^{(1)}-\varvec{\mu }^{(2)})\right\} ^{1/2}\), and
$$\begin{aligned} D=\left\{ ({\bar{{\varvec{y}}}}^{(1)}-{\bar{{\varvec{y}}}}^{(2)})'{\mathsf {S}}^{-1}({\bar{{\varvec{y}}}}^{(1)}-{\bar{{\varvec{y}}}}^{(2)})\right\} ^{1/2}, \end{aligned}$$
respectively. Here, \({\bar{{\varvec{y}}}}^{(1)}\) and \({\bar{{\varvec{y}}}}^{(2)}\) are the sample mean vectors, and \({\mathsf {S}}\) be the pooled sample covariance matrix based on \(n=n_1+n_2\) samples.
Suppose that j denotes a subset of \(\omega =\{1, \ldots , p\}\) containing \(p_j\) elements, and \({\varvec{y}}_j\) denotes the \(p_j\) vector consisting of the elements of \({\varvec{y}}\) indexed by the elements of j. We use the notation \(D_j\) and \(D_{\omega }\) for D based on \({\varvec{y}}_j\) and \({\varvec{y}}_{\omega }(={\varvec{y}})\), respectively. Let \(M_j\) be a variable selection model defined by
$$\begin{aligned} M_j: \ \beta _i \ne 0 \ \mathrm{if} \ i \in j, \ \mathrm{and} \ \beta _i = 0 \ \mathrm{if} \ i \not \in j. \end{aligned}$$
(2)
The Model \(M_j\) is equivalent to \(\varDelta _j=\varDelta _{\omega }\), i.e., the Mahalanobis distance based on \({\varvec{y}}_j\) is the same as the one based on the full set of variables \({\varvec{y}}\). We identify the selection of \(M_j\) with the selection of \({\varvec{y}}_j\). Let \(\mathrm{AIC}_j\) be the \(\mathrm{AIC}\) for \(M_j\). Then, it is known (see, e.g., Fujikoshi 1985) that
$$\begin{aligned} \mathrm{A}_j&= \mathrm{AIC}_{j}- \mathrm{AIC}_{\omega } \nonumber \\&=n \log \left\{ 1 + \frac{g^2(D^2_{\omega }-D^2_j)}{n-2+g^2D_j^2}\right\} - 2(p-p_j), \end{aligned}$$
(3)
where \(g=\sqrt{(n_1n_2)/n}\). Similarly, let \(\mathrm{BIC}_j\) be the \(\mathrm{BIC}\) for \(M_j\), and we have
$$\begin{aligned} \mathrm{B}_j&= \mathrm{BIC}_{j}- \mathrm{BIC}_{\omega } \nonumber \\&=n \log \left\{ 1 + \frac{g^2(D^2_{\omega }-D^2_j)}{n-2+g^2D_j^2}\right\} - (\log n)(p-p_j). \end{aligned}$$
(4)
In a large-sample framework, it is known (see, Fujikoshi 1985; Nishii et al. 1988) that \(\mathrm{AIC}\) is not consistent, but \(\mathrm{BIC}\) is consistent. On the other hand, the variable selection methods based on \(\mathrm{AIC}\) and \(\mathrm{BIC}\) are given as \(\min _j \mathrm{AIC}_j\) and \(\min _j \mathrm{BIC}_j\), respectively. Therefore, such criteria become computationally onerous when p is large. To circumvent this issue, we consider a test-based method (TM) drawing on the significance of each variable. A critical region for “\(\beta _i=0\)” based on the likelihood ratio principle is expressed (see, e.g., Rao 1973; Fujikoshi et al. 2010) as
$$\begin{aligned} \mathrm{T}_{d,i}=n \log \left\{ 1 + \frac{g^2(D^2_{\omega }-D^2_{(-i)})}{n-2+g^2D_{(-i)}^2}\right\} - d > 0, \end{aligned}$$
(5)
where \((-i), i=1, \ldots , p\) is the subset of \(\omega =\{1, \ldots , p\}\) obtained by omitting the i from \(\omega\), and d is a positive constant which may depend on p and n. Note that
$$\begin{aligned} \mathrm{T}_{2,i}> 0 \ \Longleftrightarrow \ \mathrm{AIC}_{(-i)}-\mathrm{AIC}_{\omega }> 0. \end{aligned}$$
We consider a test-based method for the selection of variables defined by selecting the set of suffixes or the set of variables given by
$$\begin{aligned} \mathrm{TM}_{d}=\{ i \in \omega \ | \ { \mathrm T}_{d, i} > 0 \}, \end{aligned}$$
(6)
or \(\{ y_i \in \{ y_1, \ldots , y_p\} \ | \ { \mathrm T}_{d, i} > 0 \}\). The notation \({\widehat{j}}_{\mathrm{TM}_d}\) is also used for \(\mathrm{TM}_{d}\). This idea is closely related to an approach considered by Zhao et al. (1986) and Nishii et al. (1988), who used a selection method based on comparison with \(\mathrm{IC}_{(-i)}\) and \(\mathrm{IC}_{\omega }\) for a general information criterion.

In general, if d is large, a small number of variables is selected. On the other hand, if d is small, a large number of variables is selected. Ideally, we want to select only the true variables whose discriminant coefficients are not zero. For a test-based method, there is an important problem how to decide the constant term d. Nishii et al. (1988) have used a special case with \(d=\log n\). They noted that under a large-sample framework, \(\mathrm{TM}_{\log n}\) is consistent. However, we note that \(\mathrm{TM}_{\log n}\) will be not consistent for a high-dimensional case, through a simulation experiment. We propose a class of d satisfying a high-dimensional consistency including \(d= \sqrt{n}\) in Sect. 3.

3 Consistency of \(\mathrm{TM}_d\) under high-dimensional framework

For studying consistency of the variable selection criterion \(\mathrm{TM}_{d}\), it is assumed that the true model \(M_{*}\) is included in the full model. Let the minimum model including \(M_*\) be \(M_{j_*}\). For a notational simplicity, we regard the true model \(M_*\) as \(M_{j_*}\). Let \({{\mathcal {F}}}\) be the entire suite of candidate models, that is
$$\begin{aligned} {{\mathcal {F}}}=\left\{ \{1 \}, \ldots , \{p \}, \{1,2\}, \ldots , \{1,\ldots ,p\} \right\} . \end{aligned}$$
A subset j of \(\omega\) is called overspecified model if j includes the true model. On the other hand, a subset j of \(\omega\) is called underspecified model if j does not include the true model. Then, \({{\mathcal {F}}}\) is separated into two sets, one is a set of overspecified models, i.e., \({{\mathcal {F}}}_+=\{j\in {{\mathcal {F}}}\,|\, j_* \subseteq j\}\) and the other is a set of underspecified models, i.e., \({{\mathcal {F}}}_-={{\mathcal {F}}}_+^c\cap {{\mathcal {F}}}\). It is said that a model selection criterion \({\widehat{j}}\) has a high-dimensional consistency if
$$\begin{aligned} \lim _{p/n \rightarrow c \in (0, 1)} \Pr ({\widehat{j}}=j_*)=1. \end{aligned}$$
Here, we list some of our main assumptions:
  1. A1

    (The true model): \(M_{j_*} \in {{\mathcal {F}}}\).

     
  2. A2

    (The high-dimensional asymptotic framework): \(p \rightarrow \infty , \ n \rightarrow \infty , \ p/n \rightarrow c \in (0, 1)\), \(n_i/n \rightarrow k_i>0, \ (i=1, 2).\)

     
For the dimensionality \(p_*\) of the true model and the Mahalanobis distance \(\varDelta\), first the following case is considered:
  1. A3

    : \(p_{*}\) is finite, and \(\varDelta ^2 = \mathrm{O}(1)\).

     
For the constant d of test-based statistic \(\mathrm{T}_{d,i}\) in (5), we consider the following assumptions:
  1. B1

    : \(d/n \rightarrow 0\).

     
  2. B2

    : \(h \equiv d/n-1/(n-p-3) > 0\), and \(h=\mathrm{O}(n^{-a}), \ \mathrm{where} \ 0<a<1\).

     
A consistency of \(\mathrm{TM}_d\) in (6) for some \(d > 0\) shall be shown along the following outline: In general, we have
$$\begin{aligned} \mathrm{TM}_d = j_* \Leftrightarrow \mathrm{``T}_{d,i} > 0 \ \mathrm{for} \ i \in j_*{{{\text{''}}}}, \ \mathrm{and} \ \mathrm{``T}_{d,i} \le 0 \ \mathrm{for} \ i \notin j_*{{{\text{''}}}} \end{aligned}$$
Therefore
$$\begin{aligned} P(\mathrm{TM}_{d} = j_*)& = P\left( \bigcap _{i \in j_*} \mathrm{``T}_{d,i} > 0{{{\text{''}}}} \bigcap _{i \notin j_*} \mathrm{``T}_{d,i} < 0{{{\text{''}}}} \right) \\& = 1- P\left( \bigcup _{i \in j_*} \mathrm{``T}_{d,i} \le 0{{{\text{''}}}} \bigcup _{i \notin j_*} \mathrm{``T}_{d,i} \ge 0{{{\text{''}}}} \right) \\&\ge 1 - \sum _{i \in j_*} P(\mathrm{T}_{d,i} \le 0) - \sum _{i \notin j_*} P(\mathrm{T}_{d,i} \ge 0) . \end{aligned}$$
We shall consider to show
$$\begin{aligned} \mathrm{[F1]}&\equiv \sum _{i \in j_*} P(\mathrm{T}_{d,i} \le 0) \rightarrow 0. \end{aligned}$$
(7)
$$\begin{aligned} \mathrm{[F2]}&\equiv \sum _{i \notin j_*} P(\mathrm{T}_{d,i} \ge 0) \rightarrow 0, \end{aligned}$$
(8)
\(\mathrm{[F1]}\) denotes the probability such that the true variables are not selected. \(\mathrm{[F2]}\) denotes the probability, such that the non-true variables are selected.
The square of Mahalanobis distance of \({\varvec{y}}\) is decomposed as a sum of the squares of Mahalanobis distance of \({\varvec{y}}_{(-i)}\) and the conditional Mahalanobis distance of \({\varvec{y}}_{\{i \}}\) given \({\varvec{y}}_{(-i)}\) as follows:
$$\begin{aligned} \varDelta ^2=\varDelta _{(-i)}^2+\varDelta _{\{i\} \cdot (-i)}^2. \end{aligned}$$
(9)
When \(i \in j_*\), \((-i) \notin {{\mathcal {F}}}_+\) and hence
$$\begin{aligned} \varDelta _{\{i\} \cdot (-i)}^2=\varDelta ^2-\varDelta _{(-i)}^2 > 0. \end{aligned}$$
Related to consistency of \(\mathrm{TM}_d\) under assumption A3, we consider the following assumption:
  1. A4
    : For \(i \in j_*\),
    $$\begin{aligned} \lim (n_1n_2/n^{2})\varDelta _{\{i\} \cdot (-i)}^2\left\{ 1+(n_1n_2/n^{2})\varDelta _{(-i)}^2\right\} ^{-1} > 0. \end{aligned}$$
     
Note that A4 is satisfied if \(\lim \varDelta _{\{i\} \cdot (-i)}^2> 0 \ \mathrm{and} \ \lim \varDelta _{(-i)}^2 > 0,\) under \(\varDelta ^{2}=\mathrm{O}(1)\).

Theorem 1

Suppose that assumptions A1, A2, A3 and A4 are satisfied. Then, the test-based method\(\mathrm{TM}_{d}\)is consistent if B1 and B2 are satisfied.

Let \(d=n^r\), where \(0<r<1\). Then, \(h=\mathrm{O}(n^{-(1-r)})\), and the condition B2 is satisfied. Therefore, the test-based criteria with
$$\begin{aligned} d=n^{3/4}, \quad n^{2/3}, \quad n^{1/2}, \quad n^{1/3}\quad \mathrm{or } \quad n^{1/4} \end{aligned}$$
have a high-dimensional consistency. Among of them, we have numerically seen that the one with \(d=\sqrt{n}\) has a good behavior. Note that \(\mathrm{TC}_{2}\) and \(\mathrm{TC}_{\log n}\) do not satisfy B2.
As a special case, under the assumptions of Theorem 1, we have seen that the probability of selecting overspecified models tends to zero, since
$$\begin{aligned} \sum _{i \notin j_*} P(i \in \mathrm{TM}_d ) = \sum _{i \notin j_*} P(\mathrm{T}_{d,i} \ge 0) \rightarrow 0. \end{aligned}$$
The proof given there is applicable also to the case replaced assumption A3 by assumption A5:
  1. A5

    : \(p_{*}=\mathrm{O}(p)\), and \(\varDelta ^2 = \mathrm{O}(p)\).

     
In other words, such a property holds regardless of whether the dimension of \(j_*\) is finite or not. Furthermore, it does not depend on the order of the Mahalanobis distance.
Related to consistency of \(\mathrm{TM}_d\) under assumption A5, we consider the following assumption:
  1. A6
    : For \(i \in j_*\),
    $$\begin{aligned} \theta _i^2=\varDelta _{\{i\} \cdot (-i)}^2\left\{ \varDelta _{(-i)}^2\right\} ^{-1}=\mathrm{O}(p^{b-1}), \quad 0< b < 1. \end{aligned}$$
     

Theorem 2

Suppose that assumptions A1, A2, A5 and A6 are satisfied. Then, the test-based criterion\(\mathrm{TM}_{d}\)isconsistent if\(d=n^r, 0< r < 1\)and\(r< b, (3/4)(1+\delta ) < b\)are satisfied, where\(\delta\)is any small positive number.

From our proof, it is conjectured that the condition “\(3/4 < b\)” can be replaced by “\(1/2 < b\)”. From a practical point, it shall be natural that b is small, and then, the sufficiency condition requires \(r \le 1/2\).

Theorem 3

Suppose that assumption; A1 and
$$\begin{aligned} {\text{``}} p; \ \mathrm{fixed,} \ n_i/n \rightarrow k_i>0 \ (i=1, 2), \ d \rightarrow \infty , \ d/n \rightarrow 0 {\text{''}} \end{aligned}$$
are satisfied. Then, the test-based criterion\(\mathrm{TM}_{d}\)is consistent whenntends to infinity.

From Theorem 3, we can see that \(\mathrm{TM}_{\log n}\) and \(\mathrm{TM}_{\sqrt{n}}\) are consistent under a large-sample framework. However, \(\mathrm{TM}_{2}\) does not satisfy the sufficient condition.

4 Numerical study

In this section, we numerically explore the validity of our claims through three test-based criteria, \(\mathrm{TM}_{2}\), \(\mathrm{TM}_{\log n}\), and \(\mathrm{TM}_{\sqrt{n}}\). Note that \(\mathrm{TM}_{\sqrt{n}}\) satisfies sufficient conditions B1 and B2 for its consistency, but \(\mathrm{TM}_{2}\) and \(\mathrm{TM}_{\log n}\) do not satisfy them.

The true model was assumed as follows: the true dimension is \(j_*=3, 6\), the true mean vectors:
$$\begin{aligned} \varvec{\mu }_1 = \alpha (1, \ldots , 1, 0, \ldots , 0)', \quad \varvec{\mu }_2 = \alpha (-1, \ldots , -1, 0, \ldots , 0)', \end{aligned}$$
where \(\alpha =1, 2, 3\), and the true covariance matrix \({\varvec{\Sigma }}_*=\varvec{{I}}_p\).

The selection rates associated with these criteria are given in Tables 1, 2, and 3. “Under”, “True”, and “Over” denote the underspecified models, the true model, and the overspecified models, respectively. We focused on selection rates for \(10^3\) replications in Tables 1, 2, and for \(10^2\) replications in Table 3.

From Table 1, we can identify the following tendencies.
  • The selection probabilities of the true model by \(\mathrm{TM}_{2}\) are relatively large when the dimension is small as in the case \(p=5\). However, the values do not approach to 1 as n increases, and it seems that \(\mathrm{TM}_{2}\) has no consistency under a large-sample case.

  • The selection probabilities of the true model by \(\mathrm{TM}_{\log n}\) are near to 1, and has a consistency under a large-sample case. However, the probabilities are decreasing as p increases, and so will not have a consistency in a high-dimensional case.

  • The selection probabilities of the true model by \(\mathrm{TM}_{\sqrt{n}}\) approach 1 as n is large, even if p is small. Furthermore, if p is large, but under a high-dimensional framework such that n is also large, then it has a consistency. However, the probabilities decrease as the ratio p / n approaches 1.

  • As the quantity \(\alpha\) presenting a distance between two groups becomes large, the selection probabilities of the true model by \(\mathrm{TM}_{2}\) and \(\mathrm{TM}_{\log n}\) increase in a large-sample size as in the case \(p=5\). However, the effect becomes small when p is large. On the other hand, the selection probabilities of the true model by \(\mathrm{TM}_{\sqrt{n}}\) increase to a certain extend both in large-sample and high-dimensional cases.

In Table 2, we examine the case, where the dimension \(p_*\) of the true model is larger than the one in Table 1. The following tendencies can be identified.
  • As is the case with Table 1, \(\mathrm{TM}_{2}\) and \(\mathrm{TM}_{\log n}\) are not consistent as p increases, but \(\mathrm{TM}_{\sqrt{n}}\) is consistent. In general, the probability of selecting the true model decreases as the dimension of the true model is large.

In Table 3, we examine the case, where the dimension \(p_*\) of the true model is relatively large, and especially in the case of \(p_*=p/4\). The following tendencies can be identified.
  • When \(p_*=p/4\), the consistency of \(\mathrm{TM}_{2}\) and \(\mathrm{TM}_{\log n}\) can not be seen. The consistency of \(\mathrm{TM}_{\sqrt{n}}\) can be seen when n is large.

Table 1

Selection rates of \(\mathrm{TM}_{2}\), \(\mathrm{TM}_{\log n}\) and \(\mathrm{TM}_{\sqrt{n}}\) for \(p_*=3\)

\(n_1\)

\(n_2\)

p

\(\mathrm{TM}_{2}\)

\(\mathrm{TM}_{\log n}\)

\(\mathrm{TM}_{\sqrt{n}}\)

   

Under

True

Over

Under

True

Over

Under

True

Over

\(p_* = 3\), \(\alpha = 1\)

   50

50

5

0.00

0.66

0.34

0.00

0.93

0.07

0.04

0.96

0.00

   100

100

5

0.00

0.70

0.30

0.00

0.95

0.05

0.00

1.00

0.00

   200

200

5

0.00

0.71

0.29

0.00

0.95

0.05

0.00

1.00

0.00

   50

50

25

0.00

0.01

0.98

0.01

0.28

0.71

0.07

0.80

0.13

   100

100

50

0.00

0.00

1.00

0.00

0.13

0.87

0.00

0.94

0.06

   200

200

100

0.00

0.00

1.00

0.00

0.06

0.94

0.00

0.99

0.01

   50

50

50

0.01

0.00

0.99

0.04

0.01

0.95

0.15

0.34

0.52

   100

100

100

0.00

0.00

1.00

0.00

0.00

1.00

0.01

0.53

0.46

   200

200

200

0.00

0.00

1.00

0.00

0.00

1.00

0.00

0.74

0.26

\(p_* = 3\), \(\alpha = 2\)

   50

50

5

0.00

0.27

0.73

0.00

0.85

0.15

0.00

1.00

0.00

   100

100

5

0.00

0.67

0.33

0.00

0.95

0.05

0.00

1.00

0.00

   200

200

5

0.00

0.72

0.28

0.00

0.97

0.04

0.00

1.00

0.00

   50

50

25

0.00

0.00

1.00

0.00

0.27

0.72

0.01

0.86

0.13

   100

100

50

0.00

0.00

1.00

0.00

0.16

0.85

0.00

0.95

0.05

   200

200

100

0.00

0.00

1.00

0.00

0.05

0.96

0.00

0.99

0.01

   50

50

50

0.00

0.00

1.00

0.00

0.01

0.98

0.03

0.37

0.59

   100

100

100

0.00

0.00

1.00

0.00

0.00

1.00

0.00

0.53

0.47

   200

200

200

0.00

0.00

1.00

0.00

0.00

1.00

0.00

0.75

0.25

\(p_* = 3\), \(\alpha = 3\)

   50

50

5

0.00

0.70

0.31

0.00

0.92

0.08

0.00

0.99

0.01

   100

100

5

0.00

0.71

0.29

0.00

0.95

0.05

0.00

1.00

0.00

   200

200

5

0.00

0.70

0.30

0.00

0.97

0.03

0.00

1.00

0.00

   50

50

25

0.00

0.01

0.99

0.00

0.26

0.74

0.01

0.87

0.12

   100

100

50

0.00

0.00

1.00

0.00

0.13

0.87

0.00

0.94

0.06

   200

200

100

0.00

0.00

1.00

0.00

0.04

0.96

0.00

0.99

0.01

   50

50

50

0.00

0.00

1.00

0.00

0.01

0.99

0.03

0.35

0.62

   100

100

100

0.00

0.00

1.00

0.00

0.00

1.00

0.00

0.51

0.49

   200

200

200

0.00

0.00

1.00

0.00

0.00

1.00

0.00

0.73

0.27

Table 2

Selection rates of \(\mathrm{TM}_{2}\), \(\mathrm{TM}_{\log n}\) and \(\mathrm{TM}_{\sqrt{n}}\) for \(p_*=6\)

\(n_1\)

\(n_2\)

p

\(\mathrm{TM}_{2}\)

\(\mathrm{TM}_{\log n}\)

\(\mathrm{TM}_{\sqrt{n}}\)

   

Under

True

Over

Under

True

Over

Under

True

Over

\(p_* = 6\), \(\alpha = 1\)

   50

50

25

0.08

0.01

0.92

0.30

0.24

0.46

0.86

0.12

0.02

   100

100

50

0.00

0.00

1.00

0.01

0.16

0.83

0.30

0.66

0.04

   200

200

100

0.00

0.00

1.00

0.00

0.06

0.94

0.01

0.98

0.01

   50

50

50

0.19

0.00

0.81

0.49

0.00

0.51

0.90

0.03

0.07

   100

100

100

0.00

0.00

1.00

0.05

0.00

0.95

0.48

0.28

0.25

   200

200

200

0.00

0.00

1.00

0.00

0.00

1.00

0.03

0.73

0.24

\(p_* = 6\), \(\alpha = 2\)

   50

50

25

0.06

0.01

0.93

0.23

0.24

0.53

0.75

0.21

0.04

   100

100

50

0.00

0.00

1.00

0.01

0.16

0.83

0.14

0.81

0.05

   200

200

100

0.00

0.00

1.00

0.00

0.06

0.94

0.00

0.99

0.01

   50

50

50

0.12

0.00

0.88

0.36

0.01

0.63

0.82

0.07

0.11

   100

100

100

0.00

0.00

1.00

0.02

0.00

0.98

0.33

0.32

0.35

   200

200

200

0.00

0.00

1.00

0.00

0.00

1.00

0.02

0.74

0.24

\(p_* = 6\), \(\alpha = 3\)

   50

50

25

0.03

0.01

0.96

0.17

0.24

0.59

0.71

0.25

0.05

   100

100

50

0.00

0.00

1.00

0.00

0.17

0.83

0.13

0.83

0.04

   200

200

100

0.00

0.00

1.00

0.00

0.07

0.93

0.00

0.99

0.01

   50

50

50

0.13

0.00

0.87

0.35

0.01

0.65

0.82

0.05

0.13

   100

100

100

0.00

0.00

1.00

0.03

0.00

0.97

0.30

0.37

0.33

   200

200

200

0.00

0.00

1.00

0.00

0.00

1.00

0.01

0.75

0.24

Table 3

Selection rates of \(\mathrm{TM}_{2}\), \(\mathrm{TM}_{\log n}\) and \(\mathrm{TM}_{\sqrt{n}}\) for \(p_*=p/4\)

\(n_1\)

\(n_2\)

\(\alpha\)

\(\mathrm{TM}_{2}\)

\(\mathrm{TM}_{\log n}\)

\(\mathrm{TM}_{\sqrt{n}}\)

   

Under

True

Over

Under

True

Over

Under

True

Over

\(p=100\), \(p_*=25\)

   100

100

1

0.99

0.00

0.01

1.00

0.00

0.00

1.00

0.00

0.00

   200

200

1

0.28

0.00

0.72

0.94

0.01

0.05

1.00

0.00

0.00

   500

500

1

0.00

0.00

1.00

0.01

0.36

0.63

1.00

0.00

0.00

   1000

1000

1

0.00

0.00

1.00

0.00

0.58

0.42

0.43

0.57

0.00

   2000

2000

1

0.00

0.00

1.00

0.00

0.73

0.27

0.00

1.00

0.00

\(p=200\), \(p_*=50\)

   200

200

1

1.00

0.00

0.00

1.00

0.00

0.00

1.00

0.00

0.00

   500

500

1

0.19

0.00

0.81

0.96

0.01

0.03

1.00

0.00

0.00

   1000

1000

1

0.00

0.00

1.00

0.02

0.17

0.81

1.00

0.00

0.00

   2000

2000

1

0.00

0.00

1.00

0.00

0.45

0.55

1.00

0.00

0.00

   5000

5000

1

0.00

0.00

1.00

0.00

0.70

0.30

0.00

1.00

0.00

   10,000

10000

1

0.00

0.00

1.00

0.00

0.75

0.25

0.00

1.00

0.00

5 Ridge-type methods

When \(p > n-2\), \({\mathsf {S}}\) becomes singular, and so we cannot use the \(\mathrm{TM}_{d}\). One way to overcome this problem is to use the ridge-type estimator of \({\varvec{\Sigma }}\) defined by
$$\begin{aligned} {\hat{{\varvec{\Sigma }}}}_{\lambda } = \frac{1}{n}\left\{ (n-2) {\mathsf {S}}+ \lambda \varvec{{I}}_p \right\} , \end{aligned}$$
(10)
where \(\lambda = (n-2)(np)^{-1} \mathrm{tr}{\mathsf {S}}\). The estimator was used in multivariate regression model, by Kubokawa and Srivastava (2012), and Fujikoshi and Sakurai (2016), etc.
The numerical experiment was done for \(j_*=3\), \(\varvec{\mu }_1 = \alpha (1, 1, 1, 0, \ldots , 0)\), \(\varvec{\mu }_2 = \alpha (-1, -1, -1, 0, \ldots , 0)\), \({\varvec{\Sigma }}=\varvec{{I}}_p\), where \(\alpha =1, 2, 3\). We focused on selection rates for \(10^2\) replications in Table 4. From Table 4, we can identify the following tendencies.
  • \(\mathrm{TM}_{2}\) does not have consistency. On the other hand, it seems that \(\mathrm{TM}_{\log n}\) and \(\mathrm{TM}_{\sqrt{n}}\) have consistency when the dimension p and the total sample size n are separated.

Recently, it has been proposed to use a ridge-type estimator for \({\varvec{\Sigma }}^{-1}\) by Ito and Kubokawa (2015) and Van Wieringen and Peeters (2016). By the use of such estimators, it is expected that the consistency property will be improved, but this is left as a future work.
Table 4

Selection rates of \(\mathrm{TM}_{2}\), \(\mathrm{TM}_{\log n}\) and \(\mathrm{TM}_{\sqrt{n}}\) for \(p_*=p/4\)

\(\alpha = 1\)

\(\mathrm{TM}_{2}\)

\(\mathrm{TM}_{\log n}\)

\(\mathrm{TM}_{\sqrt{n}}\)

\(n_1\)

\(n_2\)

p

Under

True

Over

Under

True

Over

Under

True

Over

15

15

40

0.31

0.00

0.69

0.48

0.00

0.52

0.73

0.04

0.23

25

25

60

0.15

0.00

0.85

0.30

0.00

0.70

0.52

0.00

0.48

50

50

110

0.07

0.00

0.93

0.14

0.00

0.87

0.35

0.00

0.65

15

15

90

0.14

0.29

0.57

0.48

0.48

0.04

0.91

0.09

0.00

25

25

150

0.01

0.13

0.86

0.10

0.83

0.07

0.60

0.40

0.00

50

50

300

0.00

0.02

0.98

0.00

0.95

0.05

0.08

0.93

0.00

6 Concluding remarks

In this paper, we propose a test-based method (TM) for the variable selection problem, based on drawing on the significance of each variable. The method involves a constant term d, and is denoted by \(\mathrm{TM}_d\). When \(d=2\) and \(d=\log n\), the corresponding \(\mathrm{TM}\)’s are related to \(\mathrm{AIC}\) and \(\mathrm{BIC}\), respectively. However, the usual model selection criteria such as \(\mathrm{AIC}\) and \(\mathrm{BIC}\) need to examine all the subsets. However, \(\mathrm{TM}_{d}\) need not to examine all the subsets, but need to examine only the p subsets \((-i), \ i=1, \ldots , p\). This circumvents computational complexities associated with \(\mathrm{AIC}\) and \(\mathrm{BIC}\) has been resolved. Furthermore, it was identified that \(\mathrm{TM}_{d}\) has a high-dimensional consistency property for some d including \(d=\sqrt{n}\), when (i) \(p_*\) is finite and \(\varDelta ^2=\mathrm{O}(1)\), and (ii) \(p_*\) is infinite and \(\varDelta ^2=\mathrm{O}(p)\), When the dimension is larger than the sample size, we propose the ridge-type methods. However, a study of their theoretical property is left as a future work.

In discriminant analysis, it is important to select a set of variables, such that it minimizes the expected probability of misclassification. When we use the linear discriminant function with cutoff point 0, its expected probability of misclassification is
$$\begin{aligned} R=\rho _1 P(2 | 1) + \rho _2 P(1 | 2), \end{aligned}$$
where P(2|1) is the probability of putting a new observation into \(\varPi _2\), although it is known from \(\varPi _1\). Similarly, P(1|2) is the probability of putting a new observation into \(\varPi _1\), although it is known from \(\varPi _2\). An estimator is used as a measure of selection of variables. Usually, the prior probabilities \(\rho _i\) are unknown, so we use an estimator with \(\rho _1=\rho _2=1/2\) as a conventional approach. Then, we have a variable selection method based on an estimator when a subset of variables, \({\varvec{y}}_j\) is used. For example, under a large-sample asymptotic framework is used, which is given (see, e.g., Fujikoshi 1985) by
$$\begin{aligned} {\hat{R}}_{j}=\varPhi (Q_j), \end{aligned}$$
where \(\varPhi\) is the distribution function of the standard normal distribution:
$$\begin{aligned} Q_j= -\frac{1}{2}D_j+\frac{1}{2}(p-1)\left\{ n_1^{-1}+n_2^{-1}\right\} D_j^{-1} +(32m)^{-1}D_j\left\{ 4(4p-1)-D_j^2\right\} , \end{aligned}$$
\(m=n_1+n_2-2\), and \(D_j\) is the sample Mahalanobis distance based on \({\varvec{y}}_j\). For an estimator under a large-sample asymptotic framework, see, e.g., Fujikoshi (2000), Hyodo and Kubokawa (2014), and Yamada et al. (2017). A usual variable selection method is to select a subset minimizing \(Q_j\). However, such a method has no consistency as in Fujikoshi (1985), and becomes computationally onerous when the dimension is large. On the other hand, along the same line as in the test-based method, we can propose a variable selection method defined by selecting the set of suffixes given by
$$\begin{aligned} \mathrm{MM}_{d}=\left\{ i \in \omega \ | \ Q_i-Q_{\omega }> d \right\} , \end{aligned}$$
(11)
where d is a constant, though the problem of looking for an optimum d is left as a future subject. Furthermore, as being pointed by one of the reviewers, the variable selection method introduced in this paper should be compared with the methods based on estimators of misclassification.

Notes

Acknowledgements

We thank two referees for careful reading of our manuscript and many helpful comments which improved the presentation of this paper. The first author’s research is partially supported by the Ministry of Education, Science, Sports, and Culture, a Grant-in-Aid for Scientific Research (C), 16K00047, 2016–2018.

References

  1. Akaike, H. (1973). Information theory and an extension of themaximum likelihood principle. In B. N. Petrov & F. Csáki (Eds.), 2nd International Symposium on Information Theory (pp. 267–281). Budapest: Akadémiai Kiadó.Google Scholar
  2. Clemmensen, L., Hastie, T., Witten, D. M., & Ersbell, B. (2011). Sparse discriminant analysis. Technometrics, 53, 406–413.MathSciNetCrossRefGoogle Scholar
  3. Fujikoshi, Y. (1985). Selection of variables in two-group discriminant analysis by error rate and Akaike’s information criteria. Journal of Multivariate Analysis, 17, 27–37.MathSciNetCrossRefzbMATHGoogle Scholar
  4. Fujikoshi, Y. (2000). Error bounds for asymptotic approximations of the linear discriminant function when the sample size and dimensionality are large. Journal of Multivariate Analysis, 73, 1–17.MathSciNetCrossRefzbMATHGoogle Scholar
  5. Fujikoshi, Y., & Sakurai, T. (2016). High-dimensional consistency of rank estimation criteria in multivariate linear model. Journal of Multivariate Analysis, 149, 199–212.MathSciNetCrossRefzbMATHGoogle Scholar
  6. Fujikoshi, Y., Ulyanov, V. V., & Shimizu, R. (2010). Multivariate statistics: high-dimensional and large-sample approximations. Hobeken, NJ: Wiley.CrossRefzbMATHGoogle Scholar
  7. Fujikoshi, Y., Sakurai, T., & Yanagihara, H. (2014). Consistency of high-dimensional AIC-type and \(\text{ C }_p\)-type criteria in multivariate linear regression. Journal of Multivariate Analysis, 144, 184–200.CrossRefzbMATHGoogle Scholar
  8. Hao, N., Dong, B. & Fan, J. (2015). Sparsifying the Fisher linear discriminant by rotation. Journal of the Royal Statistical Society: Series B, 77, 827–851.MathSciNetCrossRefGoogle Scholar
  9. Hyodo, M., & Kubokawa, T. (2014). A variable selection criterion for linear discriminant rule and its optimality in high dimensional and large sample data. Journal of Multivariate Analysis, 123, 364–379.MathSciNetCrossRefzbMATHGoogle Scholar
  10. Ito, T. & Kubokawa, T. (2015). Linear ridge estimator of high-dimensional precision matrix using random matrix theory. Discussion Paper Series, CIRJE-F-995.Google Scholar
  11. Kubokawa, T., & Srivastava, M. S. (2012). Selection of variables in multivariate regression models for large dimensions. Communication in Statistics-Theory and Methods, 41, 2465–2489.MathSciNetCrossRefzbMATHGoogle Scholar
  12. McLachlan, G. J. (1976). A criterion for selecting variables for the linear discriminant function. Biometrics, 32, 529–534.MathSciNetCrossRefzbMATHGoogle Scholar
  13. Nishii, R., Bai, Z. D., & Krishnaia, P. R. (1988). Strong consistency of the information criterion for model selection in multivariate analysis. Hiroshima Mathematical Journal, 18, 451–462.MathSciNetCrossRefzbMATHGoogle Scholar
  14. Rao, C. R. (1973). Linear statistical inference and its applications (2nd ed.). New York: Wiley.CrossRefzbMATHGoogle Scholar
  15. Sakurai, T., Nakada, T., & Fujikoshi, Y. (2013). High-dimensional AICs for selection of variables in discriminant analysis. Sankhya, Series A, 75, 1–25.MathSciNetCrossRefzbMATHGoogle Scholar
  16. Schwarz, G. (1978). Estimating the dimension od a model. Annals of Statistics, 6, 461–464.MathSciNetCrossRefzbMATHGoogle Scholar
  17. Tiku, M. (1985). Noncentral chi-square distribution. In S. Kotz & N. L. Johnson (Eds.), Encyclopedia of Statistical Sciences, vol. 6 (pp. 276–280). New York: Wiely.Google Scholar
  18. Van Wieringen, W. N., & Peeters, C. F. (2016). Ridge estimation of inverse covariance matrices from high-dimensional data. Computational Statistics & Data Analysis, 103, 284–303.MathSciNetCrossRefzbMATHGoogle Scholar
  19. Witten, D. W., & Tibshirani, R. (2011). Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B, 73, 753–772.MathSciNetCrossRefzbMATHGoogle Scholar
  20. Yamada, T., Sakurai, T. & Fujikoshi, Y. (2017). High-dimensional asymptotic results for EPMCs of W- and Z- rules. Hiroshima Statistical Research Group, 17–12.Google Scholar
  21. Yanagihara, H., Wakaki, H., & Fujikoshi, Y. (2015). A consistency property of the AIC for multivariate linear models when the dimension and the sample size are large. Electronic Journal of Statistics, 9, 869–897.MathSciNetCrossRefzbMATHGoogle Scholar
  22. Zhao, L. C., Krishnaiah, P. R., & Bai, Z. D. (1986). On determination of the number of signals in presence of white noise. Journal of Multivariate Analysis, 20, 1–25.MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Japanese Federation of Statistical Science Associations 2019

Authors and Affiliations

  1. 1.Department of Mathematics, Graduate School of ScienceHiroshima UniversityHigashi HiroshimaJapan
  2. 2.School of General and Management StudiesSuwa University of ScienceChinoJapan

Personalised recommendations