1 Introduction

Along with the development of technology, we often encounter high-dimension, low-sample-size (HDLSS) data. In this paper, we consider two-class linear discriminant analysis for the HDLSS data. Suppose we have two independent and d-variate populations, \(\Pi _i\), \(i=1,2\), having an unknown mean vector \(\varvec{\mu }_i\) and unknown covariance matrix \({\varvec{\Sigma }}_i\) for each \(i=1,2\). We have independent and identically distributed (i.i.d.) observations, \(\varvec{x}_{i1},\ldots ,\varvec{x}_{in_i}\), from each \(\Pi _i\). We assume \(n_i\ge 2\), \(i=1,2\). We also assume that

$$\begin{aligned} \limsup _{d\rightarrow \infty } \frac{\Vert \varvec{\mu }_i\Vert ^2}{d}< & {} \infty , \quad \liminf _{d\rightarrow \infty } \frac{\text { tr }(\varvec{\Sigma }_i)}{d}>0, \quad \text { and }\\&\limsup _{d\rightarrow \infty } \frac{\text { tr }(\varvec{\Sigma }_i)}{d}<\infty \end{aligned}$$

for \(i=1,2\), where \(\Vert \cdot \Vert \) denotes the Euclidean norm. Let \(N=n_1+n_2\). We simply write that \((\varvec{x}_1,\ldots ,\varvec{x}_N)=(\varvec{x}_{11},\ldots ,\varvec{x}_{1n_1},\varvec{x}_{21},\ldots ,\varvec{x}_{2n_2})\). We denote the class labels of \(t_j\) by \(-1\) for \(j=1,\ldots ,n_1\), and by \(+1\) for \(j=n_1+1,\ldots ,N\). Let \(\varvec{x}_0\) be an observation vector of an individual belonging to one of the \(\Pi _i\)s. We assume that \(\varvec{x}_0\) and \(\varvec{x}_{ij}\)s are independent.

In the HDLSS context, Hall et al. (2008), Chan and Hall (2009), and Aoshima and Yata (2014) considered distance-based classifiers. Aoshima and Yata (2019a) considered a distance-based classifier based on a data transformation technique. Aoshima and Yata (2011, 2015) considered geometric classifiers based on a geometric representation of HDLSS data. Aoshima and Yata (2019b) considered quadratic classifiers in general and discussed optimality of the classifiers under high-dimension, non-sparse settings. In the field of machine learning, there are many studies for classification (supervised learning). A typical method is the support vector machine (SVM) developed by Vapnik (2000). Hall et al. (2005), Chan and Hall (2009), and Nakayama et al. (2017, 2020) investigated asymptotic properties of the SVM in the HDLSS context. Nakayama et al. (2017, 2020) pointed out the strong inconsistency of the SVM when \(n_i\)s are imbalanced. They proposed bias-corrected SVMs and showed their superiority to the SVMs. On the other hand, Marron et al. (2007) pointed out that the SVM causes data piling in the HDLSS context. Data piling is a phenomenon that the projection of training data to the normal direction vector of a separating hyperplane is the same for each class. See Fig. 1 in Sect. 2. To avoid the data piling problem of the SVM, Marron et al. (2007) proposed the distance-weighted discrimination (DWD). Whereas the SVM finds the optimal hyperplane by maximizing the minimum distances from each class to the hyperplane, the DWD finds a proper hyperplane by minimizing the sum of reciprocals of the distance from each data point to the hyperplane. The DWD cares all the data vectors that are not always used in the SVM. Unfortunately, the DWD is designed for balanced training data sets. See Qiao et al. (2010) and Qiao and Zhang (2015). For imbalanced training data sets, Qiao et al. (2010) developed the weighted DWD (WDWD) that imposes different weights on two classes. However, the WDWD is sensitive for a choice of weights.

In this paper, we investigate the DWD and the WDWD theoretically in the HDLSS context where \(d\rightarrow \infty \) while N is fixed. In Sect. 2, we review the DWD. In Sect. 3, we give asymptotic properties of the DWD. We show that the DWD includes a huge bias caused by a heterogeneity of covariance matrices as well as sample imbalance. We propose a bias-corrected DWD (BC-DWD) and show that the BC-DWD can enjoy consistency properties about misclassification rates. In Sect. 4, we give asymptotic properties of the WDWD. We propose an optimal choice of the weights in the WDWD. Finally, in Sects. 5 and 6, we discuss performances of the BC-DWD and the WDWD with the optimal weights in numerical simulations and actual data analyses.

2 Formulation of the DWD

In this section, we give a formulation of the DWD along the line of Marron et al. (2007).

Let \(\varvec{w}\in {\mathbb {R}}^d\) be a normal vector and \( b\in {\mathbb {R}}\) be an intercept term, respectively. Let \(r_j=t_j(\varvec{w}^{\mathrm{T}}\varvec{x}_j+ b)\) for \(j=1,\ldots ,N\). When the training data sets are linearly separable, the DWD is defined by minimizing the sum of \(1/r_j\) for all observations. Note that the HDLSS data are linearly separable by a hyperplane. Thus, the optimization problem of the DWD is as follows:

$$\begin{aligned} \min _{{\varvec{w}},b} \sum _{j=1}^{N}\frac{1}{r_j} \text{ subject } \text{ to } \ r_j=t_j(\varvec{w}^{\mathrm{T}}\varvec{x}_j+ b)>0,\quad j=1,\ldots ,N, \text{ and } \Vert \varvec{w}\Vert {\le }1. \end{aligned}$$

The dual problem of the above optimization problem can be written as

$$\begin{aligned} \max _{{\varvec{\alpha }},\lambda }\min _{{\varvec{w}},b, {\varvec{r}}} \left\{ \sum _{j=1}^{N}\frac{1}{r_j}+\sum _{j=1}^{N}\alpha _j(r_j-t_j(\varvec{w}^{\mathrm{T}}\varvec{x}_j+ b))+\lambda (\Vert \varvec{w}\Vert ^2-1)\right\} , \end{aligned}$$

subject to \(\alpha _j> 0\), \(j=1,\ldots ,N\), and \(\lambda > 0\), where \(\varvec{\alpha }=(\alpha _1,\ldots ,\alpha _N)^{\mathrm{T}}\), \(\varvec{r}=(r_1,\ldots , r_N)^{\mathrm{T}}\), and \(\lambda \) and \(\alpha _j\)s are Lagrange multipliers. Let

$$\begin{aligned} A=\sum _{j=1}^{N}\frac{1}{r_j}+\sum _{j=1}^{N}\alpha _j(r_j-t_j(\varvec{w}^{\mathrm{T}}\varvec{x}_j+ b))+\lambda (\Vert \varvec{w}\Vert ^2-1). \end{aligned}$$

Then, we have that

$$\begin{aligned}&\frac{\partial A}{\partial \varvec{r}}=\varvec{0}\Leftrightarrow r_j^2=\frac{1}{\alpha _j},\quad j=1,\ldots ,N; \nonumber \\&\frac{\partial A}{\partial \varvec{w}}=\varvec{0}\Leftrightarrow \varvec{w}=\frac{1}{2\lambda }\sum _{j=1}^{N}\alpha _j t_j\varvec{x}_j; \nonumber \\&\quad \text{ and }\quad \frac{\partial A}{\partial b}=0 \Leftrightarrow \sum _{j=1}^{N}\alpha _j t_j=0. \end{aligned}$$
(1)

Let

$$\begin{aligned} K(\varvec{\alpha })=\bigg \Vert \sum _{j=1}^{N} \alpha _j t_j\varvec{x}_j\bigg \Vert ^2. \end{aligned}$$

The optimization problem can be transformed into the following:

$$\begin{aligned} {\max _{{\varvec{\alpha }}, \lambda } } \left\{ 2\sum _{j=1}^{N}\sqrt{\alpha _j}-\frac{1}{4\lambda }K(\varvec{\alpha })-\lambda \right\} \end{aligned}$$
(2)

subject to \(\alpha _j> 0\), \(j=1,\ldots ,N\), \(\lambda > 0\), and \(\sum _{j=1}^N\alpha _jt_j=0\). Let

$$\begin{aligned} A^*=2\sum _{j=1}^{N}\sqrt{\alpha _j}-\frac{1}{4\lambda }K(\varvec{\alpha })-\lambda . \end{aligned}$$

Then, by noting that

$$\begin{aligned} \frac{\partial A^*}{\partial \lambda }=0 \Leftrightarrow \lambda =\frac{1}{2}\sqrt{K(\varvec{\alpha })} , \end{aligned}$$
(3)

we can rewrite the optimization problem (2) as follows:

$$\begin{aligned} \max _{ {\varvec{\alpha }}}\ \left\{ 2\sum _{j=1}^N\sqrt{\alpha _j} -\sqrt{K(\varvec{\alpha })} \right\} \end{aligned}$$

subject to

$$\begin{aligned} \alpha _j> 0,\ j=1,\ldots ,N, \quad \text{ and } \quad \sum _{j=1}^N\alpha _jt_j=0. \end{aligned}$$
(4)

Let

$$\begin{aligned} L(\varvec{\alpha }) =2\sum _{j=1}^N\sqrt{\alpha _j} -\sqrt{K(\varvec{\alpha })} \end{aligned}$$

and

$$\begin{aligned} {\hat{\varvec{\alpha }}} =({\hat{\alpha }}_1,\ldots ,{\hat{\alpha }}_N)^{\mathrm{T}}=\mathop {\hbox {argmax}}\limits _{ {\varvec{\alpha }}}L(\varvec{\alpha })\ \text{ subject } \text{ to } (4). \end{aligned}$$

Then, from (1) and (3), we write that

$$\begin{aligned} {\hat{\varvec{w}}} =\frac{1}{ \sqrt{K({\hat{\varvec{\alpha }}})} } \sum _{j=1}^N{\hat{\alpha }}_jt_j\varvec{x}_j. \end{aligned}$$

The intercept term b is given by

$$\begin{aligned} \frac{t_j}{\sqrt{{\hat{\alpha }}_j}}-{\hat{\varvec{w}}}^{\mathrm{T}}\varvec{x}_j \quad \text{ for } \text{ all } j=1,\ldots ,N. \end{aligned}$$

Thus, we consider estimating b by the average:

$$\begin{aligned} {\hat{b}} =\frac{1}{N}\sum _{j=1}^N \left( \frac{t_j}{\sqrt{{\hat{\alpha }}_j}} -{\hat{\varvec{w}}}^{\mathrm{T}}\varvec{x}_j\right) . \end{aligned}$$

Then, the classifier function of the DWD is defined by

$$\begin{aligned} {y}(\varvec{x}_0) ={\hat{\varvec{w}}}^{\mathrm{T}}\varvec{x}_0+{\hat{b}}. \end{aligned}$$

One classifies \(\varvec{x}_0\) into \(\Pi _1\) if \({y}(\varvec{x}_0 )<0\) and into \( \Pi _2\) otherwise.

Now, let us use the following toy example to see data piling. We set \(n_1=n_2=25\), \(d=2^s\), \(s=5,\ldots ,8\). Independent pseudo-random observations were generated from \({\varvec{\Pi }}_i:N_d(\varvec{\mu }_i,\varvec{\Sigma }_i)\). We set \(\varvec{\mu }_1=\varvec{0}\), \(\varvec{\mu }_2=(1,\ldots ,1,0,\ldots ,0)^{\mathrm{T}}\) whose first \(\lceil d^{2/3} \rceil \) elements are 1, and \(\varvec{\Sigma }_1=\varvec{\Sigma }_2={\varvec{I}}_d\), where \(\lceil x \rceil \) denotes the smallest integer \(\ge x\) and \(\varvec{I}_d\) denotes the d-dimensional identity matrix. Let \({y}_{{\mathrm{SVM}}}(\cdot )\) be a classifier function of the (linear) SVM. In Fig. 1, we gave the histograms of \({y}_{{\mathrm{SVM}}}(\varvec{x}_j)\)s and normalized \({y}(\varvec{x}_j)\)s, respectively.

Fig. 1
figure 1

Toy example, illustrating the data piling problem in HDLSS settings and displaying the histograms of \({y}_{{\mathrm{SVM}}}(\varvec{x}_j)\)s on the left panels and normalized \({y}(\varvec{x}_j)\)s on the right panels for \(d=2^s,\ s=5,\ldots ,8\)

We observed that the data training points for the SVM are concentrated in \(-1\) when \(\varvec{x}_j \in \Pi _1\) and 1 when \(\varvec{x}_j \in \Pi _2\), as d increases. This phenomenon is the data piling. See Nakayama et al. (2017) for the theoretical reason. On the other hand, the data training points for the DWD did not have the phenomenon. We emphasize that the DWD cares all the data vectors that are not always used in the SVM. However, in the next section, we show that the DWD includes a huge bias caused by a heterogeneity of covariance matrices as well as sample imbalance.

3 Asymptotic properties of the DWD and its bias correction

In this section, we first give asymptotic properties of the DWD in the HDLSS context. We show that the DWD includes a huge bias caused by a heterogeneity of covariance matrices as well as sample imbalance. To overcome such difficulties, we propose a bias-corrected DWD.

3.1 Asymptotic properties of the DWD

Let \(\Delta ={\Vert \varvec{\mu }_1-\varvec{\mu }_2\Vert }\). For \(K(\varvec{\alpha })\), we have the following result.

Lemma 1

Assume

(C-i):
$$ \displaystyle \frac{ \text{ Var } (\Vert {\varvec{x}}_{ij}-{\varvec{\mu }}_i\Vert ^2)}{\varDelta ^4}\rightarrow 0 \text{ as } d\rightarrow \infty \text{ for } i=1,2; \text{ and }$$
(C-ii):
$$\displaystyle \frac{ \text{ tr } (\varvec{ \Sigma }_i^2)}{\varDelta ^4}\rightarrow 0 \text{ as } d\rightarrow \infty \text{ for } i=1,2.$$

Under (4), it holds that as \(d\rightarrow \infty \)

$$\begin{aligned} K(\varvec{\alpha })= & {} \frac{\varDelta ^2}{4}\left( \sum _{j=1}^N\alpha _j\right) ^2\{1+o_p(1)\} \\&+\left( \text { tr }(\varvec{\Sigma }_1)\sum _{j=1}^{n_1}\alpha _j^2+ \text { tr }(\varvec{\Sigma }_2)\sum _{j=n_1+1}^{N}\alpha _j^2\right) . \end{aligned}$$

Remark 1

The conditions (C-i) and (C-ii) are equivalent when \(\Pi _i\)s are Gaussian because \(\text { Var }(\Vert {\varvec{x}}_{ij}-{\varvec{\mu }}_i\Vert ^2)=2\text { tr }(\varvec{\Sigma }_i^2)\). If \(\text { tr }(\varvec{\Sigma }_i^2)=O(d),\ i=1,2,\) and \(\Delta ^2/d^{1/2}\rightarrow \infty \) as \(d\rightarrow \infty \), (C-ii) is satisfied. See Sect. 5 for several models satisfying (C-ii).

We consider maximizing \(L(\varvec{\alpha })\). From Jensen’s inequality, we note that \(\sum _{j=1}^{n_1} \sqrt{\alpha _j}/n_1\le \sqrt{ \sum _{j=1}^{n_1}\alpha _j/n_1 }\), \(\sum _{j=n_1+1}^{N} \sqrt{\alpha _j}/n_2\le \sqrt{ \sum _{j=n_1+1}^{N}\alpha _j/n_2 }\), \(\sum _{j=1}^{n_1}\) \( \alpha _j^2/n_1\ge ( \sum _{j=1}^{n_1}\alpha _j/n_1 )^2\) and \(\sum _{j=n_1+1}^{N} \alpha _j^2/n_2 \ge ( \sum _{j=n_1+1}^{N}\alpha _j/n_2 )^2\). In addition, note that \(\sum _{j=1}^{n_1}\alpha _j=\sum _{j=n_1+1}^{N}\alpha _j=\sum _{j=1}^N\alpha _j/2\) under (4). Then, under (4) and the constraint that \(\sum _{j=1}^N\alpha _j=B\) for a given positive constant B, we can claim that

$$\begin{aligned}&\max _{{{\varvec{\alpha }}}}\sum _{j=1}^N \sqrt{\alpha _j}=\sqrt{B} (\sqrt{n_1/2}+ \sqrt{n_2 /2}) \ \text { and } \nonumber \\&\quad \min _{ {{\varvec{\alpha }}}} \left( \text { tr }(\varvec{\Sigma }_1)\sum _{j=1}^{n_1}\alpha _j^2+\text { tr }(\varvec{\Sigma }_2)\sum _{j=n_1+1}^{N}\alpha _j^2\right) \nonumber \\&\qquad =\frac{B^2}{4}\left( \frac{\text { tr }(\varvec{\Sigma }_1)}{n_1}+\frac{\text { tr }(\varvec{\Sigma }_2)}{n_2}\right) \end{aligned}$$
(5)

when \(\alpha _1=\cdots =\alpha _{n_1}=B/(2n_1)\) and \(\alpha _{n_1+1}=\cdots =\alpha _{N}=B/(2n_2)\). Thus, from Lemma 1, under (C-i) and (C-ii), it holds that

$$\begin{aligned} \max _{ {\varvec{\alpha }}} L(\varvec{\alpha }) =&-\frac{1}{2}\Delta _{*} \Bigg (\sqrt{B}-\frac{\sqrt{2n_1}+\sqrt{2n_2}}{\Delta _{*} }\{1+o_p(\zeta )\} \Bigg )^2\{1+o_p(\zeta )\} \nonumber \\&+\frac{(\sqrt{n_1}+\sqrt{n_2})^2}{\Delta _{*}}\{1+o_p(\zeta )\}, \end{aligned}$$
(6)

where

$$\begin{aligned} {{\Delta }} _{*}= & {} {\sqrt{ {{\Delta }} {^2}+\text { tr }({\varvec{\Sigma }}_1)/n_1+\text { tr }({\varvec{\Sigma }}_2)/n_2}} \quad \text { and } \\&\qquad \qquad \qquad \quad \zeta ={{\Delta }} ^{{2}}/{{\Delta }} _*^{{2}}. \end{aligned}$$

Hence, by choosing \(B\approx 2(\sqrt{n_1}+\sqrt{n_2})^2/\Delta _{*}^2\), we have the maximum of \(L(\varvec{\alpha })\) asymptotically.

Lemma 2

Under (C-i) and (C-ii), it holds that as \(d\rightarrow \infty \)

$$\begin{aligned}&{\hat{\alpha }}_{j} =\frac{(\sqrt{n_1}+\sqrt{n_2})^2}{n_1\Delta _{*}^{2}}\{1+o_p(\zeta )\}\quad \text{ for } j=1,\ldots ,n_1;\quad \text{ and } \\&{\hat{\alpha }}_{j} =\frac{(\sqrt{n_1}+\sqrt{n_2})^2}{n_2\Delta _{*}^{2}}\{1+o_p(\zeta )\}\quad \text{ for } j=n_1+1,\ldots ,N. \end{aligned}$$

Furthermore,  it holds that as \(d\rightarrow \infty \)

$$\begin{aligned} {y}(\varvec{x}_0) =\frac{\Delta ^2}{ \Delta _*} \left( \frac{(-1)^i}{2}+\delta +o_p(1) \right) \end{aligned}$$
(7)

when \(\varvec{x}_0\in \Pi _i\) for \(i=1,2,\) where

$$\begin{aligned} \delta= & {} -\frac{(n_1^{3/2}-n_2^{3/2}){{\Delta }}_*^2}{N(\sqrt{n_1}+\sqrt{n_2}) {{\Delta }}^2} +\frac{n_1-n_2}{2N} \\&\quad +\frac{\text{ tr }(\varvec{\varSigma }_1)-\text{ tr }(\varvec{\varSigma }_2)}{N {{\Delta }}^2 }. \end{aligned}$$

The quantity \(\delta \) vanishes if \(n_1=n_2\) and \(\varvec{\varSigma }_1=\varvec{\varSigma }_2\). We consider the following assumption:

(C-iii):

   \( \limsup |\delta | <\frac{1}{2}\).

Let e(i) denote the error rate of misclassifying an individual from \(\Pi _i\) into the other class for \(i=1,2\). Then, we have the following results.

Theorem 1

Under (C-i), (C-ii) and (C-iii), the DWD holds : 

$$\begin{aligned} e(1)\rightarrow 0 \quad \text{ and } \quad e(2)\rightarrow 0 \qquad \text{ as } d\rightarrow \infty . \end{aligned}$$
(8)

However, without (C-iii), we have the following results.

Corollary 1

Under (C-i) and (C-ii), the DWD holds : 

$$\begin{aligned}&e(1)\rightarrow 1 \ \text{ and } \ e(2)\rightarrow 0 \quad \text{ as } d\rightarrow \infty \text{ if } \liminf _{d\rightarrow \infty }\delta >\frac{1}{2}; \quad \text{ and } \\&e(1)\rightarrow 0 \ \text{ and } \ e(2)\rightarrow 1 \quad \text{ as } d\rightarrow \infty \text{ if } \limsup _{d\rightarrow \infty } \delta <-\frac{1}{2}. \end{aligned}$$

Remark 2

For the DWD, Hall et al. (2005) and Qiao et al. (2010) showed the consistent property in Theorem 1 and the inconsistent properties in Corollary 1 under different conditions. However, we claim that (C-i), (C-ii) and (C-iii) are milder than their conditions.

From Corollary 1, the DWD brings the strong inconsistency because of the huge bias caused by the heterogeneity of covariance matrices as well as sample imbalance. For example, if \(\text{ tr }(\varvec{\varSigma }_i)/ {{\Delta }}^2 \rightarrow \infty \) as \(d\rightarrow \infty \) for some i, \(|\delta |\) tends to become large as d increases when \(\text{ tr }(\varvec{\varSigma }_1)\ne \text{ tr }(\varvec{\varSigma }_2)\) or \(n_1 \ne n_2\). To overcome such difficulties, we propose a bias-corrected DWD.

3.2 Bias-corrected DWD

We consider an unbiased estimator of \(\Delta ^2\) as follows:

$$\begin{aligned} {\hat{\Delta }}^2=\Vert {\overline{\varvec{x}}}_{1}-{\overline{\varvec{x}}}_{2} \Vert ^2-\text{ tr }(\varvec{S}_{1})/n_1-\text{ tr }(\varvec{S}_2)/n_2, \end{aligned}$$

where \({\overline{\varvec{x}}}_{i}=\sum _{j=1}^{n_i}{\varvec{x}_{ij}}/{n_i}\) and \(\varvec{S}_{i}=\sum _{j=1}^{n_i}(\varvec{x}_{ij}-{\overline{\varvec{x}}}_{i})(\varvec{x}_{ij}-{\overline{\varvec{x}}}_{i})^{\mathrm{T}}/(n_i-1)\) for \(i=1,2\). Note that \(E({\hat{\Delta }}^2)=\Delta ^2\). From \(\text { tr }(\varvec{\Sigma }_1\varvec{\Sigma }_2)\le \sqrt{\text { tr }(\varvec{\Sigma }_1^2)\text { tr }(\varvec{\Sigma }_2^2)}\) and \(({\varvec{\mu }}_1-{\varvec{\mu }}_2)^{\mathrm {T}}\varvec{\Sigma }_i({\varvec{\mu }}_1-{\varvec{\mu }}_2)\le {{\Delta }} ^2 \sqrt{ \text { tr }({\varvec{\Sigma }}_i^2)}\), under (C-ii), it holds that as \(d\rightarrow \infty \)

$$\begin{aligned} \text{ Var }({\hat{{\Delta }}}^2/{{\Delta }}^2)&=\sum _{i=1}^2 \frac{2\text{ tr }(\varvec{\varSigma }_i^2)}{n_i(n_i-1){{\Delta }}^4} + \frac{{4}\text{ tr }(\varvec{\varSigma }_1\varvec{\varSigma }_2)}{n_1n_2{{\Delta }}^4}\\&+\sum _{i=1}^2\frac{4(\varvec{\mu }_1-\varvec{\mu }_2)^{\mathrm{T}}\varvec{\varSigma }_i(\varvec{\mu }_1-\varvec{\mu }_2)}{n_i{{\Delta }}^4} \\&=o(1). \end{aligned}$$

Thus, under (C-ii), from Chebyshev’s inequality, it holds that as \(d\rightarrow \infty \)

$$\begin{aligned} {\hat{\Delta }}^2=\Delta ^2+o_p(\Delta ^2). \end{aligned}$$
(9)

See also Aoshima and Yata (2018) for asymptotic properties of \({\widehat{{\Delta }}}^2\). On the other hand, we consider an unbiased estimator of \(\Delta _*^2\) as follows:

$$\begin{aligned} {\hat{\Delta }}_*^2=\Vert {\overline{\varvec{x}}}_{1}-{\overline{\varvec{x}}}_{2} \Vert ^2. \end{aligned}$$

Note that \(E({\hat{\Delta }}_*^2)=\Delta _*^2\). Here, we write that

$$\begin{aligned} \text{ tr }(\varvec{S}_i)-\text{ tr }(\varvec{\varSigma }_i)= & {} \sum _{j=1}^{n_i}\frac{\Vert \varvec{x}_{ij}-\varvec{\mu }_i\Vert ^2-\text{ tr }(\varvec{\varSigma }_i) }{n_i}\\&-\sum _{j\ne j'}^{n_i}\frac{(\varvec{x}_{ij}-\varvec{\mu }_i)^{\mathrm{T}}(\varvec{x}_{ij'}-\varvec{\mu }_i)}{n_i(n_i-1)}. \end{aligned}$$

Then, by noting that \(\text{ Var }\{ \sum _{j\ne j'}^{n_i} (\varvec{x}_{ij}-\varvec{\mu }_i)^{\mathrm{T}}(\varvec{x}_{ij'}-\varvec{\mu }_i)/n_i^2 \} =O\{ \text{ tr }(\varvec{\varSigma }_i^2)\}\) and \(\text{ Var }\{\Vert \varvec{x}_{ij}-\varvec{\mu }_i\Vert ^2-\text{ tr }(\varvec{\varSigma }_i)\}=\text{ Var }(\Vert \varvec{x}_{ij}-\varvec{\mu }_i\Vert ^2)\), under (C-i) and (C-ii), from (9), it holds that as \(d\rightarrow \infty \)

$$\begin{aligned} \text{ tr }(\varvec{S}_{i})&= \text{ tr }(\varvec{\varSigma }_i)+o_p({{\Delta }}^2),\ i=1,2,\quad \text{ and }\nonumber \\&{\hat{{\Delta }}}_*^2={{\Delta }}_*^2+o_p({{\Delta }}^2). \end{aligned}$$
(10)

Let \(\delta _*=\Delta ^2 \delta /\Delta _*\) and \({{\hat{\Delta }}_*=\sqrt{{\hat{\Delta }}_*^2}}\). We consider an estimator of \(\delta _*\) as follows:

$$\begin{aligned} {\hat{\delta }}_*= -\frac{(n_1^{3/2}-n_2^{3/2}){\hat{\Delta }}_*}{N(\sqrt{n_1}+\sqrt{n_2})} +\frac{(n_1-n_2){\hat{\Delta }}^2}{2N{\hat{\Delta }}_*} +\frac{\text{ tr }(\varvec{S}_1)-\text{ tr }(\varvec{S}_2)}{N {\hat{\Delta }}_*}. \end{aligned}$$

Then, under (C-i) and (C-ii), from (9) and (10), it holds that as \(d\rightarrow \infty \)

$$\begin{aligned} {\hat{\delta }}_*=\delta _*+o_p( \Delta ^2/\Delta _*). \end{aligned}$$
(11)

Now, we define the bias-corrected DWD (BC-DWD) by

$$\begin{aligned} {y}_{{\mathrm{BC}}}(\varvec{x}_0)=y(\varvec{x}_0)-{\hat{\delta }}_*. \end{aligned}$$

One classifies \(\varvec{x}_0\) into \(\Pi _1\) if \({y}_{{\mathrm{BC}}}(\varvec{x}_0)<0\) and into \( \Pi _2\) otherwise. Then, from Lemma 2 and (11), we have the following result.

Theorem 2

For the BC-DWD,  (8) holds under (C-i) and (C-ii).

We emphasize that the BC-DWD enjoys the asymptotic consistency without assuming (C-iii). See Sect. 5 for numerical comparisons.

4 WDWD and its asymptotic properties

Qiao et al. (2010) developed the WDWD to overcome the weakness of the DWD for sample imbalance. The optimization problem of the WDWD is as follows:

$$\begin{aligned}&\min _{{{\varvec{w}}},b} \left\{ W(-1)\sum _{j=1}^{n_1}\frac{1}{r_j} +W(+1)\sum _{j=n_1+1}^{N}\frac{1}{r_j} \right\} \\&\text{ subject } \text{ to } r_j=t_j(\varvec{w}^{\mathrm{T}}\varvec{x}_j+ b)>0,\quad j=1,\ldots ,N, \text{ and } \Vert \varvec{w}\Vert {\le }1 \end{aligned}$$

with \(W(-1)\ (>0)\) and \(W(+1)\ (>0)\) some weights. In this paper, we assume \(W(-1)=1\) without loss of generality. We also assume that

$$\begin{aligned} \liminf _{d\rightarrow \infty }W(+1)>0 \quad \text{ and } \quad \limsup _{d\rightarrow \infty }W(+1)<\infty . \end{aligned}$$

Then, similar to the DWD, the dual optimization problem is written as follows:

$$\begin{aligned} \max _{ {\varvec{\alpha }}}L_{{\mathrm{W}}}(\varvec{\alpha }) \end{aligned}$$

subject to (4), where

$$\begin{aligned} L_{{\mathrm{W}}}(\varvec{\alpha })=2\sum _{j=1}^{n_1}\sqrt{\alpha _j} +2\sum _{j=n_1+1}^{N}\sqrt{W(+1)\alpha _j} -\sqrt{K(\varvec{\alpha })}. \end{aligned}$$

Let us write that

$$\begin{aligned} {\hat{\varvec{\alpha }}}_{{\mathrm{W}}}=({\hat{\alpha }}_{1{\mathrm{W}}},\ldots ,{\hat{\alpha }}_{N{\mathrm{W}}})^{\mathrm{T}}=\mathop {\hbox {argmax}}\limits _{ {\varvec{\alpha }}}L_{{\mathrm{W}}}(\varvec{\alpha })\ \text{ subject } \text{ to } (4). \end{aligned}$$

Similar to the DWD, we obtain the classifier function of the WDWD:

$$\begin{aligned} {y}_{{\mathrm{W}}}(\varvec{x}_0) ={\hat{\varvec{w}}}_{{\mathrm{W}}}^{\mathrm{T}}\varvec{x}_0+{\hat{b}}_{{\mathrm{W}}}, \end{aligned}$$

where

$$\begin{aligned} {\hat{\varvec{w}}}_{{\mathrm{W}}} =\frac{1}{\sqrt{K({\hat{\varvec{\alpha }}}_{{\mathrm{W}}})}} \sum _{j=1}^N{\hat{\alpha }}_{j{\mathrm{W}}} t_j\varvec{x}_j \end{aligned}$$

and

$$\begin{aligned} {\hat{b}}_{{\mathrm{W}}}= & {} \frac{1}{N}\sum _{j=1}^{n_1} \left( -\sqrt{\frac{1}{{\hat{\alpha }}_{j{\mathrm{W}}}}}-{\hat{\varvec{w}}}_{{\mathrm{W}}}^{\mathrm{T}}\varvec{x}_j\right) \nonumber \\&+\frac{1}{N}\sum _{j=n_1+1}^{N} \left( \sqrt{\frac{W(+1)}{{\hat{\alpha }}_{j{\mathrm{W}}}}}-{\hat{\varvec{w}}}_{{\mathrm{W}}}^{\mathrm{T}}\varvec{x}_j\right) . \end{aligned}$$

Then, one classifies \(\varvec{x}_0\) into \(\Pi _1\) if \({y}_{{\mathrm{W}}}(\varvec{x}_0)<0\) and into \( \Pi _2\) otherwise.

As with the DWD, we have the following result.

Lemma 3

Under (C-i) and (C-ii), it holds that as \(d\rightarrow \infty \)

$$\begin{aligned}&{\hat{\alpha }}_{j{\mathrm{W}}} =\frac{(\sqrt{n_1}+\sqrt{n_2}\sqrt{W(+1)})^2}{n_1\Delta _{*}^2}\{1+o_p(\zeta )\}\quad \text{ for } j=1,\ldots ,n_1;\quad \text{ and } \\&{\hat{\alpha }}_{j{\mathrm{W}}} =\frac{(\sqrt{n_1}+\sqrt{n_2}\sqrt{W(+1)})^2}{n_2\Delta _{*}^2}\{1+o_p(\zeta )\}\quad \text{ for } j=n_1+1,\ldots ,N. \end{aligned}$$

Furthermore,  it holds that as \(d\rightarrow \infty \)

$$\begin{aligned} {y}_{{\mathrm{W}}}(\varvec{x}_0) =\frac{\Delta ^2}{{\Delta }_*} \left( \frac{(-1)^i}{2}+\delta _{{\mathrm{W}}}+o_p(1) \right) \end{aligned}$$
(12)

when \(\varvec{x}_0\in \Pi _i\) for \(i=1,2,\) where

$$\begin{aligned} \delta _{{\mathrm{W}}}= & {} -\frac{(n_1^{3/2}-n_2^{3/2}\sqrt{W(+1)})\Delta _*^2}{N(\sqrt{n_1}+\sqrt{n_2}\sqrt{W(+1)} ) \Delta ^2} +\frac{n_1-n_2}{2N} \\&\quad +\frac{\text{ tr }(\varvec{\varSigma }_1)-\text{ tr }(\varvec{\varSigma }_2)}{N \Delta ^2 }. \end{aligned}$$

We consider the following assumption:

(C-iv):

   \( \limsup |\delta _{{\mathrm{W}}}|<\frac{1}{2}\).

Then, we have the following results.

Theorem 3

For the WDWD,  (8) holds under (C-i), (C-ii), and (C-iv).

Corollary 2

Under (C-i) and (C-ii), the WDWD holds : 

$$\begin{aligned}&e(1)\rightarrow 1 \ \text{ and } \ e(2)\rightarrow 0 \ \text{ as } d\rightarrow \infty \text{ if } \liminf _{d\rightarrow \infty }\delta _{{\mathrm{W}}} >\frac{1}{2}; \quad \text{ and } \\&e(1)\rightarrow 0 \ \text{ and } \ e(2)\rightarrow 1 \ \text{ as } d\rightarrow \infty \text{ if } \limsup _{d\rightarrow \infty } \delta _{{\mathrm{W}}} <-\frac{1}{2}. \end{aligned}$$

For the WDWD, Qiao et al. (2010) recommended to use \(W(+1)=n_1/n_2\) in a case of equal costs. See Table 3 in Qiao et al. (2010). However, if \(\text{ tr }(\varvec{\varSigma }_i)/ \Delta ^2 \rightarrow \infty \) as \(d\rightarrow \infty \) for some i, \(|\delta _{{\mathrm{W}}}|\) with \(W(+1)=n_1/n_2\) tends to become large as d increases when \(\text{ tr }(\varvec{\varSigma }_1)\ne \text{ tr }(\varvec{\varSigma }_2)\) or \(n_1 \ne n_2\). Thus, from Corollary 2, the WDWD still brings the strong inconsistency because of the huge bias caused by the heterogeneity of covariance matrices as well as sample imbalance.

To overcome such difficulties, we propose an optimal choice of \(W(+1)\) in the WDWD. Let

$$\begin{aligned} W_0=\frac{n_1\{\Delta ^2+2\text{ tr }(\varvec{\varSigma }_2)/n_2\}^2}{n_2\{\Delta ^2+2\text{ tr }(\varvec{\varSigma }_1)/n_1\}^2}. \end{aligned}$$

We claim that

$$\begin{aligned} \delta _{{\mathrm{W}}}=0 \quad \text{ when } W(+1)=W_0. \end{aligned}$$

Thus, we consider the estimator of \(W_0\) as follows:

$$\begin{aligned} {\hat{W}}_0=\frac{n_1\{{\hat{\Delta }}^2+2\text{ tr }(\varvec{S}_2)/n_2\}^2}{n_2\{{\hat{\Delta }}^2+2\text{ tr }(\varvec{S}_1)/n_1\}^2}. \end{aligned}$$

Then, under (C-i) and (C-ii), from (9) and (10), it holds that as \(d\rightarrow \infty \)

$$\begin{aligned} {\hat{W}}_0={W}_0\{1+o_p(\zeta )\}, \end{aligned}$$

so that

$$\begin{aligned} \delta _{{\mathrm{W}}}=o_p(1) \quad \text{ when } W(+1)= {\hat{W}}_0. \end{aligned}$$
(13)

Then, from Lemma 3 and (13), we have the following result.

Theorem 4

For the WDWD with the \(W(+1)= {\hat{W}}_0,\) (8) holds under (C-i) and (C-ii).

From Theorem 4, we recommend to use \(W(+1)= {\hat{W}}_0\) (and \(W(-1)=1\)) in the WDWD.

Hereafter, we call the classifier by \({y}_{{\mathrm{W}}}(\varvec{x}_0)\) with \(W(+1)={\hat{W}}_0\) “the optimal WDWD (OWDWD)”. See Sect. 5 for numerical comparisons.

5 Comparison in high-dimensional setting

We used computer simulations to compare the performance of the classifiers: the DWD, the BC-DWD, the WDWD, and the OWDWD. We set \(W(+1)=n_1/n_2\) for the WDWD. Note that the WDWD is equivalent to the DWD when \(n_1=n_2\).

As for \(\Pi _i\ (i=1,2)\), we considered the following three cases:

  1. (i)

    \(\varvec{x}_{ij}\) is \(N_{d}(\varvec{\mu }_i, \varvec{\varSigma }_i)\);

  2. (ii)

    Let \(\varvec{x}_{ij}-\varvec{\mu }_i=\varvec{\varSigma }_i^{1/2}(z_{ij1},\ldots ,z_{ijd})^{\mathrm{T}}\) for all j. Here, \(z_{ij\ell }=(v_{ij\ell }-5)/{10}^{1/2}\) \((\ell =1,\ldots ,d)\) in which \(v_{ij\ell }\)s are i.i.d. as the chi-squared distribution with 5 degrees of freedom; and

  3. (iii)

    \(\varvec{x}_{ij}-\varvec{\mu }_i\), \(j=1,\ldots ,n_i,\) are i.i.d. as a d-variate t-distribution, \(t_d(\varvec{0}, \varvec{\varSigma }_i, 10)\), \(i=1,2,\) with mean zero, covariance matrix \(\varvec{\varSigma }_i\) and degrees of freedom 10.

Note that the conditions (C-i) and (C-ii) are equivalent for (i) and (ii) because \(\text{ Var }(\Vert \varvec{x}_{ij}-\varvec{\mu }_i \Vert ^2)=O\{\text{ tr }(\varvec{\varSigma }_i^2)\}\). We set \(d=2^s,\ s=5,\ldots ,11\), \(\varvec{\mu }_1=\varvec{0}\), and \(\varvec{\mu }_2=(1,\ldots ,1,0,\ldots ,0)^{\mathrm{T}}\) whose first \(\lceil d^{2/3} \rceil \) elements are 1. Note that \(\Delta ^2\approx d^{2/3}\). Let \(\varvec{\Phi }=\varvec{C}( 0.3^{|i-j|^{1/3}})\varvec{C}\), where \(\varvec{C}=\text{ diag }(\{0.5+1/(d+1)\}^{1/2},\ldots ,\{0.5+d/(d+1)\}^{1/2})\). Note that \(\text{ tr }(\varvec{\Phi })=d\). We considered four cases:

  1. (a)

    \((n_1,n_2)=(10,10)\) and \(\varvec{\varSigma }_1=\varvec{\varSigma }_2=\varvec{\Phi }\) for (i);

  2. (b)

    \((n_1,n_2)=(5,15)\) and \(\varvec{\varSigma }_1=\varvec{\varSigma }_2=\varvec{\Phi }\) for (i);

  3. (c)

    \((n_1,n_2)=(10,10)\), \(\varvec{\varSigma }_1=(2/3)\varvec{\Phi }\), and \(\varvec{\varSigma }_2=(4/3)\varvec{\Phi }\) for (ii); and

  4. (d)

    \((n_1,n_2)=(8,12)\), \(\varvec{\varSigma }_1=(6/5)\varvec{\Phi }\), and \(\varvec{\varSigma }_2=(4/5)\varvec{\Phi }\) for (iii).

Note that \(\text{ tr }(\varvec{\varSigma }_i^2)=O(d),\ i=1,2\), for (a) to (d), so that (C-ii) holds for (a) to (d). In addition, note that \(\text{ tr }(\varvec{\varSigma }_1)-\text{ tr }(\varvec{\varSigma }_2)=-(2/3)d\) for (c) and \(\text{ tr }(\varvec{\varSigma }_1)-\text{ tr }(\varvec{\varSigma }_2)=(2/5)d\) for (d), so that \(|\text{ tr }(\varvec{\varSigma }_1)-\text{ tr }(\varvec{\varSigma }_2)|/\Delta ^2\rightarrow \infty \) as \(d\rightarrow \infty \) for (c) and (d). We repeated it 2000 times to confirm if the classifier does (or does not) classify \(\varvec{x}_0 \in \Pi _i\) correctly and defined \(P_{ir}=0\ (\text{ or }\ 1)\) accordingly for each \(\Pi _i\ (i=1,2)\). We calculated the error rates, \({\overline{e}}(i)= \sum _{r=1}^{2000}P_{ir}/2000\), \(i=1,2\). In addition, we calculated the average error rate, \({\overline{e}}=\{{\overline{e}}(1)+{\overline{e}}(2)\}/2\). Their standard deviations are less than 0.011. In Figs. 2 and 3, we plotted \({\overline{e}}(1)\), \({\overline{e}}(2)\), and \({\overline{e}}\) for (a) to (d).

We observed that the DWD and the WDWD give quite bad performances for (b) to (d). This can be regarded as a natural consequence because of the bias in the DWD and the WDWD. Note that \(|\delta | \rightarrow \infty \) and \(|\delta _{{\mathrm{W}}}| \rightarrow \infty \) as \(d\rightarrow \infty \) for (b) to (d), where \(W(+1)=n_1/n_2\) in \(\delta _{{\mathrm{W}}}\). Thus, from Corollaries 1 and 2, the DWD and the WDWD hold the strong inconsistency. On the other hand, the BC-DWD and the OWDWD gave preferable performances for all cases. We emphasize that the BC-DWD and the OWDWD hold the consistency property without (C-iii) or (C-iv). See Sects. 3.2 and 4 for the details.

Fig. 2
figure 2

The error rates of the DWD, the BC-DWD, the WDWD, and the OWDWD for (a) \(n_1=n_2\), \(\text{ tr }(\varvec{\varSigma }_1)=\text{ tr }(\varvec{\varSigma }_2)\) for (i); and (b) \(n_1\ne n_2\), \(\text{ tr }(\varvec{\varSigma }_1)=\text{ tr }(\varvec{\varSigma }_2)\) for (i)

Fig. 3
figure 3

The error rates of the DWD, the BC-DWD, the WDWD, and the OWDWD for (c) \(n_1=n_2\), \(\text{ tr }(\varvec{\varSigma }_1)\ne \text{ tr }(\varvec{\varSigma }_2)\) for (ii); and (d) \(n_1\ne n_2\), \(\text{ tr }(\varvec{\varSigma }_1)\ne \text{ tr }(\varvec{\varSigma }_2)\) for (iii)

6 Real data analysis

In this section, we analyze a gene expression data using the DWD, the BC-DWD, the WDWD, the OWDWD, the (linear) SVM, and the bias corrected-SVM (BC-SVM) by Nakayama et al. (2017). We set \(W(+1)=n_1/n_2\) for the WDWD. Note that the WDWD is equivalent to the DWD when \(n_1=n_2\). We used colon cancer data with \(2000\ (=d)\) genes in Alon et al. (1999) that consists of \(\Pi _1:\) colon tumor (40 samples) and \(\Pi _2:\) normal colon (22 samples).

We randomly split the datasets from \((\Pi _1,\Pi _2)\) into training data sets of sizes \((n_1,n_2)\) and test data sets of sizes \((40-n_1,22-n_2)\). We considered eight cases: \((n_1,n_2)=\) (5,5), (5,15), (15,5), (15,15), (25,5), (25,15), (35,5), and (35,15). We constructed the DWD, the BC-DWD, the WDWD, the OWDWD, the SVM, and the BC-SVM using the training data sets. We checked accuracy using the test data set for each \(\Pi _i\) and denoted the misclassification rates by \({\widehat{e}}(1)_r\) and \({\widehat{e}}(2)_r\). We repeated this procedure 100 times and obtained \({\widehat{e}}(1)_r\) and \({\widehat{e}}(2)_r\), \(r=1,\ldots ,100\). We calculated the average misclassification rates, \({\overline{e}}(1)\ (=\sum _{r=1}^{100}{\widehat{e}}(1)_r/100)\), \({\overline{e}}(2)\ (=\sum _{r=1}^{100}{\widehat{e}}(2)_r/100)\), and \({\overline{e}}\ (=\{{\overline{e}}(1)+{\overline{e}}(2) \}/2)\) for the classifiers in various combinations of \((n_1,n_2)\) in Table 1.

We observed that the BC-DWD and the OWDWD give adequate performances compared to the DWD, the WDWD, and the SVM especially when \(n_1\) and \(n_2\) are unbalanced. See Sects. 3.2 and 4 for theoretical reasons. The BC-SVM also gave adequate performances even when \(n_1\) and \(n_2\) are unbalanced. This can be regarded as an acceptable consequence because the BC-SVM has the consistency (8) under (C-i) and (C-ii). See Section 3 in Nakayama et al. (2017) for the details. However, the BC-DWD (or the OWDWD) seems to give better performances compared to the BC-SVM. This can be regarded as a natural consequence because the DWD cares all the data vectors that are not always used in the SVM. See Fig. 1. A theoretical study of relevance between the BC-SVM and the BC-DWD is left to a future work.

Table 1 The average misclassification rates of the classifiers for Alon et al. (1999)’s colon cancer data

7 Proofs

Throughout this section, let \(\varvec{\mu }_{12}=\varvec{\mu }_1-\varvec{\mu }_2\) and \(\varvec{\mu }=(\varvec{\mu }_1+\varvec{\mu }_2)/2\).

7.1 Proof of Lemma 1

Under (C-ii), we have that as \(d\rightarrow \infty \)

$$\begin{aligned} \frac{\text{ tr }(\varvec{\varSigma }_1\varvec{\varSigma }_2)}{\Delta ^4}\le \frac{\sqrt{ \text{ tr }(\varvec{\varSigma }_1^2) \text{ tr }(\varvec{\varSigma }_2^{2})}}{\Delta ^4}=o(1) \quad \text{ and } \quad \frac{\varvec{\mu }_{12}^{\mathrm{T}}\varvec{\varSigma }_i\varvec{\mu }_{12}}{\Delta ^4} \le \frac{\sqrt{ \text{ tr }(\varvec{\varSigma }_i^2)}}{\Delta ^2}=o(1) \end{aligned}$$
(14)

for \(i=1,2\). Then, using Chebyshev’s inequality, for any \(\tau >0\), under (C-i) and (C-ii), we have that

$$\begin{aligned}&P(|(\varvec{x}_j-\varvec{\mu })^{\mathrm{T}}(\varvec{x}_{j'}-\varvec{\mu })-\Delta ^2/4 |\ge \tau \Delta ^2 )\nonumber \\&\quad \le {E[\{(\varvec{x}_j-\varvec{\mu })^{\mathrm{T}}(\varvec{x}_{j'}-\varvec{\mu })-\Delta ^2/4 \}^2]}/{(\tau \Delta ^2)^2} \nonumber \\&\qquad =O\{\text{ tr }(\varvec{\varSigma }_1^2)+\varvec{\mu }_{12}^{\mathrm{T}}\varvec{\varSigma }_1\varvec{\mu }_{12}\}/\Delta ^4=o(1) \nonumber \\&\qquad \qquad \text{ for } 1\le j< j'\le n_1; \nonumber \\&P(|(\varvec{x}_j-\varvec{\mu })^{\mathrm{T}}(\varvec{x}_{j'}-\varvec{\mu })-\Delta ^2/4 |\ge \tau \Delta ^2 ) \nonumber \\&\qquad =O\{\text{ tr }(\varvec{\varSigma }_2^2)+\varvec{\mu }_{12}^{\mathrm{T}}\varvec{\varSigma }_2\varvec{\mu }_{12}\}/\Delta ^4=o(1)\nonumber \\&\qquad \qquad \text{ for } n_1+1 \le j< j' \le N;\quad \text{ and } \nonumber \\&P(|(\varvec{x}_j-\varvec{\mu })^{\mathrm{T}}(\varvec{x}_{j'}-\varvec{\mu })+\Delta ^2/4 |\ge \tau \Delta ^2 )\nonumber \\&\qquad =O\{\text{ tr }(\varvec{\varSigma }_1\varvec{\varSigma }_2)+\varvec{\mu }_{12}^{\mathrm{T}}(\varvec{\varSigma }_1+\varvec{\varSigma }_2)\varvec{\mu }_{12}\}/\Delta ^4=o(1)\nonumber \\&\qquad \qquad \hbox {for } 1 \le j \le n_1 \hbox { and } n_1+1\le j' \le N. \end{aligned}$$
(15)

From (14), for any \(\tau >0\), under (C-i) and (C-ii), we have that

$$\begin{aligned}&P\Big (\Big | \Vert \varvec{x}_j-\varvec{\mu }\Vert ^2-\Delta ^2/4-\text{ tr }(\varvec{\varSigma }_1) \Big |\ge \tau \Delta ^2 \Big )\nonumber \\&\qquad =O\{\text{ Var }(\Vert \varvec{x}_{1j}-\varvec{\mu }_1\Vert ^2) +\varvec{\mu }_{12}^{\mathrm{T}}\varvec{\varSigma }_1\varvec{\mu }_{12}\}/\Delta ^4=o(1) \nonumber \\&\qquad \qquad \text{ for } j=1,\ldots ,n_1;\quad \text{ and } \nonumber \\&P\Big (\Big |\Vert \varvec{x}_j-\varvec{\mu }\Vert ^2-\Delta ^2/4-\text{ tr }(\varvec{\varSigma }_2) \Big |\ge \tau \Delta ^2 \Big )=o(1) \nonumber \\&\qquad \qquad \text{ for } j=n_1+1,\ldots ,N. \end{aligned}$$
(16)

Here, under (4), we can write that

$$\begin{aligned} K(\varvec{\alpha }) =\sum _{j=1}^N\sum _{j'=1}^N \alpha _j\alpha _{j'}t_jt_{j'}(\varvec{x}_j-\varvec{\mu })^{\mathrm{T}}(\varvec{x}_{j'}-\varvec{\mu }). \end{aligned}$$

Then, by noting that \(\alpha _j> 0\) for all j under (4), from (15) and (16), we have that

$$\begin{aligned} K(\varvec{\alpha })&= \frac{\Delta ^2}{4}\left( \sum _{j=1}^N\alpha _j\right) ^2\{1+o_p(1)\}\\&\quad +\left( \text{ tr }(\varvec{\varSigma }_1)\sum _{j=1}^{n_1}\alpha _j^2+ \text{ tr }(\varvec{\varSigma }_2)\sum _{j=n_1+1}^{N}\alpha _j^2\right) \end{aligned}$$

under (C-i) and (C-ii). It concludes the result.

7.2 Proof of Lemma 2

From (5) and (6), we can claim the first result of Lemma 2. Next, we consider the second result of Lemma 2. By noting that \(\sum _{j=1}^N{\hat{\alpha }}_{j}t_j\varvec{x}_j=\sum _{j=1}^N{\hat{\alpha }}_{j}t_j(\varvec{x}_j-\varvec{\mu })\), from the first result of Lemma 2, (15) and (16), under (C-i) and (C-ii), it holds that as \(d\rightarrow \infty \):

$$\begin{aligned}&K({\hat{\varvec{\alpha }}})=\frac{\big ( \sqrt{n_1}+\sqrt{n_2}\big )^4 }{\Delta _*^2}\{1+o_p(\zeta )\}; \\&\sum _{j=1}^N\sum _{j'=1}^N\frac{{\hat{\alpha }}_jt_j(\varvec{x}_j-\varvec{\mu })^{\mathrm{T}}(\varvec{x}_{j'}-\varvec{\mu })}{\sqrt{K({\hat{\varvec{\alpha }}})}} \\&\quad = \frac{ \Delta ^2(n_2-n_1)-{2}\text{ tr }(\varvec{\varSigma }_1)+{2}\text{ tr }(\varvec{\varSigma }_2)+o_p(\Delta ^2) }{2 \Delta _*}; \\&\qquad \qquad \qquad \text{ and } \\&\sum _{j=1}^N \frac{{\hat{\alpha }}_jt_j(\varvec{x}_j-\varvec{\mu })^{\mathrm{T}}(\varvec{x}_0-\varvec{\mu })}{\sqrt{K({\hat{\varvec{\alpha }}})}} =\frac{(-1)^{i}\Delta ^2 }{2\Delta _*} \{1+o_p(1)\} \\&\quad \text{ when } \varvec{x}_0 \in \Pi _i, i=1,2. \end{aligned}$$

Then, from the first result of Lemma 2, we can claim the second result of Lemma 2. It concludes the results of Lemma 2.

7.3 Proofs of Theorem 1 and Corollary 1

Using (7), the results are obtained straightforwardly.

7.4 Proofs of Theorem 2

By combining (7) with (11), we can conclude the result.

7.5 Proofs of Lemma 3, Corollary 2, Theorems 3 and 4

Similar to (6), from Lemma 1 and (5), it concludes the first result of Lemma 3. For the second result of Lemma 3, in a way similar to Proof of Lemma 2, we can claim the result. For the results of Corollary 2, Theorems 3 and 4, by combining (12) with (13), we can claim the results.