Asymptotic properties of distance-weighted discrimination and its bias correction for high-dimension, low-sample-size data

Egashira, Kento; Yata, Kazuyoshi; Aoshima, Makoto

doi:10.1007/s42081-021-00135-x

Asymptotic properties of distance-weighted discrimination and its bias correction for high-dimension, low-sample-size data

Original Paper
Open access
Published: 09 August 2021

Volume 4, pages 821–840, (2021)
Cite this article

Download PDF

You have full access to this open access article

Japanese Journal of Statistics and Data Science Aims and scope Submit manuscript

Asymptotic properties of distance-weighted discrimination and its bias correction for high-dimension, low-sample-size data

Download PDF

1632 Accesses
1 Altmetric
Explore all metrics

A Correction to this article was published on 06 July 2022

This article has been updated

Abstract

While distance-weighted discrimination (DWD) was proposed to improve the support vector machine in high-dimensional settings, it is known that the DWD is quite sensitive to the imbalanced ratio of sample sizes. In this paper, we study asymptotic properties of the DWD in high-dimension, low-sample-size (HDLSS) settings. We show that the DWD includes a huge bias caused by a heterogeneity of covariance matrices as well as sample imbalance. We propose a bias-corrected DWD (BC-DWD) and show that the BC-DWD can enjoy consistency properties about misclassification rates. We also consider the weighted DWD (WDWD) and propose an optimal choice of weights in the WDWD. Finally, we discuss performances of the BC-DWD and the WDWD with the optimal weights in numerical simulations and actual data analyses.

Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings

Article 15 July 2019

More About Asymptotic Properties of Some Binary Classification Methods for High Dimensional Data

An Asymptotic Expansion for the Distribution of Euclidean Distance-Based Discriminant Function in Normal Populations

Article 31 August 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Along with the development of technology, we often encounter high-dimension, low-sample-size (HDLSS) data. In this paper, we consider two-class linear discriminant analysis for the HDLSS data. Suppose we have two independent and d-variate populations, $\Pi _i$, $i=1,2$, having an unknown mean vector $\varvec{\mu }_i$ and unknown covariance matrix ${\varvec{\Sigma }}_i$ for each $i=1,2$. We have independent and identically distributed (i.i.d.) observations, $\varvec{x}_{i1},\ldots ,\varvec{x}_{in_i}$, from each $\Pi _i$. We assume $n_i\ge 2$, $i=1,2$. We also assume that

$$\begin{aligned} \limsup _{d\rightarrow \infty } \frac{\Vert \varvec{\mu }_i\Vert ^2}{d}< & {} \infty , \quad \liminf _{d\rightarrow \infty } \frac{\text { tr }(\varvec{\Sigma }_i)}{d}>0, \quad \text { and }\\&\limsup _{d\rightarrow \infty } \frac{\text { tr }(\varvec{\Sigma }_i)}{d}<\infty \end{aligned}$$

for $i=1,2$, where $\Vert \cdot \Vert $ denotes the Euclidean norm. Let $N=n_1+n_2$. We simply write that $(\varvec{x}_1,\ldots ,\varvec{x}_N)=(\varvec{x}_{11},\ldots ,\varvec{x}_{1n_1},\varvec{x}_{21},\ldots ,\varvec{x}_{2n_2})$. We denote the class labels of $t_j$ by $-1$ for $j=1,\ldots ,n_1$, and by $+1$ for $j=n_1+1,\ldots ,N$. Let $\varvec{x}_0$ be an observation vector of an individual belonging to one of the $\Pi _i$s. We assume that $\varvec{x}_0$ and $\varvec{x}_{ij}$s are independent.

In the HDLSS context, Hall et al. (2008), Chan and Hall (2009), and Aoshima and Yata (2014) considered distance-based classifiers. Aoshima and Yata (2019a) considered a distance-based classifier based on a data transformation technique. Aoshima and Yata (2011, 2015) considered geometric classifiers based on a geometric representation of HDLSS data. Aoshima and Yata (2019b) considered quadratic classifiers in general and discussed optimality of the classifiers under high-dimension, non-sparse settings. In the field of machine learning, there are many studies for classification (supervised learning). A typical method is the support vector machine (SVM) developed by Vapnik (2000). Hall et al. (2005), Chan and Hall (2009), and Nakayama et al. (2017, 2020) investigated asymptotic properties of the SVM in the HDLSS context. Nakayama et al. (2017, 2020) pointed out the strong inconsistency of the SVM when $n_i$s are imbalanced. They proposed bias-corrected SVMs and showed their superiority to the SVMs. On the other hand, Marron et al. (2007) pointed out that the SVM causes data piling in the HDLSS context. Data piling is a phenomenon that the projection of training data to the normal direction vector of a separating hyperplane is the same for each class. See Fig. 1 in Sect. 2. To avoid the data piling problem of the SVM, Marron et al. (2007) proposed the distance-weighted discrimination (DWD). Whereas the SVM finds the optimal hyperplane by maximizing the minimum distances from each class to the hyperplane, the DWD finds a proper hyperplane by minimizing the sum of reciprocals of the distance from each data point to the hyperplane. The DWD cares all the data vectors that are not always used in the SVM. Unfortunately, the DWD is designed for balanced training data sets. See Qiao et al. (2010) and Qiao and Zhang (2015). For imbalanced training data sets, Qiao et al. (2010) developed the weighted DWD (WDWD) that imposes different weights on two classes. However, the WDWD is sensitive for a choice of weights.

In this paper, we investigate the DWD and the WDWD theoretically in the HDLSS context where $d\rightarrow \infty $ while N is fixed. In Sect. 2, we review the DWD. In Sect. 3, we give asymptotic properties of the DWD. We show that the DWD includes a huge bias caused by a heterogeneity of covariance matrices as well as sample imbalance. We propose a bias-corrected DWD (BC-DWD) and show that the BC-DWD can enjoy consistency properties about misclassification rates. In Sect. 4, we give asymptotic properties of the WDWD. We propose an optimal choice of the weights in the WDWD. Finally, in Sects. 5 and 6, we discuss performances of the BC-DWD and the WDWD with the optimal weights in numerical simulations and actual data analyses.

2 Formulation of the DWD

In this section, we give a formulation of the DWD along the line of Marron et al. (2007).

Let $\varvec{w}\in {\mathbb {R}}^d$ be a normal vector and $ b\in {\mathbb {R}}$ be an intercept term, respectively. Let $r_j=t_j(\varvec{w}^{\mathrm{T}}\varvec{x}_j+ b)$ for $j=1,\ldots ,N$. When the training data sets are linearly separable, the DWD is defined by minimizing the sum of $1/r_j$ for all observations. Note that the HDLSS data are linearly separable by a hyperplane. Thus, the optimization problem of the DWD is as follows:

$$\begin{aligned} \min _{{\varvec{w}},b} \sum _{j=1}^{N}\frac{1}{r_j} \text{ subject } \text{ to } \ r_j=t_j(\varvec{w}^{\mathrm{T}}\varvec{x}_j+ b)>0,\quad j=1,\ldots ,N, \text{ and } \Vert \varvec{w}\Vert {\le }1. \end{aligned}$$

The dual problem of the above optimization problem can be written as

$$\begin{aligned} \max _{{\varvec{\alpha }},\lambda }\min _{{\varvec{w}},b, {\varvec{r}}} \left\{ \sum _{j=1}^{N}\frac{1}{r_j}+\sum _{j=1}^{N}\alpha _j(r_j-t_j(\varvec{w}^{\mathrm{T}}\varvec{x}_j+ b))+\lambda (\Vert \varvec{w}\Vert ^2-1)\right\} , \end{aligned}$$

subject to $\alpha _j> 0$, $j=1,\ldots ,N$, and $\lambda > 0$, where $\varvec{\alpha }=(\alpha _1,\ldots ,\alpha _N)^{\mathrm{T}}$, $\varvec{r}=(r_1,\ldots , r_N)^{\mathrm{T}}$, and $\lambda $ and $\alpha _j$s are Lagrange multipliers. Let

$$\begin{aligned} A=\sum _{j=1}^{N}\frac{1}{r_j}+\sum _{j=1}^{N}\alpha _j(r_j-t_j(\varvec{w}^{\mathrm{T}}\varvec{x}_j+ b))+\lambda (\Vert \varvec{w}\Vert ^2-1). \end{aligned}$$

Then, we have that

$$\begin{aligned}&\frac{\partial A}{\partial \varvec{r}}=\varvec{0}\Leftrightarrow r_j^2=\frac{1}{\alpha _j},\quad j=1,\ldots ,N; \nonumber \\&\frac{\partial A}{\partial \varvec{w}}=\varvec{0}\Leftrightarrow \varvec{w}=\frac{1}{2\lambda }\sum _{j=1}^{N}\alpha _j t_j\varvec{x}_j; \nonumber \\&\quad \text{ and }\quad \frac{\partial A}{\partial b}=0 \Leftrightarrow \sum _{j=1}^{N}\alpha _j t_j=0. \end{aligned}$$

(1)

Let

$$\begin{aligned} K(\varvec{\alpha })=\bigg \Vert \sum _{j=1}^{N} \alpha _j t_j\varvec{x}_j\bigg \Vert ^2. \end{aligned}$$

The optimization problem can be transformed into the following:

$$\begin{aligned} {\max _{{\varvec{\alpha }}, \lambda } } \left\{ 2\sum _{j=1}^{N}\sqrt{\alpha _j}-\frac{1}{4\lambda }K(\varvec{\alpha })-\lambda \right\} \end{aligned}$$

(2)

subject to $\alpha _j> 0$, $j=1,\ldots ,N$, $\lambda > 0$, and $\sum _{j=1}^N\alpha _jt_j=0$. Let

$$\begin{aligned} A^*=2\sum _{j=1}^{N}\sqrt{\alpha _j}-\frac{1}{4\lambda }K(\varvec{\alpha })-\lambda . \end{aligned}$$

Then, by noting that

$$\begin{aligned} \frac{\partial A^*}{\partial \lambda }=0 \Leftrightarrow \lambda =\frac{1}{2}\sqrt{K(\varvec{\alpha })} , \end{aligned}$$

(3)

we can rewrite the optimization problem (2) as follows:

$$\begin{aligned} \max _{ {\varvec{\alpha }}}\ \left\{ 2\sum _{j=1}^N\sqrt{\alpha _j} -\sqrt{K(\varvec{\alpha })} \right\} \end{aligned}$$

subject to

$$\begin{aligned} \alpha _j> 0,\ j=1,\ldots ,N, \quad \text{ and } \quad \sum _{j=1}^N\alpha _jt_j=0. \end{aligned}$$

(4)

Let

$$\begin{aligned} L(\varvec{\alpha }) =2\sum _{j=1}^N\sqrt{\alpha _j} -\sqrt{K(\varvec{\alpha })} \end{aligned}$$

and

$$\begin{aligned} {\hat{\varvec{\alpha }}} =({\hat{\alpha }}_1,\ldots ,{\hat{\alpha }}_N)^{\mathrm{T}}=\mathop {\hbox {argmax}}\limits _{ {\varvec{\alpha }}}L(\varvec{\alpha })\ \text{ subject } \text{ to } (4). \end{aligned}$$

Then, from (1) and (3), we write that

$$\begin{aligned} {\hat{\varvec{w}}} =\frac{1}{ \sqrt{K({\hat{\varvec{\alpha }}})} } \sum _{j=1}^N{\hat{\alpha }}_jt_j\varvec{x}_j. \end{aligned}$$

The intercept term b is given by

$$\begin{aligned} \frac{t_j}{\sqrt{{\hat{\alpha }}_j}}-{\hat{\varvec{w}}}^{\mathrm{T}}\varvec{x}_j \quad \text{ for } \text{ all } j=1,\ldots ,N. \end{aligned}$$

Thus, we consider estimating b by the average:

$$\begin{aligned} {\hat{b}} =\frac{1}{N}\sum _{j=1}^N \left( \frac{t_j}{\sqrt{{\hat{\alpha }}_j}} -{\hat{\varvec{w}}}^{\mathrm{T}}\varvec{x}_j\right) . \end{aligned}$$

Then, the classifier function of the DWD is defined by

$$\begin{aligned} {y}(\varvec{x}_0) ={\hat{\varvec{w}}}^{\mathrm{T}}\varvec{x}_0+{\hat{b}}. \end{aligned}$$

One classifies $\varvec{x}_0$ into $\Pi _1$ if ${y}(\varvec{x}_0 )<0$ and into $ \Pi _2$ otherwise.

Now, let us use the following toy example to see data piling. We set $n_1=n_2=25$, $d=2^s$, $s=5,\ldots ,8$. Independent pseudo-random observations were generated from ${\varvec{\Pi }}_i:N_d(\varvec{\mu }_i,\varvec{\Sigma }_i)$. We set $\varvec{\mu }_1=\varvec{0}$, $\varvec{\mu }_2=(1,\ldots ,1,0,\ldots ,0)^{\mathrm{T}}$ whose first $\lceil d^{2/3} \rceil $ elements are 1, and $\varvec{\Sigma }_1=\varvec{\Sigma }_2={\varvec{I}}_d$, where $\lceil x \rceil $ denotes the smallest integer $\ge x$ and $\varvec{I}_d$ denotes the d-dimensional identity matrix. Let ${y}_{{\mathrm{SVM}}}(\cdot )$ be a classifier function of the (linear) SVM. In Fig. 1, we gave the histograms of ${y}_{{\mathrm{SVM}}}(\varvec{x}_j)$s and normalized ${y}(\varvec{x}_j)$s, respectively.

We observed that the data training points for the SVM are concentrated in $-1$ when $\varvec{x}_j \in \Pi _1$ and 1 when $\varvec{x}_j \in \Pi _2$, as d increases. This phenomenon is the data piling. See Nakayama et al. (2017) for the theoretical reason. On the other hand, the data training points for the DWD did not have the phenomenon. We emphasize that the DWD cares all the data vectors that are not always used in the SVM. However, in the next section, we show that the DWD includes a huge bias caused by a heterogeneity of covariance matrices as well as sample imbalance.

3 Asymptotic properties of the DWD and its bias correction

In this section, we first give asymptotic properties of the DWD in the HDLSS context. We show that the DWD includes a huge bias caused by a heterogeneity of covariance matrices as well as sample imbalance. To overcome such difficulties, we propose a bias-corrected DWD.

3.1 Asymptotic properties of the DWD

Let $\Delta ={\Vert \varvec{\mu }_1-\varvec{\mu }_2\Vert }$. For $K(\varvec{\alpha })$, we have the following result.

Lemma 1

Assume

(C-i):: $$ \displaystyle \frac{ \text{ Var } (\Vert {\varvec{x}}_{ij}-{\varvec{\mu }}_i\Vert ^2)}{\varDelta ^4}\rightarrow 0 \text{ as } d\rightarrow \infty \text{ for } i=1,2; \text{ and }$$
(C-ii):: $$\displaystyle \frac{ \text{ tr } (\varvec{ \Sigma }_i^2)}{\varDelta ^4}\rightarrow 0 \text{ as } d\rightarrow \infty \text{ for } i=1,2.$$

Under (4), it holds that as $d\rightarrow \infty $

$$\begin{aligned} K(\varvec{\alpha })= & {} \frac{\varDelta ^2}{4}\left( \sum _{j=1}^N\alpha _j\right) ^2\{1+o_p(1)\} \\&+\left( \text { tr }(\varvec{\Sigma }_1)\sum _{j=1}^{n_1}\alpha _j^2+ \text { tr }(\varvec{\Sigma }_2)\sum _{j=n_1+1}^{N}\alpha _j^2\right) . \end{aligned}$$

Remark 1

The conditions (C-i) and (C-ii) are equivalent when $\Pi _i$s are Gaussian because $\text { Var }(\Vert {\varvec{x}}_{ij}-{\varvec{\mu }}_i\Vert ^2)=2\text { tr }(\varvec{\Sigma }_i^2)$. If $\text { tr }(\varvec{\Sigma }_i^2)=O(d),\ i=1,2,$ and $\Delta ^2/d^{1/2}\rightarrow \infty $ as $d\rightarrow \infty $, (C-ii) is satisfied. See Sect. 5 for several models satisfying (C-ii).

We consider maximizing $L(\varvec{\alpha })$. From Jensen’s inequality, we note that $\sum _{j=1}^{n_1} \sqrt{\alpha _j}/n_1\le \sqrt{ \sum _{j=1}^{n_1}\alpha _j/n_1 }$, $\sum _{j=n_1+1}^{N} \sqrt{\alpha _j}/n_2\le \sqrt{ \sum _{j=n_1+1}^{N}\alpha _j/n_2 }$, $\sum _{j=1}^{n_1}$ $ \alpha _j^2/n_1\ge ( \sum _{j=1}^{n_1}\alpha _j/n_1 )^2$ and $\sum _{j=n_1+1}^{N} \alpha _j^2/n_2 \ge ( \sum _{j=n_1+1}^{N}\alpha _j/n_2 )^2$. In addition, note that $\sum _{j=1}^{n_1}\alpha _j=\sum _{j=n_1+1}^{N}\alpha _j=\sum _{j=1}^N\alpha _j/2$ under (4). Then, under (4) and the constraint that $\sum _{j=1}^N\alpha _j=B$ for a given positive constant B, we can claim that

$$\begin{aligned}&\max _{{{\varvec{\alpha }}}}\sum _{j=1}^N \sqrt{\alpha _j}=\sqrt{B} (\sqrt{n_1/2}+ \sqrt{n_2 /2}) \ \text { and } \nonumber \\&\quad \min _{ {{\varvec{\alpha }}}} \left( \text { tr }(\varvec{\Sigma }_1)\sum _{j=1}^{n_1}\alpha _j^2+\text { tr }(\varvec{\Sigma }_2)\sum _{j=n_1+1}^{N}\alpha _j^2\right) \nonumber \\&\qquad =\frac{B^2}{4}\left( \frac{\text { tr }(\varvec{\Sigma }_1)}{n_1}+\frac{\text { tr }(\varvec{\Sigma }_2)}{n_2}\right) \end{aligned}$$

(5)

when $\alpha _1=\cdots =\alpha _{n_1}=B/(2n_1)$ and $\alpha _{n_1+1}=\cdots =\alpha _{N}=B/(2n_2)$. Thus, from Lemma 1, under (C-i) and (C-ii), it holds that

$$\begin{aligned} \max _{ {\varvec{\alpha }}} L(\varvec{\alpha }) =&-\frac{1}{2}\Delta _{*} \Bigg (\sqrt{B}-\frac{\sqrt{2n_1}+\sqrt{2n_2}}{\Delta _{*} }\{1+o_p(\zeta )\} \Bigg )^2\{1+o_p(\zeta )\} \nonumber \\&+\frac{(\sqrt{n_1}+\sqrt{n_2})^2}{\Delta _{*}}\{1+o_p(\zeta )\}, \end{aligned}$$

(6)

where

$$\begin{aligned} {{\Delta }} _{*}= & {} {\sqrt{ {{\Delta }} {^2}+\text { tr }({\varvec{\Sigma }}_1)/n_1+\text { tr }({\varvec{\Sigma }}_2)/n_2}} \quad \text { and } \\&\qquad \qquad \qquad \quad \zeta ={{\Delta }} ^{{2}}/{{\Delta }} _*^{{2}}. \end{aligned}$$

Hence, by choosing $B\approx 2(\sqrt{n_1}+\sqrt{n_2})^2/\Delta _{*}^2$, we have the maximum of $L(\varvec{\alpha })$ asymptotically.

Lemma 2

Under (C-i) and (C-ii), it holds that as $d\rightarrow \infty $

$$\begin{aligned}&{\hat{\alpha }}_{j} =\frac{(\sqrt{n_1}+\sqrt{n_2})^2}{n_1\Delta _{*}^{2}}\{1+o_p(\zeta )\}\quad \text{ for } j=1,\ldots ,n_1;\quad \text{ and } \\&{\hat{\alpha }}_{j} =\frac{(\sqrt{n_1}+\sqrt{n_2})^2}{n_2\Delta _{*}^{2}}\{1+o_p(\zeta )\}\quad \text{ for } j=n_1+1,\ldots ,N. \end{aligned}$$

Furthermore, it holds that as $d\rightarrow \infty $

$$\begin{aligned} {y}(\varvec{x}_0) =\frac{\Delta ^2}{ \Delta _*} \left( \frac{(-1)^i}{2}+\delta +o_p(1) \right) \end{aligned}$$

(7)

when $\varvec{x}_0\in \Pi _i$ for $i=1,2,$ where

$$\begin{aligned} \delta= & {} -\frac{(n_1^{3/2}-n_2^{3/2}){{\Delta }}_*^2}{N(\sqrt{n_1}+\sqrt{n_2}) {{\Delta }}^2} +\frac{n_1-n_2}{2N} \\&\quad +\frac{\text{ tr }(\varvec{\varSigma }_1)-\text{ tr }(\varvec{\varSigma }_2)}{N {{\Delta }}^2 }. \end{aligned}$$

The quantity $\delta $ vanishes if $n_1=n_2$ and $\varvec{\varSigma }_1=\varvec{\varSigma }_2$. We consider the following assumption:

(C-iii):: $ \limsup |\delta | <\frac{1}{2}$.

Let e(i) denote the error rate of misclassifying an individual from $\Pi _i$ into the other class for $i=1,2$. Then, we have the following results.

Theorem 1

Under (C-i), (C-ii) and (C-iii), the DWD holds :

$$\begin{aligned} e(1)\rightarrow 0 \quad \text{ and } \quad e(2)\rightarrow 0 \qquad \text{ as } d\rightarrow \infty . \end{aligned}$$

(8)

However, without (C-iii), we have the following results.

Corollary 1

Under (C-i) and (C-ii), the DWD holds :

$$\begin{aligned}&e(1)\rightarrow 1 \ \text{ and } \ e(2)\rightarrow 0 \quad \text{ as } d\rightarrow \infty \text{ if } \liminf _{d\rightarrow \infty }\delta >\frac{1}{2}; \quad \text{ and } \\&e(1)\rightarrow 0 \ \text{ and } \ e(2)\rightarrow 1 \quad \text{ as } d\rightarrow \infty \text{ if } \limsup _{d\rightarrow \infty } \delta <-\frac{1}{2}. \end{aligned}$$

Remark 2

For the DWD, Hall et al. (2005) and Qiao et al. (2010) showed the consistent property in Theorem 1 and the inconsistent properties in Corollary 1 under different conditions. However, we claim that (C-i), (C-ii) and (C-iii) are milder than their conditions.

From Corollary 1, the DWD brings the strong inconsistency because of the huge bias caused by the heterogeneity of covariance matrices as well as sample imbalance. For example, if $\text{ tr }(\varvec{\varSigma }_i)/ {{\Delta }}^2 \rightarrow \infty $ as $d\rightarrow \infty $ for some i, $|\delta |$ tends to become large as d increases when $\text{ tr }(\varvec{\varSigma }_1)\ne \text{ tr }(\varvec{\varSigma }_2)$ or $n_1 \ne n_2$. To overcome such difficulties, we propose a bias-corrected DWD.

3.2 Bias-corrected DWD

We consider an unbiased estimator of $\Delta ^2$ as follows:

$$\begin{aligned} {\hat{\Delta }}^2=\Vert {\overline{\varvec{x}}}_{1}-{\overline{\varvec{x}}}_{2} \Vert ^2-\text{ tr }(\varvec{S}_{1})/n_1-\text{ tr }(\varvec{S}_2)/n_2, \end{aligned}$$

where ${\overline{\varvec{x}}}_{i}=\sum _{j=1}^{n_i}{\varvec{x}_{ij}}/{n_i}$ and $\varvec{S}_{i}=\sum _{j=1}^{n_i}(\varvec{x}_{ij}-{\overline{\varvec{x}}}_{i})(\varvec{x}_{ij}-{\overline{\varvec{x}}}_{i})^{\mathrm{T}}/(n_i-1)$ for $i=1,2$. Note that $E({\hat{\Delta }}^2)=\Delta ^2$. From $\text { tr }(\varvec{\Sigma }_1\varvec{\Sigma }_2)\le \sqrt{\text { tr }(\varvec{\Sigma }_1^2)\text { tr }(\varvec{\Sigma }_2^2)}$ and $({\varvec{\mu }}_1-{\varvec{\mu }}_2)^{\mathrm {T}}\varvec{\Sigma }_i({\varvec{\mu }}_1-{\varvec{\mu }}_2)\le {{\Delta }} ^2 \sqrt{ \text { tr }({\varvec{\Sigma }}_i^2)}$, under (C-ii), it holds that as $d\rightarrow \infty $

$$\begin{aligned} \text{ Var }({\hat{{\Delta }}}^2/{{\Delta }}^2)&=\sum _{i=1}^2 \frac{2\text{ tr }(\varvec{\varSigma }_i^2)}{n_i(n_i-1){{\Delta }}^4} + \frac{{4}\text{ tr }(\varvec{\varSigma }_1\varvec{\varSigma }_2)}{n_1n_2{{\Delta }}^4}\\&+\sum _{i=1}^2\frac{4(\varvec{\mu }_1-\varvec{\mu }_2)^{\mathrm{T}}\varvec{\varSigma }_i(\varvec{\mu }_1-\varvec{\mu }_2)}{n_i{{\Delta }}^4} \\&=o(1). \end{aligned}$$

Thus, under (C-ii), from Chebyshev’s inequality, it holds that as $d\rightarrow \infty $

$$\begin{aligned} {\hat{\Delta }}^2=\Delta ^2+o_p(\Delta ^2). \end{aligned}$$

(9)

See also Aoshima and Yata (2018) for asymptotic properties of ${\widehat{{\Delta }}}^2$. On the other hand, we consider an unbiased estimator of $\Delta _*^2$ as follows:

$$\begin{aligned} {\hat{\Delta }}_*^2=\Vert {\overline{\varvec{x}}}_{1}-{\overline{\varvec{x}}}_{2} \Vert ^2. \end{aligned}$$

Note that $E({\hat{\Delta }}_*^2)=\Delta _*^2$. Here, we write that

$$\begin{aligned} \text{ tr }(\varvec{S}_i)-\text{ tr }(\varvec{\varSigma }_i)= & {} \sum _{j=1}^{n_i}\frac{\Vert \varvec{x}_{ij}-\varvec{\mu }_i\Vert ^2-\text{ tr }(\varvec{\varSigma }_i) }{n_i}\\&-\sum _{j\ne j'}^{n_i}\frac{(\varvec{x}_{ij}-\varvec{\mu }_i)^{\mathrm{T}}(\varvec{x}_{ij'}-\varvec{\mu }_i)}{n_i(n_i-1)}. \end{aligned}$$

Then, by noting that $\text{ Var }\{ \sum _{j\ne j'}^{n_i} (\varvec{x}_{ij}-\varvec{\mu }_i)^{\mathrm{T}}(\varvec{x}_{ij'}-\varvec{\mu }_i)/n_i^2 \} =O\{ \text{ tr }(\varvec{\varSigma }_i^2)\}$ and $\text{ Var }\{\Vert \varvec{x}_{ij}-\varvec{\mu }_i\Vert ^2-\text{ tr }(\varvec{\varSigma }_i)\}=\text{ Var }(\Vert \varvec{x}_{ij}-\varvec{\mu }_i\Vert ^2)$, under (C-i) and (C-ii), from (9), it holds that as $d\rightarrow \infty $

$$\begin{aligned} \text{ tr }(\varvec{S}_{i})&= \text{ tr }(\varvec{\varSigma }_i)+o_p({{\Delta }}^2),\ i=1,2,\quad \text{ and }\nonumber \\&{\hat{{\Delta }}}_*^2={{\Delta }}_*^2+o_p({{\Delta }}^2). \end{aligned}$$

(10)

Let $\delta _*=\Delta ^2 \delta /\Delta _*$ and ${{\hat{\Delta }}_*=\sqrt{{\hat{\Delta }}_*^2}}$. We consider an estimator of $\delta _*$ as follows:

$$\begin{aligned} {\hat{\delta }}_*= -\frac{(n_1^{3/2}-n_2^{3/2}){\hat{\Delta }}_*}{N(\sqrt{n_1}+\sqrt{n_2})} +\frac{(n_1-n_2){\hat{\Delta }}^2}{2N{\hat{\Delta }}_*} +\frac{\text{ tr }(\varvec{S}_1)-\text{ tr }(\varvec{S}_2)}{N {\hat{\Delta }}_*}. \end{aligned}$$

Then, under (C-i) and (C-ii), from (9) and (10), it holds that as $d\rightarrow \infty $

$$\begin{aligned} {\hat{\delta }}_*=\delta _*+o_p( \Delta ^2/\Delta _*). \end{aligned}$$

(11)

Now, we define the bias-corrected DWD (BC-DWD) by

$$\begin{aligned} {y}_{{\mathrm{BC}}}(\varvec{x}_0)=y(\varvec{x}_0)-{\hat{\delta }}_*. \end{aligned}$$

One classifies $\varvec{x}_0$ into $\Pi _1$ if ${y}_{{\mathrm{BC}}}(\varvec{x}_0)<0$ and into $ \Pi _2$ otherwise. Then, from Lemma 2 and (11), we have the following result.

Theorem 2

For the BC-DWD, (8) holds under (C-i) and (C-ii).

We emphasize that the BC-DWD enjoys the asymptotic consistency without assuming (C-iii). See Sect. 5 for numerical comparisons.

4 WDWD and its asymptotic properties

Qiao et al. (2010) developed the WDWD to overcome the weakness of the DWD for sample imbalance. The optimization problem of the WDWD is as follows:

$$\begin{aligned}&\min _{{{\varvec{w}}},b} \left\{ W(-1)\sum _{j=1}^{n_1}\frac{1}{r_j} +W(+1)\sum _{j=n_1+1}^{N}\frac{1}{r_j} \right\} \\&\text{ subject } \text{ to } r_j=t_j(\varvec{w}^{\mathrm{T}}\varvec{x}_j+ b)>0,\quad j=1,\ldots ,N, \text{ and } \Vert \varvec{w}\Vert {\le }1 \end{aligned}$$

with $W(-1)\ (>0)$ and $W(+1)\ (>0)$ some weights. In this paper, we assume $W(-1)=1$ without loss of generality. We also assume that

$$\begin{aligned} \liminf _{d\rightarrow \infty }W(+1)>0 \quad \text{ and } \quad \limsup _{d\rightarrow \infty }W(+1)<\infty . \end{aligned}$$

Then, similar to the DWD, the dual optimization problem is written as follows:

$$\begin{aligned} \max _{ {\varvec{\alpha }}}L_{{\mathrm{W}}}(\varvec{\alpha }) \end{aligned}$$

subject to (4), where

$$\begin{aligned} L_{{\mathrm{W}}}(\varvec{\alpha })=2\sum _{j=1}^{n_1}\sqrt{\alpha _j} +2\sum _{j=n_1+1}^{N}\sqrt{W(+1)\alpha _j} -\sqrt{K(\varvec{\alpha })}. \end{aligned}$$

Let us write that

$$\begin{aligned} {\hat{\varvec{\alpha }}}_{{\mathrm{W}}}=({\hat{\alpha }}_{1{\mathrm{W}}},\ldots ,{\hat{\alpha }}_{N{\mathrm{W}}})^{\mathrm{T}}=\mathop {\hbox {argmax}}\limits _{ {\varvec{\alpha }}}L_{{\mathrm{W}}}(\varvec{\alpha })\ \text{ subject } \text{ to } (4). \end{aligned}$$

Similar to the DWD, we obtain the classifier function of the WDWD:

$$\begin{aligned} {y}_{{\mathrm{W}}}(\varvec{x}_0) ={\hat{\varvec{w}}}_{{\mathrm{W}}}^{\mathrm{T}}\varvec{x}_0+{\hat{b}}_{{\mathrm{W}}}, \end{aligned}$$

where

$$\begin{aligned} {\hat{\varvec{w}}}_{{\mathrm{W}}} =\frac{1}{\sqrt{K({\hat{\varvec{\alpha }}}_{{\mathrm{W}}})}} \sum _{j=1}^N{\hat{\alpha }}_{j{\mathrm{W}}} t_j\varvec{x}_j \end{aligned}$$

and

$$\begin{aligned} {\hat{b}}_{{\mathrm{W}}}= & {} \frac{1}{N}\sum _{j=1}^{n_1} \left( -\sqrt{\frac{1}{{\hat{\alpha }}_{j{\mathrm{W}}}}}-{\hat{\varvec{w}}}_{{\mathrm{W}}}^{\mathrm{T}}\varvec{x}_j\right) \nonumber \\&+\frac{1}{N}\sum _{j=n_1+1}^{N} \left( \sqrt{\frac{W(+1)}{{\hat{\alpha }}_{j{\mathrm{W}}}}}-{\hat{\varvec{w}}}_{{\mathrm{W}}}^{\mathrm{T}}\varvec{x}_j\right) . \end{aligned}$$

Then, one classifies $\varvec{x}_0$ into $\Pi _1$ if ${y}_{{\mathrm{W}}}(\varvec{x}_0)<0$ and into $ \Pi _2$ otherwise.

As with the DWD, we have the following result.

Lemma 3

Under (C-i) and (C-ii), it holds that as $d\rightarrow \infty $

$$\begin{aligned}&{\hat{\alpha }}_{j{\mathrm{W}}} =\frac{(\sqrt{n_1}+\sqrt{n_2}\sqrt{W(+1)})^2}{n_1\Delta _{*}^2}\{1+o_p(\zeta )\}\quad \text{ for } j=1,\ldots ,n_1;\quad \text{ and } \\&{\hat{\alpha }}_{j{\mathrm{W}}} =\frac{(\sqrt{n_1}+\sqrt{n_2}\sqrt{W(+1)})^2}{n_2\Delta _{*}^2}\{1+o_p(\zeta )\}\quad \text{ for } j=n_1+1,\ldots ,N. \end{aligned}$$

Furthermore, it holds that as $d\rightarrow \infty $

$$\begin{aligned} {y}_{{\mathrm{W}}}(\varvec{x}_0) =\frac{\Delta ^2}{{\Delta }_*} \left( \frac{(-1)^i}{2}+\delta _{{\mathrm{W}}}+o_p(1) \right) \end{aligned}$$

(12)

when $\varvec{x}_0\in \Pi _i$ for $i=1,2,$ where

$$\begin{aligned} \delta _{{\mathrm{W}}}= & {} -\frac{(n_1^{3/2}-n_2^{3/2}\sqrt{W(+1)})\Delta _*^2}{N(\sqrt{n_1}+\sqrt{n_2}\sqrt{W(+1)} ) \Delta ^2} +\frac{n_1-n_2}{2N} \\&\quad +\frac{\text{ tr }(\varvec{\varSigma }_1)-\text{ tr }(\varvec{\varSigma }_2)}{N \Delta ^2 }. \end{aligned}$$

We consider the following assumption:

(C-iv):: $ \limsup |\delta _{{\mathrm{W}}}|<\frac{1}{2}$.

Then, we have the following results.

Theorem 3

For the WDWD, (8) holds under (C-i), (C-ii), and (C-iv).

Corollary 2

Under (C-i) and (C-ii), the WDWD holds :

$$\begin{aligned}&e(1)\rightarrow 1 \ \text{ and } \ e(2)\rightarrow 0 \ \text{ as } d\rightarrow \infty \text{ if } \liminf _{d\rightarrow \infty }\delta _{{\mathrm{W}}} >\frac{1}{2}; \quad \text{ and } \\&e(1)\rightarrow 0 \ \text{ and } \ e(2)\rightarrow 1 \ \text{ as } d\rightarrow \infty \text{ if } \limsup _{d\rightarrow \infty } \delta _{{\mathrm{W}}} <-\frac{1}{2}. \end{aligned}$$

For the WDWD, Qiao et al. (2010) recommended to use $W(+1)=n_1/n_2$ in a case of equal costs. See Table 3 in Qiao et al. (2010). However, if $\text{ tr }(\varvec{\varSigma }_i)/ \Delta ^2 \rightarrow \infty $ as $d\rightarrow \infty $ for some i, $|\delta _{{\mathrm{W}}}|$ with $W(+1)=n_1/n_2$ tends to become large as d increases when $\text{ tr }(\varvec{\varSigma }_1)\ne \text{ tr }(\varvec{\varSigma }_2)$ or $n_1 \ne n_2$. Thus, from Corollary 2, the WDWD still brings the strong inconsistency because of the huge bias caused by the heterogeneity of covariance matrices as well as sample imbalance.

To overcome such difficulties, we propose an optimal choice of $W(+1)$ in the WDWD. Let

$$\begin{aligned} W_0=\frac{n_1\{\Delta ^2+2\text{ tr }(\varvec{\varSigma }_2)/n_2\}^2}{n_2\{\Delta ^2+2\text{ tr }(\varvec{\varSigma }_1)/n_1\}^2}. \end{aligned}$$

We claim that

$$\begin{aligned} \delta _{{\mathrm{W}}}=0 \quad \text{ when } W(+1)=W_0. \end{aligned}$$

Thus, we consider the estimator of $W_0$ as follows:

$$\begin{aligned} {\hat{W}}_0=\frac{n_1\{{\hat{\Delta }}^2+2\text{ tr }(\varvec{S}_2)/n_2\}^2}{n_2\{{\hat{\Delta }}^2+2\text{ tr }(\varvec{S}_1)/n_1\}^2}. \end{aligned}$$

Then, under (C-i) and (C-ii), from (9) and (10), it holds that as $d\rightarrow \infty $

$$\begin{aligned} {\hat{W}}_0={W}_0\{1+o_p(\zeta )\}, \end{aligned}$$

so that

$$\begin{aligned} \delta _{{\mathrm{W}}}=o_p(1) \quad \text{ when } W(+1)= {\hat{W}}_0. \end{aligned}$$

(13)

Then, from Lemma 3 and (13), we have the following result.

Theorem 4

For the WDWD with the $W(+1)= {\hat{W}}_0,$ (8) holds under (C-i) and (C-ii).

From Theorem 4, we recommend to use $W(+1)= {\hat{W}}_0$ (and $W(-1)=1$) in the WDWD.

Hereafter, we call the classifier by ${y}_{{\mathrm{W}}}(\varvec{x}_0)$ with $W(+1)={\hat{W}}_0$ “the optimal WDWD (OWDWD)”. See Sect. 5 for numerical comparisons.

5 Comparison in high-dimensional setting

We used computer simulations to compare the performance of the classifiers: the DWD, the BC-DWD, the WDWD, and the OWDWD. We set $W(+1)=n_1/n_2$ for the WDWD. Note that the WDWD is equivalent to the DWD when $n_1=n_2$.

As for $\Pi _i\ (i=1,2)$, we considered the following three cases:

(i)
$\varvec{x}_{ij}$ is $N_{d}(\varvec{\mu }_i, \varvec{\varSigma }_i)$;
(ii)
Let $\varvec{x}_{ij}-\varvec{\mu }_i=\varvec{\varSigma }_i^{1/2}(z_{ij1},\ldots ,z_{ijd})^{\mathrm{T}}$ for all j. Here, $z_{ij\ell }=(v_{ij\ell }-5)/{10}^{1/2}$ $(\ell =1,\ldots ,d)$ in which $v_{ij\ell }$s are i.i.d. as the chi-squared distribution with 5 degrees of freedom; and
(iii)
$\varvec{x}_{ij}-\varvec{\mu }_i$, $j=1,\ldots ,n_i,$ are i.i.d. as a d-variate t-distribution, $t_d(\varvec{0}, \varvec{\varSigma }_i, 10)$, $i=1,2,$ with mean zero, covariance matrix $\varvec{\varSigma }_i$ and degrees of freedom 10.

Note that the conditions (C-i) and (C-ii) are equivalent for (i) and (ii) because $\text{ Var }(\Vert \varvec{x}_{ij}-\varvec{\mu }_i \Vert ^2)=O\{\text{ tr }(\varvec{\varSigma }_i^2)\}$. We set $d=2^s,\ s=5,\ldots ,11$, $\varvec{\mu }_1=\varvec{0}$, and $\varvec{\mu }_2=(1,\ldots ,1,0,\ldots ,0)^{\mathrm{T}}$ whose first $\lceil d^{2/3} \rceil $ elements are 1. Note that $\Delta ^2\approx d^{2/3}$. Let $\varvec{\Phi }=\varvec{C}( 0.3^{|i-j|^{1/3}})\varvec{C}$, where $\varvec{C}=\text{ diag }(\{0.5+1/(d+1)\}^{1/2},\ldots ,\{0.5+d/(d+1)\}^{1/2})$. Note that $\text{ tr }(\varvec{\Phi })=d$. We considered four cases:

(a)
$(n_1,n_2)=(10,10)$ and $\varvec{\varSigma }_1=\varvec{\varSigma }_2=\varvec{\Phi }$ for (i);
(b)
$(n_1,n_2)=(5,15)$ and $\varvec{\varSigma }_1=\varvec{\varSigma }_2=\varvec{\Phi }$ for (i);
(c)
$(n_1,n_2)=(10,10)$, $\varvec{\varSigma }_1=(2/3)\varvec{\Phi }$, and $\varvec{\varSigma }_2=(4/3)\varvec{\Phi }$ for (ii); and
(d)
$(n_1,n_2)=(8,12)$, $\varvec{\varSigma }_1=(6/5)\varvec{\Phi }$, and $\varvec{\varSigma }_2=(4/5)\varvec{\Phi }$ for (iii).

Note that $\text{ tr }(\varvec{\varSigma }_i^2)=O(d),\ i=1,2$, for (a) to (d), so that (C-ii) holds for (a) to (d). In addition, note that $\text{ tr }(\varvec{\varSigma }_1)-\text{ tr }(\varvec{\varSigma }_2)=-(2/3)d$ for (c) and $\text{ tr }(\varvec{\varSigma }_1)-\text{ tr }(\varvec{\varSigma }_2)=(2/5)d$ for (d), so that $|\text{ tr }(\varvec{\varSigma }_1)-\text{ tr }(\varvec{\varSigma }_2)|/\Delta ^2\rightarrow \infty $ as $d\rightarrow \infty $ for (c) and (d). We repeated it 2000 times to confirm if the classifier does (or does not) classify $\varvec{x}_0 \in \Pi _i$ correctly and defined $P_{ir}=0\ (\text{ or }\ 1)$ accordingly for each $\Pi _i\ (i=1,2)$. We calculated the error rates, ${\overline{e}}(i)= \sum _{r=1}^{2000}P_{ir}/2000$, $i=1,2$. In addition, we calculated the average error rate, ${\overline{e}}=\{{\overline{e}}(1)+{\overline{e}}(2)\}/2$. Their standard deviations are less than 0.011. In Figs. 2 and 3, we plotted ${\overline{e}}(1)$, ${\overline{e}}(2)$, and ${\overline{e}}$ for (a) to (d).

We observed that the DWD and the WDWD give quite bad performances for (b) to (d). This can be regarded as a natural consequence because of the bias in the DWD and the WDWD. Note that $|\delta | \rightarrow \infty $ and $|\delta _{{\mathrm{W}}}| \rightarrow \infty $ as $d\rightarrow \infty $ for (b) to (d), where $W(+1)=n_1/n_2$ in $\delta _{{\mathrm{W}}}$. Thus, from Corollaries 1 and 2, the DWD and the WDWD hold the strong inconsistency. On the other hand, the BC-DWD and the OWDWD gave preferable performances for all cases. We emphasize that the BC-DWD and the OWDWD hold the consistency property without (C-iii) or (C-iv). See Sects. 3.2 and 4 for the details.

6 Real data analysis

In this section, we analyze a gene expression data using the DWD, the BC-DWD, the WDWD, the OWDWD, the (linear) SVM, and the bias corrected-SVM (BC-SVM) by Nakayama et al. (2017). We set $W(+1)=n_1/n_2$ for the WDWD. Note that the WDWD is equivalent to the DWD when $n_1=n_2$. We used colon cancer data with $2000\ (=d)$ genes in Alon et al. (1999) that consists of $\Pi _1:$ colon tumor (40 samples) and $\Pi _2:$ normal colon (22 samples).

We randomly split the datasets from $(\Pi _1,\Pi _2)$ into training data sets of sizes $(n_1,n_2)$ and test data sets of sizes $(40-n_1,22-n_2)$. We considered eight cases: $(n_1,n_2)=$ (5,5), (5,15), (15,5), (15,15), (25,5), (25,15), (35,5), and (35,15). We constructed the DWD, the BC-DWD, the WDWD, the OWDWD, the SVM, and the BC-SVM using the training data sets. We checked accuracy using the test data set for each $\Pi _i$ and denoted the misclassification rates by ${\widehat{e}}(1)_r$ and ${\widehat{e}}(2)_r$. We repeated this procedure 100 times and obtained ${\widehat{e}}(1)_r$ and ${\widehat{e}}(2)_r$, $r=1,\ldots ,100$. We calculated the average misclassification rates, ${\overline{e}}(1)\ (=\sum _{r=1}^{100}{\widehat{e}}(1)_r/100)$, ${\overline{e}}(2)\ (=\sum _{r=1}^{100}{\widehat{e}}(2)_r/100)$, and ${\overline{e}}\ (=\{{\overline{e}}(1)+{\overline{e}}(2) \}/2)$ for the classifiers in various combinations of $(n_1,n_2)$ in Table 1.

We observed that the BC-DWD and the OWDWD give adequate performances compared to the DWD, the WDWD, and the SVM especially when $n_1$ and $n_2$ are unbalanced. See Sects. 3.2 and 4 for theoretical reasons. The BC-SVM also gave adequate performances even when $n_1$ and $n_2$ are unbalanced. This can be regarded as an acceptable consequence because the BC-SVM has the consistency (8) under (C-i) and (C-ii). See Section 3 in Nakayama et al. (2017) for the details. However, the BC-DWD (or the OWDWD) seems to give better performances compared to the BC-SVM. This can be regarded as a natural consequence because the DWD cares all the data vectors that are not always used in the SVM. See Fig. 1. A theoretical study of relevance between the BC-SVM and the BC-DWD is left to a future work.

Table 1 The average misclassification rates of the classifiers for Alon et al. (1999)’s colon cancer data

Full size table

7 Proofs

Throughout this section, let $\varvec{\mu }_{12}=\varvec{\mu }_1-\varvec{\mu }_2$ and $\varvec{\mu }=(\varvec{\mu }_1+\varvec{\mu }_2)/2$.

7.1 Proof of Lemma 1

Under (C-ii), we have that as $d\rightarrow \infty $

$$\begin{aligned} \frac{\text{ tr }(\varvec{\varSigma }_1\varvec{\varSigma }_2)}{\Delta ^4}\le \frac{\sqrt{ \text{ tr }(\varvec{\varSigma }_1^2) \text{ tr }(\varvec{\varSigma }_2^{2})}}{\Delta ^4}=o(1) \quad \text{ and } \quad \frac{\varvec{\mu }_{12}^{\mathrm{T}}\varvec{\varSigma }_i\varvec{\mu }_{12}}{\Delta ^4} \le \frac{\sqrt{ \text{ tr }(\varvec{\varSigma }_i^2)}}{\Delta ^2}=o(1) \end{aligned}$$

(14)

for $i=1,2$. Then, using Chebyshev’s inequality, for any $\tau >0$, under (C-i) and (C-ii), we have that

$$\begin{aligned}&P(|(\varvec{x}_j-\varvec{\mu })^{\mathrm{T}}(\varvec{x}_{j'}-\varvec{\mu })-\Delta ^2/4 |\ge \tau \Delta ^2 )\nonumber \\&\quad \le {E[\{(\varvec{x}_j-\varvec{\mu })^{\mathrm{T}}(\varvec{x}_{j'}-\varvec{\mu })-\Delta ^2/4 \}^2]}/{(\tau \Delta ^2)^2} \nonumber \\&\qquad =O\{\text{ tr }(\varvec{\varSigma }_1^2)+\varvec{\mu }_{12}^{\mathrm{T}}\varvec{\varSigma }_1\varvec{\mu }_{12}\}/\Delta ^4=o(1) \nonumber \\&\qquad \qquad \text{ for } 1\le j< j'\le n_1; \nonumber \\&P(|(\varvec{x}_j-\varvec{\mu })^{\mathrm{T}}(\varvec{x}_{j'}-\varvec{\mu })-\Delta ^2/4 |\ge \tau \Delta ^2 ) \nonumber \\&\qquad =O\{\text{ tr }(\varvec{\varSigma }_2^2)+\varvec{\mu }_{12}^{\mathrm{T}}\varvec{\varSigma }_2\varvec{\mu }_{12}\}/\Delta ^4=o(1)\nonumber \\&\qquad \qquad \text{ for } n_1+1 \le j< j' \le N;\quad \text{ and } \nonumber \\&P(|(\varvec{x}_j-\varvec{\mu })^{\mathrm{T}}(\varvec{x}_{j'}-\varvec{\mu })+\Delta ^2/4 |\ge \tau \Delta ^2 )\nonumber \\&\qquad =O\{\text{ tr }(\varvec{\varSigma }_1\varvec{\varSigma }_2)+\varvec{\mu }_{12}^{\mathrm{T}}(\varvec{\varSigma }_1+\varvec{\varSigma }_2)\varvec{\mu }_{12}\}/\Delta ^4=o(1)\nonumber \\&\qquad \qquad \hbox {for } 1 \le j \le n_1 \hbox { and } n_1+1\le j' \le N. \end{aligned}$$

(15)

From (14), for any $\tau >0$, under (C-i) and (C-ii), we have that

$$\begin{aligned}&P\Big (\Big | \Vert \varvec{x}_j-\varvec{\mu }\Vert ^2-\Delta ^2/4-\text{ tr }(\varvec{\varSigma }_1) \Big |\ge \tau \Delta ^2 \Big )\nonumber \\&\qquad =O\{\text{ Var }(\Vert \varvec{x}_{1j}-\varvec{\mu }_1\Vert ^2) +\varvec{\mu }_{12}^{\mathrm{T}}\varvec{\varSigma }_1\varvec{\mu }_{12}\}/\Delta ^4=o(1) \nonumber \\&\qquad \qquad \text{ for } j=1,\ldots ,n_1;\quad \text{ and } \nonumber \\&P\Big (\Big |\Vert \varvec{x}_j-\varvec{\mu }\Vert ^2-\Delta ^2/4-\text{ tr }(\varvec{\varSigma }_2) \Big |\ge \tau \Delta ^2 \Big )=o(1) \nonumber \\&\qquad \qquad \text{ for } j=n_1+1,\ldots ,N. \end{aligned}$$

(16)

Here, under (4), we can write that

$$\begin{aligned} K(\varvec{\alpha }) =\sum _{j=1}^N\sum _{j'=1}^N \alpha _j\alpha _{j'}t_jt_{j'}(\varvec{x}_j-\varvec{\mu })^{\mathrm{T}}(\varvec{x}_{j'}-\varvec{\mu }). \end{aligned}$$

Then, by noting that $\alpha _j> 0$ for all j under (4), from (15) and (16), we have that

$$\begin{aligned} K(\varvec{\alpha })&= \frac{\Delta ^2}{4}\left( \sum _{j=1}^N\alpha _j\right) ^2\{1+o_p(1)\}\\&\quad +\left( \text{ tr }(\varvec{\varSigma }_1)\sum _{j=1}^{n_1}\alpha _j^2+ \text{ tr }(\varvec{\varSigma }_2)\sum _{j=n_1+1}^{N}\alpha _j^2\right) \end{aligned}$$

under (C-i) and (C-ii). It concludes the result.

7.2 Proof of Lemma 2

From (5) and (6), we can claim the first result of Lemma 2. Next, we consider the second result of Lemma 2. By noting that $\sum _{j=1}^N{\hat{\alpha }}_{j}t_j\varvec{x}_j=\sum _{j=1}^N{\hat{\alpha }}_{j}t_j(\varvec{x}_j-\varvec{\mu })$, from the first result of Lemma 2, (15) and (16), under (C-i) and (C-ii), it holds that as $d\rightarrow \infty $:

$$\begin{aligned}&K({\hat{\varvec{\alpha }}})=\frac{\big ( \sqrt{n_1}+\sqrt{n_2}\big )^4 }{\Delta _*^2}\{1+o_p(\zeta )\}; \\&\sum _{j=1}^N\sum _{j'=1}^N\frac{{\hat{\alpha }}_jt_j(\varvec{x}_j-\varvec{\mu })^{\mathrm{T}}(\varvec{x}_{j'}-\varvec{\mu })}{\sqrt{K({\hat{\varvec{\alpha }}})}} \\&\quad = \frac{ \Delta ^2(n_2-n_1)-{2}\text{ tr }(\varvec{\varSigma }_1)+{2}\text{ tr }(\varvec{\varSigma }_2)+o_p(\Delta ^2) }{2 \Delta _*}; \\&\qquad \qquad \qquad \text{ and } \\&\sum _{j=1}^N \frac{{\hat{\alpha }}_jt_j(\varvec{x}_j-\varvec{\mu })^{\mathrm{T}}(\varvec{x}_0-\varvec{\mu })}{\sqrt{K({\hat{\varvec{\alpha }}})}} =\frac{(-1)^{i}\Delta ^2 }{2\Delta _*} \{1+o_p(1)\} \\&\quad \text{ when } \varvec{x}_0 \in \Pi _i, i=1,2. \end{aligned}$$

Then, from the first result of Lemma 2, we can claim the second result of Lemma 2. It concludes the results of Lemma 2.

7.3 Proofs of Theorem 1 and Corollary 1

Using (7), the results are obtained straightforwardly.

7.4 Proofs of Theorem 2

By combining (7) with (11), we can conclude the result.

7.5 Proofs of Lemma 3, Corollary 2, Theorems 3 and 4

Similar to (6), from Lemma 1 and (5), it concludes the first result of Lemma 3. For the second result of Lemma 3, in a way similar to Proof of Lemma 2, we can claim the result. For the results of Corollary 2, Theorems 3 and 4, by combining (12) with (13), we can claim the results.

Change history

06 July 2022
A Correction to this paper has been published: https://doi.org/10.1007/s42081-022-00167-x

References

Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96, 6745–6750.
Article Google Scholar
Aoshima, M., & Yata, K. (2011). Two-stage procedures for high-dimensional data. Sequential Analysis (Editor’s Special Invited Paper), 30, 356–399.
Article MathSciNet Google Scholar
Aoshima, M., & Yata, K. (2014). A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data. Annals of the Institute of Statistical Mathematics, 66, 983–1010.
Article MathSciNet Google Scholar
Aoshima, M., & Yata, K. (2015). Geometric classifier for multiclass, high-dimensional data. Sequential Analysis, 34, 279–294.
Article MathSciNet Google Scholar
Aoshima, M., & Yata, K. (2018). Two-sample tests for high-dimension, strongly spiked eigenvalue models. Statistica Sinica, 28, 43–62.
MathSciNet MATH Google Scholar
Aoshima, M., & Yata, K. (2019a). Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models. Annals of the Institute of Statistical Mathematics, 71, 473–503.
Article MathSciNet Google Scholar
Aoshima, M., & Yata, K. (2019b). High-dimensional quadratic classifiers in non-sparse settings. Methodology and Computing in Applied Probability, 21, 663–682.
Article MathSciNet Google Scholar
Chan, Y.-B., & Hall, P. (2009). Scale adjustments for classifiers in high-dimensional, low sample size settings. Biometrika, 96, 469–478.
Article MathSciNet Google Scholar
Hall, P., Marron, J. S., & Neeman, A. (2005). Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society, Series B, 67, 427–444.
Article MathSciNet Google Scholar
Hall, P., Pittelkow, Y., & Ghosh, M. (2008). Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes. Journal of the Royal Statistical Society, Series B, 70, 159–173.
Article MathSciNet Google Scholar
Marron, J. S., Todd, M. J., & Ahn, J. (2007). Distance-weighted discrimination. Journal of the American Statistical Association, 102, 1267–1271.
Article MathSciNet Google Scholar
Nakayama, Y., Yata, K., & Aoshima, M. (2017). Support vector machine and its bias correction in high-dimension, low-sample-size settings. Journal of Statistical Planning and Inference, 191, 88–100.
Article MathSciNet Google Scholar
Nakayama, Y., Yata, K., & Aoshima, M. (2020). Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings. Annals of the Institute of Statistical Mathematics, 72, 1257–1286.
Article MathSciNet Google Scholar
Qiao, X., Zhang, H. H., Liu, Y., Todd, M. J., & Marron, J. S. (2010). Weighted distance weighted discrimination and its asymptotic properties. Journal of the American Statistical Association, 105, 401–414.
Article MathSciNet Google Scholar
Qiao, X., & Zhang, L. (2015). Flexible high-dimensional classification machines and their asymptotic properties. Journal of Machine Learning Research, 16, 1547–1572.
MathSciNet MATH Google Scholar
Vapnik, V. N. (2000). The nature of statistical learning theory (2nd ed.). Springer.

Download references

Acknowledgements

We would like to thank two anonymous referees for their constructive comments.

Author information

Authors and Affiliations

Degree Programs in Pure and Applied Sciences, Graduate School of Science and Technology, University of Tsukuba, Ibaraki, 305-8571, Japan
Kento Egashira
Institute of Mathematics, University of Tsukuba, Ibaraki, 305-8571, Japan
Kazuyoshi Yata & Makoto Aoshima

Authors

Kento Egashira
View author publications
You can also search for this author in PubMed Google Scholar
Kazuyoshi Yata
View author publications
You can also search for this author in PubMed Google Scholar
Makoto Aoshima
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kazuyoshi Yata.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The research of K. Yata was partially supported by Grant-in-Aid for Scientific Research (C), JSPS, under Contract number 18K03409. The research of M. Aoshima was partially supported by Grants-in-Aid for Scientific Research, (A) and (S), and Challenging Research (Exploratory), JSPS, under Contract numbers 20H00576, 18H05290 and 19K22837.

The original version of this article was revised due to a retrospective Open Access order.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Egashira, K., Yata, K. & Aoshima, M. Asymptotic properties of distance-weighted discrimination and its bias correction for high-dimension, low-sample-size data. Jpn J Stat Data Sci 4, 821–840 (2021). https://doi.org/10.1007/s42081-021-00135-x

Download citation

Received: 22 December 2020
Revised: 05 July 2021
Accepted: 07 July 2021
Published: 09 August 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s42081-021-00135-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Asymptotic properties of distance-weighted discrimination and its bias correction for high-dimension, low-sample-size data

Abstract

Similar content being viewed by others

Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings

More About Asymptotic Properties of Some Binary Classification Methods for High Dimensional Data

An Asymptotic Expansion for the Distribution of Euclidean Distance-Based Discriminant Function in Normal Populations

1 Introduction

2 Formulation of the DWD

3 Asymptotic properties of the DWD and its bias correction

3.1 Asymptotic properties of the DWD

Lemma 1

Remark 1

Lemma 2

Theorem 1

Corollary 1

Remark 2

3.2 Bias-corrected DWD

Theorem 2

4 WDWD and its asymptotic properties

Lemma 3

Theorem 3

Corollary 2

Theorem 4

5 Comparison in high-dimensional setting

6 Real data analysis

7 Proofs

7.1 Proof of Lemma 1

7.2 Proof of Lemma 2

7.3 Proofs of Theorem 1 and Corollary 1

7.4 Proofs of Theorem 2

7.5 Proofs of Lemma 3, Corollary 2, Theorems 3 and 4

Change history

06 July 2022

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation