Abstract
While distance-weighted discrimination (DWD) was proposed to improve the support vector machine in high-dimensional settings, it is known that the DWD is quite sensitive to the imbalanced ratio of sample sizes. In this paper, we study asymptotic properties of the DWD in high-dimension, low-sample-size (HDLSS) settings. We show that the DWD includes a huge bias caused by a heterogeneity of covariance matrices as well as sample imbalance. We propose a bias-corrected DWD (BC-DWD) and show that the BC-DWD can enjoy consistency properties about misclassification rates. We also consider the weighted DWD (WDWD) and propose an optimal choice of weights in the WDWD. Finally, we discuss performances of the BC-DWD and the WDWD with the optimal weights in numerical simulations and actual data analyses.
Similar content being viewed by others
1 Introduction
Along with the development of technology, we often encounter high-dimension, low-sample-size (HDLSS) data. In this paper, we consider two-class linear discriminant analysis for the HDLSS data. Suppose we have two independent and d-variate populations, \(\Pi _i\), \(i=1,2\), having an unknown mean vector \(\varvec{\mu }_i\) and unknown covariance matrix \({\varvec{\Sigma }}_i\) for each \(i=1,2\). We have independent and identically distributed (i.i.d.) observations, \(\varvec{x}_{i1},\ldots ,\varvec{x}_{in_i}\), from each \(\Pi _i\). We assume \(n_i\ge 2\), \(i=1,2\). We also assume that
for \(i=1,2\), where \(\Vert \cdot \Vert \) denotes the Euclidean norm. Let \(N=n_1+n_2\). We simply write that \((\varvec{x}_1,\ldots ,\varvec{x}_N)=(\varvec{x}_{11},\ldots ,\varvec{x}_{1n_1},\varvec{x}_{21},\ldots ,\varvec{x}_{2n_2})\). We denote the class labels of \(t_j\) by \(-1\) for \(j=1,\ldots ,n_1\), and by \(+1\) for \(j=n_1+1,\ldots ,N\). Let \(\varvec{x}_0\) be an observation vector of an individual belonging to one of the \(\Pi _i\)s. We assume that \(\varvec{x}_0\) and \(\varvec{x}_{ij}\)s are independent.
In the HDLSS context, Hall et al. (2008), Chan and Hall (2009), and Aoshima and Yata (2014) considered distance-based classifiers. Aoshima and Yata (2019a) considered a distance-based classifier based on a data transformation technique. Aoshima and Yata (2011, 2015) considered geometric classifiers based on a geometric representation of HDLSS data. Aoshima and Yata (2019b) considered quadratic classifiers in general and discussed optimality of the classifiers under high-dimension, non-sparse settings. In the field of machine learning, there are many studies for classification (supervised learning). A typical method is the support vector machine (SVM) developed by Vapnik (2000). Hall et al. (2005), Chan and Hall (2009), and Nakayama et al. (2017, 2020) investigated asymptotic properties of the SVM in the HDLSS context. Nakayama et al. (2017, 2020) pointed out the strong inconsistency of the SVM when \(n_i\)s are imbalanced. They proposed bias-corrected SVMs and showed their superiority to the SVMs. On the other hand, Marron et al. (2007) pointed out that the SVM causes data piling in the HDLSS context. Data piling is a phenomenon that the projection of training data to the normal direction vector of a separating hyperplane is the same for each class. See Fig. 1 in Sect. 2. To avoid the data piling problem of the SVM, Marron et al. (2007) proposed the distance-weighted discrimination (DWD). Whereas the SVM finds the optimal hyperplane by maximizing the minimum distances from each class to the hyperplane, the DWD finds a proper hyperplane by minimizing the sum of reciprocals of the distance from each data point to the hyperplane. The DWD cares all the data vectors that are not always used in the SVM. Unfortunately, the DWD is designed for balanced training data sets. See Qiao et al. (2010) and Qiao and Zhang (2015). For imbalanced training data sets, Qiao et al. (2010) developed the weighted DWD (WDWD) that imposes different weights on two classes. However, the WDWD is sensitive for a choice of weights.
In this paper, we investigate the DWD and the WDWD theoretically in the HDLSS context where \(d\rightarrow \infty \) while N is fixed. In Sect. 2, we review the DWD. In Sect. 3, we give asymptotic properties of the DWD. We show that the DWD includes a huge bias caused by a heterogeneity of covariance matrices as well as sample imbalance. We propose a bias-corrected DWD (BC-DWD) and show that the BC-DWD can enjoy consistency properties about misclassification rates. In Sect. 4, we give asymptotic properties of the WDWD. We propose an optimal choice of the weights in the WDWD. Finally, in Sects. 5 and 6, we discuss performances of the BC-DWD and the WDWD with the optimal weights in numerical simulations and actual data analyses.
2 Formulation of the DWD
In this section, we give a formulation of the DWD along the line of Marron et al. (2007).
Let \(\varvec{w}\in {\mathbb {R}}^d\) be a normal vector and \( b\in {\mathbb {R}}\) be an intercept term, respectively. Let \(r_j=t_j(\varvec{w}^{\mathrm{T}}\varvec{x}_j+ b)\) for \(j=1,\ldots ,N\). When the training data sets are linearly separable, the DWD is defined by minimizing the sum of \(1/r_j\) for all observations. Note that the HDLSS data are linearly separable by a hyperplane. Thus, the optimization problem of the DWD is as follows:
The dual problem of the above optimization problem can be written as
subject to \(\alpha _j> 0\), \(j=1,\ldots ,N\), and \(\lambda > 0\), where \(\varvec{\alpha }=(\alpha _1,\ldots ,\alpha _N)^{\mathrm{T}}\), \(\varvec{r}=(r_1,\ldots , r_N)^{\mathrm{T}}\), and \(\lambda \) and \(\alpha _j\)s are Lagrange multipliers. Let
Then, we have that
Let
The optimization problem can be transformed into the following:
subject to \(\alpha _j> 0\), \(j=1,\ldots ,N\), \(\lambda > 0\), and \(\sum _{j=1}^N\alpha _jt_j=0\). Let
Then, by noting that
we can rewrite the optimization problem (2) as follows:
subject to
Let
and
Then, from (1) and (3), we write that
The intercept term b is given by
Thus, we consider estimating b by the average:
Then, the classifier function of the DWD is defined by
One classifies \(\varvec{x}_0\) into \(\Pi _1\) if \({y}(\varvec{x}_0 )<0\) and into \( \Pi _2\) otherwise.
Now, let us use the following toy example to see data piling. We set \(n_1=n_2=25\), \(d=2^s\), \(s=5,\ldots ,8\). Independent pseudo-random observations were generated from \({\varvec{\Pi }}_i:N_d(\varvec{\mu }_i,\varvec{\Sigma }_i)\). We set \(\varvec{\mu }_1=\varvec{0}\), \(\varvec{\mu }_2=(1,\ldots ,1,0,\ldots ,0)^{\mathrm{T}}\) whose first \(\lceil d^{2/3} \rceil \) elements are 1, and \(\varvec{\Sigma }_1=\varvec{\Sigma }_2={\varvec{I}}_d\), where \(\lceil x \rceil \) denotes the smallest integer \(\ge x\) and \(\varvec{I}_d\) denotes the d-dimensional identity matrix. Let \({y}_{{\mathrm{SVM}}}(\cdot )\) be a classifier function of the (linear) SVM. In Fig. 1, we gave the histograms of \({y}_{{\mathrm{SVM}}}(\varvec{x}_j)\)s and normalized \({y}(\varvec{x}_j)\)s, respectively.
We observed that the data training points for the SVM are concentrated in \(-1\) when \(\varvec{x}_j \in \Pi _1\) and 1 when \(\varvec{x}_j \in \Pi _2\), as d increases. This phenomenon is the data piling. See Nakayama et al. (2017) for the theoretical reason. On the other hand, the data training points for the DWD did not have the phenomenon. We emphasize that the DWD cares all the data vectors that are not always used in the SVM. However, in the next section, we show that the DWD includes a huge bias caused by a heterogeneity of covariance matrices as well as sample imbalance.
3 Asymptotic properties of the DWD and its bias correction
In this section, we first give asymptotic properties of the DWD in the HDLSS context. We show that the DWD includes a huge bias caused by a heterogeneity of covariance matrices as well as sample imbalance. To overcome such difficulties, we propose a bias-corrected DWD.
3.1 Asymptotic properties of the DWD
Let \(\Delta ={\Vert \varvec{\mu }_1-\varvec{\mu }_2\Vert }\). For \(K(\varvec{\alpha })\), we have the following result.
Lemma 1
Assume
- (C-i):
-
$$ \displaystyle \frac{ \text{ Var } (\Vert {\varvec{x}}_{ij}-{\varvec{\mu }}_i\Vert ^2)}{\varDelta ^4}\rightarrow 0 \text{ as } d\rightarrow \infty \text{ for } i=1,2; \text{ and }$$
- (C-ii):
-
$$\displaystyle \frac{ \text{ tr } (\varvec{ \Sigma }_i^2)}{\varDelta ^4}\rightarrow 0 \text{ as } d\rightarrow \infty \text{ for } i=1,2.$$
Under (4), it holds that as \(d\rightarrow \infty \)
Remark 1
The conditions (C-i) and (C-ii) are equivalent when \(\Pi _i\)s are Gaussian because \(\text { Var }(\Vert {\varvec{x}}_{ij}-{\varvec{\mu }}_i\Vert ^2)=2\text { tr }(\varvec{\Sigma }_i^2)\). If \(\text { tr }(\varvec{\Sigma }_i^2)=O(d),\ i=1,2,\) and \(\Delta ^2/d^{1/2}\rightarrow \infty \) as \(d\rightarrow \infty \), (C-ii) is satisfied. See Sect. 5 for several models satisfying (C-ii).
We consider maximizing \(L(\varvec{\alpha })\). From Jensen’s inequality, we note that \(\sum _{j=1}^{n_1} \sqrt{\alpha _j}/n_1\le \sqrt{ \sum _{j=1}^{n_1}\alpha _j/n_1 }\), \(\sum _{j=n_1+1}^{N} \sqrt{\alpha _j}/n_2\le \sqrt{ \sum _{j=n_1+1}^{N}\alpha _j/n_2 }\), \(\sum _{j=1}^{n_1}\) \( \alpha _j^2/n_1\ge ( \sum _{j=1}^{n_1}\alpha _j/n_1 )^2\) and \(\sum _{j=n_1+1}^{N} \alpha _j^2/n_2 \ge ( \sum _{j=n_1+1}^{N}\alpha _j/n_2 )^2\). In addition, note that \(\sum _{j=1}^{n_1}\alpha _j=\sum _{j=n_1+1}^{N}\alpha _j=\sum _{j=1}^N\alpha _j/2\) under (4). Then, under (4) and the constraint that \(\sum _{j=1}^N\alpha _j=B\) for a given positive constant B, we can claim that
when \(\alpha _1=\cdots =\alpha _{n_1}=B/(2n_1)\) and \(\alpha _{n_1+1}=\cdots =\alpha _{N}=B/(2n_2)\). Thus, from Lemma 1, under (C-i) and (C-ii), it holds that
where
Hence, by choosing \(B\approx 2(\sqrt{n_1}+\sqrt{n_2})^2/\Delta _{*}^2\), we have the maximum of \(L(\varvec{\alpha })\) asymptotically.
Lemma 2
Under (C-i) and (C-ii), it holds that as \(d\rightarrow \infty \)
Furthermore, it holds that as \(d\rightarrow \infty \)
when \(\varvec{x}_0\in \Pi _i\) for \(i=1,2,\) where
The quantity \(\delta \) vanishes if \(n_1=n_2\) and \(\varvec{\varSigma }_1=\varvec{\varSigma }_2\). We consider the following assumption:
- (C-iii):
-
\( \limsup |\delta | <\frac{1}{2}\).
Let e(i) denote the error rate of misclassifying an individual from \(\Pi _i\) into the other class for \(i=1,2\). Then, we have the following results.
Theorem 1
Under (C-i), (C-ii) and (C-iii), the DWD holds :
However, without (C-iii), we have the following results.
Corollary 1
Under (C-i) and (C-ii), the DWD holds :
Remark 2
For the DWD, Hall et al. (2005) and Qiao et al. (2010) showed the consistent property in Theorem 1 and the inconsistent properties in Corollary 1 under different conditions. However, we claim that (C-i), (C-ii) and (C-iii) are milder than their conditions.
From Corollary 1, the DWD brings the strong inconsistency because of the huge bias caused by the heterogeneity of covariance matrices as well as sample imbalance. For example, if \(\text{ tr }(\varvec{\varSigma }_i)/ {{\Delta }}^2 \rightarrow \infty \) as \(d\rightarrow \infty \) for some i, \(|\delta |\) tends to become large as d increases when \(\text{ tr }(\varvec{\varSigma }_1)\ne \text{ tr }(\varvec{\varSigma }_2)\) or \(n_1 \ne n_2\). To overcome such difficulties, we propose a bias-corrected DWD.
3.2 Bias-corrected DWD
We consider an unbiased estimator of \(\Delta ^2\) as follows:
where \({\overline{\varvec{x}}}_{i}=\sum _{j=1}^{n_i}{\varvec{x}_{ij}}/{n_i}\) and \(\varvec{S}_{i}=\sum _{j=1}^{n_i}(\varvec{x}_{ij}-{\overline{\varvec{x}}}_{i})(\varvec{x}_{ij}-{\overline{\varvec{x}}}_{i})^{\mathrm{T}}/(n_i-1)\) for \(i=1,2\). Note that \(E({\hat{\Delta }}^2)=\Delta ^2\). From \(\text { tr }(\varvec{\Sigma }_1\varvec{\Sigma }_2)\le \sqrt{\text { tr }(\varvec{\Sigma }_1^2)\text { tr }(\varvec{\Sigma }_2^2)}\) and \(({\varvec{\mu }}_1-{\varvec{\mu }}_2)^{\mathrm {T}}\varvec{\Sigma }_i({\varvec{\mu }}_1-{\varvec{\mu }}_2)\le {{\Delta }} ^2 \sqrt{ \text { tr }({\varvec{\Sigma }}_i^2)}\), under (C-ii), it holds that as \(d\rightarrow \infty \)
Thus, under (C-ii), from Chebyshev’s inequality, it holds that as \(d\rightarrow \infty \)
See also Aoshima and Yata (2018) for asymptotic properties of \({\widehat{{\Delta }}}^2\). On the other hand, we consider an unbiased estimator of \(\Delta _*^2\) as follows:
Note that \(E({\hat{\Delta }}_*^2)=\Delta _*^2\). Here, we write that
Then, by noting that \(\text{ Var }\{ \sum _{j\ne j'}^{n_i} (\varvec{x}_{ij}-\varvec{\mu }_i)^{\mathrm{T}}(\varvec{x}_{ij'}-\varvec{\mu }_i)/n_i^2 \} =O\{ \text{ tr }(\varvec{\varSigma }_i^2)\}\) and \(\text{ Var }\{\Vert \varvec{x}_{ij}-\varvec{\mu }_i\Vert ^2-\text{ tr }(\varvec{\varSigma }_i)\}=\text{ Var }(\Vert \varvec{x}_{ij}-\varvec{\mu }_i\Vert ^2)\), under (C-i) and (C-ii), from (9), it holds that as \(d\rightarrow \infty \)
Let \(\delta _*=\Delta ^2 \delta /\Delta _*\) and \({{\hat{\Delta }}_*=\sqrt{{\hat{\Delta }}_*^2}}\). We consider an estimator of \(\delta _*\) as follows:
Then, under (C-i) and (C-ii), from (9) and (10), it holds that as \(d\rightarrow \infty \)
Now, we define the bias-corrected DWD (BC-DWD) by
One classifies \(\varvec{x}_0\) into \(\Pi _1\) if \({y}_{{\mathrm{BC}}}(\varvec{x}_0)<0\) and into \( \Pi _2\) otherwise. Then, from Lemma 2 and (11), we have the following result.
Theorem 2
For the BC-DWD, (8) holds under (C-i) and (C-ii).
We emphasize that the BC-DWD enjoys the asymptotic consistency without assuming (C-iii). See Sect. 5 for numerical comparisons.
4 WDWD and its asymptotic properties
Qiao et al. (2010) developed the WDWD to overcome the weakness of the DWD for sample imbalance. The optimization problem of the WDWD is as follows:
with \(W(-1)\ (>0)\) and \(W(+1)\ (>0)\) some weights. In this paper, we assume \(W(-1)=1\) without loss of generality. We also assume that
Then, similar to the DWD, the dual optimization problem is written as follows:
subject to (4), where
Let us write that
Similar to the DWD, we obtain the classifier function of the WDWD:
where
and
Then, one classifies \(\varvec{x}_0\) into \(\Pi _1\) if \({y}_{{\mathrm{W}}}(\varvec{x}_0)<0\) and into \( \Pi _2\) otherwise.
As with the DWD, we have the following result.
Lemma 3
Under (C-i) and (C-ii), it holds that as \(d\rightarrow \infty \)
Furthermore, it holds that as \(d\rightarrow \infty \)
when \(\varvec{x}_0\in \Pi _i\) for \(i=1,2,\) where
We consider the following assumption:
- (C-iv):
-
\( \limsup |\delta _{{\mathrm{W}}}|<\frac{1}{2}\).
Then, we have the following results.
Theorem 3
For the WDWD, (8) holds under (C-i), (C-ii), and (C-iv).
Corollary 2
Under (C-i) and (C-ii), the WDWD holds :
For the WDWD, Qiao et al. (2010) recommended to use \(W(+1)=n_1/n_2\) in a case of equal costs. See Table 3 in Qiao et al. (2010). However, if \(\text{ tr }(\varvec{\varSigma }_i)/ \Delta ^2 \rightarrow \infty \) as \(d\rightarrow \infty \) for some i, \(|\delta _{{\mathrm{W}}}|\) with \(W(+1)=n_1/n_2\) tends to become large as d increases when \(\text{ tr }(\varvec{\varSigma }_1)\ne \text{ tr }(\varvec{\varSigma }_2)\) or \(n_1 \ne n_2\). Thus, from Corollary 2, the WDWD still brings the strong inconsistency because of the huge bias caused by the heterogeneity of covariance matrices as well as sample imbalance.
To overcome such difficulties, we propose an optimal choice of \(W(+1)\) in the WDWD. Let
We claim that
Thus, we consider the estimator of \(W_0\) as follows:
Then, under (C-i) and (C-ii), from (9) and (10), it holds that as \(d\rightarrow \infty \)
so that
Then, from Lemma 3 and (13), we have the following result.
Theorem 4
For the WDWD with the \(W(+1)= {\hat{W}}_0,\) (8) holds under (C-i) and (C-ii).
From Theorem 4, we recommend to use \(W(+1)= {\hat{W}}_0\) (and \(W(-1)=1\)) in the WDWD.
Hereafter, we call the classifier by \({y}_{{\mathrm{W}}}(\varvec{x}_0)\) with \(W(+1)={\hat{W}}_0\) “the optimal WDWD (OWDWD)”. See Sect. 5 for numerical comparisons.
5 Comparison in high-dimensional setting
We used computer simulations to compare the performance of the classifiers: the DWD, the BC-DWD, the WDWD, and the OWDWD. We set \(W(+1)=n_1/n_2\) for the WDWD. Note that the WDWD is equivalent to the DWD when \(n_1=n_2\).
As for \(\Pi _i\ (i=1,2)\), we considered the following three cases:
-
(i)
\(\varvec{x}_{ij}\) is \(N_{d}(\varvec{\mu }_i, \varvec{\varSigma }_i)\);
-
(ii)
Let \(\varvec{x}_{ij}-\varvec{\mu }_i=\varvec{\varSigma }_i^{1/2}(z_{ij1},\ldots ,z_{ijd})^{\mathrm{T}}\) for all j. Here, \(z_{ij\ell }=(v_{ij\ell }-5)/{10}^{1/2}\) \((\ell =1,\ldots ,d)\) in which \(v_{ij\ell }\)s are i.i.d. as the chi-squared distribution with 5 degrees of freedom; and
-
(iii)
\(\varvec{x}_{ij}-\varvec{\mu }_i\), \(j=1,\ldots ,n_i,\) are i.i.d. as a d-variate t-distribution, \(t_d(\varvec{0}, \varvec{\varSigma }_i, 10)\), \(i=1,2,\) with mean zero, covariance matrix \(\varvec{\varSigma }_i\) and degrees of freedom 10.
Note that the conditions (C-i) and (C-ii) are equivalent for (i) and (ii) because \(\text{ Var }(\Vert \varvec{x}_{ij}-\varvec{\mu }_i \Vert ^2)=O\{\text{ tr }(\varvec{\varSigma }_i^2)\}\). We set \(d=2^s,\ s=5,\ldots ,11\), \(\varvec{\mu }_1=\varvec{0}\), and \(\varvec{\mu }_2=(1,\ldots ,1,0,\ldots ,0)^{\mathrm{T}}\) whose first \(\lceil d^{2/3} \rceil \) elements are 1. Note that \(\Delta ^2\approx d^{2/3}\). Let \(\varvec{\Phi }=\varvec{C}( 0.3^{|i-j|^{1/3}})\varvec{C}\), where \(\varvec{C}=\text{ diag }(\{0.5+1/(d+1)\}^{1/2},\ldots ,\{0.5+d/(d+1)\}^{1/2})\). Note that \(\text{ tr }(\varvec{\Phi })=d\). We considered four cases:
-
(a)
\((n_1,n_2)=(10,10)\) and \(\varvec{\varSigma }_1=\varvec{\varSigma }_2=\varvec{\Phi }\) for (i);
-
(b)
\((n_1,n_2)=(5,15)\) and \(\varvec{\varSigma }_1=\varvec{\varSigma }_2=\varvec{\Phi }\) for (i);
-
(c)
\((n_1,n_2)=(10,10)\), \(\varvec{\varSigma }_1=(2/3)\varvec{\Phi }\), and \(\varvec{\varSigma }_2=(4/3)\varvec{\Phi }\) for (ii); and
-
(d)
\((n_1,n_2)=(8,12)\), \(\varvec{\varSigma }_1=(6/5)\varvec{\Phi }\), and \(\varvec{\varSigma }_2=(4/5)\varvec{\Phi }\) for (iii).
Note that \(\text{ tr }(\varvec{\varSigma }_i^2)=O(d),\ i=1,2\), for (a) to (d), so that (C-ii) holds for (a) to (d). In addition, note that \(\text{ tr }(\varvec{\varSigma }_1)-\text{ tr }(\varvec{\varSigma }_2)=-(2/3)d\) for (c) and \(\text{ tr }(\varvec{\varSigma }_1)-\text{ tr }(\varvec{\varSigma }_2)=(2/5)d\) for (d), so that \(|\text{ tr }(\varvec{\varSigma }_1)-\text{ tr }(\varvec{\varSigma }_2)|/\Delta ^2\rightarrow \infty \) as \(d\rightarrow \infty \) for (c) and (d). We repeated it 2000 times to confirm if the classifier does (or does not) classify \(\varvec{x}_0 \in \Pi _i\) correctly and defined \(P_{ir}=0\ (\text{ or }\ 1)\) accordingly for each \(\Pi _i\ (i=1,2)\). We calculated the error rates, \({\overline{e}}(i)= \sum _{r=1}^{2000}P_{ir}/2000\), \(i=1,2\). In addition, we calculated the average error rate, \({\overline{e}}=\{{\overline{e}}(1)+{\overline{e}}(2)\}/2\). Their standard deviations are less than 0.011. In Figs. 2 and 3, we plotted \({\overline{e}}(1)\), \({\overline{e}}(2)\), and \({\overline{e}}\) for (a) to (d).
We observed that the DWD and the WDWD give quite bad performances for (b) to (d). This can be regarded as a natural consequence because of the bias in the DWD and the WDWD. Note that \(|\delta | \rightarrow \infty \) and \(|\delta _{{\mathrm{W}}}| \rightarrow \infty \) as \(d\rightarrow \infty \) for (b) to (d), where \(W(+1)=n_1/n_2\) in \(\delta _{{\mathrm{W}}}\). Thus, from Corollaries 1 and 2, the DWD and the WDWD hold the strong inconsistency. On the other hand, the BC-DWD and the OWDWD gave preferable performances for all cases. We emphasize that the BC-DWD and the OWDWD hold the consistency property without (C-iii) or (C-iv). See Sects. 3.2 and 4 for the details.
6 Real data analysis
In this section, we analyze a gene expression data using the DWD, the BC-DWD, the WDWD, the OWDWD, the (linear) SVM, and the bias corrected-SVM (BC-SVM) by Nakayama et al. (2017). We set \(W(+1)=n_1/n_2\) for the WDWD. Note that the WDWD is equivalent to the DWD when \(n_1=n_2\). We used colon cancer data with \(2000\ (=d)\) genes in Alon et al. (1999) that consists of \(\Pi _1:\) colon tumor (40 samples) and \(\Pi _2:\) normal colon (22 samples).
We randomly split the datasets from \((\Pi _1,\Pi _2)\) into training data sets of sizes \((n_1,n_2)\) and test data sets of sizes \((40-n_1,22-n_2)\). We considered eight cases: \((n_1,n_2)=\) (5,5), (5,15), (15,5), (15,15), (25,5), (25,15), (35,5), and (35,15). We constructed the DWD, the BC-DWD, the WDWD, the OWDWD, the SVM, and the BC-SVM using the training data sets. We checked accuracy using the test data set for each \(\Pi _i\) and denoted the misclassification rates by \({\widehat{e}}(1)_r\) and \({\widehat{e}}(2)_r\). We repeated this procedure 100 times and obtained \({\widehat{e}}(1)_r\) and \({\widehat{e}}(2)_r\), \(r=1,\ldots ,100\). We calculated the average misclassification rates, \({\overline{e}}(1)\ (=\sum _{r=1}^{100}{\widehat{e}}(1)_r/100)\), \({\overline{e}}(2)\ (=\sum _{r=1}^{100}{\widehat{e}}(2)_r/100)\), and \({\overline{e}}\ (=\{{\overline{e}}(1)+{\overline{e}}(2) \}/2)\) for the classifiers in various combinations of \((n_1,n_2)\) in Table 1.
We observed that the BC-DWD and the OWDWD give adequate performances compared to the DWD, the WDWD, and the SVM especially when \(n_1\) and \(n_2\) are unbalanced. See Sects. 3.2 and 4 for theoretical reasons. The BC-SVM also gave adequate performances even when \(n_1\) and \(n_2\) are unbalanced. This can be regarded as an acceptable consequence because the BC-SVM has the consistency (8) under (C-i) and (C-ii). See Section 3 in Nakayama et al. (2017) for the details. However, the BC-DWD (or the OWDWD) seems to give better performances compared to the BC-SVM. This can be regarded as a natural consequence because the DWD cares all the data vectors that are not always used in the SVM. See Fig. 1. A theoretical study of relevance between the BC-SVM and the BC-DWD is left to a future work.
7 Proofs
Throughout this section, let \(\varvec{\mu }_{12}=\varvec{\mu }_1-\varvec{\mu }_2\) and \(\varvec{\mu }=(\varvec{\mu }_1+\varvec{\mu }_2)/2\).
7.1 Proof of Lemma 1
Under (C-ii), we have that as \(d\rightarrow \infty \)
for \(i=1,2\). Then, using Chebyshev’s inequality, for any \(\tau >0\), under (C-i) and (C-ii), we have that
From (14), for any \(\tau >0\), under (C-i) and (C-ii), we have that
Here, under (4), we can write that
Then, by noting that \(\alpha _j> 0\) for all j under (4), from (15) and (16), we have that
under (C-i) and (C-ii). It concludes the result.
7.2 Proof of Lemma 2
From (5) and (6), we can claim the first result of Lemma 2. Next, we consider the second result of Lemma 2. By noting that \(\sum _{j=1}^N{\hat{\alpha }}_{j}t_j\varvec{x}_j=\sum _{j=1}^N{\hat{\alpha }}_{j}t_j(\varvec{x}_j-\varvec{\mu })\), from the first result of Lemma 2, (15) and (16), under (C-i) and (C-ii), it holds that as \(d\rightarrow \infty \):
Then, from the first result of Lemma 2, we can claim the second result of Lemma 2. It concludes the results of Lemma 2.
7.3 Proofs of Theorem 1 and Corollary 1
Using (7), the results are obtained straightforwardly.
7.4 Proofs of Theorem 2
By combining (7) with (11), we can conclude the result.
7.5 Proofs of Lemma 3, Corollary 2, Theorems 3 and 4
Similar to (6), from Lemma 1 and (5), it concludes the first result of Lemma 3. For the second result of Lemma 3, in a way similar to Proof of Lemma 2, we can claim the result. For the results of Corollary 2, Theorems 3 and 4, by combining (12) with (13), we can claim the results.
Change history
06 July 2022
A Correction to this paper has been published: https://doi.org/10.1007/s42081-022-00167-x
References
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96, 6745–6750.
Aoshima, M., & Yata, K. (2011). Two-stage procedures for high-dimensional data. Sequential Analysis (Editor’s Special Invited Paper), 30, 356–399.
Aoshima, M., & Yata, K. (2014). A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data. Annals of the Institute of Statistical Mathematics, 66, 983–1010.
Aoshima, M., & Yata, K. (2015). Geometric classifier for multiclass, high-dimensional data. Sequential Analysis, 34, 279–294.
Aoshima, M., & Yata, K. (2018). Two-sample tests for high-dimension, strongly spiked eigenvalue models. Statistica Sinica, 28, 43–62.
Aoshima, M., & Yata, K. (2019a). Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models. Annals of the Institute of Statistical Mathematics, 71, 473–503.
Aoshima, M., & Yata, K. (2019b). High-dimensional quadratic classifiers in non-sparse settings. Methodology and Computing in Applied Probability, 21, 663–682.
Chan, Y.-B., & Hall, P. (2009). Scale adjustments for classifiers in high-dimensional, low sample size settings. Biometrika, 96, 469–478.
Hall, P., Marron, J. S., & Neeman, A. (2005). Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society, Series B, 67, 427–444.
Hall, P., Pittelkow, Y., & Ghosh, M. (2008). Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes. Journal of the Royal Statistical Society, Series B, 70, 159–173.
Marron, J. S., Todd, M. J., & Ahn, J. (2007). Distance-weighted discrimination. Journal of the American Statistical Association, 102, 1267–1271.
Nakayama, Y., Yata, K., & Aoshima, M. (2017). Support vector machine and its bias correction in high-dimension, low-sample-size settings. Journal of Statistical Planning and Inference, 191, 88–100.
Nakayama, Y., Yata, K., & Aoshima, M. (2020). Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings. Annals of the Institute of Statistical Mathematics, 72, 1257–1286.
Qiao, X., Zhang, H. H., Liu, Y., Todd, M. J., & Marron, J. S. (2010). Weighted distance weighted discrimination and its asymptotic properties. Journal of the American Statistical Association, 105, 401–414.
Qiao, X., & Zhang, L. (2015). Flexible high-dimensional classification machines and their asymptotic properties. Journal of Machine Learning Research, 16, 1547–1572.
Vapnik, V. N. (2000). The nature of statistical learning theory (2nd ed.). Springer.
Acknowledgements
We would like to thank two anonymous referees for their constructive comments.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The research of K. Yata was partially supported by Grant-in-Aid for Scientific Research (C), JSPS, under Contract number 18K03409. The research of M. Aoshima was partially supported by Grants-in-Aid for Scientific Research, (A) and (S), and Challenging Research (Exploratory), JSPS, under Contract numbers 20H00576, 18H05290 and 19K22837.
The original version of this article was revised due to a retrospective Open Access order.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Egashira, K., Yata, K. & Aoshima, M. Asymptotic properties of distance-weighted discrimination and its bias correction for high-dimension, low-sample-size data. Jpn J Stat Data Sci 4, 821–840 (2021). https://doi.org/10.1007/s42081-021-00135-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42081-021-00135-x