Asymptotic properties of distance-weighted discrimination and its bias correction for high-dimension, low-sample-size data

While distance-weighted discrimination (DWD) was proposed to improve the support vector machine in high-dimensional settings, it is known that the DWD is quite sensitive to the imbalanced ratio of sample sizes. In this paper, we study asymptotic properties of the DWD in high-dimension, low-sample-size (HDLSS) settings. We show that the DWD includes a huge bias caused by a heterogeneity of covariance matrices as well as sample imbalance. We propose a bias-corrected DWD (BC-DWD) and show that the BC-DWD can enjoy consistency properties about misclassification rates. We also consider the weighted DWD (WDWD) and propose an optimal choice of weights in the WDWD. Finally, we discuss performances of the BC-DWD and the WDWD with the optimal weights in numerical simulations and actual data analyses.


Introduction
Along with the development of technology, we often encounter high-dimension, lowsample-size (HDLSS) data. In this paper, we consider two-class linear discriminant analysis for the HDLSS data. Suppose we have two independent and d-variate populations, i , i = 1, 2, having an unknown mean vector μ i and unknown covariance matrix i for each i = 1, 2. We have independent and identically distributed (i.i.d.) observations, x i1 , . . . , x in i , from each i . We assume n i ≥ 2, i = 1, 2. We also assume that lim sup for i = 1, 2, where · denotes the Euclidean norm. Let N = n 1 + n 2 . We simply write that (x 1 , . . . , x N ) = (x 11 , . . . , x 1n 1 , x 21 , . . . , x 2n 2 ). We denote the class labels of t j by −1 for j = 1, . . . , n 1 , and by +1 for j = n 1 + 1, . . . , N . Let x 0 be an observation vector of an individual belonging to one of the i s. We assume that x 0 and x i j s are independent.
In the HDLSS context, Hall et al. (2008), Chan and Hall (2009), and Aoshima and Yata (2014) considered distance-based classifiers. Aoshima and Yata (2019a) considered a distance-based classifier based on a data transformation technique. Yata (2011, 2015) considered geometric classifiers based on a geometric representation of HDLSS data. Aoshima and Yata (2019b) considered quadratic classifiers in general and discussed optimality of the classifiers under high-dimension, non-sparse settings. In the field of machine learning, there are many studies for classification (supervised learning). A typical method is the support vector machine (SVM) developed by Vapnik (2000). Hall et al. (2005), Chan and Hall (2009), and Nakayama et al. (2017, 2020 investigated asymptotic properties of the SVM in the HDLSS context. Nakayama et al. (2017Nakayama et al. ( , 2020 pointed out the strong inconsistency of the SVM when n i s are imbalanced. They proposed bias-corrected SVMs and showed their superiority to the SVMs. On the other hand, Marron et al. (2007) pointed out that the SVM causes data piling in the HDLSS context. Data piling is a phenomenon that the projection of training data to the normal direction vector of a separating hyperplane is the same for each class. See Fig. 1 in Sect. 2. To avoid the data piling problem of the SVM, Marron et al. (2007) proposed the distance-weighted discrimination (DWD). Whereas the SVM finds the optimal hyperplane by maximizing the minimum distances from each class to the hyperplane, the DWD finds a proper hyperplane by minimizing the sum of reciprocals of the distance from each data point to the hyperplane. The DWD cares all the data vectors that are not always used in the SVM. Unfortunately, the DWD is designed for balanced training data sets. See Qiao et al. (2010) and Qiao and Zhang (2015). For imbalanced training data sets, Qiao et al. (2010) developed the weighted DWD (WDWD) that imposes different weights on two classes. However, the WDWD is sensitive for a choice of weights.
In this paper, we investigate the DWD and the WDWD theoretically in the HDLSS context where d → ∞ while N is fixed. In Sect. 2, we review the DWD. In Sect. 3, we give asymptotic properties of the DWD. We show that the DWD includes a huge bias caused by a heterogeneity of covariance matrices as well as sample imbalance. We propose a bias-corrected DWD (BC-DWD) and show that the BC-DWD can enjoy consistency properties about misclassification rates. In Sect. 4, we give asymptotic properties of the WDWD. We propose an optimal choice of the weights in the WDWD. Finally, in Sects. 5 and 6, we discuss performances of the BC-DWD and the WDWD with the optimal weights in numerical simulations and actual data analyses.

Formulation of the DWD
In this section, we give a formulation of the DWD along the line of Marron et al. (2007).
Let w ∈ R d be a normal vector and b ∈ R be an intercept term, respectively. Let r j = t j (w T x j + b) for j = 1, . . . , N . When the training data sets are linearly separable, the DWD is defined by minimizing the sum of 1/r j for all observations. Note that the HDLSS data are linearly separable by a hyperplane. Thus, the optimization problem of the DWD is as follows: The dual problem of the above optimization problem can be written as subject to α j > 0, j = 1, . . . , N , and λ > 0, where α = (α 1 , . . . , α N ) T , r = (r 1 , . . . , r N ) T , and λ and α j s are Lagrange multipliers. Let Then, we have that The optimization problem can be transformed into the following: Then, by noting that we can rewrite the optimization problem (2) as follows: Let Then, from (1) and (3), we write that The intercept term b is given by Thus, we consider estimating b by the average: Then, the classifier function of the DWD is defined by One classifies x 0 into 1 if y(x 0 ) < 0 and into 2 otherwise. Now, let us use the following toy example to see data piling. We set n 1 = n 2 = 25, d = 2 s , s = 5, . . . , 8. Independent pseudo-random observations were generated from i : N d (μ i , i ). We set μ 1 = 0, μ 2 = (1, . . . , 1, 0, . . . , 0) T whose first d 2/3 elements are 1, and 1 = 2 = I d , where x denotes the smallest integer ≥ x and I d denotes the d-dimensional identity matrix. Let y SVM (·) be a classifier function of the (linear) SVM. In Fig. 1, we gave the histograms of y SVM (x j )s and normalized y(x j )s, respectively.
We observed that the data training points for the SVM are concentrated in −1 when x j ∈ 1 and 1 when x j ∈ 2 , as d increases. This phenomenon is the data piling. See Nakayama et al. (2017) for the theoretical reason. On the other hand, the data training points for the DWD did not have the phenomenon. We emphasize that the DWD cares all the data vectors that are not always used in the SVM. However, in the next section, we show that the DWD includes a huge bias caused by a heterogeneity of covariance matrices as well as sample imbalance.

Lemma 2 Under
The quantity δ vanishes if n 1 = n 2 and Σ 1 = Σ 2 . We consider the following assumption: (C-iii) lim sup |δ| < 1 2 . Let e(i) denote the error rate of misclassifying an individual from i into the other class for i = 1, 2. Then, we have the following results.
Theorem 1 Under (C-i), (C-ii) and (C-iii), the DWD holds: However, without (C-iii), we have the following results.
Corollary 1 Under (C-i) and (C-ii), the DWD holds: Remark 2 For the DWD, Hall et al. (2005) and Qiao et al. (2010) showed the consistent property in Theorem 1 and the inconsistent properties in Corollary 1 under different conditions. However, we claim that (C-i), (C-ii) and (C-iii) are milder than their conditions. From Corollary 1, the DWD brings the strong inconsistency because of the huge bias caused by the heterogeneity of covariance matrices as well as sample imbalance. For example, if tr(Σ i )/ 2 → ∞ as d → ∞ for some i, |δ| tends to become large as d increases when tr(Σ 1 ) = tr(Σ 2 ) or n 1 = n 2 . To overcome such difficulties, we propose a bias-corrected DWD.

Theorem 2 For the BC-DWD, (8) holds under (C-i) and (C-ii).
We emphasize that the BC-DWD enjoys the asymptotic consistency without assuming (C-iii). See Sect. 5 for numerical comparisons. Qiao et al. (2010) developed the WDWD to overcome the weakness of the DWD for sample imbalance. The optimization problem of the WDWD is as follows:

WDWD and its asymptotic properties
with W (−1) (> 0) and W (+1) (> 0) some weights. In this paper, we assume W (−1) = 1 without loss of generality. We also assume that Then, similar to the DWD, the dual optimization problem is written as follows: Let us write that Similar to the DWD, we obtain the classifier function of the WDWD: Then, one classifies x 0 into 1 if y W (x 0 ) < 0 and into 2 otherwise. As with the DWD, we have the following result.

Lemma 3 Under (C-i) and (C-ii), it holds that as d → ∞
Furthermore, it holds that as d → ∞ We consider the following assumption: (C-iv) lim sup |δ W | < 1 2 . Then, we have the following results. For the WDWD, Qiao et al. (2010) recommended to use W (+1) = n 1 /n 2 in a case of equal costs. See Table 3 in Qiao et al. (2010). However, if tr(Σ i )/ 2 → ∞ as d → ∞ for some i, |δ W | with W (+1) = n 1 /n 2 tends to become large as d increases when tr(Σ 1 ) = tr(Σ 2 ) or n 1 = n 2 . Thus, from Corollary 2, the WDWD still brings the strong inconsistency because of the huge bias caused by the heterogeneity of covariance matrices as well as sample imbalance.
To overcome such difficulties, we propose an optimal choice of W (+1) in the WDWD. Let We claim that Thus, we consider the estimator of W 0 as follows: Then, under (C-i) and (C-ii), from (9) and (10), it holds that as d → ∞ Then, from Lemma 3 and (13), we have the following result.

Comparison in high-dimensional setting
We used computer simulations to compare the performance of the classifiers: the DWD, the BC-DWD, the WDWD, and the OWDWD. We set W (+1) = n 1 /n 2 for the WDWD. Note that the WDWD is equivalent to the DWD when n 1 = n 2 .
As for i (i = 1, 2), we considered the following three cases: as the chi-squared distribution with 5 degrees of freedom; and (iii) x i j − μ i , j = 1, . . . , n i , are i.i.d. as a d-variate t-distribution, t d (0, Σ i , 10), i = 1, 2, with mean zero, covariance matrix Σ i and degrees of freedom 10.
We observed that the DWD and the WDWD give quite bad performances for (b) to (d). This can be regarded as a natural consequence because of the bias in the DWD Fig. 2 The error rates of the DWD, the BC-DWD, the WDWD, and the OWDWD for (a) n 1 = n 2 , tr(Σ 1 ) = tr(Σ 2 ) for (i); and (b) n 1 = n 2 , tr(Σ 1 ) = tr(Σ 2 ) for (i) and the WDWD. Note that |δ| → ∞ and |δ W | → ∞ as d → ∞ for (b) to (d), where W (+1) = n 1 /n 2 in δ W . Thus, from Corollaries 1 and 2, the DWD and the WDWD hold the strong inconsistency. On the other hand, the BC-DWD and the OWDWD gave preferable performances for all cases. We emphasize that the BC-DWD and the OWDWD hold the consistency property without (C-iii) or (C-iv). See Sects. 3.2 and 4 for the details. The error rates of the DWD, the BC-DWD, the WDWD, and the OWDWD for (c) n 1 = n 2 , tr(Σ 1 ) = tr(Σ 2 ) for (ii); and (d) n 1 = n 2 , tr(Σ 1 ) = tr(Σ 2 ) for (iii)

Real data analysis
In this section, we analyze a gene expression data using the DWD, the BC-DWD, the WDWD, the OWDWD, the (linear) SVM, and the bias corrected-SVM (BC-SVM) by Nakayama et al. (2017). We set W (+1) = n 1 /n 2 for the WDWD. Note that the WDWD is equivalent to the DWD when n 1 = n 2 . We used colon cancer data with 2000 (= d) genes in Alon et al. (1999) that consists of 1 : colon tumor (40 samples) and 2 : normal colon (22 samples).
We observed that the BC-DWD and the OWDWD give adequate performances compared to the DWD, the WDWD, and the SVM especially when n 1 and n 2 are unbalanced. See Sects. 3.2 and 4 for theoretical reasons. The BC-SVM also gave adequate performances even when n 1 and n 2 are unbalanced. This can be regarded as an acceptable consequence because the BC-SVM has the consistency (8) under (C-i) and (C-ii). See Section 3 in Nakayama et al. (2017) for the details. However, the BC-DWD (or the OWDWD) seems to give better performances compared to the BC-SVM. This can be regarded as a natural consequence because the DWD cares all the data vectors that are not always used in the SVM. See Fig. 1. A theoretical study of relevance between the BC-SVM and the BC-DWD is left to a future work.

Proof of Lemma 2
From (5) and (6), we can claim the first result of Lemma 2. Next, we consider the second result of Lemma 2. By noting that N j=1α j t j x j = N j=1α j t j (x j − μ), from the first result of Lemma 2, (15) and (16), under (C-i) and (C-ii), it holds that as d → ∞: Then, from the first result of Lemma 2, we can claim the second result of Lemma 2. It concludes the results of Lemma 2.

Proofs of Theorem 1 and Corollary 1
Using (7), the results are obtained straightforwardly.

Proofs of Theorem 2
By combining (7) with (11), we can conclude the result.

Proofs of Lemma 3, Corollary 2, Theorems 3 and 4
Similar to (6), from Lemma 1 and (5), it concludes the first result of Lemma 3. For the second result of Lemma 3, in a way similar to Proof of Lemma 2, we can claim the result. For the results of Corollary 2, Theorems 3 and 4, by combining (12) with (13), we can claim the results.