Abstract
In this manuscript we study the asymptotic behavior of the following binary classification methods: Support Vector Machine, Mean Difference, Distance Weighted Discrimination and Maximal Data Piling, when the dimension of the data increases and the sample sizes of the classes are fixed. We consider multivariate data with the asymptotic geometric structure of n-simplex, such that the multivariate standard Gaussian distribution, as the dimension increases and the sample size n is fixed. We provide the asymptotic behavior of the four methods in terms of the angle between the normal vector of the separating hyperplane of the method and the optimal direction for classification, under more general conditions than those of Bolivar-Cime and Cordova-Rodriguez (Commun Stat Theory Methods 47(11):2720–2740, 2018). We also analyze the asymptotic behavior of the probabilities of misclassification of the methods. A simulation study is performed to illustrate the theoretical results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ahn, J., Marron, J.S.: The maximal data piling direction for discrimination. Biometrika 97(1), 254–259 (2010)
Ahn, J., Marron, J.S., Muller, K.M., Chi, Y.: The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika 94(3), 760–766 (2007)
Bolivar-Cime, A., Cordova-Rodriguez, L.M.: Binary discrimination methods for high dimensional data with a geometric representation. Commun. Stat. Theory Methods 47(11), 2720–2740 (2018)
Bolivar-Cime, A., Marron, J.S.: Comparison of binary discrimination methods for high dimension low sample size data. J. Multivar. Anal. 115, 108–121 (2013)
Hall, P., Marron, J.S., Neeman, A.: Geometric representation of high dimension, low sample size data. J. R. Stat. Soc. B 67(3), 427–444 (2005)
Jung, S., Marron, J.S.: PCA consistency in high dimension, low sample size context. Ann. Stat. 37(6B), 4104–4130 (2009)
Marron, J.S.: Distance-weighted discrimination. WIREs Comput. Stat. 7, 109–114 (2015)
Marron, J.S., Todd, M.J., Ahn, J.: Distance-weighted discrimination. J. Am. Stat. Assess. 102(480), 1267–1271 (2007)
Qiao, X., Zhang, H.H., Liu, Y., Todd, M.J., Marron, J.S.: Weighted distance weighted discrimination and its asymptotic properties. J. Am. Stat. Assess. 105(489), 401–414 (2010)
Qiao, X., Zhang, L.: Flexible high-dimensional classification machines and their asymptotic properties. J. Mach. Learn. Res. 16, 1547–1572 (2015)
Scholkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. The MIT Press, Cambridge, Massachusetts (2002)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Berlin (1995)
Yata, K., Aoshima, M.: Effective PCA for high-dimension, low-sample-size data with noise reduction via geometric representations. J. Multivar. Anal. 105(1), 193–215 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
1.1 Proof of Theorem 2.1
Case 1: The vectorvis the normal vector of the MD, SVM or DWD hyperplane. Let \(\widetilde {v}\) be the vector given in Lemma 2.1. Let \(X_i^{\prime }=X_i-\mu _+\) and \(Y_j^{\prime }=Y_j-\mu _-\), for i = 1, 2, …, m and j = 1, 2, …, n. We denote by \(\left \langle x,y \right \rangle \) the dot product between the d-dimensional vectors x and y. Note that
Dividing both sides of (8) by d, we have by Lemma 2.1 and (1)–(3) that the sum of the first three terms of the right side converges in probability to σ 2∕m + τ 2∕n + c 2 as d →∞. Now we will see that the sum of the last five terms converges in probability to zero as d →∞. By (1) we have that
as d →∞, for i ≠ j. Analogously
Observe that sum of the last three terms of the right side of (8) is equal to
we have that
Therefore, by Lemma 2.1 and (9)–(11), we have that the sum of the last five terms of the right side of (8) divided by d converges in probability to zero as d →∞. Thus,
From the results of [5], under the asymptotic geometric structure of the data, if \(Y^*_1,Y^*_2,\dots , Y^*_k\) are independent and identically distributed d-dimensional random vectors with the same distribution as the vectors of the class C − and \(\overline {Y}^*_k=\sum _{j=1}^kY^*_j/k\), we have
Since \(\overline {Y}^*_k\) converges in probability to μ − as k →∞, we have that
as d →∞. Thus by the Pythagoras theorem, after rescaling by d −1∕2, the segments X i μ −, X i μ + and μ + μ − tend to form a right triangle as d →∞, where the hypotenuse is X i μ −. Therefore, \(X^{\prime }_i/d^{1/2}=(X_i-\mu _+)/d^{1/2}\) and v d∕d 1∕2 = (μ + − μ −)∕d 1∕2 tend to be orthogonal as d →∞, then
We also have that
Analogously,
Note that
Therefore, dividing both sides of (16) by d 1∕2 ∥ v d ∥, from Lemma 2.1, (3), (14) and (15) we have
as d →∞. Then
as d →∞.
Case 2: The vectorvis the normal vector of the MDP hyperplane. Let \(X^{\prime }_i\) and \(Y^{\prime }_j\) be as in case 1, for i = 1, …, m and j = 1, 2, …, n. From (1) and the results of [5] we have
as d →∞. By (11), (13) and (15) we have
Note that
Dividing both sides of the last equality by d 1∕2 ∥ v d ∥, from (1), (9), (13), (18) and (20) we have
as d →∞. Furthermore, by (3) and (12) we have
Thus, by (19), (21) and (22) it follows that
as d →∞, ∀i. Analogously
As it was shown in the proof of Theorem 3.1 of [3], (23) and (24) imply that when d is large, the normal vector v of the MDP hyperplane is approximately in the same direction as \((\overline {X}-\overline {Y})/\parallel \overline {X}-\overline {Y} \parallel \). Hence, \(\mathrm {Angle}(v,v_d)=\arccos (\left \langle v,v_d \right \rangle /(\parallel v \parallel \parallel v_d \parallel ))\) is approximately
when d is large, which by case 1 converges in probability to \(\arccos \left [\frac {c}{(\sigma ^2/m+\tau ^2/n+c^2)^{1/2}}\right ]\) as d →∞.
1.2 The Data in the Simulations Satisfy Conditions (1)–(4)
We have that X i is equal in distribution to rZ i + β 1 d, for i = 1, 2, …, m, and Y j is equal in distribution to Z m+j, for j = 1, 2, …, n, where Z 1, Z 2, …, Z m+n are independent and identically distributed with the same distribution as the random vector Z given at the beginning of Sect. 3. Therefore, by (7)
as d →∞. Thus conditions (1) and (2) hold with \(\sigma =\sqrt {2}r\) and \(\tau =\sqrt {2}\). We also have
therefore condition (3) holds with c = β.
Now we will see that condition (4) holds. Observe that
for all i, j. From the properties of Z given in [3], we have that
Therefore, by (7) and (26)–(28) we have
Then the condition (4) holds.
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bolivar-Cime, A. (2021). More About Asymptotic Properties of Some Binary Classification Methods for High Dimensional Data. In: Hernández‐Hernández, D., Leonardi, F., Mena, R.H., Pardo Millán, J.C. (eds) Advances in Probability and Mathematical Statistics. Progress in Probability, vol 79. Birkhäuser, Cham. https://doi.org/10.1007/978-3-030-85325-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-85325-9_3
Published:
Publisher Name: Birkhäuser, Cham
Print ISBN: 978-3-030-85324-2
Online ISBN: 978-3-030-85325-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)