Skip to main content

Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings

Abstract

In this paper, we study asymptotic properties of nonlinear support vector machines (SVM) in high-dimension, low-sample-size settings. We propose a bias-corrected SVM (BC-SVM) which is robust against imbalanced data in a general framework. In particular, we investigate asymptotic properties of the BC-SVM having the Gaussian kernel and compare them with the ones having the linear kernel. We show that the performance of the BC-SVM is influenced by the scale parameter involved in the Gaussian kernel. We discuss a choice of the scale parameter yielding a high performance and examine the validity of the choice by numerical simulations and actual data analyses.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

References

  1. Ahn, J., Marron, J. S. (2010). The maximal data piling direction for discrimination. Biometrika, 97, 254–259.

    MathSciNet  MATH  Article  Google Scholar 

  2. Ahn, J., Marron, J. S., Muller, K. M., Chi, Y.-Y. (2007). The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika, 94, 760–766.

    MathSciNet  MATH  Article  Google Scholar 

  3. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96, 6745–6750.

    Article  Google Scholar 

  4. Aoshima, M., Yata, K. (2011). Two-stage procedures for high-dimensional data. Sequential Analysis, 30, 356–399. (Editor’s special invited paper).

    MathSciNet  MATH  Article  Google Scholar 

  5. Aoshima, M., Yata, K. (2014). A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data. Annals of the Institute of Statistical Mathematics, 66, 983–1010.

    MathSciNet  MATH  Article  Google Scholar 

  6. Aoshima, M., Yata, K. (2015). Geometric classifier for multiclass, high-dimensional data. Sequential Analysis, 34, 279–294.

    MathSciNet  MATH  Article  Google Scholar 

  7. Aoshima, M., Yata, K. (2018a). Two-sample tests for high-dimension, strongly spiked eigenvalue models. Statistica Sinica, 28, 43–62.

    MathSciNet  MATH  Google Scholar 

  8. Aoshima, M., Yata, K. (2018b). High-dimensional quadratic classifiers in non-sparse settings. Methodology and Computing in Applied Probability. https://doi.org/10.1007/s11009-018-9646-z.

  9. Aoshima, M., Yata, K. (2019). Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models. Annals of the Institute of Statistical Mathematics, 71, 473–503.

    MathSciNet  MATH  Article  Google Scholar 

  10. Bai, Z., Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statistica Sinica, 6, 311–329.

    MathSciNet  MATH  Google Scholar 

  11. Benjamin, X. W., Nathalie, J. (2010). Boosting support vector machines for imbalanced data sets. Knowledge and Information Systems, 25, 1–20.

    Article  Google Scholar 

  12. Chang, J. C., Wooten, E. C., Tsimelzon, A., Hilsenbeck, S. G., Gutierrez, M. C., Elledge, R., Mohsin, S., Osborne, C. K., Chamness, G. C., Allred, D. C., O’Connell, P. (2003). Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. The Lancet, 362, 362–369.

    Article  Google Scholar 

  13. Chan, Y.-B., Hall, P. (2009). Scale adjustments for classifiers in high-dimensional, low sample size settings. Biometrika, 96, 469–478.

    MathSciNet  MATH  Article  Google Scholar 

  14. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537.

    Article  Google Scholar 

  15. Hall, P., Marron, J. S., Neeman, A. (2005). Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society, Series B, 67, 427–444.

    MathSciNet  MATH  Article  Google Scholar 

  16. Hall, P., Pittelkow, Y., Ghosh, M. (2008). Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes. Journal of the Royal Statistical Society, Series B, 70, 159–173.

    MathSciNet  MATH  Article  Google Scholar 

  17. He, H., Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21, 1263–1284.

    Article  Google Scholar 

  18. Huang, H. (2017). Asymptotic behavior of support vector machine for spiked population model. Journal of Machine Learning Research, 18, 1–21.

    MathSciNet  MATH  Google Scholar 

  19. Marron, J. S., Todd, M. J., Ahn, J. (2007). Distance-weighted discrimination. Journal of the American Statistical Association, 102, 1267–1271.

    MathSciNet  MATH  Article  Google Scholar 

  20. Nakayama, Y., Yata, K., Aoshima, M. (2017). Support vector machine and its bias correction in high-dimension, low-sample-size settings. Journal of Statistical Planning and Inference, 191, 88–100.

    MathSciNet  MATH  Article  Google Scholar 

  21. Nutt, C. L., Mani, D. R., Betensky, R. A., Tamayo, P., Cairncross, J. G., Ladd, C., Pohl, U., Hartmann, C., McLaughlin, M. E., Batchelor, T. T., Black, P. M., von Deimling, A., Pomeroy, S. L., Golub, T. R., Louis, D. N. (2003). Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Research, 63, 1602–1607.

    Google Scholar 

  22. Qiao, X., Zhang, L. (2015). Flexible high-dimensional classification machines and their asymptotic properties. Journal of Machine Learning Research, 16, 1547–1572.

    MathSciNet  MATH  Google Scholar 

  23. Qiao, X., Zhang, H. H., Liu, Y., Todd, M. J., Marron, J. S. (2010). Weighted distance weighted discrimination and its asymptotic properties. Journal of the American Statistical Association, 105, 401–414.

    MathSciNet  MATH  Article  Google Scholar 

  24. Schölkopf, B., Smola, A. J. (2002). Learning with Kernels. Cambridge: MIT Press.

    MATH  Google Scholar 

  25. Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., Gaasenbeek, M., et al. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine, 8, 68–74.

    Article  Google Scholar 

  26. Vapnik, V. N. (2000). The Nature of Statistical Learning Theory2nd ed. New York: Springer.

    MATH  Book  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Makoto Aoshima.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We are very grateful to the associate editor and the reviewer for their constructive comments. The research of the second author was partially supported by Grant-in-Aid for Scientific Research (C), Japan Society for the Promotion of Science (JSPS), under Contract Number 18K03409. The research of the third author was partially supported by Grants-in-Aid for Scientific Research (A) and Challenging Research (Exploratory), JSPS, under Contract Numbers 15H01678 and 17K19956.

Appendices

Appendix A: soft-margin SVM

In Sects. 25, we discussed asymptotic properties and the performance of the hard-margin SVMs (hmSVM). In this section, we consider soft-margin SVMs (smSVM). The smSVM is given by \({\hat{y}}({{{\varvec{x}}}})\) after replacing (4) with

$$\begin{aligned} 0\le \alpha _j \le C,\ j=1,\ldots ,N, \ \hbox { and } \ \sum _{j=1}^N\alpha _jt_j=0, \end{aligned}$$
(36)

where \(C(>0)\) is a regularization parameter. Let \(n_{\min }=\min \{n_1, n_2\}\). From (11) in Sect. 2, we can asymptotically claim that \({\hat{\alpha }}_j \le 2/(\varDelta _* n_{\min })\) for all j. Thus, we consider the following condition for C:

$$\begin{aligned} \liminf _{d\rightarrow \infty } \frac{C\varDelta _* n_{\min } }{2}>1 . \end{aligned}$$
(37)

Let \({\hat{y}}_{(S)}({{{\varvec{x}}}}_0)\) and \({\hat{y}}_{BC(S)}({{{\varvec{x}}}}_0)\) denote \({\hat{y}}({{{\varvec{x}}}}_0 )\) and \({\hat{y}}_{BC}({{{\varvec{x}}}}_0)\) after replacing (4) with (36), respectively. Then, we have the following result.

Proposition 7

Assume (A-i), (A-i’) and (8). Under (37), it holds that when \({{{\varvec{x}}}}_0\in \varPi _i\) for \(i=1,2\)

$$\begin{aligned} {\hat{y}}_{(S)}({{{\varvec{x}}}}_0)=\frac{\varDelta }{\varDelta _*}\Big ((-1)^i+\frac{\delta }{\varDelta } +o_P(1)\Big ) \ \hbox { and } \ {\hat{y}}_{BC(S)}({{{\varvec{x}}}}_0)=\frac{\varDelta }{\varDelta _*}\{(-1)^i+o_P(1)\}. \end{aligned}$$

From Proposition 7, the bias-corrected smSVM (BC-smSVM) holds the consistency (6) even when \(|\delta /\varDelta |\rightarrow \infty \). Hence, for smSVMs, we recommend to use the BC-smSVM.

For the settings (a) to (c) in Sect. 2.4, we checked the performance of the BC-smSVM and smSVM together with the hmSVM and bias-corrected hmSVM (BC-hmSVM) for the kernel function (II). We set \((n_1,n_2)=(20,10)\), \(d=1024\ (=2^{10})\) and \(\gamma =d/4\). We set \(C= 2^{-5+t}/(n_{\min }\varDelta _*), \ t=1,\ldots ,10\), for the smSVMs. Similar to Fig. 3, we calculated \({\overline{e}}\) by 2000 replications and plotted the results in Fig. 10. We observed that smSVMs give bad performances when \(C<2/(n_{\min }\varDelta _*)\). As expected, the smSVMs are close to the hmSVMs when \(C>2/(n_{\min }\varDelta _*)\).

Fig. 10
figure10

The average error rate, \({\overline{e}}\), of the BC-smSVM, smSVM, BC-hmSVM and hmSVM with (II) for (ac) when \(d=1024\) and \(C=2^{-5+t}/(n_{\min }\varDelta _{*}),\ t=1,\ldots ,10\). The average error rates of the BC-smSVM and smSVM are described by the dashed lines, and the average error rates of the BC-hmSVM and hmSVM are described by the solid lines

Appendix B: Polynomial kernel SVM

In this section, we consider the polynomial kernel SVM; that is, the classifier (5) has the kernel function (III). We give some asymptotic properties of the polynomial kernel SVM. We consider the following conditions for \(\zeta \) and r:

$$\begin{aligned} \zeta /d \in (0,\infty ) \ \hbox { and } \ r \in (0,\infty ) \ \hbox { as } d\rightarrow \infty . \end{aligned}$$
(38)

We set \(\kappa _1=(\zeta + \Vert {{\varvec{\mu }}}_1\Vert ^2)^r\), \(\kappa _2=(\zeta +{\mathrm{tr}}({{\varvec{\varSigma }}}_1)+\Vert {{\varvec{\mu }}}_1\Vert ^2)^r\), \(\kappa _3=(\zeta +\Vert {{\varvec{\mu }}}_2\Vert ^2)^r\), \(\kappa _4=(\zeta +{\mathrm{tr}}({{\varvec{\varSigma }}}_2)+\Vert {{\varvec{\mu }}}_2\Vert ^2)^r\) and \(\kappa _5=(\zeta +{{\varvec{\mu }}}_1^T{{\varvec{\mu }}}_2)^r\). Then, we have the following result.

Proposition 8

Assume (1), (38) and (A-ii). Assume that N is fixed and

$$\begin{aligned} \liminf _{d\rightarrow \infty } \Big | \frac{ \ \Vert {{\varvec{\mu }}}_1 \Vert ^2-\Vert {{\varvec{\mu }}}_2\Vert ^2}{d}\Big | >0. \end{aligned}$$
(39)

Then, the assumptions (A-i) and (A-i’) are met for the polynomial kernel (III). Furthermore, the BC-SVM (17) with the polynomial kernel (III) holds the consistency (6).

See Fig. 5 for the performance of the BC-SVM with the polynomial kernel (III).

Remark 5

For the Laplace kernel (IV), it is difficult to provide asymptotic properties of the kernel SVM unless \(\varPi _i\)s are Gaussian. Detailed study of the BC-SVM with the Laplace kernel is left to a future work.

Appendix C: proofs

Proof of Lemma 1

Note that \(L({\varvec{\alpha }})=\sum _{j=1}^N\alpha _j-\acute{{\varvec{\alpha }}}^\mathrm{T}{\varvec{K}}\acute{{\varvec{\alpha }}}/2\). The result is obtained from (7) straightforwardly. \(\square \)

Proofs of Propositions 1 and 2

We assume (A-i) and (A-i’). From Lemma 1, it holds that under (8) and (D)

$$\begin{aligned} \eta _1 \sum _{j=1}^{n_1}{\hat{\alpha }}_j^2/{\hat{\alpha }}_{\star }^2 = \eta _1/n_1+o_P(\varDelta ) \quad \hbox {and} \quad \eta _2 \sum _{j=n_1+1}^{N}{\hat{\alpha }}_j^2/ {\hat{\alpha }}_{\star }^2= \eta _2/n_2+o_P(\varDelta ), \end{aligned}$$
(40)

so that \(L({\hat{{\varvec{\alpha }}}})=2{\hat{\alpha }}_{\star }-\varDelta _* {\hat{\alpha }}_{\star }^2\{1+o_P(\varDelta /\varDelta _*) \}/2\). Then, it holds that

$$\begin{aligned} {\hat{\alpha }}_{\star }=(2/{\varDelta _*})\{1+o_P(\varDelta /\varDelta _*) \}. \end{aligned}$$
(41)

Also, from (40) we have (9) under (8).

Next, we consider the second result of Proposition 1. Let \({\hat{S}}_1=\{j|{\hat{\alpha }}_j\ne 0,\ j=1,\ldots ,n_1\}\), \({\hat{S}}_2=\{j|{\hat{\alpha }}_j\ne 0,\ j=n_1+1,\ldots ,N\}\), \({\hat{n}}_{1}=\# {\hat{S}}_1\) and \({\hat{n}}_{2}=\# {\hat{S}}_2\). Then, we have that when \({{{\varvec{x}}}}_0 \in \varPi _i\) for \(i=1,2\),

$$\begin{aligned}&\sum _{j=1}^N {\hat{\alpha }}_jt_jk({{{\varvec{x}}}}_0,{{{\varvec{x}}}}_j)+ \frac{1}{N_{{\hat{S}}}}\sum _{j\in {\hat{S}}} \Big (t_j- \sum _{j'\in {\hat{S}}} {\hat{\alpha }}_{j'}t_{j'}k({{{\varvec{x}}}}_j,{{{\varvec{x}}}}_{j'})\Big )\nonumber \\&=(-1)^i{\hat{\alpha }}_{\star }(\kappa _{2i-1}-\kappa _5) +\frac{{\hat{n}}_2-{\hat{n}}_1}{N_{{\hat{S}}}} \nonumber \\&\quad -\,{\hat{\alpha }}_{\star } \Big (\frac{ -\kappa _1 {\hat{n}}_1-\eta _1 +\kappa _3 {\hat{n}}_2+\eta _2+({\hat{n}}_1-{\hat{n}}_2)\kappa _5 }{N_{{\hat{S}}}} \Big ) +o_P( \varDelta {\hat{\alpha }}_{\star } ) \nonumber \\&=(-1)^i{\hat{\alpha }}_{\star }(\kappa _{2i-1}-\kappa _5) +\frac{({\hat{n}}_2-{\hat{n}}_1)(1-{\hat{\alpha }}_{\star }\varDelta _*/2 )}{N_{{\hat{S}}}}+\frac{{\hat{\alpha }}_{\star }(\kappa _1-\kappa _3) }{2} \nonumber \\&\quad +\,{\hat{\alpha }}_{\star } \frac{\eta _1/n_1-\eta _2/n_2}{2}+{\hat{\alpha }}_{\star } \frac{\eta _1(1-{\hat{n}}_1/n_1)-\eta _2(1-{\hat{n}}_2/n_2) }{N_{{\hat{S}}}}+o_P( \varDelta {\hat{\alpha }}_{\star }). \end{aligned}$$
(42)

Here, we note that \(\eta _1 \sum _{j=1}^{n_1} {\hat{\alpha }}_j^2/{\hat{\alpha }}_{\star }^2 \ge \eta _1/ {\hat{n}}_1\). Thus, from (40) it holds that

$$\begin{aligned} {\hat{n}}_1 (\eta _1/{\hat{n}}_1-\eta _1/{n}_1)=\eta _1(1-{\hat{n}}_1/n_1)=o_P({\hat{n}}_1 \varDelta ) \end{aligned}$$
(43)

under (8). Similarly, we have \( \eta _2(1-{\hat{n}}_2/n_2)=o_P({\hat{n}}_2 \varDelta )\) under (8). Then, from (41) and (42), we have that when \({{{\varvec{x}}}}_0 \in \varPi _i\) for \(i=1,2\),

$$\begin{aligned} {\hat{y}}({{{\varvec{x}}}}_0)&=2(-1)^i\frac{\kappa _{2i-1}-\kappa _5}{\varDelta _{*}}+\frac{\kappa _1 -\kappa _3}{\varDelta _{*}} +\frac{\eta _1/n_1-\eta _2/n_2}{\varDelta _{*}} +o_P\Big (\frac{\varDelta }{\varDelta _{*}} \Big ) \nonumber \\&=(-1)^i {\varDelta }/{\varDelta _{*}} +{\delta }/{\varDelta _{*}} +o_P({\varDelta }/{\varDelta _{*}} ) \end{aligned}$$
(44)

under (8). Hence, we conclude the second result of Proposition 1.

Finally, we consider the proof of Proposition 2. In view of (13), we claim the first result. By noting that \(\varDelta _{*}/\varDelta \rightarrow 1\) and \(\delta /\varDelta =o(1)\) under (12) and (D), it holds from (42) that \({\hat{y}}({{{\varvec{x}}}}_0)=(-1)^i+o_P(1)\) under (12) and (D). We conclude the second result. \(\square \)

Proofs of Theorem 1 and Corollary 1

We assume (A-i) and (A-i’). We consider the following conditions under (D):

$$\begin{aligned} \liminf _{d\rightarrow \infty } {\eta _{2}}/({n_{2} \varDelta })>0 \ \hbox { and } \ {\eta _{1}}/({n_{1} \varDelta })=o(1). \end{aligned}$$
(45)

Let \(\varDelta _{*2}=\varDelta +\eta _{2}/n_{2}\). Note that \(\eta _1 \sum _{j=1}^{n_1}{\hat{\alpha }}_j^2/{\hat{\alpha }}_{\star }^2=o_P(\varDelta )\) under (45). Similar to (41), it holds from (42) and (43) that \({\hat{\alpha }}_{\star }=(2/{\varDelta _{*2}})\{1+o_P(\varDelta /\varDelta _{*2}) \}\) and

$$\begin{aligned} {\hat{y}}({{{\varvec{x}}}}_0) =(-1)^i\varDelta /\varDelta _{*2}+\delta /\varDelta _{*2} +o_P(\varDelta /\varDelta _{*2}) \end{aligned}$$
(46)

under (45) when \({{{\varvec{x}}}}_0 \in \varPi _i\) for \(i=1,2\). Note that \(\varDelta _{*}/\varDelta \rightarrow 1\) and \(\delta /\varDelta _{*} \rightarrow 0\) under (12) and \(\varDelta _{*}/ \varDelta _{*2}\rightarrow 1\) under (45). From Propositions 1, 2 and (46), we obtain (44) under (D) without (8). Thus, from (44), we conclude the results of Theorem 1 and Corollary 1. \(\square \)

Proofs of Lemma 2 and Theorem 2

Under (A-i) and (D), it holds that \({\hat{\varDelta }}_*=\varDelta _*+o_P(\varDelta )\) and \({\hat{\eta }}_i=\eta _i+o_P(\varDelta )\) for \(i=1,2\). Thus, we can conclude the result of Lemma 2. From the proofs of Theorem 1 and Corollary 1, we obtain (44) under (A-i) and (D). By combining (44) with Lemma 2, we conclude the result of Theorem 2. \(\square \)

Proofs of Lemma 3, Corollaries 2 and 3

We assume (A-ii) and (C-ii). Assume also \({{\varvec{\mu }}}_2={\varvec{0}}\) without loss of generality. Note that \(\kappa _1=\Vert {{\varvec{\mu }}}_1 \Vert ^2\), \(\kappa _2=\Vert {{\varvec{\mu }}}_1 \Vert ^2+{\mathrm{tr}}({{\varvec{\varSigma }}}_1)\), \(\kappa _3=\kappa _5=0\), \(\kappa _4=\eta _{2(I)}\) and \(\varDelta _{(I)}=\Vert {{\varvec{\mu }}}_1\Vert ^2\). Also, note that

$$\begin{aligned} {{\varvec{\mu }}}_1^\mathrm{T} {{\varvec{\varSigma }}}_i{{\varvec{\mu }}}_1\le \varDelta _{(I)} \lambda _{\max }({{\varvec{\varSigma }}}_i)\le \varDelta _{(I)}{\mathrm{tr}}({{\varvec{\varSigma }}}_i^2)^{1/2}. \end{aligned}$$
(47)

Then, by using Chebyshev’s inequality, for any \(\tau >0\) we have that

$$\begin{aligned}&\sum _{j=1 }^{n_1} P\left( | {{\varvec{\mu }}}_1^\mathrm{T} ({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1)| \ge \tau \varDelta _{(I)}\right) \le n_1 \left( \tau \varDelta _{(I)}\right) ^{-4} E\left[ \left\{ {{\varvec{\mu }}}_1^\mathrm{T} ({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1) \right\} ^4\right] \nonumber \\&\quad =O\left\{ n_1 \left( \left( {{\varvec{\mu }}}_1^\mathrm{T} {{\varvec{\varSigma }}}_i{{\varvec{\mu }}}_1\right) ^2+\sum _{r=1}^{p_1} \left( {\varvec{\gamma }}_{r}^\mathrm{T} {{\varvec{\mu }}}_1\right) ^4 \right) /\varDelta _{(I)}^4 \right\} =O\left( n_1 {\mathrm{tr}}\left( {{\varvec{\varSigma }}}_i^2]\right) / \varDelta _{(I)}^2 \right) \rightarrow 0 \end{aligned}$$
(48)

from the fact that \(\sum _{r=1}^{p_1} ({\varvec{\gamma }}_{r}^\mathrm{T} {{\varvec{\mu }}}_1)^4\le ({{\varvec{\mu }}}_1^\mathrm{T}{{\varvec{\varSigma }}}_i{{\varvec{\mu }}}_1)^2\), where \({\varvec{\varGamma }}_1=[{\varvec{\gamma }}_1,\ldots ,{\varvec{\gamma }}_{p_1}]\). On the other hand, we have that

$$\begin{aligned}&\sum _{j<j' }^{n_i} P(|({{{\varvec{x}}}}_{ij}-{{\varvec{\mu }}}_i )^\mathrm{T}({{{\varvec{x}}}}_{ij'}-{{\varvec{\mu }}}_i)|\ge \tau \varDelta _{(I)} ) \nonumber \\&\quad \le \sum _{j<j' }^{n_i} (\tau \varDelta _{(I)})^{-4} E\left[ \{ ({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1 )^\mathrm{T} ({{{\varvec{x}}}}_{1j'}-{{\varvec{\mu }}}_1) \}^4\right] =O\left( n_i^2 {\mathrm{tr}}\left( {{\varvec{\varSigma }}}_i^2\right) ^2 /\varDelta _{(I)}^4 \right) \rightarrow 0. \end{aligned}$$
(49)

Note that \({{{\varvec{x}}}}_{1j}^\mathrm{T}{{{\varvec{x}}}}_{1j'}-\kappa _1=({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1 )^\mathrm{T} ({{{\varvec{x}}}}_{1j'}-{{\varvec{\mu }}}_1)+{{\varvec{\mu }}}_1^\mathrm{T}({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1 +{{{\varvec{x}}}}_{1j'}-{{\varvec{\mu }}}_1 )\). Thus, from (48) and (49), it holds that

$$\begin{aligned} {{{\varvec{x}}}}_{1j}^\mathrm{T}{{{\varvec{x}}}}_{1j'}=\kappa _1+o_P(\varDelta _{(I)})\ \hbox { for all } j<j' \le n_1 . \end{aligned}$$
(50)

Note that

$$\begin{aligned}&\sum _{j=1}^{n_1}\sum _{j'=1}^{n_2} P(|({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1 )^\mathrm{T}({{{\varvec{x}}}}_{2j'}-{{\varvec{\mu }}}_2)|\ge \tau \varDelta _{(I)} ) \nonumber \\&\quad =O\Big (n_1 n_2 \{{\mathrm{tr}}({{\varvec{\varSigma }}}_1{{\varvec{\varSigma }}}_2)\}^2+{\mathrm{tr}}({{\varvec{\varSigma }}}_1{{\varvec{\varSigma }}}_2{{\varvec{\varSigma }}}_1{{\varvec{\varSigma }}}_2) \} /\varDelta _{(I)}^4 \Big )\rightarrow 0 \end{aligned}$$
(51)

from the fact that \({\mathrm{tr}}({{\varvec{\varSigma }}}_1{{\varvec{\varSigma }}}_2{{\varvec{\varSigma }}}_1{{\varvec{\varSigma }}}_2) \le \{{\mathrm{tr}}({{\varvec{\varSigma }}}_1{{\varvec{\varSigma }}}_2)\}^2\). Then, similar to (50), we have that

$$\begin{aligned}&{{{\varvec{x}}}}_{2j}^\mathrm{T}{{{\varvec{x}}}}_{2j'}=\kappa _3+o_P(\varDelta _{(I)}) \hbox { for all } j<j' \le n_2,\\&{{{\varvec{x}}}}_{1j}^\mathrm{T}{{{\varvec{x}}}}_{2j'}=\kappa _5+o_P(\varDelta _{(I)}) \hbox { for all } j=1,\ldots ,n_1;\ j'=1,\ldots ,n_2,\\&{{{\varvec{x}}}}_0^\mathrm{T}{{{\varvec{x}}}}_{ij}=\kappa _{2i-1}+o_P(\varDelta _{(I)}) \hbox { for all } 1 \le j \le n_i, i=1,2, \hbox { when } {{{\varvec{x}}}}_0 \in \varPi _i\\&\hbox {and } {{{\varvec{x}}}}_0^\mathrm{T}{{{\varvec{x}}}}_{i'j}=\kappa _{5}+o_P(\varDelta _{(I)}) \hbox { for all } 1 \le j \le n_i, i=1,2\ (i'\ne i) \hbox { when } {{{\varvec{x}}}}_0 \in \varPi _i. \end{aligned}$$

In addition, for any \(\tau >0\) we have that

$$\begin{aligned} \sum _{j=1 }^{n_i} P\big (\big | \Vert {{{\varvec{x}}}}_{ij}-{{\varvec{\mu }}}_i \Vert ^2-{\mathrm{tr}}({{\varvec{\varSigma }}}_i) \big |\ge \tau \varDelta _{(I)} \big ) =O\Big (n_i {\mathrm{tr}}\left( {{\varvec{\varSigma }}}_i^2\right) /\varDelta _{(I)}^2 \Big )\rightarrow 0 \end{aligned}$$
(52)

for \(i=1,2\). Thus, from (48) and (52), it holds that for all \(j=1,\ldots , n_i;\ i=1,2 \)

$$\begin{aligned} {{{\varvec{x}}}}_{ij}^\mathrm{T}{{{\varvec{x}}}}_{ij}=\kappa _{2i}+o_P(\varDelta _{(I)}). \end{aligned}$$

It concludes Lemma 3.

For the proofs of Corollaries 2 and 3 , from Theorems 12 and Corollary 1, we conclude the results. \(\square \)

Proofs of Lemma 4, Corollaries 4 and 5

We assume (A-ii). Let \(\varOmega = \min \{\gamma \varDelta _{(II)}\)\(/\psi ,\ \gamma \} \). Similar to (48), for any \(\tau >0\), we have that under (C-iii) and (D)

$$\begin{aligned} \sum _{j=1 }^{n_i} P(| ({{\varvec{\mu }}}_1-{{\varvec{\mu }}}_2)^\mathrm{T}({{{\varvec{x}}}}_{ij}-{{\varvec{\mu }}}_i)| \ge \tau \varOmega ) \rightarrow 0 \end{aligned}$$

for \(i=1,2\), so that \(({{\varvec{\mu }}}_1-{{\varvec{\mu }}}_2)^\mathrm{T}({{{\varvec{x}}}}_{ij}-{{\varvec{\mu }}}_i)=o_P(\varOmega )\) for all \(j=1,\ldots ,n_i;\ i=1,2\). Similarly, \(\Vert {{{\varvec{x}}}}_{ij}-{{\varvec{\mu }}}_i\Vert ^2={\mathrm{tr}}({{\varvec{\varSigma }}}_i)+ o_P(\varOmega )\) for all \(j=1,\ldots ,n_i;\ i=1,2\), and \( ({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1)^\mathrm{T}({{{\varvec{x}}}}_{2j'}-{{\varvec{\mu }}}_2)=o_P(\varOmega )\) for all \(j=1,\ldots ,n_1;\ j'=1,\ldots ,n_2\). Then, under (C-iii), we have that for all \(j=1,\ldots ,n_1;\ j'=1,\ldots ,n_2\)

$$\begin{aligned} \exp ( -\Vert {{{\varvec{x}}}}_{1j}-{{{\varvec{x}}}}_{2j'} \Vert ^2/\gamma )&= \exp ( -\Vert ({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1)-({{{\varvec{x}}}}_{2j'}-{{\varvec{\mu }}}_2)+{{\varvec{\mu }}}_1-{{\varvec{\mu }}}_2 \Vert ^2/\gamma ) \nonumber \\&=\kappa _{5(II)}+o_P(\kappa _{5(II)} \varOmega /\gamma )=\kappa _{5(II)}+o_P(\varDelta _{(II)}) \end{aligned}$$
(53)

from the fact that \(\kappa _{5(II)} \le \psi \). Similar to (53), we can conclude that the assumptions (A-i) and (A-i’) are met. It concludes Lemma 4.

For the proofs of Corollaries 4 and 5 , from Theorems 12 and Corollary 1, we conclude the results. \(\square \)

Proofs of Propositions 3 and 4

From (23), (C-iii) holds under (C-ii) and (C-v). Thus, from (44) and Lemmas 2 to 4, we conclude Proposition 4. For the proof of Proposition 3, we note that \({\mathrm{tr}}({{\varvec{\varSigma }}}_i)/\gamma \rightarrow 0\) for \(i=1,2,\) under (C-iv) from the fact that \(\varDelta _{(I)}=O(d)\). Thus, it holds that \(\psi \rightarrow 1\) and \(\gamma \eta _{i(II)}=2{\mathrm{tr}}({{\varvec{\varSigma }}}_i)+O(d^2/\gamma )\) for \(i=1,2,\) under (C-iv). In addition, from (22) it holds that \(\delta _{(II)}/\varDelta _{(II)}=\delta _{(I)}\{1+o(1)\}/\varDelta _{(I)}+o(1)\) under (C-iv). Thus, from (44), Lemmas 3 and 4 , we conclude Proposition 3. \(\square \)

Proof of Proposition 5

We assume (24). Note that \(1/\omega \rightarrow 0\) under (24). First, we consider the case when \(\limsup _{d\rightarrow \infty } \gamma _{\star }<\infty \). Then, it holds that \(F (\gamma _{\star }) = \{1+o(1)\}/\gamma _{ \star }\), so that \(\liminf _{d\rightarrow \infty } F (\gamma _{\star })>0\). Next, we consider the case when \(\gamma _{\star } \rightarrow \infty \). Let \(\nu =\omega /\gamma _{\star }\ (>0)\). Note that \(\nu =\varDelta _{(I)}/\gamma \). Then, it holds that

$$\begin{aligned} \omega F (\gamma _{\star })=\nu +\frac{2\nu \exp (-\nu ) \{1+o(\nu )\}}{ \{1-\exp (-\nu ) \}+o(\nu )}. \end{aligned}$$

Let \(g(\nu )=\nu + 2\nu \exp (-\nu )/\{1-\exp (-\nu ) \}\). Note that \(g(\nu )\) is a monotonically increasing function and \(g(\nu )\rightarrow 2\) as \(\nu \rightarrow 0\), so that \( F (\gamma _{\star })=2\{1+o(1)\}/\omega =o(1)\) when \(\nu \rightarrow 0\). We can conclude the result. \(\square \)

Proof of Proposition 6

When \( \omega \le 1\), it holds that \( F (\gamma _{\star })=2\{1+o(1)\}/\omega \) under \(\gamma _\star \rightarrow \infty \). When \(\omega \le 1\) and \(\gamma _\star =1\), it holds that

$$\begin{aligned} F (\gamma _{\star })=1+\frac{4}{\exp (\omega +1)+\exp (\omega -1) -2}< 1+1/\omega \le 2/\omega \end{aligned}$$

from the facts that \(\exp (\omega +1)>1+(\omega +1)+(\omega +1)^2/2\ge 2+3\omega \) and \(\exp (\omega -1) \ge \omega \). Hence, when \(\omega \le 1\), we have that \(\varDelta _{\varSigma }/\gamma _0 \in (0, \infty )\) as \(d\rightarrow \infty \). It concludes the result. \(\square \)

Proof of Proposition 7

From Proposition 1, Lemma 2 and (11), we can conclude the results. \(\square \)

Proof of Proposition 8

We set that \(\kappa _1=(\zeta + \Vert {{\varvec{\mu }}}_1\Vert ^2)^r\), \(\kappa _2=(\zeta +{\mathrm{tr}}({{\varvec{\varSigma }}}_1)+\Vert {{\varvec{\mu }}}_1\Vert ^2)^r\), \(\kappa _3=(\zeta +\Vert {{\varvec{\mu }}}_2\Vert ^2)^r\), \(\kappa _4=(\zeta +{\mathrm{tr}}({{\varvec{\varSigma }}}_2)+\Vert {{\varvec{\mu }}}_2\Vert ^2)^r\) and \(\kappa _5=(\zeta +{{\varvec{\mu }}}_1^T{{\varvec{\mu }}}_2)^r\). From (1), we note that \({{\varvec{\mu }}}_i^T{{\varvec{\varSigma }}}_{i'}{{\varvec{\mu }}}_i\le \Vert {{\varvec{\mu }}}_i \Vert ^2 \lambda _{\max }({{\varvec{\varSigma }}}_i)=o(d^2)\) as \(d\rightarrow \infty \) for \(i,i'=1,2\). Then, similar to (50)–(52), for the polynomial kernel, we have that \({{{\varvec{x}}}}_{ij}^\mathrm{T}{{{\varvec{x}}}}_{ij'}=\Vert {{\varvec{\mu }}}_i\Vert ^2+o_P(d)\) for all \(j<j',\ i=1,2\), \({{{\varvec{x}}}}_{ij}^\mathrm{T}{{{\varvec{x}}}}_{ij}={\mathrm{tr}}({{\varvec{\varSigma }}}_i)+\Vert {{\varvec{\mu }}}_i\Vert ^2+o_P(d)\) for all ij, and \({{{\varvec{x}}}}_{1j}^\mathrm{T}{{{\varvec{x}}}}_{2j'}={{\varvec{\mu }}}_1^T{{\varvec{\mu }}}_2+o_P(d)\) for all \(j,j'\), so that \(k({{{\varvec{x}}}}_{ij},{{{\varvec{x}}}}_{ij'})=\kappa _{2i-1}+o_P(d^r)\) for all \(j<j',\ i=1,2\), \(k({{{\varvec{x}}}}_{ij},{{{\varvec{x}}}}_{ij})=\kappa _{2i}+o_P(d^r)\) for all ij, and \(k({{{\varvec{x}}}}_{1j},{{{\varvec{x}}}}_{2j'})=\kappa _{5}+o_P(d^r)\) for all \(j,j'\). Here, note that

$$\begin{aligned}&(\zeta + \Vert {{\varvec{\mu }}}_1\Vert ^2)^r +(\zeta + \Vert {{\varvec{\mu }}}_2\Vert ^2)^r -2(\zeta +{{\varvec{\mu }}}_1^T{{\varvec{\mu }}}_2)^r\\&\quad \ge \{(\zeta + \Vert {{\varvec{\mu }}}_1\Vert ^2)^{r/2}-(\zeta + \Vert {{\varvec{\mu }}}_2\Vert ^2)^{r/2} \}^2 \end{aligned}$$

from the fact that \((\zeta +{{\varvec{\mu }}}_1^T{{\varvec{\mu }}}_2)^r\le (\zeta + \Vert {{\varvec{\mu }}}_1\Vert ^2)^{r/2}(\zeta + \Vert {{\varvec{\mu }}}_2\Vert ^2)^{r/2} \). Then, it holds that \(\liminf _{d\rightarrow \infty }\varDelta /d^r>0\) from (39). Thus, we have (A-i). Similarly, we can conclude (A-i’). From Theorem 2, the BC-SVM (17) holds (6) for the polynomial kernel. It concludes Proposition 8. \(\square \)

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nakayama, Y., Yata, K. & Aoshima, M. Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings. Ann Inst Stat Math 72, 1257–1286 (2020). https://doi.org/10.1007/s10463-019-00727-1

Download citation

Keywords

  • Geometric representation
  • HDLSS
  • Imbalanced data
  • Radial basis function kernel