Abstract
Multicategory classification problems are frequently encountered in practice. Considering that the massive data sets are increasingly common and often stored locally, we first provide a distributed estimation in the multicategory angle-based classification framework and obtain its excess risk under general conditions. Further, under varied robustness settings, we develop two robust distributed algorithms to provide robust estimations of the multicategory classification. The first robust distributed algorithm takes advantage of median-of-means (MOM) and is designed by the MOM-based gradient estimation. The second robust distributed algorithm is implemented by constructing the weighted-based gradient estimation. The theoretical guarantees of our algorithms are established via the non-asymptotic error bounds of the iterative estimations. Some numerical simulations demonstrate that our methods can effectively reduce the impact of outliers.
Similar content being viewed by others
References
Allwein EL, Schapire RE, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141
Alon N, Matias Y, Szegedy M (1999) The space complexity of approximating the frequency moments. J Comput Syst Sci 58(1):137–147
Bubeck S (2015) Convex optimization: algorithms and complexity. Found Trends® Mach Learn 8:231–357
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27
Chen Y, Su L, Xu J (2017) Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proc ACM Meas Anal Comput Syst 1(2):1–25
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Dobriban E, Sheng Y (2021) Distributed linear regression by averaging. Ann Stat 49(2):918–943
Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
Hill SI, Doucet A (2007) A framework for kernel-based multi-category classification. J Artif Intell Res 30:525–564
Holland MJ, Ikeda K (2019) Efficient learning with robust gradient descent. Mach Learn 108(8–9):1523–1560
Huber PJ, Ronchetti EM (2009) Robust statistics. Wiley, Hoboken
Jordan MI, Lee JD, Yang Y (2019) Communication-efficient distributed statistical inference. J Am Stat Assoc 114(526):668–681
Lange K, Wu T (2008) An mm algorithm for multicategory vertex discriminant analysis. J Comput Graph Stat 17(3):527–544
Lee Y, Lin Y, Wahba G (2004) Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. J Am Stat Assoc 99(465):67–81
Li T, Sahu AK, Talwalkar A et al (2020) Federated learning: challenges, methods, and future directions. IEEE Signal Process Mag 37(3):50–60
Li K, Bao H, Zhang L (2021) Robust covariance estimation for distributed principal component analysis. Metrika. https://doi.org/10.1007/s00184-021-00848-9
Lian H, Fan Z (2018) Divide-and-conquer for debiased l1-norm support vector machine in ultra-high dimensions. J Mach Learn Res 18(182):1–26
Lin SB, Guo X, Zhou DX (2017) Distributed learning with regularized least squares. J Mach Learn Res 18(92):1–31
Liu Y, Shen X (2006) Multicategory \(\psi \)-learning. J Am Stat Assoc 101(474):500–509
Liu Y, Yuan M (2011) Reinforced multicategory support vector machines. J Comput Graph Stat 20(4):901–919
Luo J, Sun Q, Zhou W (2022) Distributed adaptive Huber regression. Comput Stat Data Anal 169(107):419
Minsker S (2015) Geometric median and robust estimation in Banach spaces. Bernoulli 21(4):2308–2335
Minsker S (2019) Distributed statistical estimation and rates of convergence in normal approximation. Electron J Stat 13(2):5213–5252
Minsker S, Ndaoud M (2021) Robust and efficient mean estimation: an approach based on the properties of self-normalized sums. Electron J Stat 15(2):6036–6070
Prasad A, Suggala AS, Balakrishnan S et al (2020) Robust estimation via robust gradient estimation. J R Stat Soc Ser B Stat Methodol 82(3):601–627
Rosenblatt JD, Nadler B (2016) On the optimality of averaging in distributed statistical learning. Inf Inference 5(4):379–404
Sun H, Craig BA, Zhang L (2017) Angle-based multicategory distance-weighted SVM. J Mach Learn Res 18(1):2981–3001
Tu J, Liu W, Mao X et al (2021) Variance reduced median-of-means estimator for Byzantine-robust distributed inference. J Mach Learn Res 22(84):1–67
Wang L, Lian H (2020) Communication-efficient estimation of high-dimensional quantile regression. Anal Appl 18(6):1057–1075
Yang Y, Guo Y, Chang X (2021) Angle-based cost-sensitive multicategory classification. Comput Stat Data Anal 156(107):107
Yin D, Chen Y, Ramchandran K et al (2018) Byzantine-robust distributed learning: towards optimal statistical rates. In: Proceedings of the 35th international conference on machine learning vol 80, pp 5650–5659
Yin D, Chen Y, Ramchandran K et al (2019) Defending against saddle point attack in Byzantine-robust distributed learning. J Am Stat Assoc 97:7074–7084
Zhang C, Liu Y (2014) Multicategory angle-based large-margin classification. Biometrika 101(3):625–640
Zhang C, Liu Y, Wang J et al (2016) Reinforced angle-based multicategory support vector machines. J Comput Graph Stat 25(3):806–825
Zhang Y, Duchi JC, Wainwright MJ (2013) Communication-efficient algorithms for statistical optimization. J Mach Learn Res 14(68):3321–3363
Zhang C, Pham M, Fu S et al (2018) Robust multicategory support vector machines using difference convex algorithm. Math Program 169(1):277–305
Zhang Y, Duchi JC, Wainwright MJ (2015) Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. J Mach Learn Res 16(102):3299–3340
Zhao T, Cheng G, Liu H (2016) A partially linear framework for massive heterogeneous data. Ann Stat 44(4):1400–1437
Zhou WX, Bose K, Fan J et al (2018) A new perspective on robust m-estimation: finite sample theory and applications to dependence-adjusted multiple testing. Ann Stat 46(5):1904–1931
Acknowledgements
Xiaozhou Wang and Riquan Zhang are the co-corresponding authors. We thank the editor in chief and the two referees for their comments and suggestions which we believe led to an improved manuscript.
Funding
Xiaozhou Wang’s research is supported by the National Natural Science Foundation of China (12101240), the Chenguang Program of Shanghai Education Development Foundation and Shanghai Municipal Education Commission (20CG29), and Shanghai Sailing Program (21YF1410500). Riquan Zhang’s research is supported by the National Natural Science Foundation of China (11971171, 11831008, 12171310), and the Basic Research Project of Shanghai Science and Technology Commission (22JC1400800).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
No potential conflict of interest was reported by the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A proof of theoretical results
Appendix A proof of theoretical results
1.1 A.1 Proof of Theorem 1
Given a compact convex set \({\mathcal {B}}=\{\varvec{\beta }:\Vert \varvec{\beta }\Vert _2^2\le c_0\}\), by duality with convex \(\ell \) and \(\Vert \varvec{\beta }\Vert _2^2\), minimizing the optimization model (5) is equivalent to solving the optimization problem
subject to \(\Vert \varvec{\beta }\Vert _2^2\le c_0\). Here, \(c_0\) is a positive constant with respect to \(\lambda _m>0\) for \(m=1,\ldots ,M\). That is to say, it is sufficient to consider the optimization problem (A1) on the compact convex set \({\mathcal {B}}\in {\mathbb {R}}^{(k-1)p}\).
According to the condition of
in Assumption 3, we have
where I denotes the \((k-1)p\times (k-1)p\) identity matrix.
Based on the above argument under Assumption 1-3, according to Corollary 2 in Zhang et al. (2013), we can directly obtain that
where \(C_0\) is a numerical constant.
By the Lipschitz continuity of the loss \(\ell \) in Assumption 1, we get that
Further, applying the Cauchy-Schwarz inequality yields that
Combine (A3) and (A4), we have
Substitute (A2) into (A5), we finally obtain that
where C is a numerical constant.
1.2 A.2 Proof of Proposition 3 and Theorem 4
.
Proof of Proposition 3
According to the nature of the MOM-based gradient defined in (10), for given \(\widetilde{\varvec{\beta }}\) with \(\Vert \widetilde{\varvec{\beta }}-\varvec{\beta }^0\Vert \le C_0\), we first observe that the event
which implies that at least L/2 of \(({\varvec{g}}_m^{(l)}(\widetilde{\varvec{\beta }})-\lambda _m\widetilde{\varvec{\beta }})_{r}\) has to be outside \(\varepsilon _{m,r}\) distance to \((\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_{r}\). Namely,
Define \({\mathcal {L}}_m=\{l\in \{1,\ldots ,L\}:{\mathcal {G}}_m^{(l)}\cap {\mathcal {O}}_m=\emptyset \}\), and let \(T_{m,r}^{(l)}={\varvec{1}}\{\vert ({\varvec{g}}_m^{(l)}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_{r}\vert >\varepsilon _{m,r}\}\) and \(p_{m,r}^{(l)}=E(T_{m,r}^{(l)})\), then the above (A6) implies that
The last inequality is derived by the one-sided Hoeffding’s inequality.
Note that
for some \(l\in {\mathcal {L}}\). Choose \(\varepsilon _{m,r}=\sqrt{\frac{(2+c)(Cov(\nabla \ell (\widetilde{\varvec{\beta }})))_{rr}L}{n(1-2C_1)}}\) for all \(c>0\), we can obtain that
This means that
with probability at least \(1-e^{-\frac{L(1-2C_1)^2c^2}{2(2+c)^2}}\).
Thus, for given \(\widetilde{\varvec{\beta }}\) with \(\Vert \widetilde{\varvec{\beta }}-\varvec{\beta }^0\Vert \le C_0\), we can get that
with probability at least \(1-e^{-\frac{L(1-2C_1)^2c^2}{2(2+c)^2}}\), where C is a numerical constant. \(\square \)
.
Proof of Theorem 4
Let \(\varvec{\delta }^{m,t}=\varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\varvec{\beta }^{m,t}) -\nabla {\mathcal {R}}(\varvec{\beta }^{m,t})\in {\mathbb {R}}^{(k-1)p}\), by the fact of \(\nabla {\mathcal {R}}(\varvec{\beta }^{*})=0\) and our update rule of the MOM-based gradient estimation, i.e., \(\varvec{\beta }^{m,t+1}=\varvec{\beta }^{m,t} -\eta \varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\varvec{\beta }^{m,t})\), we have
Let \({\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*})=\nabla {\mathcal {R}}(\varvec{\beta }^{m,t}) -\nabla {\mathcal {R}}(\varvec{\beta }^{*})\) for brevity and then use Lemma 3.11 introduced in Bubeck (2015), we can directly get that
Choose \(\eta =\frac{2}{\sqrt{M}(\gamma _1+\gamma _2)}\) with respect to the number of local machines M as the step size, there is
Thus, substitute (A8) into (A7), we have
Using Proposition 3 and the condition \(Cov(\nabla \ell )\) is finite, for given \(\varvec{\beta }^{m,t}\) with \(\Vert \varvec{\beta }^{m,t}-\varvec{\beta }^0\Vert \le C_0\), we further have
with probability at least \(1-e^{-\frac{L(1-2C_1)^2c^2}{2(2+c)^2}}\) for all \(c>0\), where
and C is a numerical constant.
Write \(\kappa =\sqrt{1-\frac{2\eta \gamma _1\gamma _2}{\gamma _1+\gamma _2}}\), by the fact of \(\kappa <1\) for \(0<\gamma _1<\gamma _2\), we can obtain that
with probability at least \(1-Te^{-\frac{L(1-2C_1)^2c^2}{2(2+c)^2}}\) for all \(c>0\), and C is a numerical constant.
Finally, by the norm inequality of \(\Vert \varvec{\beta }^{T}-\varvec{\beta }^{*}\Vert _2 \le \frac{1}{M}\sum _{m=1}^M\Vert \varvec{\beta }^{m,T}-\varvec{\beta }^{*}\Vert _2\), we have
with probability at least \(1-Te^{-\frac{L(1-2C_1)^2c^2}{2(2+c)^2}}\). \(\square \)
1.3 A.3 Proof of Proposition 5 and Theorem 6
Before the proof of Proposition 5, we introduce Theorem 3.2 proposed by Minsker and Ndaoud (2021) and write it as the following form to provide the error bounds of the weighted-based gradient estimation.
Lemma 7
Suppose \(\max _{1\le m\le M}\vert {\mathcal {O}}_m\vert \le C_2L\) for \(0\le C_2<1\) and event \(\Omega _m^\nu \) holds. Then, we have
with probability at least \(1-2e^{-c_1}-Le^{-c_2n/L}\) for \(m=1,2,\ldots ,M\) and \(r=1,2,\ldots ,(k-1)p\). Here, \(C_\nu >0\), \(c_1>0\), \(c_2>0\), and
Thus, for given \(\widetilde{\varvec{\beta }}\) with \(\Vert \widetilde{\varvec{\beta }}-\varvec{\beta }^0\Vert \le C_0\), by Lemma 7, we can get that directly
with probability at least \(1-(2e^{-c_1}+Le^{-c_2n/L})(k-1)p\).
The proof of Theorem 6 is similar to that given Theorem 4 above and so is omitted. \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sun, G., Wang, X., Yan, Y. et al. Robust distributed multicategory angle-based classification for massive data. Metrika 87, 299–323 (2024). https://doi.org/10.1007/s00184-023-00915-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-023-00915-3