Robust distributed multicategory angle-based classification for massive data

Sun, Gaoming; Wang, Xiaozhou; Yan, Yibo; Zhang, Riquan

doi:10.1007/s00184-023-00915-3

Robust distributed multicategory angle-based classification for massive data

Published: 28 June 2023

Volume 87, pages 299–323, (2024)
Cite this article

Metrika Aims and scope Submit manuscript

Gaoming Sun¹,
Xiaozhou Wang ORCID: orcid.org/0000-0002-8783-7246^1,2^na1,
Yibo Yan¹^na1 &
…
Riquan Zhang³^na1

296 Accesses
Explore all metrics

Abstract

Multicategory classification problems are frequently encountered in practice. Considering that the massive data sets are increasingly common and often stored locally, we first provide a distributed estimation in the multicategory angle-based classification framework and obtain its excess risk under general conditions. Further, under varied robustness settings, we develop two robust distributed algorithms to provide robust estimations of the multicategory classification. The first robust distributed algorithm takes advantage of median-of-means (MOM) and is designed by the MOM-based gradient estimation. The second robust distributed algorithm is implemented by constructing the weighted-based gradient estimation. The theoretical guarantees of our algorithms are established via the non-asymptotic error bounds of the iterative estimations. Some numerical simulations demonstrate that our methods can effectively reduce the impact of outliers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust multicategory support vector machines using difference convex algorithm

Article 29 November 2017

Robust and sparse multinomial regression in high dimensions

Article 16 April 2023

Dataset Weighting via Intrinsic Data Characteristics for Pairwise Statistical Comparisons in Classification

References

Allwein EL, Schapire RE, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141
MathSciNet Google Scholar
Alon N, Matias Y, Szegedy M (1999) The space complexity of approximating the frequency moments. J Comput Syst Sci 58(1):137–147
Article MathSciNet Google Scholar
Bubeck S (2015) Convex optimization: algorithms and complexity. Found Trends® Mach Learn 8:231–357
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27
Article Google Scholar
Chen Y, Su L, Xu J (2017) Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proc ACM Meas Anal Comput Syst 1(2):1–25
CAS Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Article Google Scholar
Dobriban E, Sheng Y (2021) Distributed linear regression by averaging. Ann Stat 49(2):918–943
Article MathSciNet Google Scholar
Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
Book Google Scholar
Hill SI, Doucet A (2007) A framework for kernel-based multi-category classification. J Artif Intell Res 30:525–564
Article MathSciNet Google Scholar
Holland MJ, Ikeda K (2019) Efficient learning with robust gradient descent. Mach Learn 108(8–9):1523–1560
Article MathSciNet Google Scholar
Huber PJ, Ronchetti EM (2009) Robust statistics. Wiley, Hoboken
Book Google Scholar
Jordan MI, Lee JD, Yang Y (2019) Communication-efficient distributed statistical inference. J Am Stat Assoc 114(526):668–681
Article MathSciNet CAS Google Scholar
Lange K, Wu T (2008) An mm algorithm for multicategory vertex discriminant analysis. J Comput Graph Stat 17(3):527–544
Article MathSciNet Google Scholar
Lee Y, Lin Y, Wahba G (2004) Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. J Am Stat Assoc 99(465):67–81
Article MathSciNet Google Scholar
Li T, Sahu AK, Talwalkar A et al (2020) Federated learning: challenges, methods, and future directions. IEEE Signal Process Mag 37(3):50–60
Article CAS Google Scholar
Li K, Bao H, Zhang L (2021) Robust covariance estimation for distributed principal component analysis. Metrika. https://doi.org/10.1007/s00184-021-00848-9
Article Google Scholar
Lian H, Fan Z (2018) Divide-and-conquer for debiased l1-norm support vector machine in ultra-high dimensions. J Mach Learn Res 18(182):1–26
ADS Google Scholar
Lin SB, Guo X, Zhou DX (2017) Distributed learning with regularized least squares. J Mach Learn Res 18(92):1–31
MathSciNet Google Scholar
Liu Y, Shen X (2006) Multicategory $\psi $-learning. J Am Stat Assoc 101(474):500–509
Article MathSciNet CAS Google Scholar
Liu Y, Yuan M (2011) Reinforced multicategory support vector machines. J Comput Graph Stat 20(4):901–919
Article MathSciNet Google Scholar
Luo J, Sun Q, Zhou W (2022) Distributed adaptive Huber regression. Comput Stat Data Anal 169(107):419
MathSciNet Google Scholar
Minsker S (2015) Geometric median and robust estimation in Banach spaces. Bernoulli 21(4):2308–2335
Article MathSciNet Google Scholar
Minsker S (2019) Distributed statistical estimation and rates of convergence in normal approximation. Electron J Stat 13(2):5213–5252
Article MathSciNet Google Scholar
Minsker S, Ndaoud M (2021) Robust and efficient mean estimation: an approach based on the properties of self-normalized sums. Electron J Stat 15(2):6036–6070
Article MathSciNet Google Scholar
Prasad A, Suggala AS, Balakrishnan S et al (2020) Robust estimation via robust gradient estimation. J R Stat Soc Ser B Stat Methodol 82(3):601–627
Article MathSciNet Google Scholar
Rosenblatt JD, Nadler B (2016) On the optimality of averaging in distributed statistical learning. Inf Inference 5(4):379–404
Article MathSciNet Google Scholar
Sun H, Craig BA, Zhang L (2017) Angle-based multicategory distance-weighted SVM. J Mach Learn Res 18(1):2981–3001
MathSciNet Google Scholar
Tu J, Liu W, Mao X et al (2021) Variance reduced median-of-means estimator for Byzantine-robust distributed inference. J Mach Learn Res 22(84):1–67
MathSciNet Google Scholar
Wang L, Lian H (2020) Communication-efficient estimation of high-dimensional quantile regression. Anal Appl 18(6):1057–1075
Article MathSciNet CAS Google Scholar
Yang Y, Guo Y, Chang X (2021) Angle-based cost-sensitive multicategory classification. Comput Stat Data Anal 156(107):107
MathSciNet Google Scholar
Yin D, Chen Y, Ramchandran K et al (2018) Byzantine-robust distributed learning: towards optimal statistical rates. In: Proceedings of the 35th international conference on machine learning vol 80, pp 5650–5659
Yin D, Chen Y, Ramchandran K et al (2019) Defending against saddle point attack in Byzantine-robust distributed learning. J Am Stat Assoc 97:7074–7084
Google Scholar
Zhang C, Liu Y (2014) Multicategory angle-based large-margin classification. Biometrika 101(3):625–640
Article MathSciNet PubMed Google Scholar
Zhang C, Liu Y, Wang J et al (2016) Reinforced angle-based multicategory support vector machines. J Comput Graph Stat 25(3):806–825
Article MathSciNet PubMed PubMed Central Google Scholar
Zhang Y, Duchi JC, Wainwright MJ (2013) Communication-efficient algorithms for statistical optimization. J Mach Learn Res 14(68):3321–3363
MathSciNet Google Scholar
Zhang C, Pham M, Fu S et al (2018) Robust multicategory support vector machines using difference convex algorithm. Math Program 169(1):277–305
Article MathSciNet PubMed Google Scholar
Zhang Y, Duchi JC, Wainwright MJ (2015) Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. J Mach Learn Res 16(102):3299–3340
MathSciNet Google Scholar
Zhao T, Cheng G, Liu H (2016) A partially linear framework for massive heterogeneous data. Ann Stat 44(4):1400–1437
Article MathSciNet PubMed PubMed Central Google Scholar
Zhou WX, Bose K, Fan J et al (2018) A new perspective on robust m-estimation: finite sample theory and applications to dependence-adjusted multiple testing. Ann Stat 46(5):1904–1931
Article MathSciNet PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Xiaozhou Wang and Riquan Zhang are the co-corresponding authors. We thank the editor in chief and the two referees for their comments and suggestions which we believe led to an improved manuscript.

Funding

Xiaozhou Wang’s research is supported by the National Natural Science Foundation of China (12101240), the Chenguang Program of Shanghai Education Development Foundation and Shanghai Municipal Education Commission (20CG29), and Shanghai Sailing Program (21YF1410500). Riquan Zhang’s research is supported by the National Natural Science Foundation of China (11971171, 11831008, 12171310), and the Basic Research Project of Shanghai Science and Technology Commission (22JC1400800).

Author information

Xiaozhou Wang, Yibo Yan, Riquan Zhang are contributed equally to this work.

Authors and Affiliations

School of Statistics, East China Normal University, Shanghai, 200062, China
Gaoming Sun, Xiaozhou Wang & Yibo Yan
Key Laboratory of Advanced Theory and Application in Statistics and Data Science-MOE, East China Normal University, Shanghai, 200062, China
Xiaozhou Wang
School of Statistics and Information, Shanghai University of International Business and Economics, Shanghai, 200062, China
Riquan Zhang

Authors

Gaoming Sun
View author publications
You can also search for this author in PubMed Google Scholar
Xiaozhou Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yibo Yan
View author publications
You can also search for this author in PubMed Google Scholar
Riquan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xiaozhou Wang or Riquan Zhang.

Ethics declarations

Conflict of interest

No potential conflict of interest was reported by the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A proof of theoretical results

1.1 A.1 Proof of Theorem 1

Given a compact convex set ${\mathcal {B}}=\{\varvec{\beta }:\Vert \varvec{\beta }\Vert _2^2\le c_0\}$, by duality with convex $\ell $ and $\Vert \varvec{\beta }\Vert _2^2$, minimizing the optimization model (5) is equivalent to solving the optimization problem

$$\begin{aligned} \min _{\varvec{\beta }}\frac{1}{\vert {\mathcal {D}}_m\vert } \sum _{i=1}^{\vert {\mathcal {D}}_m\vert } \ell (({\varvec{x}}_i\otimes {\varvec{W}}_{y_i})^{\top }\varvec{\beta }) \end{aligned}$$

(A1)

subject to $\Vert \varvec{\beta }\Vert _2^2\le c_0$. Here, $c_0$ is a positive constant with respect to $\lambda _m>0$ for $m=1,\ldots ,M$. That is to say, it is sufficient to consider the optimization problem (A1) on the compact convex set ${\mathcal {B}}\in {\mathbb {R}}^{(k-1)p}$.

According to the condition of

$$\begin{aligned} \frac{\gamma _1}{2}\Vert \varvec{\beta }-\varvec{\beta }^\prime \Vert _2^2\le {\mathcal {R}}(\varvec{\beta })-{\mathcal {R}}(\varvec{\beta }^\prime ) -\langle \nabla {\mathcal {R}}(\varvec{\beta }^\prime ),\varvec{\beta } -\varvec{\beta }^\prime \rangle \end{aligned}$$

in Assumption 3, we have

$$\begin{aligned} \nabla ^2{\mathcal {R}}(\varvec{\beta })\succeq \gamma _1I, \end{aligned}$$

where I denotes the $(k-1)p\times (k-1)p$ identity matrix.

Based on the above argument under Assumption 1-3, according to Corollary 2 in Zhang et al. (2013), we can directly obtain that

$$\begin{aligned} E\big [\Vert \varvec{{\widehat{\beta }}}-\varvec{\beta }^{*}\Vert _2^2\big ]\le C_0\bigg (\frac{1}{nM\gamma _1^2}+\frac{1}{n^2}\Big (\frac{\log (k-1)p}{\gamma _1^4}+\frac{1}{\gamma _1^6}\Big ) +\frac{1}{n^2M}+\frac{1}{n^3}\bigg ), \nonumber \\ \end{aligned}$$

(A2)

where $C_0$ is a numerical constant.

By the Lipschitz continuity of the loss $\ell $ in Assumption 1, we get that

$$\begin{aligned} {\mathcal {R}}(\varvec{{\widehat{\beta }}})-{\mathcal {R}}(\varvec{\beta }^*) =E[L_0({\varvec{X}},Y)\Vert \varvec{{\widehat{\beta }}}-\varvec{\beta }^*\Vert _2]. \end{aligned}$$

(A3)

Further, applying the Cauchy-Schwarz inequality yields that

$$\begin{aligned} E[L_0({\varvec{X}},Y)\Vert \varvec{{\widehat{\beta }}}-\varvec{\beta }^*\Vert _2] \le E[L_0^2({\varvec{X}},Y)]^{\frac{1}{2}}E[\Vert \varvec{{\widehat{\beta }}}-\varvec{\beta }^*\Vert _2^2]^{\frac{1}{2}}. \end{aligned}$$

(A4)

Combine (A3) and (A4), we have

$$\begin{aligned} {\mathcal {E}}_{\ell }(\varvec{{\widehat{\beta }}}, \varvec{\beta }^{*}) \le L_0E[\Vert \varvec{{\widehat{\beta }}}-\varvec{\beta }^*\Vert _2^2]^{\frac{1}{2}}. \end{aligned}$$

(A5)

Substitute (A2) into (A5), we finally obtain that

$$\begin{aligned} {\mathcal {E}}_{\ell }(\varvec{{\widehat{\beta }}}, \varvec{\beta }^{*}) \le C\bigg (\frac{1}{\sqrt{nM}\gamma _1}+\frac{1}{n}\Big (\frac{\sqrt{\log (k-1)p}}{\gamma _1^2} +\frac{1}{\gamma _1^3}\Big )\bigg ), \end{aligned}$$

where C is a numerical constant.

1.2 A.2 Proof of Proposition 3 and Theorem 4

.

Proof of Proposition 3

According to the nature of the MOM-based gradient defined in (10), for given $\widetilde{\varvec{\beta }}$ with $\Vert \widetilde{\varvec{\beta }}-\varvec{\beta }^0\Vert \le C_0$, we first observe that the event

$$\begin{aligned} \Big \{\vert (\varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert >\varepsilon _{m,r}\Big \} \end{aligned}$$

which implies that at least L/2 of $({\varvec{g}}_m^{(l)}(\widetilde{\varvec{\beta }})-\lambda _m\widetilde{\varvec{\beta }})_{r}$ has to be outside $\varepsilon _{m,r}$ distance to $(\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_{r}$. Namely,

$$\begin{aligned}{} & {} \Bigg \{\vert (\varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }})) _r\vert>\varepsilon _{m,r}\Big \}\nonumber \\{} & {} \quad \subset \left\{ \sum _{l=1}^L{\varvec{1}}\{\vert ({\varvec{g}}_m^{(l)}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert >\varepsilon _{m,r}\}\ge \frac{L}{2}\right\} . \end{aligned}$$

(A6)

Define ${\mathcal {L}}_m=\{l\in \{1,\ldots ,L\}:{\mathcal {G}}_m^{(l)}\cap {\mathcal {O}}_m=\emptyset \}$, and let $T_{m,r}^{(l)}={\varvec{1}}\{\vert ({\varvec{g}}_m^{(l)}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_{r}\vert >\varepsilon _{m,r}\}$ and $p_{m,r}^{(l)}=E(T_{m,r}^{(l)})$, then the above (A6) implies that

$$\begin{aligned}{} & {} P\Big (\vert (\varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert >\varepsilon _{m,r}\Big )\\{} & {} \quad \le P\left( \sum _{l=1}^LT_{m,r}^{(l)}\ge \frac{L}{2}\right) \\{} & {} \quad \le P\left( \sum _{k\in {\mathcal {L}}_m}T_{m,r}^{(l)}+\vert {\mathcal {O}}_m\vert \ge \frac{L}{2}\right) \\{} & {} \quad \le P\left( \sum _{k\in {\mathcal {L}}_m}(T_{m,r}^{(l)}-E(T_{m,r}^{(l)}))\ge \left( \frac{1}{2}-C_1-p_{m,r}^{(l)}\right) L\right) \\{} & {} \quad \le e^{-2L(\frac{1}{2}-C_1-p_{m,r}^{(l)})^2}. \end{aligned}$$

The last inequality is derived by the one-sided Hoeffding’s inequality.

Note that

$$\begin{aligned} p_{m,r}^{(l)}=E(T_{m,r}^{(l)}) =P\Big (\vert ({\varvec{g}}_m^{(l)}(\widetilde{\varvec{\beta }}) -\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert >\varepsilon _{m,r}\Big ) \le \frac{(Cov(\nabla \ell (\widetilde{\varvec{\beta }})))_{rr}L}{n\varepsilon _{m,r}^2} \end{aligned}$$

for some $l\in {\mathcal {L}}$. Choose $\varepsilon _{m,r}=\sqrt{\frac{(2+c)(Cov(\nabla \ell (\widetilde{\varvec{\beta }})))_{rr}L}{n(1-2C_1)}}$ for all $c>0$, we can obtain that

$$\begin{aligned}{} & {} P\Bigg (\vert (\varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert >\sqrt{\frac{(2+c)(Cov(\nabla \ell (\widetilde{\varvec{\beta }})))_{rr}L}{n(1-2C_1)}}\Bigg )\\{} & {} \quad \le e^{-2L(\frac{1}{2}-C_1-\frac{1-2C_1}{2+c})^2}. \end{aligned}$$

This means that

$$\begin{aligned} \vert (\varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert \le \sqrt{\frac{(2+c)(Cov(\nabla \ell (\widetilde{\varvec{\beta }})))_{rr}L}{n(1-2C_1)}} \end{aligned}$$

with probability at least $1-e^{-\frac{L(1-2C_1)^2c^2}{2(2+c)^2}}$.

Thus, for given $\widetilde{\varvec{\beta }}$ with $\Vert \widetilde{\varvec{\beta }}-\varvec{\beta }^0\Vert \le C_0$, we can get that

$$\begin{aligned}{} & {} \Vert \varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }})\Vert _2\\{} & {} \quad \le \Vert \varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }})\Vert _ 2+\lambda _m\Vert \widetilde{\varvec{\beta }}\Vert _2\\{} & {} \quad \le \sqrt{(k-1)p}\max _{1\le r \le (k-1)p}\vert (\varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert +Cn^{-1/2}\\{} & {} \quad \le \max _{1\le r \le (k-1)p}\sqrt{\frac{(2+c)(k-1)p(Cov(\nabla \ell (\widetilde{\varvec{\beta }})))_{rr}L}{n(1-2C_1)}}+Cn^{-1/2} \end{aligned}$$

with probability at least $1-e^{-\frac{L(1-2C_1)^2c^2}{2(2+c)^2}}$, where C is a numerical constant. $\square $

.

Proof of Theorem 4

Let $\varvec{\delta }^{m,t}=\varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\varvec{\beta }^{m,t}) -\nabla {\mathcal {R}}(\varvec{\beta }^{m,t})\in {\mathbb {R}}^{(k-1)p}$, by the fact of $\nabla {\mathcal {R}}(\varvec{\beta }^{*})=0$ and our update rule of the MOM-based gradient estimation, i.e., $\varvec{\beta }^{m,t+1}=\varvec{\beta }^{m,t} -\eta \varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\varvec{\beta }^{m,t})$, we have

$$\begin{aligned}{} & {} \Vert \varvec{\beta }^{m,t+1}-\varvec{\beta }^{*}\Vert _2\nonumber \\{} & {} \quad =\Vert \varvec{\beta }^{m,t}-\eta \varvec{{\widehat{g}}}_m^ {\textrm{MOM}}(\varvec{\beta }^{m,t})-\varvec{\beta }^{*} +\eta \nabla {\mathcal {R}}(\varvec{\beta }^{*})\Vert _2\nonumber \\{} & {} \quad =\Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*} -\eta [\nabla {\mathcal {R}}(\varvec{\beta }^{m,t}) -(\nabla {\mathcal {R}}(\varvec{\beta }^{*})]-\eta \varvec{\delta }^{m,t}\Vert _2\nonumber \\{} & {} \quad \le \Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*}-\eta [\nabla {\mathcal {R}}(\varvec{\beta }^{m,t}) -(\nabla {\mathcal {R}}(\varvec{\beta }^{*})]\Vert _2+\eta \Vert \varvec{\delta }^{m,t}\Vert _2. \end{aligned}$$

(A7)

Let ${\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*})=\nabla {\mathcal {R}}(\varvec{\beta }^{m,t}) -\nabla {\mathcal {R}}(\varvec{\beta }^{*})$ for brevity and then use Lemma 3.11 introduced in Bubeck (2015), we can directly get that

$$\begin{aligned}{} & {} \Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*} -\eta {\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*})\Vert _2^2\\{} & {} \quad =\Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*}\Vert _2^2 +\eta ^2\Vert {\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*}) \Vert _2^2-2\eta ({\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*}))^\top (\varvec{\beta } ^{m,t}-\varvec{\beta }^{*})\\{} & {} \quad \le \Vert \varvec{\beta }^{m(t)}-\varvec{\beta }^{*}\Vert _2^2 +\eta ^2\Vert {\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*})\Vert _2^2 -2\eta \Big [\frac{\gamma _1\gamma _2}{\gamma _1+\gamma _2}\Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*}\Vert _2^2\\{} & {} \quad +\frac{1}{\gamma _1+\gamma _2}\Vert {\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*})\Vert _2^2\Big ]\\{} & {} \quad =\left( 1-\frac{2\eta \gamma _1\gamma _2}{\gamma _1+\gamma _2}\right) \Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*}\Vert _2^2 +\left( \eta ^2-\frac{2\eta }{\gamma _1+\gamma _2}\right) \Vert {\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*})\Vert _2^2. \end{aligned}$$

Choose $\eta =\frac{2}{\sqrt{M}(\gamma _1+\gamma _2)}$ with respect to the number of local machines M as the step size, there is

$$\begin{aligned} \Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*} -\eta {\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*})\Vert _2^2 \le \left( 1-\frac{2\eta \gamma _1\gamma _2}{\gamma _1+\gamma _2}\right) \Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*}\Vert _2^2. \end{aligned}$$

(A8)

Thus, substitute (A8) into (A7), we have

$$\begin{aligned} \Vert \varvec{\beta }^{m,t+1}-\varvec{\beta }^{*}\Vert _2 \le \sqrt{1-\frac{2\eta \gamma _1\gamma _2}{\gamma _1+\gamma _2}}\Vert \varvec{\beta }^{m,t} -\varvec{\beta }^{*}\Vert _2+\eta \Vert \varvec{\delta }^{m,t}\Vert _2. \end{aligned}$$

Using Proposition 3 and the condition $Cov(\nabla \ell )$ is finite, for given $\varvec{\beta }^{m,t}$ with $\Vert \varvec{\beta }^{m,t}-\varvec{\beta }^0\Vert \le C_0$, we further have

$$\begin{aligned} \Vert \varvec{\beta }^{m,t+1}-\varvec{\beta }^{*}\Vert _2\le & {} \sqrt{1-\frac{2\eta \gamma _1\gamma _2}{\gamma _1+\gamma _2}}\Vert \varvec{\beta }^{m,t} -\varvec{\beta }^{*}\Vert _2\nonumber \\{} & {} +C\eta \sqrt{\frac{(2+c)(k-1)pL}{n(1-2C_1)}}+C\eta n^{-1/2}\nonumber \\= & {} \sqrt{1-\frac{2\eta \gamma _1\gamma _2}{\gamma _1+\gamma _2}}\Vert \varvec{\beta }^{m,t} -\varvec{\beta }^{*}\Vert _2+\frac{2\varepsilon _{_\textrm{MOM}}}{\gamma _1+\gamma _2} \end{aligned}$$

with probability at least $1-e^{-\frac{L(1-2C_1)^2c^2}{2(2+c)^2}}$ for all $c>0$, where

$$\begin{aligned} \varepsilon _{_\textrm{MOM}}=C\sqrt{\frac{(2+c)(k-1)pL}{N(1-2C_1)}}+CN^{-1/2}. \end{aligned}$$

and C is a numerical constant.

Write $\kappa =\sqrt{1-\frac{2\eta \gamma _1\gamma _2}{\gamma _1+\gamma _2}}$, by the fact of $\kappa <1$ for $0<\gamma _1<\gamma _2$, we can obtain that

$$\begin{aligned} \Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*}\Vert _2\le & {} \kappa \Vert \varvec{\beta }^{m,t-1}-\varvec{\beta }^{*}\Vert _2+\frac{2\varepsilon _{_\textrm{MOM}}}{\gamma _1+\gamma _2}\\\le & {} \kappa ^2\Vert \varvec{\beta }^{m,t-2}-\varvec{\beta }^{*}\Vert _2 +\frac{2(1+\kappa )\varepsilon _{_\textrm{MOM}}}{\gamma _1+\gamma _2}\\{} & {} \vdots \\\le & {} \kappa ^t\Vert \varvec{\beta }^{m,0}-\varvec{\beta }^{*}\Vert _2 +\frac{2(1+\kappa +\ldots +\kappa ^t)\varepsilon _{_\textrm{MOM}}}{\gamma _1+\gamma _2}\\\le & {} \kappa ^t\Vert \varvec{\beta }^{0}-\varvec{\beta }^{*}\Vert _2 +\frac{2\varepsilon _{_\textrm{MOM}}}{(1-\kappa )(\gamma _1+\gamma _2)} \end{aligned}$$

with probability at least $1-Te^{-\frac{L(1-2C_1)^2c^2}{2(2+c)^2}}$ for all $c>0$, and C is a numerical constant.

Finally, by the norm inequality of $\Vert \varvec{\beta }^{T}-\varvec{\beta }^{*}\Vert _2 \le \frac{1}{M}\sum _{m=1}^M\Vert \varvec{\beta }^{m,T}-\varvec{\beta }^{*}\Vert _2$, we have

$$\begin{aligned} \Vert \varvec{\beta }^{T}-\varvec{\beta }^{*}\Vert _2 \le \kappa ^T\Vert \varvec{\beta }^{0}-\varvec{\beta }^{*}\Vert _2 +\frac{2\varepsilon _{_\textrm{MOM}}}{(1-\kappa )(\gamma _1+\gamma _2)} \end{aligned}$$

with probability at least $1-Te^{-\frac{L(1-2C_1)^2c^2}{2(2+c)^2}}$. $\square $

1.3 A.3 Proof of Proposition 5 and Theorem 6

Before the proof of Proposition 5, we introduce Theorem 3.2 proposed by Minsker and Ndaoud (2021) and write it as the following form to provide the error bounds of the weighted-based gradient estimation.

Lemma 7

Suppose $\max _{1\le m\le M}\vert {\mathcal {O}}_m\vert \le C_2L$ for $0\le C_2<1$ and event $\Omega _m^\nu $ holds. Then, we have

$$\begin{aligned}{} & {} \vert (\varvec{{\widehat{g}}}_m^{\textrm{WT}}(\varvec{\beta }) -\lambda _m\varvec{\beta }-\nabla {\mathcal {R}}(\varvec{\beta }))_r\vert \\{} & {} \quad \le \frac{C_\nu \sqrt{Cov(\nabla \ell (\varvec{\beta }))_{rr}}}{(1-C_2)^\nu } \left( \sqrt{\frac{1+c_1}{n}}+\sqrt{\frac{L}{n}} +\quad \frac{\vert {\mathcal {O}}_m\vert }{\sqrt{nL}\left( \sqrt{\alpha ({\mathcal {O}})}\right) ^{\nu -1}}\right) \end{aligned}$$

with probability at least $1-2e^{-c_1}-Le^{-c_2n/L}$ for $m=1,2,\ldots ,M$ and $r=1,2,\ldots ,(k-1)p$. Here, $C_\nu >0$, $c_1>0$, $c_2>0$, and

$$\begin{aligned} \Omega _m^{\nu }=\bigcup _{r=1}^{(k-1)p}\left\{ \sum _{l=1}^L1/(\varvec{\sigma }_m^{(l)}(\varvec{\beta }))_r^\nu \le (4\sqrt{(Cov(\nabla \ell (\varvec{\beta })))_{rr}}/(1-C_2))^\nu \right\} . \end{aligned}$$

Thus, for given $\widetilde{\varvec{\beta }}$ with $\Vert \widetilde{\varvec{\beta }}-\varvec{\beta }^0\Vert \le C_0$, by Lemma 7, we can get that directly

$$\begin{aligned}{} & {} \Vert \varvec{{\widehat{g}}}_m^{\textrm{WT}}(\widetilde{\varvec{\beta }}) -\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }})\Vert _2\\{} & {} \quad \le \Vert \varvec{{\widehat{g}}}_m^{\textrm{WT}}(\widetilde{\varvec{\beta }})-\lambda _m\widetilde{\varvec{\beta }} -\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }})\Vert _2+\lambda _m\Vert \widetilde{\varvec{\beta }}\Vert _2\\{} & {} \quad \le \sqrt{(k-1)p}\max _{1\le r\le (k-1)p}\vert (\varvec{{\widehat{g}}}_m^{\textrm{WT}}(\widetilde{\varvec{\beta }})-\lambda _m\widetilde{\varvec{\beta }} -\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert +Cn^{-1/2}\\{} & {} \quad \le \max _{1\le r\le (k-1)p}\frac{C_\nu \sqrt{(k-1)pCov(\nabla \ell (\widetilde{\varvec{\beta }}))_{rr}}}{(1-C_2)^\nu } \Big (\sqrt{\frac{1+c_1}{n}}+\sqrt{\frac{L}{n}} +\frac{\vert {\mathcal {O}}_m\vert }{\sqrt{nL}(\sqrt{\alpha ({\mathcal {O}})})^{\nu -1}}\Big )\\{} & {} \quad +Cn^{-1/2} \end{aligned}$$

with probability at least $1-(2e^{-c_1}+Le^{-c_2n/L})(k-1)p$.

The proof of Theorem 6 is similar to that given Theorem 4 above and so is omitted. $\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sun, G., Wang, X., Yan, Y. et al. Robust distributed multicategory angle-based classification for massive data. Metrika 87, 299–323 (2024). https://doi.org/10.1007/s00184-023-00915-3

Download citation

Received: 24 June 2022
Accepted: 09 June 2023
Published: 28 June 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s00184-023-00915-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust distributed multicategory angle-based classification for massive data

Abstract

Access this article

Similar content being viewed by others

Robust multicategory support vector machines using difference convex algorithm

Robust and sparse multinomial regression in high dimensions

Dataset Weighting via Intrinsic Data Characteristics for Pairwise Statistical Comparisons in Classification

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix A proof of theoretical results

1.1 A.1 Proof of Theorem 1

1.2 A.2 Proof of Proposition 3 and Theorem 4

Proof of Proposition 3

Proof of Theorem 4

1.3 A.3 Proof of Proposition 5 and Theorem 6

Lemma 7

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Robust distributed multicategory angle-based classification for massive data

Abstract

Access this article

Similar content being viewed by others

Robust multicategory support vector machines using difference convex algorithm

Robust and sparse multinomial regression in high dimensions

Dataset Weighting via Intrinsic Data Characteristics for Pairwise Statistical Comparisons in Classification

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix A proof of theoretical results

Appendix A proof of theoretical results

1.1 A.1 Proof of Theorem 1

1.2 A.2 Proof of Proposition 3 and Theorem 4

Proof of Proposition 3

Proof of Theorem 4

1.3 A.3 Proof of Proposition 5 and Theorem 6

Lemma 7

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation