Skip to main content
Log in

Robust distributed multicategory angle-based classification for massive data

  • Published:
Metrika Aims and scope Submit manuscript

Abstract

Multicategory classification problems are frequently encountered in practice. Considering that the massive data sets are increasingly common and often stored locally, we first provide a distributed estimation in the multicategory angle-based classification framework and obtain its excess risk under general conditions. Further, under varied robustness settings, we develop two robust distributed algorithms to provide robust estimations of the multicategory classification. The first robust distributed algorithm takes advantage of median-of-means (MOM) and is designed by the MOM-based gradient estimation. The second robust distributed algorithm is implemented by constructing the weighted-based gradient estimation. The theoretical guarantees of our algorithms are established via the non-asymptotic error bounds of the iterative estimations. Some numerical simulations demonstrate that our methods can effectively reduce the impact of outliers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Allwein EL, Schapire RE, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141

    MathSciNet  Google Scholar 

  • Alon N, Matias Y, Szegedy M (1999) The space complexity of approximating the frequency moments. J Comput Syst Sci 58(1):137–147

    Article  MathSciNet  Google Scholar 

  • Bubeck S (2015) Convex optimization: algorithms and complexity. Found Trends® Mach Learn 8:231–357

  • Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27

    Article  Google Scholar 

  • Chen Y, Su L, Xu J (2017) Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proc ACM Meas Anal Comput Syst 1(2):1–25

    CAS  Google Scholar 

  • Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    Article  Google Scholar 

  • Dobriban E, Sheng Y (2021) Distributed linear regression by averaging. Ann Stat 49(2):918–943

    Article  MathSciNet  Google Scholar 

  • Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York

    Book  Google Scholar 

  • Hill SI, Doucet A (2007) A framework for kernel-based multi-category classification. J Artif Intell Res 30:525–564

    Article  MathSciNet  Google Scholar 

  • Holland MJ, Ikeda K (2019) Efficient learning with robust gradient descent. Mach Learn 108(8–9):1523–1560

    Article  MathSciNet  Google Scholar 

  • Huber PJ, Ronchetti EM (2009) Robust statistics. Wiley, Hoboken

    Book  Google Scholar 

  • Jordan MI, Lee JD, Yang Y (2019) Communication-efficient distributed statistical inference. J Am Stat Assoc 114(526):668–681

    Article  MathSciNet  CAS  Google Scholar 

  • Lange K, Wu T (2008) An mm algorithm for multicategory vertex discriminant analysis. J Comput Graph Stat 17(3):527–544

    Article  MathSciNet  Google Scholar 

  • Lee Y, Lin Y, Wahba G (2004) Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. J Am Stat Assoc 99(465):67–81

    Article  MathSciNet  Google Scholar 

  • Li T, Sahu AK, Talwalkar A et al (2020) Federated learning: challenges, methods, and future directions. IEEE Signal Process Mag 37(3):50–60

    Article  CAS  Google Scholar 

  • Li K, Bao H, Zhang L (2021) Robust covariance estimation for distributed principal component analysis. Metrika. https://doi.org/10.1007/s00184-021-00848-9

    Article  Google Scholar 

  • Lian H, Fan Z (2018) Divide-and-conquer for debiased l1-norm support vector machine in ultra-high dimensions. J Mach Learn Res 18(182):1–26

    ADS  Google Scholar 

  • Lin SB, Guo X, Zhou DX (2017) Distributed learning with regularized least squares. J Mach Learn Res 18(92):1–31

    MathSciNet  Google Scholar 

  • Liu Y, Shen X (2006) Multicategory \(\psi \)-learning. J Am Stat Assoc 101(474):500–509

    Article  MathSciNet  CAS  Google Scholar 

  • Liu Y, Yuan M (2011) Reinforced multicategory support vector machines. J Comput Graph Stat 20(4):901–919

    Article  MathSciNet  Google Scholar 

  • Luo J, Sun Q, Zhou W (2022) Distributed adaptive Huber regression. Comput Stat Data Anal 169(107):419

    MathSciNet  Google Scholar 

  • Minsker S (2015) Geometric median and robust estimation in Banach spaces. Bernoulli 21(4):2308–2335

    Article  MathSciNet  Google Scholar 

  • Minsker S (2019) Distributed statistical estimation and rates of convergence in normal approximation. Electron J Stat 13(2):5213–5252

    Article  MathSciNet  Google Scholar 

  • Minsker S, Ndaoud M (2021) Robust and efficient mean estimation: an approach based on the properties of self-normalized sums. Electron J Stat 15(2):6036–6070

    Article  MathSciNet  Google Scholar 

  • Prasad A, Suggala AS, Balakrishnan S et al (2020) Robust estimation via robust gradient estimation. J R Stat Soc Ser B Stat Methodol 82(3):601–627

    Article  MathSciNet  Google Scholar 

  • Rosenblatt JD, Nadler B (2016) On the optimality of averaging in distributed statistical learning. Inf Inference 5(4):379–404

    Article  MathSciNet  Google Scholar 

  • Sun H, Craig BA, Zhang L (2017) Angle-based multicategory distance-weighted SVM. J Mach Learn Res 18(1):2981–3001

    MathSciNet  Google Scholar 

  • Tu J, Liu W, Mao X et al (2021) Variance reduced median-of-means estimator for Byzantine-robust distributed inference. J Mach Learn Res 22(84):1–67

    MathSciNet  Google Scholar 

  • Wang L, Lian H (2020) Communication-efficient estimation of high-dimensional quantile regression. Anal Appl 18(6):1057–1075

    Article  MathSciNet  CAS  Google Scholar 

  • Yang Y, Guo Y, Chang X (2021) Angle-based cost-sensitive multicategory classification. Comput Stat Data Anal 156(107):107

    MathSciNet  Google Scholar 

  • Yin D, Chen Y, Ramchandran K et al (2018) Byzantine-robust distributed learning: towards optimal statistical rates. In: Proceedings of the 35th international conference on machine learning vol 80, pp 5650–5659

  • Yin D, Chen Y, Ramchandran K et al (2019) Defending against saddle point attack in Byzantine-robust distributed learning. J Am Stat Assoc 97:7074–7084

    Google Scholar 

  • Zhang C, Liu Y (2014) Multicategory angle-based large-margin classification. Biometrika 101(3):625–640

    Article  MathSciNet  PubMed  Google Scholar 

  • Zhang C, Liu Y, Wang J et al (2016) Reinforced angle-based multicategory support vector machines. J Comput Graph Stat 25(3):806–825

    Article  MathSciNet  PubMed  PubMed Central  Google Scholar 

  • Zhang Y, Duchi JC, Wainwright MJ (2013) Communication-efficient algorithms for statistical optimization. J Mach Learn Res 14(68):3321–3363

    MathSciNet  Google Scholar 

  • Zhang C, Pham M, Fu S et al (2018) Robust multicategory support vector machines using difference convex algorithm. Math Program 169(1):277–305

    Article  MathSciNet  PubMed  Google Scholar 

  • Zhang Y, Duchi JC, Wainwright MJ (2015) Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. J Mach Learn Res 16(102):3299–3340

    MathSciNet  Google Scholar 

  • Zhao T, Cheng G, Liu H (2016) A partially linear framework for massive heterogeneous data. Ann Stat 44(4):1400–1437

    Article  MathSciNet  PubMed  PubMed Central  Google Scholar 

  • Zhou WX, Bose K, Fan J et al (2018) A new perspective on robust m-estimation: finite sample theory and applications to dependence-adjusted multiple testing. Ann Stat 46(5):1904–1931

    Article  MathSciNet  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

Xiaozhou Wang and Riquan Zhang are the co-corresponding authors. We thank the editor in chief and the two referees for their comments and suggestions which we believe led to an improved manuscript.

Funding

Xiaozhou Wang’s research is supported by the National Natural Science Foundation of China (12101240), the Chenguang Program of Shanghai Education Development Foundation and Shanghai Municipal Education Commission (20CG29), and Shanghai Sailing Program (21YF1410500). Riquan Zhang’s research is supported by the National Natural Science Foundation of China (11971171, 11831008, 12171310), and the Basic Research Project of Shanghai Science and Technology Commission (22JC1400800).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xiaozhou Wang or Riquan Zhang.

Ethics declarations

Conflict of interest

No potential conflict of interest was reported by the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A proof of theoretical results

Appendix A proof of theoretical results

1.1 A.1 Proof of Theorem 1

Given a compact convex set \({\mathcal {B}}=\{\varvec{\beta }:\Vert \varvec{\beta }\Vert _2^2\le c_0\}\), by duality with convex \(\ell \) and \(\Vert \varvec{\beta }\Vert _2^2\), minimizing the optimization model (5) is equivalent to solving the optimization problem

$$\begin{aligned} \min _{\varvec{\beta }}\frac{1}{\vert {\mathcal {D}}_m\vert } \sum _{i=1}^{\vert {\mathcal {D}}_m\vert } \ell (({\varvec{x}}_i\otimes {\varvec{W}}_{y_i})^{\top }\varvec{\beta }) \end{aligned}$$
(A1)

subject to \(\Vert \varvec{\beta }\Vert _2^2\le c_0\). Here, \(c_0\) is a positive constant with respect to \(\lambda _m>0\) for \(m=1,\ldots ,M\). That is to say, it is sufficient to consider the optimization problem (A1) on the compact convex set \({\mathcal {B}}\in {\mathbb {R}}^{(k-1)p}\).

According to the condition of

$$\begin{aligned} \frac{\gamma _1}{2}\Vert \varvec{\beta }-\varvec{\beta }^\prime \Vert _2^2\le {\mathcal {R}}(\varvec{\beta })-{\mathcal {R}}(\varvec{\beta }^\prime ) -\langle \nabla {\mathcal {R}}(\varvec{\beta }^\prime ),\varvec{\beta } -\varvec{\beta }^\prime \rangle \end{aligned}$$

in Assumption 3, we have

$$\begin{aligned} \nabla ^2{\mathcal {R}}(\varvec{\beta })\succeq \gamma _1I, \end{aligned}$$

where I denotes the \((k-1)p\times (k-1)p\) identity matrix.

Based on the above argument under Assumption 1-3, according to Corollary 2 in Zhang et al. (2013), we can directly obtain that

$$\begin{aligned} E\big [\Vert \varvec{{\widehat{\beta }}}-\varvec{\beta }^{*}\Vert _2^2\big ]\le C_0\bigg (\frac{1}{nM\gamma _1^2}+\frac{1}{n^2}\Big (\frac{\log (k-1)p}{\gamma _1^4}+\frac{1}{\gamma _1^6}\Big ) +\frac{1}{n^2M}+\frac{1}{n^3}\bigg ), \nonumber \\ \end{aligned}$$
(A2)

where \(C_0\) is a numerical constant.

By the Lipschitz continuity of the loss \(\ell \) in Assumption 1, we get that

$$\begin{aligned} {\mathcal {R}}(\varvec{{\widehat{\beta }}})-{\mathcal {R}}(\varvec{\beta }^*) =E[L_0({\varvec{X}},Y)\Vert \varvec{{\widehat{\beta }}}-\varvec{\beta }^*\Vert _2]. \end{aligned}$$
(A3)

Further, applying the Cauchy-Schwarz inequality yields that

$$\begin{aligned} E[L_0({\varvec{X}},Y)\Vert \varvec{{\widehat{\beta }}}-\varvec{\beta }^*\Vert _2] \le E[L_0^2({\varvec{X}},Y)]^{\frac{1}{2}}E[\Vert \varvec{{\widehat{\beta }}}-\varvec{\beta }^*\Vert _2^2]^{\frac{1}{2}}. \end{aligned}$$
(A4)

Combine (A3) and (A4), we have

$$\begin{aligned} {\mathcal {E}}_{\ell }(\varvec{{\widehat{\beta }}}, \varvec{\beta }^{*}) \le L_0E[\Vert \varvec{{\widehat{\beta }}}-\varvec{\beta }^*\Vert _2^2]^{\frac{1}{2}}. \end{aligned}$$
(A5)

Substitute (A2) into (A5), we finally obtain that

$$\begin{aligned} {\mathcal {E}}_{\ell }(\varvec{{\widehat{\beta }}}, \varvec{\beta }^{*}) \le C\bigg (\frac{1}{\sqrt{nM}\gamma _1}+\frac{1}{n}\Big (\frac{\sqrt{\log (k-1)p}}{\gamma _1^2} +\frac{1}{\gamma _1^3}\Big )\bigg ), \end{aligned}$$

where C is a numerical constant.

1.2 A.2 Proof of Proposition 3 and Theorem 4

.

Proof of Proposition 3

According to the nature of the MOM-based gradient defined in (10), for given \(\widetilde{\varvec{\beta }}\) with \(\Vert \widetilde{\varvec{\beta }}-\varvec{\beta }^0\Vert \le C_0\), we first observe that the event

$$\begin{aligned} \Big \{\vert (\varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert >\varepsilon _{m,r}\Big \} \end{aligned}$$

which implies that at least L/2 of \(({\varvec{g}}_m^{(l)}(\widetilde{\varvec{\beta }})-\lambda _m\widetilde{\varvec{\beta }})_{r}\) has to be outside \(\varepsilon _{m,r}\) distance to \((\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_{r}\). Namely,

$$\begin{aligned}{} & {} \Bigg \{\vert (\varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }})) _r\vert>\varepsilon _{m,r}\Big \}\nonumber \\{} & {} \quad \subset \left\{ \sum _{l=1}^L{\varvec{1}}\{\vert ({\varvec{g}}_m^{(l)}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert >\varepsilon _{m,r}\}\ge \frac{L}{2}\right\} . \end{aligned}$$
(A6)

Define \({\mathcal {L}}_m=\{l\in \{1,\ldots ,L\}:{\mathcal {G}}_m^{(l)}\cap {\mathcal {O}}_m=\emptyset \}\), and let \(T_{m,r}^{(l)}={\varvec{1}}\{\vert ({\varvec{g}}_m^{(l)}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_{r}\vert >\varepsilon _{m,r}\}\) and \(p_{m,r}^{(l)}=E(T_{m,r}^{(l)})\), then the above (A6) implies that

$$\begin{aligned}{} & {} P\Big (\vert (\varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert >\varepsilon _{m,r}\Big )\\{} & {} \quad \le P\left( \sum _{l=1}^LT_{m,r}^{(l)}\ge \frac{L}{2}\right) \\{} & {} \quad \le P\left( \sum _{k\in {\mathcal {L}}_m}T_{m,r}^{(l)}+\vert {\mathcal {O}}_m\vert \ge \frac{L}{2}\right) \\{} & {} \quad \le P\left( \sum _{k\in {\mathcal {L}}_m}(T_{m,r}^{(l)}-E(T_{m,r}^{(l)}))\ge \left( \frac{1}{2}-C_1-p_{m,r}^{(l)}\right) L\right) \\{} & {} \quad \le e^{-2L(\frac{1}{2}-C_1-p_{m,r}^{(l)})^2}. \end{aligned}$$

The last inequality is derived by the one-sided Hoeffding’s inequality.

Note that

$$\begin{aligned} p_{m,r}^{(l)}=E(T_{m,r}^{(l)}) =P\Big (\vert ({\varvec{g}}_m^{(l)}(\widetilde{\varvec{\beta }}) -\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert >\varepsilon _{m,r}\Big ) \le \frac{(Cov(\nabla \ell (\widetilde{\varvec{\beta }})))_{rr}L}{n\varepsilon _{m,r}^2} \end{aligned}$$

for some \(l\in {\mathcal {L}}\). Choose \(\varepsilon _{m,r}=\sqrt{\frac{(2+c)(Cov(\nabla \ell (\widetilde{\varvec{\beta }})))_{rr}L}{n(1-2C_1)}}\) for all \(c>0\), we can obtain that

$$\begin{aligned}{} & {} P\Bigg (\vert (\varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert >\sqrt{\frac{(2+c)(Cov(\nabla \ell (\widetilde{\varvec{\beta }})))_{rr}L}{n(1-2C_1)}}\Bigg )\\{} & {} \quad \le e^{-2L(\frac{1}{2}-C_1-\frac{1-2C_1}{2+c})^2}. \end{aligned}$$

This means that

$$\begin{aligned} \vert (\varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert \le \sqrt{\frac{(2+c)(Cov(\nabla \ell (\widetilde{\varvec{\beta }})))_{rr}L}{n(1-2C_1)}} \end{aligned}$$

with probability at least \(1-e^{-\frac{L(1-2C_1)^2c^2}{2(2+c)^2}}\).

Thus, for given \(\widetilde{\varvec{\beta }}\) with \(\Vert \widetilde{\varvec{\beta }}-\varvec{\beta }^0\Vert \le C_0\), we can get that

$$\begin{aligned}{} & {} \Vert \varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }})\Vert _2\\{} & {} \quad \le \Vert \varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }})\Vert _ 2+\lambda _m\Vert \widetilde{\varvec{\beta }}\Vert _2\\{} & {} \quad \le \sqrt{(k-1)p}\max _{1\le r \le (k-1)p}\vert (\varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\widetilde{\varvec{\beta }}) -\lambda _m\widetilde{\varvec{\beta }}-\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert +Cn^{-1/2}\\{} & {} \quad \le \max _{1\le r \le (k-1)p}\sqrt{\frac{(2+c)(k-1)p(Cov(\nabla \ell (\widetilde{\varvec{\beta }})))_{rr}L}{n(1-2C_1)}}+Cn^{-1/2} \end{aligned}$$

with probability at least \(1-e^{-\frac{L(1-2C_1)^2c^2}{2(2+c)^2}}\), where C is a numerical constant. \(\square \)

.

Proof of Theorem 4

Let \(\varvec{\delta }^{m,t}=\varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\varvec{\beta }^{m,t}) -\nabla {\mathcal {R}}(\varvec{\beta }^{m,t})\in {\mathbb {R}}^{(k-1)p}\), by the fact of \(\nabla {\mathcal {R}}(\varvec{\beta }^{*})=0\) and our update rule of the MOM-based gradient estimation, i.e., \(\varvec{\beta }^{m,t+1}=\varvec{\beta }^{m,t} -\eta \varvec{{\widehat{g}}}_m^{\textrm{MOM}}(\varvec{\beta }^{m,t})\), we have

$$\begin{aligned}{} & {} \Vert \varvec{\beta }^{m,t+1}-\varvec{\beta }^{*}\Vert _2\nonumber \\{} & {} \quad =\Vert \varvec{\beta }^{m,t}-\eta \varvec{{\widehat{g}}}_m^ {\textrm{MOM}}(\varvec{\beta }^{m,t})-\varvec{\beta }^{*} +\eta \nabla {\mathcal {R}}(\varvec{\beta }^{*})\Vert _2\nonumber \\{} & {} \quad =\Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*} -\eta [\nabla {\mathcal {R}}(\varvec{\beta }^{m,t}) -(\nabla {\mathcal {R}}(\varvec{\beta }^{*})]-\eta \varvec{\delta }^{m,t}\Vert _2\nonumber \\{} & {} \quad \le \Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*}-\eta [\nabla {\mathcal {R}}(\varvec{\beta }^{m,t}) -(\nabla {\mathcal {R}}(\varvec{\beta }^{*})]\Vert _2+\eta \Vert \varvec{\delta }^{m,t}\Vert _2. \end{aligned}$$
(A7)

Let \({\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*})=\nabla {\mathcal {R}}(\varvec{\beta }^{m,t}) -\nabla {\mathcal {R}}(\varvec{\beta }^{*})\) for brevity and then use Lemma 3.11 introduced in Bubeck (2015), we can directly get that

$$\begin{aligned}{} & {} \Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*} -\eta {\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*})\Vert _2^2\\{} & {} \quad =\Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*}\Vert _2^2 +\eta ^2\Vert {\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*}) \Vert _2^2-2\eta ({\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*}))^\top (\varvec{\beta } ^{m,t}-\varvec{\beta }^{*})\\{} & {} \quad \le \Vert \varvec{\beta }^{m(t)}-\varvec{\beta }^{*}\Vert _2^2 +\eta ^2\Vert {\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*})\Vert _2^2 -2\eta \Big [\frac{\gamma _1\gamma _2}{\gamma _1+\gamma _2}\Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*}\Vert _2^2\\{} & {} \quad +\frac{1}{\gamma _1+\gamma _2}\Vert {\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*})\Vert _2^2\Big ]\\{} & {} \quad =\left( 1-\frac{2\eta \gamma _1\gamma _2}{\gamma _1+\gamma _2}\right) \Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*}\Vert _2^2 +\left( \eta ^2-\frac{2\eta }{\gamma _1+\gamma _2}\right) \Vert {\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*})\Vert _2^2. \end{aligned}$$

Choose \(\eta =\frac{2}{\sqrt{M}(\gamma _1+\gamma _2)}\) with respect to the number of local machines M as the step size, there is

$$\begin{aligned} \Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*} -\eta {\varvec{G}}(\varvec{\beta }^{m,t},\varvec{\beta }^{*})\Vert _2^2 \le \left( 1-\frac{2\eta \gamma _1\gamma _2}{\gamma _1+\gamma _2}\right) \Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*}\Vert _2^2. \end{aligned}$$
(A8)

Thus, substitute (A8) into (A7), we have

$$\begin{aligned} \Vert \varvec{\beta }^{m,t+1}-\varvec{\beta }^{*}\Vert _2 \le \sqrt{1-\frac{2\eta \gamma _1\gamma _2}{\gamma _1+\gamma _2}}\Vert \varvec{\beta }^{m,t} -\varvec{\beta }^{*}\Vert _2+\eta \Vert \varvec{\delta }^{m,t}\Vert _2. \end{aligned}$$

Using Proposition 3 and the condition \(Cov(\nabla \ell )\) is finite, for given \(\varvec{\beta }^{m,t}\) with \(\Vert \varvec{\beta }^{m,t}-\varvec{\beta }^0\Vert \le C_0\), we further have

$$\begin{aligned} \Vert \varvec{\beta }^{m,t+1}-\varvec{\beta }^{*}\Vert _2\le & {} \sqrt{1-\frac{2\eta \gamma _1\gamma _2}{\gamma _1+\gamma _2}}\Vert \varvec{\beta }^{m,t} -\varvec{\beta }^{*}\Vert _2\nonumber \\{} & {} +C\eta \sqrt{\frac{(2+c)(k-1)pL}{n(1-2C_1)}}+C\eta n^{-1/2}\nonumber \\= & {} \sqrt{1-\frac{2\eta \gamma _1\gamma _2}{\gamma _1+\gamma _2}}\Vert \varvec{\beta }^{m,t} -\varvec{\beta }^{*}\Vert _2+\frac{2\varepsilon _{_\textrm{MOM}}}{\gamma _1+\gamma _2} \end{aligned}$$

with probability at least \(1-e^{-\frac{L(1-2C_1)^2c^2}{2(2+c)^2}}\) for all \(c>0\), where

$$\begin{aligned} \varepsilon _{_\textrm{MOM}}=C\sqrt{\frac{(2+c)(k-1)pL}{N(1-2C_1)}}+CN^{-1/2}. \end{aligned}$$

and C is a numerical constant.

Write \(\kappa =\sqrt{1-\frac{2\eta \gamma _1\gamma _2}{\gamma _1+\gamma _2}}\), by the fact of \(\kappa <1\) for \(0<\gamma _1<\gamma _2\), we can obtain that

$$\begin{aligned} \Vert \varvec{\beta }^{m,t}-\varvec{\beta }^{*}\Vert _2\le & {} \kappa \Vert \varvec{\beta }^{m,t-1}-\varvec{\beta }^{*}\Vert _2+\frac{2\varepsilon _{_\textrm{MOM}}}{\gamma _1+\gamma _2}\\\le & {} \kappa ^2\Vert \varvec{\beta }^{m,t-2}-\varvec{\beta }^{*}\Vert _2 +\frac{2(1+\kappa )\varepsilon _{_\textrm{MOM}}}{\gamma _1+\gamma _2}\\{} & {} \vdots \\\le & {} \kappa ^t\Vert \varvec{\beta }^{m,0}-\varvec{\beta }^{*}\Vert _2 +\frac{2(1+\kappa +\ldots +\kappa ^t)\varepsilon _{_\textrm{MOM}}}{\gamma _1+\gamma _2}\\\le & {} \kappa ^t\Vert \varvec{\beta }^{0}-\varvec{\beta }^{*}\Vert _2 +\frac{2\varepsilon _{_\textrm{MOM}}}{(1-\kappa )(\gamma _1+\gamma _2)} \end{aligned}$$

with probability at least \(1-Te^{-\frac{L(1-2C_1)^2c^2}{2(2+c)^2}}\) for all \(c>0\), and C is a numerical constant.

Finally, by the norm inequality of \(\Vert \varvec{\beta }^{T}-\varvec{\beta }^{*}\Vert _2 \le \frac{1}{M}\sum _{m=1}^M\Vert \varvec{\beta }^{m,T}-\varvec{\beta }^{*}\Vert _2\), we have

$$\begin{aligned} \Vert \varvec{\beta }^{T}-\varvec{\beta }^{*}\Vert _2 \le \kappa ^T\Vert \varvec{\beta }^{0}-\varvec{\beta }^{*}\Vert _2 +\frac{2\varepsilon _{_\textrm{MOM}}}{(1-\kappa )(\gamma _1+\gamma _2)} \end{aligned}$$

with probability at least \(1-Te^{-\frac{L(1-2C_1)^2c^2}{2(2+c)^2}}\). \(\square \)

1.3 A.3 Proof of Proposition 5 and Theorem 6

Before the proof of Proposition 5, we introduce Theorem 3.2 proposed by Minsker and Ndaoud (2021) and write it as the following form to provide the error bounds of the weighted-based gradient estimation.

Lemma 7

Suppose \(\max _{1\le m\le M}\vert {\mathcal {O}}_m\vert \le C_2L\) for \(0\le C_2<1\) and event \(\Omega _m^\nu \) holds. Then, we have

$$\begin{aligned}{} & {} \vert (\varvec{{\widehat{g}}}_m^{\textrm{WT}}(\varvec{\beta }) -\lambda _m\varvec{\beta }-\nabla {\mathcal {R}}(\varvec{\beta }))_r\vert \\{} & {} \quad \le \frac{C_\nu \sqrt{Cov(\nabla \ell (\varvec{\beta }))_{rr}}}{(1-C_2)^\nu } \left( \sqrt{\frac{1+c_1}{n}}+\sqrt{\frac{L}{n}} +\quad \frac{\vert {\mathcal {O}}_m\vert }{\sqrt{nL}\left( \sqrt{\alpha ({\mathcal {O}})}\right) ^{\nu -1}}\right) \end{aligned}$$

with probability at least \(1-2e^{-c_1}-Le^{-c_2n/L}\) for \(m=1,2,\ldots ,M\) and \(r=1,2,\ldots ,(k-1)p\). Here, \(C_\nu >0\), \(c_1>0\), \(c_2>0\), and

$$\begin{aligned} \Omega _m^{\nu }=\bigcup _{r=1}^{(k-1)p}\left\{ \sum _{l=1}^L1/(\varvec{\sigma }_m^{(l)}(\varvec{\beta }))_r^\nu \le (4\sqrt{(Cov(\nabla \ell (\varvec{\beta })))_{rr}}/(1-C_2))^\nu \right\} . \end{aligned}$$

Thus, for given \(\widetilde{\varvec{\beta }}\) with \(\Vert \widetilde{\varvec{\beta }}-\varvec{\beta }^0\Vert \le C_0\), by Lemma 7, we can get that directly

$$\begin{aligned}{} & {} \Vert \varvec{{\widehat{g}}}_m^{\textrm{WT}}(\widetilde{\varvec{\beta }}) -\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }})\Vert _2\\{} & {} \quad \le \Vert \varvec{{\widehat{g}}}_m^{\textrm{WT}}(\widetilde{\varvec{\beta }})-\lambda _m\widetilde{\varvec{\beta }} -\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }})\Vert _2+\lambda _m\Vert \widetilde{\varvec{\beta }}\Vert _2\\{} & {} \quad \le \sqrt{(k-1)p}\max _{1\le r\le (k-1)p}\vert (\varvec{{\widehat{g}}}_m^{\textrm{WT}}(\widetilde{\varvec{\beta }})-\lambda _m\widetilde{\varvec{\beta }} -\nabla {\mathcal {R}}(\widetilde{\varvec{\beta }}))_r\vert +Cn^{-1/2}\\{} & {} \quad \le \max _{1\le r\le (k-1)p}\frac{C_\nu \sqrt{(k-1)pCov(\nabla \ell (\widetilde{\varvec{\beta }}))_{rr}}}{(1-C_2)^\nu } \Big (\sqrt{\frac{1+c_1}{n}}+\sqrt{\frac{L}{n}} +\frac{\vert {\mathcal {O}}_m\vert }{\sqrt{nL}(\sqrt{\alpha ({\mathcal {O}})})^{\nu -1}}\Big )\\{} & {} \quad +Cn^{-1/2} \end{aligned}$$

with probability at least \(1-(2e^{-c_1}+Le^{-c_2n/L})(k-1)p\).

The proof of Theorem 6 is similar to that given Theorem 4 above and so is omitted. \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, G., Wang, X., Yan, Y. et al. Robust distributed multicategory angle-based classification for massive data. Metrika 87, 299–323 (2024). https://doi.org/10.1007/s00184-023-00915-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00184-023-00915-3

Keywords

Navigation