Abstract
Distributed optimization with open collaboration is a popular field since it provides an opportunity for small groups/companies/universities, and individuals to jointly solve huge-scale problems. However, standard optimization algorithms are fragile in such settings due to the possible presence of so-called Byzantine workers – participants that can send (intentionally or not) incorrect information instead of the one prescribed by the protocol (e.g., send anti-gradient instead of stochastic gradients). Thus, the problem of designing distributed methods with provable robustness to Byzantine workers has been receiving a lot of attention recently. In particular, several works consider a very promising way to achieve Byzantine tolerance via exploiting variance reduction and robust aggregation. The existing approaches use SAGA- and SARAH-type variance reduced estimators, while another popular estimator – SVRG – is not studied in the context of Byzantine-robustness. In this work, we close this gap in the literature and propose a new method – Byzantine-Robust Loopless Stochastic Variance Reduced Gradient (BR-LSVRG). We derive non-asymptotic convergence guarantees for the new method in the strongly convex case and compare its performance with existing approaches in numerical experiments.
The research was supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) 075-00337-20-03, project No. 0714-2020-0005.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Baruch, G., Baruch, M., Goldberg, Y.: A little is enough: circumventing defenses for distributed learning. Adv. Neural Inf. Process. Syst. 32 (2019)
Blanchard, P., El Mhamdi, E.M., Guerraoui, R., Stainer, J.: Machine learning with adversaries: byzantine tolerant gradient descent. Adv. Neural Inf. Process. Syst. 30 (2017)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)
Damaskinos, G., El-Mhamdi, E.M., Guerraoui, R., Guirguis, A., Rouault, S.: AGGREGATHOR: byzantine machine learning via robust gradient aggregation. Proc. Mach. Learn. Res. 1, 81–106 (2019)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. Adv. Neural Inf. Process. Syst. 27 (2014)
Fang, C., Li, C.J., Lin, Z., Zhang, T.: SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator. Adv. Neural Inf. Process. Syst. 31 (2018)
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Goodfellow, I., Bengio, Y., Courville, A.: Deep learning (2016)
Gorbunov, E., Borzunov, A., Diskin, M., Ryabinin, M.: Secure distributed training at scale. In: International Conference on Machine Learning, pp. 7679–7739. PMLR (2022). http://proceedings.mlr.press/v162/gorbunov22a/gorbunov22a.pdf
Gorbunov, E., Hanzely, F., Richtárik, P.: A unified theory of SGD: variance reduction, sampling, quantization and coordinate descent. In: International Conference on Artificial Intelligence and Statistics, pp. 680–690. PMLR (2020)
Gorbunov, E., Horváth, S., Richtárik, P., Gidel, G.: Variance reduction is an antidote to byzantines: better rates, weaker assumptions and communication compression as a cherry on the top. arXiv preprint arXiv:2206.00529 (2022)
Gower, R.M., Schmidt, M., Bach, F., Richtárik, P.: Variance-reduced methods for machine learning. Proc. IEEE 108(11), 1968–1983 (2020)
Guerraoui, R., Rouault, S., et al.: The hidden vulnerability of distributed learning in byzantium. In: International Conference on Machine Learning, pp. 3521–3530. PMLR (2018)
He, L., Karimireddy, S.P., Jaggi, M.: Byzantine-robust decentralized learning via self-centered clipping. arXiv preprint arXiv:2202.01545 (2022)
Hofmann, T., Lucchi, A., Lacoste-Julien, S., McWilliams, B.: Variance reduced stochastic gradient descent with neighbors. Adv. Neural Inf. Process. Syst. 28 (2015)
Horváth, S., Lei, L., Richtárik, P., Jordan, M.I.: Adaptivity of stochastic gradient methods for nonconvex optimization. SIAM J. Math. Data Sci. 4(2), 634–648 (2022)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 26 (2013)
Karimireddy, S.P., He, L., Jaggi, M.: Learning from history for byzantine robust optimization. In: International Conference on Machine Learning, pp. 5311–5319. PMLR (2021)
Karimireddy, S.P., He, L., Jaggi, M.: Byzantine-robust learning on heterogeneous datasets via bucketing. In: International Conference on Learning Representations (2022). https://arxiv.org/pdf/2006.09365.pdf
Kovalev, D., Horváth, S., Richtárik, P.: Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In: Algorithmic Learning Theory, pp. 451–467. PMLR (2020). http://proceedings.mlr.press/v117/kovalev20a/kovalev20a.pdf
Lamport, L., Shostak, R., Pease, M.: The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4(3), 382–401 (1982)
Li, C.: Demystifying GPT-3 language model: a technical overview (2020)
Li, Z., Bao, H., Zhang, X., Richtárik, P.: PAGE: a simple and optimal probabilistic gradient estimator for nonconvex optimization. In: International Conference on Machine Learning, pp. 6286–6295. PMLR (2021)
Lojasiewicz, S.: A topological property of real analytic subsets. Coll. du CNRS, Les équations aux dérivées partielles 117(87–89), 2 (1963)
Lyu, L., et al.: Privacy and robustness in federated learning: attacks and defenses. IEEE Trans. Neural Netw. Learn. Syst. (2022)
Nesterov, Y.: Lectures on Convex Optimization. SOIA, vol. 137. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91578-4
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: International Conference on Machine Learning, pp. 2613–2621. PMLR (2017)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155 (2022)
Pillutla, K., Kakade, S.M., Harchaoui, Z.: Robust aggregation for federated learning. IEEE Trans. Signal Process. 70, 1142–1154 (2022)
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Polyak, B.T.: Gradient methods for the minimisation of functionals. USSR Comput. Math. Math. Phys. 3(4), 864–878 (1963)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 400–407 (1951)
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)
Wu, Z., Ling, Q., Chen, T., Giannakis, G.B.: Federated variance-reduced stochastic gradient descent with robustness to byzantine attacks. IEEE Trans. Signal Process. 68, 4583–4596 (2020)
Xie, C., Koyejo, O., Gupta, I.: Fall of empires: breaking byzantine-tolerant SGD by inner product manipulation. In: Uncertainty in Artificial Intelligence, pp. 261–270. PMLR (2020)
Yin, D., Chen, Y., Kannan, R., Bartlett, P.: Byzantine-robust distributed learning: towards optimal statistical rates. In: International Conference on Machine Learning, pp. 5650–5659. PMLR (2018)
Zinkevich, M., Weimer, M., Li, L., Smola, A.: Parallelized stochastic gradient descent. Adv. Neural Inf. Process. Syst. 23 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Examples of Robust Aggregators
A Examples of Robust Aggregators
In [20], the authors propose the procedure called bucketing (see Algorithm 2) that robustifies certain aggregation rules such as: • geometric median (GM) \(\hat{x} = \arg \min _{x\in \mathbb {R}^d}\sum _{i=1}^n \Vert x - x_i\Vert \); • coordinate-wise median (CM) \(\hat{x} = \arg \min _{x\in \mathbb {R}^d}\sum _{i=1}^n \Vert x - x_i\Vert _1\); • Krum estimator [3] \(\arg \min _{x_i \in \{x_1, \ldots , x_n\}} \sum _{j \in S_i} \Vert x_j - x_i\Vert ^2\), where \(S_i \subseteq \{x_1, \ldots , x_n\}\) is the subset of \(n - |\mathcal{B}| - 2\) closest (w.r.t. \(\ell _2\)-norm) vectors to \(x_i\).

The following result establishes the robustness of the aforementioned aggregation rules in combination with Bucketing.
Theorem 3
(Theorem D.1 from [12]). Assume that \(\{x_1,x_2,\ldots ,x_n\}\) is such that there exists a subset \(\mathcal{G}\subseteq [n]\), \(|\mathcal{G}| = G \ge (1-\delta )n\) and \(\sigma \ge 0\) such that \(\frac{1}{G(G-1)}\sum _{i,l \in \mathcal{G}}\mathbb {E}\Vert x_i - x_l\Vert ^2 \le \sigma ^2\). Assume that \(\delta \le \delta _{\max }\). If Algorithm 2 is run with
, then
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fedin, N., Gorbunov, E. (2023). Byzantine-Robust Loopless Stochastic Variance-Reduced Gradient. In: Khachay, M., Kochetov, Y., Eremeev, A., Khamisov, O., Mazalov, V., Pardalos, P. (eds) Mathematical Optimization Theory and Operations Research. MOTOR 2023. Lecture Notes in Computer Science, vol 13930. Springer, Cham. https://doi.org/10.1007/978-3-031-35305-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-35305-5_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35304-8
Online ISBN: 978-3-031-35305-5
eBook Packages: Computer ScienceComputer Science (R0)

,
,
.