Skip to main content
Log in

Robust supervised learning with coordinate gradient descent

  • Original Paper
  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

This paper considers the problem of supervised learning with linear methods when both features and labels can be corrupted, either in the form of heavy tailed data and/or corrupted rows. We introduce a combination of coordinate gradient descent as a learning algorithm together with robust estimators of the partial derivatives. This leads to robust statistical learning methods that have a numerical complexity nearly identical to non-robust ones based on empirical risk minimization. The main idea is simple: while robust learning with gradient descent requires the computational cost of robustly estimating the whole gradient to update all parameters, a parameter can be updated immediately using a robust estimator of a single partial derivative in coordinate gradient descent. We prove upper bounds on the generalization error of the algorithms derived from this idea, that control both the optimization and statistical errors with and without a strong convexity assumption of the risk. Finally, we propose an efficient implementation of this approach in a new Python library called linlearn, and demonstrate through extensive numerical experiments that our approach introduces a new interesting compromise between robustness, statistical performance and numerical efficiency for this problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Availability of data and materials

All data sets used in the experimental sections of this work are publicly available except for the synthetic ones which were randomly generated. Detailed information and sources were given in Appendix A.3.

Code Availability

The experiments were carried out using the library called linlearn which was developed as part of this project and open-sourced under the BSD-3 License on GitHub and available here https://github.com/linlearn/linlearn.

Notes

  1. By implicit we mean defined as the \({{\,\mathrm{\hbox {argmin}}\,}}\) of some functional, as opposed to the explicit iterations of an optimization algorithm: an implicit estimator differs from the exact algorithm applied on the data, while an explicit algorithm does not.

  2. https://github.com/linlearn/linlearn.

  3. Or more generally the centered moment of order \(1+\alpha \) for \(\alpha \in (0,1]\), see below.

  4. We call “\(\eta \)-corruption” the context where the outlier set \({\mathcal {O}}\) in Assumption 2 satisfies \(\vert {\mathcal {O}}\vert = \eta n\) with \(\eta \in [0, 1/2)\).

  5. Indeed, considering strong convexity, optimization converges linearly and the final bound is of the form \(a\exp (-bT) + c T \log (T/\delta )/n\) for some \(a,b,c >0\) and one can see that \(T \sim \log n\) is approximately optimal.

References

  • Acharya, A., Hashemi, A., Jain, P., Sanghavi, S., Dhillon, I.S., Topcu, U.: Robust training in high dimensions via block coordinate geometric median descent. In: International Conference on Artificial Intelligence and Statistics, pp. 11145–11168 (2022)

  • Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)

    MathSciNet  MATH  Google Scholar 

  • Armijo, L.: Minimization of functions having Lipschitz continuous first partial derivatives. Pac. J. Math. 16(1), 1–3 (1966)

    MathSciNet  MATH  Google Scholar 

  • Audibert, J.-Y., Munos, R., Szepesvári, C.: Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoret. Comput. Sci. 410(19), 1876–1902 (2009). (Algorithmic Learning Theory)

    MathSciNet  MATH  Google Scholar 

  • Ballester-Ripoll, R., Paredes, E.G., Pajarola, R.: Sobol tensor trains for global sensitivity analysis. Reliab. Eng. Syst. Saf. 183, 311–322 (2019)

    Google Scholar 

  • Bartlett, P.L., Bousquet, O., Mendelson, S.: Local rademacher complexities. Ann. Stat. 33(4), 1497–1537 (2005)

    MathSciNet  MATH  Google Scholar 

  • Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM J. Optim. 23(4), 2037–2060 (2013)

    MathSciNet  MATH  Google Scholar 

  • Bhatia, K., Jain, P., Kamalaruban, P., Kar, P.: Consistent robust regression. Adv. Neural. Inf. Process. Syst. 30, 2110–2119 (2017)

    Google Scholar 

  • Blondel, M., Seki, K., Uehara, K.: Block coordinate descent algorithms for large-scale sparse multiclass classification. Mach. Learn. 93(1), 31–52 (2013)

    MathSciNet  MATH  Google Scholar 

  • Boucheron, S., Lugosi, G., Massart, P., Ledoux, M.: Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford (2013)

    MATH  Google Scholar 

  • Brownlees, C., Joly, E., Lugosi, G., et al.: Empirical risk minimization for heavy-tailed losses. Ann. Stat. 43(6), 2507–2536 (2015)

    MathSciNet  MATH  Google Scholar 

  • Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends® Mach. Learn. 8(3–4), 231–357 (2015)

    MATH  Google Scholar 

  • Bubeck, S., Cesa-Bianchi, N., Lugosi, G.: Bandits with heavy tail. IEEE Trans. Inf. Theory 59(11), 7711–7717 (2013)

    MathSciNet  MATH  Google Scholar 

  • Candanedo, L.M., Feldheim, V.: Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Energy Build. 112, 28–39 (2016)

    Google Scholar 

  • Candanedo, L.M., Feldheim, V., Deramaix, D.: Data driven prediction models of energy use of appliances in a low-energy house. Energy Build. 140, 81–97 (2017)

    Google Scholar 

  • Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3), 1–37 (2011)

    MathSciNet  MATH  Google Scholar 

  • Catoni, O.: Challenging the empirical mean and empirical variance: a deviation study. In: Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, vol. 48, pp. 1148–1185. Institut Henri Poincaré (2012)

  • Charikar, M., Steinhardt, J., Valiant, G.: Learning from untrusted data. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 47–60 (2017)

  • Chen, M., Gao, C., Ren, Z.: Robust covariance and scatter matrix estimation under Huber’s contamination model. Ann. Stat. 46(5), 1932–1960 (2018)

    MathSciNet  MATH  Google Scholar 

  • Chen, P., Jin, X., Li, X., Lihu, X.: A generalized Catoni’s M-estimator under finite \(\alpha \)-th moment assumption with \(\alpha \in (1, 2)\). Electron. J. Stat. 15(2), 5523–5544 (2021)

    MathSciNet  MATH  Google Scholar 

  • Chen, Y., Lili, S., Jiaming, X.: Distributed statistical machine learning in adversarial settings: byzantine gradient descent. Proc. ACM Meas. Anal. Comput. Syst. 1(2), 1–25 (2017)

    Google Scholar 

  • Cherapanamjeri, Y., Aras, E., Tripuraneni, N., Jordan, M.I., Flammarion, N., Bartlett, P.L.: Optimal robust linear regression in nearly linear time. arXiv preprint arXiv:2007.08137 (2020)

  • Cherapanamjeri, Y., Flammarion, N., Bartlett, P.L.: Fast mean estimation with sub-gaussian rates. In: Conference on Learning Theory, pp. 786–806 (2019)

  • Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2009)

    MATH  Google Scholar 

  • Depersin, J., Lecué, G.: Robust sub-gaussian estimation of a mean vector in nearly linear time. Ann. Stat. 50(1), 511–536 (2022)

    MathSciNet  MATH  Google Scholar 

  • Devroye, L., Györfi, L.: Nonparametric Density Estimation: The L1 View. Wiley Interscience Series in Discrete Mathematics. Wiley (1985)

  • Devroye, L., Lerasle, M., Lugosi, G., Oliveira, R.I.: Sub-gaussian mean estimators. Ann. Stat. 44(6), 2695–2725 (2016)

    MathSciNet  MATH  Google Scholar 

  • Diakonikolas, I., Kamath, G., Kane, D., Li, J., Moitra, A., Stewart, A.: Robust estimators in high-dimensions without the computational intractability. SIAM J. Comput. 48(2), 742–864 (2019)

    MathSciNet  MATH  Google Scholar 

  • Diakonikolas, I., Kamath, G., Kane, D., Li, J., Steinhardt, J., Stewart, A.: Sever: a robust meta-algorithm for stochastic optimization. In: International Conference on Machine Learning, pp. 1596–1606 (2019b)

  • Diakonikolas, I., Kane, D.M., Pensia, A.: Outlier robust mean estimation with subgaussian rates via stability. Adv. Neural. Inf. Process. Syst. 33, 1830–1840 (2020)

    Google Scholar 

  • Diakonikolas, I., Kong, W., Stewart, A.: Efficient algorithms and lower bounds for robust linear regression. In: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2745–2754. SIAM (2019)

  • Dixon, W.J.: Analysis of extreme values. Ann. Math. Stat. 21(4), 488–506 (1950)

    MathSciNet  MATH  Google Scholar 

  • Donoho, D.L., Liu, R.C.: The “automatic’’ robustness of minimum distance functionals. Ann. Stat. 16(2), 552–586 (1988)

    MathSciNet  MATH  Google Scholar 

  • Dua, D., Graff, C.: UCI machine learning repository (2017)

  • Edgeworth, F.Y.: On observations relating to several quantities. Hermathena 6(13), 279–285 (1887)

    Google Scholar 

  • Fanaee-T, H., Gama, J.: Event labeling combining ensemble detectors and background knowledge. Progr Artif Intell 2(2), 113–127 (2014)

    Google Scholar 

  • Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)

    MathSciNet  Google Scholar 

  • Friedman, J. Hastie, T., Tibshirani, R.: A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736 (2010)

  • Gao, C., et al.: Robust regression via mutivariate regression depth. Bernoulli 26(2), 1139–1170 (2020)

    MathSciNet  MATH  Google Scholar 

  • Geer, S.A., van de Geer, S.: Empirical Processes in M-estimation, vol. 6. Cambridge University Press, Cambridge (2000)

    MATH  Google Scholar 

  • Genkin, A., Lewis, D.D., Madigan, D.: Large-scale bayesian logistic regression for text categorization. Technometrics 49(3), 291–304 (2007)

    MathSciNet  Google Scholar 

  • Geoffrey, C., Guillaume, L., Matthieu, L.: Robust high dimensional learning for Lipschitz and convex losses. J. Mach. Learn. Res. 21 (2020)

  • Grubbs, F.E.: Procedures for detecting outlying observations in samples. Technometrics 11(1), 1–21 (1969)

    Google Scholar 

  • Gupta, A., Kohli, S.: An MCDM approach towards handling outliers in web data: a case study using OWA operators. Artif. Intell. Rev. 46(1), 59–82 (2016)

    Google Scholar 

  • Hampel, F.R.: A general qualitative definition of robustness. Ann. Math. Stat. 42(6), 1887–1896 (1971)

    MathSciNet  MATH  Google Scholar 

  • Hampel, F.R., Ronchetti, E.M., Rousseeuw, P., Stahel, W.A.: Robust Statistics: The Approach Based on Influence Functions. Wiley-Interscience, New York (1986)

    MATH  Google Scholar 

  • Hawkins, D.M.: Identification of Outliers, vol. 11. Springer, Berlin (1980)

    MATH  Google Scholar 

  • Hoare, C.A.R.: Algorithm 65: find. Commun. ACM 4(7), 321–322 (1961)

    Google Scholar 

  • Holland, M.: Robustness and scalability under heavy tails, without strong convexity. In International Conference on Artificial Intelligence and Statistics, pp. 865–873 (2021)

  • Holland, M., Ikeda, K.: Better generalization with less data using robust gradient descent. In International Conference on Machine Learning, pp. 2761–2770 (2019)

  • Holland, M.J.: Robust descent using smoothed multiplicative noise. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 703–711 (2019)

  • Holland, M.J., Ikeda, K.: Efficient learning with robust gradient descent. Mach. Learn. 108(8), 1523–1560 (2019)

    MathSciNet  MATH  Google Scholar 

  • Hopkins, S.B.: Mean estimation with sub-Gaussian rates in polynomial time. Ann. Stat. 48(2), 1193–1213 (2020)

    MathSciNet  MATH  Google Scholar 

  • Hsu, D., Sabato, S.: Loss minimization and parameter estimation with heavy tails. J. Mach. Learn. Res. 17(1), 543–582 (2016)

    MathSciNet  MATH  Google Scholar 

  • Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964)

    MathSciNet  MATH  Google Scholar 

  • Huber, P.J.: The 1972 wald lecture robust statistics: a review. Ann. Math. Stat. 43(4), 1041–1067 (1972)

    MathSciNet  MATH  Google Scholar 

  • Huber, P.J.: Robust Statistics. Wiley, New York (1981)

    MATH  Google Scholar 

  • Jerrum, M.R., Valiant, L.G., Vazirani, V.V.: Random generation of combinatorial structures from a uniform distribution. Theoret. Comput. Sci. 43, 169–188 (1986)

    MathSciNet  MATH  Google Scholar 

  • Juditsky, A., Kulunchakov, A., Tsyntseus, H.: Sparse recovery by reduced variance stochastic approximation. Inf. Inference: J. IMA 12(2), 851–896 (2022)

    MathSciNet  MATH  Google Scholar 

  • Klivans, A., Kothari, P.K., Meka, R.: Efficient algorithms for outlier-robust regression. In: Conference On Learning Theory, pp. 1420–1430 (2018)

  • Klivans, A.R., Long, P.M., Servedio, R.A.: Learning halfspaces with malicious noise. J. Mach. Learn. Res. 10(12) (2009)

  • Knuth, D.E.: Seminumerical algorithms (the art of computer programming 2). AddisonWesley, Reading, MA, pp. 124–125 (1969)

  • Koklu, M., Ozkan, I.A.: Multiclass classification of dry beans using computer vision and machine learning techniques. Comput. Electron. Agric. 174, 105507 (2020)

    Google Scholar 

  • Koltchinskii, V.: Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 34(6), 2593–2656 (2006)

    MathSciNet  MATH  Google Scholar 

  • Kuhn, M., Johnson, K.: Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press (2019)

  • Lai, K.A., Rao, A.B., Vempala, S.: Agnostic estimation of mean and covariance. In: 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 665–674. IEEE (2016)

  • Lecué, G., Lerasle, M., et al.: Robust machine learning by median-of-means: theory and practice. Ann. Stat. 48(2), 906–931 (2020)

    MathSciNet  MATH  Google Scholar 

  • Lecué, G., Lerasle, M., Mathieu, T.: Robust classification via MOM minimization. Mach. Learn. 109(8), 1635–1665 (2020)

    MathSciNet  MATH  Google Scholar 

  • Lecué, G., Mendelson, S.: Learning subgaussian classes: upper and minimax bounds (2013). Topics in Learning Theory-Société Mathématique de France,(S. Boucheron and N. Vayatis Eds.) (2013)

  • Ledoux, M., Talagrand, M.: Probability in Banach Spaces: Isoperimetry and Processes. Springer, Berlin (1991)

    MATH  Google Scholar 

  • Lei, Z., Luh, K., Venkat, P., Zhang, F.: A fast spectral algorithm for mean estimation with sub-Gaussian rates. In: Conference on Learning Theory, pp. 2598–2612 (2020)

  • Li, J.: Robust sparse estimation tasks in high dimensions. arXiv preprint arXiv:1702.05860 (2017)

  • Li, X., Zhao, T., Arora, R., Liu, H., Hong, M.: On faster convergence of cyclic block coordinate descent-type methods for strongly convex minimization. J Mach. Learn. Res. 18(1), 6741–6764 (2017)

    MathSciNet  MATH  Google Scholar 

  • Liu, L., Li, T., Caramanis, C.: High dimensional robust estimation of sparse models via trimmed hard thresholding. arXiv preprint arXiv:1901.08237 (2019)

  • Liu, L., Shen, Y., Li, T., Caramanis, C.: High dimensional robust sparse regression. In: International Conference on Artificial Intelligence and Statistics, pp. 411–421 (2020)

  • Liu, T., Tao, D.: Classification with noisy labels by importance reweighting. IEEE Trans. Pattern Anal. Mach. Intell. 38(3), 447–461 (2015)

    Google Scholar 

  • Lugosi, G., Mendelson, S.: Mean estimation and regression under heavy-tailed distributions: a survey. Found. Comput. Math. 19(5), 1145–1190 (2019)

    MathSciNet  MATH  Google Scholar 

  • Lugosi, G., Mendelson, S.: Sub-gaussian estimators of the mean of a random vector. Ann. Stat. 47(2), 783–794 (2019)

    MathSciNet  MATH  Google Scholar 

  • Lugosi, G., Mendelson, S.: Robust multivariate mean estimation: the optimality of trimmed mean. Ann. Stat. 49(1), 393–410 (2021)

    MathSciNet  MATH  Google Scholar 

  • Massart, P., Nédélec, É.: Risk bounds for statistical learning. Ann. Stat. 34(5), 2326–2366 (2006)

    MathSciNet  MATH  Google Scholar 

  • Maurer, A., Pontil, M.: Empirical Bernstein bounds and sample-variance penalization. In: COLT (2009)

  • Minsker, S., et al.: Geometric median and robust estimation in banach spaces. Bernoulli 21(4), 2308–2335 (2015)

    MathSciNet  MATH  Google Scholar 

  • Minsker, S., et al.: Sub-Gaussian estimators of the mean of a random matrix with heavy-tailed entries. Ann. Stat. 46(6A), 2871–2903 (2018)

    MathSciNet  MATH  Google Scholar 

  • Mizera, I., et al.: On depth and deep points: a calculus. Ann. Stat. 30(6), 1681–1736 (2002)

    MathSciNet  MATH  Google Scholar 

  • Mnih, V., Szepesvári, C., Audibert, J.-Y.: Empirical Bernstein stopping. In: Proceedings of the 25th International Conference on Machine Learning, pp. 672–679 (2008)

  • Nemirovskij, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience (1983)

  • Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, New York (2004)

    MATH  Google Scholar 

  • Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    MathSciNet  MATH  Google Scholar 

  • Owen, A.: A robust hybrid of lasso and ridge regression. Contemp. Math. 443(7), 59–72 (2007)

    MathSciNet  MATH  Google Scholar 

  • Paul, D., Chakraborty, S., Das, S.: Robust principal component analysis: a median of means approach. arXiv preprint arXiv:2102.03403 (2021)

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  • Pensia, A., Jog, V., Loh, P.-L.: Robust regression with covariate filtering: heavy tails and adversarial contamination. arXiv preprint arXiv:2009.12976 (2020)

  • Prasad, A., Balakrishnan, S., Ravikumar, P.: A robust univariate mean estimator is all you need. In: Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, vol. 108, pp. 4034–4044 (2020)

  • Prasad, A., Suggala, A.S., Balakrishnan, S., Ravikumar, P.: Robust estimation via robust gradient estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82(3), 601–627 (2020)

    MathSciNet  MATH  Google Scholar 

  • Shalev-Shwartz, S., Tewari, A.: Stochastic Methods for \(\ell _1\)-Regularized Loss Minimization. The Journal of Machine Learning Research 12, 1865–1892 (2011)

    MATH  Google Scholar 

  • Shevade, S.K., Sathiya Keerthi, S.: A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19(17), 2246–2253 (2003)

    Google Scholar 

  • Srebro, Nathan, Sridharan, Karthik, Tewari, Ambuj: Optimistic rates for learning with a smooth loss. arXiv preprint arXiv:1009.3896, (2010)

  • Jiyuan, T., Liu, W., Mao, X., Chen, X.: Variance Reduced Median-of-Means Estimator for Byzantine-Robust Distributed Inference. J. Mach. Learn. Res. 22(84), 1–67 (2021)

    MathSciNet  MATH  Google Scholar 

  • Tukey, John W.: A survey of sampling from contaminated distributions. Contributions to Probability and Statistics, pp. 448–485, (1960)

  • van der Vaart, A.W.: Asymptotic Statistics (Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge University Press (1998)

  • van Erven, Tim, Sachs, Sarah, Koolen, Wouter M., Kotlowski, Wojciech: Robust Online Convex Optimization in the Presence of Outliers. In: Proceedings of Thirty Fourth Conference on Learning Theory, volume 134, pp. 4174–4194. PMLR, (2021)

  • Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York, NY (1999)

    MATH  Google Scholar 

  • Vardi, Y., Zhang, C.-H.: The multivariate \(L_1\)-median and associated data depth. Proc. Natl. Acad. Sci. 97(4), 1423–1426 (2000)

    MathSciNet  MATH  Google Scholar 

  • Vergara, A., Vembu, S., Ayhan, T., Ryan, M.A., Homer, M.L., Huerta, R.: Chemical gas sensor drift compensation using classifier ensembles. Sens. Actuators, B Chem. 166, 320–329 (2012)

    Google Scholar 

  • Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)

    MathSciNet  MATH  Google Scholar 

  • Wu, T.T., Lange, K.: Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics 2(1), 224–244 (2008)

    MathSciNet  MATH  Google Scholar 

  • Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1), 49–67 (2006)

    MathSciNet  MATH  Google Scholar 

  • Zhang, Lijun, Zhou, Zhi-Hua: \(\ell _1\)-regression with heavy-tailed distributions. In: Advances in Neural Information Processing Systems, pp. 1084–1094, (2018)

  • Zhang, Tong: Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 116, New York, NY, USA, (2004). Association for Computing Machinery

  • Zheng, Alice, Casari, Amanda: Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O’Reilly Media, Inc., (2018)

Download references

Acknowledgements

This research is supported by the Agence Nationale de la Recherche as part of the “Investissements d’avenir” program (reference ANR-19-P3IA-0001; PRAIRIE 3IA Institute).

Funding

This research is supported by the french Agence Nationale de la Recherche as well as by the PRAIRIE Institute.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: Stéphane Gaïffas; Methodology: Ibrahim Merad, Stéphane Gaïffas; Formal analysis and investigation: Ibrahim Merad; Writing - original draft preparation: Ibrahim Merad; Writing - review and editing: Stéphane Gaïffas; Funding acquisition, Resources and Supervision: Stéphane Gaïffas.

Corresponding author

Correspondence to Ibrahim Merad.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Ethics approval

The authors approve the ethical rules set by the journal and are committed to abiding by them. Declarations relative to conflicts of interest, research involving Human Participants and/or Animals and informed consent are not applicable for this work.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Supplementary theoretical results and details on experiments

1.1 A.1 The Lipschitz constants \(L_j\) are unknown

The step-sizes \((\beta _j)_{j \in \llbracket d \rrbracket }\) used in Theorems 1 and 2 are given by \(\beta _j = 1 / L_j\), where the Lipschitz constants \(L_j\) are defined by (8). This makes them non-observable, since they depend on the unknown distribution of the non-corrupted features \(P_{X_i}\) for \(i \in {\mathcal {I}}\). We cannot use line-search (Armijo 1966) here, since it requires to evaluate the objective \(R(\theta )\), which is unknown as well. In order to provide theoretical guarantees similar to that of Theorem 1 without knowing \((L_j)_{j=1}^d\), we use the following approach. First, we use the upper bound

$$\begin{aligned} U_j:= \gamma {\mathbb {E}}\big [(X^j)^2\big ] \ge L_j, \end{aligned}$$
(A1)

which holds under Assumption 1 and estimate \({\mathbb {E}}[(X^j)^2]\) to build a robust estimator of \(U_j\). In order to obtain an observable upper bound and to control its deviation with a large probability, we introduce the following condition.

Definition 2

We say that a real random variable Z satisfies the \(L^{\zeta }\)-\(L^{\xi }\) condition with constant \(C \ge 1\) whenever it satisfies

$$\begin{aligned} \big ( {\mathbb {E}}\big [ \vert Z - {\mathbb {E}}Z\vert ^{\zeta } \big ] \big )^{1/\zeta } \le C \big ( {\mathbb {E}}\big [ \vert Z - {\mathbb {E}}Z\vert ^{\xi } \big ] \big )^{1/\xi }. \end{aligned}$$
(A2)

Using this condition, we can use the \(\texttt {MOM} \) estimator to obtain a high probability upper bound on \({\mathbb {E}}[(X^j)^2]\) as stated in the following lemma.

Lemma 5

Grant Assumption 2 with \(\alpha \in (0, 1]\) and suppose that for all \(j\in \llbracket d \rrbracket ,\) the variable \((X^j)^2\) satisfies the \(L^{(1+\alpha )}\)-\(L^1\) condition with a known constant C. For any fixed \(j \in \llbracket d \rrbracket ,\) let \(\widehat{\sigma }^2_j\) be the \(\texttt {MOM} \) estimator of \({\mathbb {E}}[(X^j)^2]\) with K blocks. If \(\vert {\mathcal {O}}\vert \le K / 12,\) we have

$$\begin{aligned}{} & {} {\mathbb {P}}\Big [ \Big (1 - 12^{1/(1+\alpha )}C\Big ( \frac{K}{ n} \Big )^{\alpha /(1+\alpha )} \Big )^{-1}\widehat{\sigma }_j^2 \\{} & {} \quad \le {\mathbb {E}}[(X^j)^2 ] \Big ] \le \exp (-K/18). \end{aligned}$$

If we fix a confidence level \(\delta \in (0, 1)\) and choose \(K:= \lceil 18 \log (1 / \delta ) \rceil ,\) we have

$$\begin{aligned} \left( 1 - 216^{1/(1+\alpha )}C\left( \frac{\log (1/\delta )}{ n} \right) ^{\alpha /(1+\alpha )} \right) ^{-1} \widehat{\sigma }_j^2 > {\mathbb {E}}[ (X^j)^2 ] \end{aligned}$$

with a probability larger than \(1 - \delta \).

The proof of Lemma 5 is given in Appendix B. Denoting \(\widehat{U}_j\) the upper bounds it provides on \({\mathbb {E}}[(X^j)^2],\) we can readily bound the Lipschitz constants as \(L_j \le \gamma \widehat{U}_j\) which leads to the following statement.

Corollary 2

Grant the same assumptions as in Theorem 1 and Proposition 3. Suppose additionally that for all \(j\in \llbracket d \rrbracket \), the variable \((X^j)^2\) satisfies the \(L^{(1+\alpha )}\)-\(L^1\) condition with a known constant C and fix \(\delta \in (0, 1)\). Let \(\theta ^{(T)}\) be the output of Algorithm 1 with step-sizes \(\widehat{\beta }_j = 1 / {\overline{L}}_j\) where \({\overline{L}}_j:= \gamma \widehat{U}_j\) and \(\widehat{U}_j\) are the upper bounds from Lemma 5 with confidence \(\delta /2d,\) an initial iterate \(\theta ^{(0)},\) importance sampling distribution \(p_j = {\overline{L}}_j / \sum _{k \in \llbracket d \rrbracket } {\overline{L}}_{k}\) and estimators of the partial derivatives with error vector \(\epsilon (\cdot )\). Then, we have

$$\begin{aligned}&{\mathbb {E}}\big [ R(\theta ^{(T)}) \big ] - R^\star \le (R(\theta ^{(0)}) - R^\star ) \nonumber \\&\quad \left( 1 - \frac{\lambda }{\sum _{j \in \llbracket d \rrbracket } \overline{L}_{j}} \right) ^T + \frac{1}{2\lambda } \big \Vert \epsilon ( \delta /2 ) \big \Vert _2^2 \end{aligned}$$
(A3)

with probability at least \(1 - \delta \).

The proof of Corollary 2 is given in Appendix B. It is a direct consequence of Theorem 1 and Lemma 5 and shows that an upper bound similar to that of Theorem 1 can be achieved with observable step-sizes. One may argue that the \(L^{(1+\alpha )}\)-\(L^1\) condition simply bypasses the difficulty of deriving an observable upper bound by arbitrarily assuming that a ratio of moments is observed. However, we point out that a hypothesis of this nature is indispensable to obtain bounds such as the one above (alternatively, consider a real random variable with an infinitesimal mass drifting towards infinity). In fact, the \(L^{(1+\alpha )}\)-\(L^1\) condition is much weaker than the requirement of boundedness (with known range) common to most known empirical bounds (Maurer et al. 2009; Audibert et al. 2009; Mnih et al. 2008).

1.2 A.2 Observable upper bound for the moment \(m_{\alpha , j}\)

Since the moment \(m_{\alpha , j}\), it is not observable, so we propose in Lemma 6 below an observable upper bound deviation for it based on \(\texttt {MOM} \). Let us introduce now a robust estimator \(\widehat{m}^{\texttt {MOM} }_{\alpha , j}(\theta )\) of the unknown moment \(m_{\alpha , j}(\theta )\) using the following “two-step” \(\texttt {MOM} \) procedure. First, we compute \(\widehat{g}^{\texttt {MOM} }_j(\theta )\), the \(\texttt {MOM} \) estimator of \(g_j(\theta )\) with K blocks given by (14). Then, we compute again a \(\texttt {MOM} \) estimator on \(\vert g^i_j(\theta ) - \widehat{g}^{\texttt {MOM} }_j(\theta ) \vert ^{1+\alpha }\) for \(i\in \llbracket n \rrbracket \), namely

$$\begin{aligned} \widehat{m}_{\alpha , j}^{\texttt {MOM} }(\theta ):= {{\,\textrm{median}\,}}\big ( \widehat{m}_{\alpha , j}^{(1)}(\theta ), \ldots , \widehat{m}_{\alpha , j}^{(K)}(\theta ) \big ), \end{aligned}$$
(A4)

where

$$\begin{aligned} \widehat{m}_{\alpha , j}^{(k)}(\theta ):= \frac{1}{\vert B_k\vert } \sum _{i \in B_k} \big \vert g^i_j(\theta ) - \widehat{g}^{\texttt {MOM} }_j(\theta ) \big \vert ^{1+\alpha }, \end{aligned}$$

using uniformly sampled blocks \(B_1, \ldots , B_K\) of equal size that form a partition of \(\llbracket n \rrbracket \).

Lemma 6

Grant Assumptions 1 and 2 with \(\alpha \in (0, 1]\) and suppose that for all \(j\in \llbracket d \rrbracket \) and \(\theta \in \Theta \) the partial derivatives \(\ell '(X^\top \theta , Y)X^j\) satisfy the \(L^{(1+\alpha )^2}\)-\(L^{(1+\alpha )}\) condition with known constant C for any \(j\in \llbracket d \rrbracket \) (see Definition 2). Then, if \(\vert {\mathcal {O}}\vert \le K / 12,\) we have

$$\begin{aligned} {\mathbb {P}}\big [ \widehat{m}^{\texttt {MOM} }_{\alpha , j}(\theta ) \le (1 - \kappa )m_{\alpha , j}(\theta ) \big ] \le 2\exp (-K/18) \end{aligned}$$

where \(\kappa = \epsilon + 24 (1 + \alpha ) \big (\frac{(1 + \epsilon ) K}{n} \big )^{\alpha /(1+\alpha )}\) and \(\epsilon = (24 (1 + C^{(1+\alpha )^2} ))^{1/(1+\alpha )} \big (\frac{K}{n}\big )^{\alpha / (1 + \alpha )}\).

The proof of Lemma 6 is given in Appendix B.

1.3 A.3 Experimental details

We provide in this section supplementary information about the numerical experiments conducted in Sect. 6.

1.3.1 A.3.1 Data sets

The main characteristics of the data sets used from the UCI repository are given in Table 3 and their direct URLs are given in Table 4.

Table 3 Main characteristics of the data sets used in experiments, including number of samples, number of features, number of categorical features and number of classes
Table 4 The URLs of all the data sets used in the paper, giving direct download links and supplementary details

1.3.2 A.3.2 Data corruption

For a given corruption rate \(\eta \), we obtain a corrupted version of a data set by replacing an \(\eta \)-fraction of its samples with uninformative elements. For a data set of size n we choose \({\mathcal {O}}\subset \llbracket n \rrbracket \) which satisfies \(\vert {\mathcal {O}}\vert = \eta n\) up to integer rounding. The corruption is applied prior to any preprocessing except in the regression case where label scaling is applied before. The affected subset is chosen uniformly at random. Since many data sets contain both continuous and categorical data features, we distinguish two different corruption mechanisms which we apply depending on their nature. The labels are corrupted as continuous or categorical values when the task is respectively regression or classification. Denote \({\widetilde{{\varvec{X}}}} \in {\mathbb {R}}^{n\times (d+1)}\) the data matrix with the vector of labels added to its columns. Let \({\widetilde{J}} \subset \llbracket d+1 \rrbracket \) denote the index of continuous columns, we compute \(\widehat{\mu }_j\) and \(\widehat{\sigma }_j\) their empirical means and standard deviations respectively for \(j \in {{\widetilde{J}}}\). We also sample a random unit vector u of size \(\vert {\widetilde{J}}\vert \).

  • For categorical feature columns, for each corrupted index \(i \in {\mathcal {O}}\), we replace \({\varvec{X}}_{i,j}\) with a uniformly sampled value among \(\{{\varvec{X}}_{\cdot ,j}\}\) i.e. among the possible modalities of the categorical feature in question.

  • For continuous features, for each corrupted index \(i \in {\mathcal {O}}\), we replace \({\varvec{X}}_{i, {\widetilde{J}}}\) with equal probability with one of the following possibilities:

    • a vector \(\xi \) sampled coordinatewise according to \(\xi _j = r_j + 5 \widehat{\sigma }_j \nu \) where \(r_j\) is a value randomly picked in the column \({\varvec{X}}_{\cdot ,j}\) and \(\nu \) is a sample from the Student distribution with 2.1 degrees of freedom.

    • a vector \(\xi \) sampled coordinatewise according to \(\xi _j = \widehat{\mu }_j + 5\widehat{\sigma }_j u_j + z \) where z is a standard gaussian.

    • a vector \(\xi \) sampled according to \(\xi = \widehat{\mu }+ 5\widehat{\sigma }\otimes w \) where w is a uniformly sampled unit vector.

1.4 A.4 Preprocessing

We apply a minimal amount of preprocessing to the data before applying the considered learning algorithms. More precisely, categorical features are one-hot encoded while centering and standard scaling is applied to the continuous features.

1.5 A.5 Parameter hyper-optimization

We use the hyperopt library to find optimal hyper-parameters for all algorithms. For each data set, the available samples are split into training, validation and test sets with proportions \(70\%, 15\%, 15\%\). Whenever corruption is applied, it is restricted to the training set. We run 50 rounds of hyper-parameter optimization which are trained on the training set and evaluated on the validation set. Then, we report results on the test set for all hyper-optimized algorithms. For each algorithm, the hyper-parameters are tried out using the following sampling mechanism (the one we specify to hyperopt):

  • \(\texttt {MOM} \), \(\texttt {GMOM} \), \(\texttt {LLM} \): we optimize the number of blocks K used for the median-of-means computations. This is done through a block_size \(=K/n\) hyper-parameter chosen with log-uniform distribution over \([10^{-5}, 0.2]\)

  • \(\texttt {CH} \) and \(\texttt {CH\, GD} \): we optimize the confidence \(\delta \) used to define the \(\texttt {CH} \) estimator’s scale parameter (see Equation (21)) chosen with log-uniform distribution over \([e^{-10}, 1]\)

  • \(\texttt {TM} \), \(\texttt {HG} \): we optimize the percentage used for trimming uniformly in \([10^{-5}, 0.3]\)

  • \(\texttt {RANSAC} \): we optimize the value of the min_samples parameter in the scikit-learn implementation, chosen as \(4 + m\) with m an integer chosen uniformly in \(\llbracket 100 \rrbracket \)

  • \(\texttt {HUBER} \): we optimize the epsilon parameter in the scikit-learn implementation chosen uniformly in [1.0, 2.5]

Appendix B: Proofs

1.1 B.1 Proof of Theorem 1

This proof follows, with minor modifications, the proof of Theorem 1 from Wright (2015). Using Definition 1, we obtain

$$\begin{aligned}{} & {} {\mathbb {P}}[ {\mathcal {E}} ] \ge 1 - \delta \quad \text {where} \quad {\mathcal {E}}:= \big \{ \forall j \in \llbracket d \rrbracket , \quad \forall t \in [T], \nonumber \\{} & {} \quad \big \vert {\widehat{g}}_j(\theta ^{(t)}) - g_j(\theta ^{(t)})\big \vert \le \epsilon _j(\delta ) \big \}. \end{aligned}$$
(B5)

Let us recall that \(e_j\) stands for the j-th canonical basis of \({\mathbb {R}}^d\) and that, as described in Algorithm 1, we have

$$\begin{aligned} \theta ^{(t+1)} = \theta ^{(t)} - \beta _{j_t} {\widehat{g}}_t e_{j_t}, \end{aligned}$$

where we use the notations \({\widehat{g}}_t = {\widehat{g}}_{j_t}(\theta ^{(t)})\) and \(g_t = g_{j_t}(\theta ^{(t)})\) and where we recall that \(j_1, \ldots , j_t\) is a i.i.d sequence with distribution p. We introduce also \(\epsilon _j:= \epsilon _j(\delta )\). Using Assumption 3, we obtain

$$\begin{aligned} R(&\theta ^{(t+1)}) = R\big (\theta ^{(t)} - \beta _{j_t} {\widehat{g}}_t e_{j_t}\big ) \nonumber \\&\le R(\theta ^{(t)}) - \big \langle g(\theta ^{(t)}), \beta _{j_t} {\widehat{g}}_t e_{j_t} \big \rangle + \frac{L_{j_t}}{2}\beta _{j_t}^2{\widehat{g}}_t^2 \nonumber \\&= R(\theta ^{(t)}) - \beta _{j_t} g_t^2 - \beta _{j_t} g_t ( {\widehat{g}}_t \!-\! g_t ) \nonumber \\&\quad + \frac{L_{j_t}\beta _{j_t}^2}{2}\big (g_t^2 + ({\widehat{g}}_t \!-\! g_t)^2 + 2g_t({\widehat{g}}_t \!-\! g_t)\big ) \nonumber \\&= R(\theta ^{(t)}) - \beta _{j_t} g_t (1 \!-\! L_{j_t}\beta _{j_t})( {\widehat{g}}_t \!-\! g_t ) \nonumber \\&\quad - \beta _{j_t} \Big ( 1 \!-\! \frac{L_{j_t}\beta _{j_t}}{2} \Big ) g_t^2 + \frac{L_{j_t}\beta _{j_t}^2}{2}({\widehat{g}}_t \!-\! g_t)^2 \nonumber \\&= R(\theta ^{(t)}) - \frac{1}{2 L_{j_t}}g_t^2 + \frac{1}{2L_{j_t}}({\widehat{g}}_t - g_t)^2 \nonumber \\&\le R(\theta ^{(t)}) - \frac{1}{2 L_{j_t}}g_t^2 + \frac{\epsilon _{j_t}^2}{2L_{j_t}} \end{aligned}$$
(B6)

on the event \({\mathcal {E}}\), where we used the choice \(\beta _{j_t} = 1/L_{j_t}\) and the fact that \(\vert {\widehat{g}}_t - g_t \vert \le \epsilon _{j_t}\) on \({\mathcal {E}}\).

Since \(j_1, \ldots , j_t\) is a i.i.d sequence with distribution p, we have for any \((j_1, \ldots , j_{t-1})\)-measurable and integrable function \(\varphi \) that

$$\begin{aligned} {\mathbb {E}}_{t-1}\big [\varphi (j_t)\big ] = \sum _{j \in \llbracket d \rrbracket } \varphi (j) p_j, \end{aligned}$$

where we denote for short the conditional expectation \({\mathbb {E}}_{t-1}[\cdot ] = {\mathbb {E}}_{t-1}[\cdot \vert j_1, \ldots , j_{t-1}]\). So, taking \({\mathbb {E}}_{t-1}[\cdot ]\) on both sides of (B6) leads, whenever \(p_j = L_j / \sum _{k=1}^d L_k\), to

$$\begin{aligned}{} & {} {\mathbb {E}}_{t-1}\big [ R(\theta ^{(t+1)})\big ] \le R(\theta ^{(t)}) - \frac{1}{2 \sum _k L_k}\big \Vert g(\theta ^{(t)})\big \Vert ^2 \\{} & {} \quad + \frac{1}{2\sum _k L_k} \Xi , \end{aligned}$$

where we introduced \(\Xi := \Vert \epsilon (\delta )\Vert _2^2\), while it leads to

$$\begin{aligned}{} & {} {\mathbb {E}}_{t-1} \big [R(\theta ^{(t+1)})\big ] \le R(\theta ^{(t)}) - \frac{1}{2 L_{\max }d} \big \Vert g(\theta ^{(t)})\big \Vert ^2 \\{} & {} \quad + \frac{1}{2d L_{\min }} \Xi \end{aligned}$$

whenever \(p_j = 1 / d\), simply using \(L_{\min } \le L_j \le L_{\max }\). In order to treat both cases simultaneously, consider \({{\bar{L}}} = \sum _{k=1} L_k\) and \({{\bar{\epsilon }}} = \Xi / (2 \sum _k L_k)\) whenever \(p_j = L_j / \sum _{k=1}^d L_k\) and \({{\bar{L}}} = d L_{\max }\) and \({{\bar{\epsilon }}} / (2 d L_{\min })\) whenever \(p_j = 1 / d\) and continue from the inequality

$$\begin{aligned} {\mathbb {E}}_{t-1}\big [ R(\theta ^{(t+1)})\big ] \le R(\theta ^{(t)}) - \frac{1}{2 {{\bar{L}}}} \big \Vert g(\theta ^{(t)})\big \Vert ^2 + {{\bar{\epsilon }}}. \end{aligned}$$

Introducing \(\phi _t:= {\mathbb {E}}\big [R(\theta ^{(t)})\big ] - R^\star \) and taking the expectation w.r.t. all \(j_1, \dots , j_t\) we obtain

$$\begin{aligned} \phi _{t+1} \le \phi _{t} - \frac{1}{2 {{\bar{L}}}} {\mathbb {E}}\big \Vert g(\theta ^{(t)}) \big \Vert ^2 + {{\bar{\epsilon }}}. \end{aligned}$$
(B7)

Using Inequality (9) with \(\theta _1 = \theta ^{(t)}\) gives

$$\begin{aligned} R(\theta _2) \ge R(\theta ^{(t)}) + \big \langle \nabla R(\theta ^{(t)}), \theta _2 - \theta ^{(t)} \big \rangle + \frac{\lambda }{2}\big \Vert \theta _2 - \theta ^{(t)}\big \Vert ^2 \end{aligned}$$

for any \(\theta _2 \in {\mathbb {R}}^d\), so that by minimizing both sides with respect to \(\theta _2\) leads to

$$\begin{aligned} R^\star \ge R(\theta ^{(t)}) - \frac{1}{2\lambda }\big \Vert g(\theta ^{(t)})\big \Vert ^2 \end{aligned}$$

namely

$$\begin{aligned} \phi _t \le \frac{1}{2\lambda } {\mathbb {E}}\big \Vert g(\theta ^{(t)})\big \Vert ^2, \end{aligned}$$

by taking the expectation on both sides. Together with (B7) this leads to the following approximate contraction property:

$$\begin{aligned} \phi _{t+1} \le \phi _{t} \Big (1 - \frac{\lambda }{{{\bar{L}}}} \Big ) + {{\bar{\epsilon }}}, \end{aligned}$$

and by iterating \(t=1, \ldots , T\) to

$$\begin{aligned} \phi _T \le \phi _0 \Big ( 1 - \frac{\lambda }{{\bar{L}}} \Big )^T + \frac{{\bar{\epsilon }}{\bar{L}}}{\lambda }, \end{aligned}$$

which allows to conclude the Proof of Theorem 1. \(\square \)

1.2 B.2 Proof of Theorem 2

This proof reuses ideas from Li et al. (2017) and Beck and Tetruashvili (2013) and adapts them to our context where the gradient coordinates are replaced with high confidence approximations. Without loss of generality, we initially assume that the coordinates are cycled upon in the natural order. We condition on the event (B5) which holds with probability \(\ge 1 - \delta \) as in the proof of Theorem 1 and denote \(\epsilon _j = \epsilon _j(\delta )\) and \(\epsilon _{Euc} = \Vert \epsilon (\delta )\Vert \).

Let the iterations be denoted as \(\theta ^{(t)}\) for \(t=0,\dots , T\) and \(\theta ^{(t)}_{i+1} = \theta ^{(t)}_{i} - \beta _{i+1} {\widehat{g}}(\theta ^{(t)}_{i})_{i+1} e_{i+1}\) for \(i=0, \dots , d-1\) with \(\beta _i = 1/L_i\), \(\theta ^{(t)}_0 = \theta ^{(t)}\) and \(\theta ^{(t)}_d = \theta ^{(t+1)}\). With these notations we have

$$\begin{aligned} R(\theta ^{(t)}) - R(\theta ^{(t+1)}) = \sum _{i=0}^{d-1} R(\theta ^{(t)}_i) - R(\theta ^{(t)}_{i+1}). \end{aligned}$$

Similarly to (B6) in the proof of Theorem 1 we find:

$$\begin{aligned} R(\theta ^{(t)}_i) - R(\theta ^{(t)}_{i+1}) \ge \frac{1}{2L_{i+1}}\big ( g(\theta ^{(t)}_{i})_{i+1}^2 - \epsilon _{i+1}^2\big ), \end{aligned}$$

leading to

$$\begin{aligned}{} & {} R(\theta ^{(t)}) - R(\theta ^{(t+1)}) \ge \sum _{i=0}^{d-1} \frac{1}{2L_{i+1}} g(\theta ^{(t)}_{i})_{i+1}^2 \nonumber \\{} & {} \quad - \frac{1}{2 L_{\min }} \sum _{i=0}^{d-1}\epsilon _{i+1}^2. \end{aligned}$$
(B8)

The following aims to find a relationship between \(\sum _{i=0}^{d-1} \frac{1}{2L_{i+1}} g(\theta ^{(t)}_{i})_{i+1}^2\) and \(\big \Vert g(\theta ^{(t)})\Vert _2^2\) which we do by comparing coordinates. For the first step in a cycle we have \(g(\theta ^{(t)})_1 = g(\theta ^{(t)}_0)_1\) because \(\theta ^{(t)}= \theta ^{(t)}_0\). Let \(j \in \{1,\dots , d-1\}\), by the Mean Value Theorem, there exists \(\gamma ^{(t)}_j \in {\mathbb {R}}^d\) such that we have:

$$\begin{aligned}&g(\theta ^{(t)})_{j+1} = g(\theta ^{(t)})_{j+1} - g(\theta ^{(t)}_j)_{j+1} + g(\theta ^{(t)}_j)_{j+1} \\&\quad = \big (\nabla g_{j+1}(\gamma ^{(t)}_j)\big )^\top \big (\theta ^{(t)}- \theta ^{(t)}_j\big ) + g(\theta ^{(t)}_j)_{j+1} \\&\quad = \bigg [\frac{\partial R(\gamma ^{(t)}_j)}{\partial _{j+1}\partial _1}, \dots , \frac{\partial R(\gamma ^{(t)}_j)}{\partial _{j+1}\partial _j}, 0, \dots , 0\bigg ] \\&\quad \qquad \big [ (\theta ^{(t)}- \theta ^{(t)}_j)_1, \dots , (\theta ^{(t)}- \theta ^{(t)}_j)_j, 0, \dots , 0 \big ]^\top \\ {}&\quad + g(\theta ^{(t)}_j)_{j+1} \\&\quad = [H_{j+1, 1}, \dots , H_{j+1, j}, 0, \dots , 0] \bigg [ \frac{{\widehat{g}}_1(\theta ^{(t)}_0)}{L_1}, \dots , \\&\quad \qquad \frac{{\widehat{g}}_j(\theta ^{(t)}_{j-1})}{L_j}, 0, \dots , 0 \bigg ]^\top + g(\theta ^{(t)}_j)_{j+1} \\&\quad = [H_{j+1, 1}, \dots , H_{j+1, j}, 0, \dots , 0] \bigg [ \frac{g_1(\theta ^{(t)}_0) + \delta ^{(t)}_{1}}{L_1}, \dots ,\\&\quad \qquad \frac{g_j(\theta ^{(t)}_{j-1}) + \delta ^{(t)}_{j}}{L_j}, 0, \dots , 0 \bigg ]^\top \\&\qquad + g(\theta ^{(t)}_j)_{j+1} \\&\quad = \underbrace{\bigg [\frac{H_{j+1, 1}}{\sqrt{L_1}}, \dots , \frac{H_{j+1, j}}{\sqrt{L_j}}, \sqrt{L_{j+1}}, 0, \dots , 0\bigg ]}_{{\widetilde{h}}^\top _{j+1}} \\&\quad \qquad \underbrace{\bigg [ \frac{g_1(\theta ^{(t)}_0)}{\sqrt{L_1}}, \dots , \frac{g_d(\theta ^{(t)}_{d-1})}{\sqrt{L_d}} \bigg ]^\top }_{{\widetilde{g}}_t} \\&\qquad + \underbrace{[H_{j+1, 1}, \dots , H_{j+1, j}, 0, \dots , 0]}_{h^\top _{j+1}} \bigg [ \frac{\delta ^{(t)}_{1}}{L_1}, \dots , \frac{\delta ^{(t)}_{d}}{L_d}\bigg ]^\top \\&\quad = {\widetilde{h}}_{j+1} {\widetilde{g}}_t + h_{j+1} A^{-1}\delta ^{(t)}, \end{aligned}$$

where we introduced the following quantities: \(A \in {\mathbb {R}}^d\) equal to \(A = \text {diag}(L_j)_{j=1}^d\), the vector \(\delta ^{(t)} \in {\mathbb {R}}^d\) is such that \(\delta ^{(t)}_j = {\widehat{g}}(\theta ^{(t)}_{j-1})_{j} - g(\theta ^{(t)}_{j-1})_{j}\) which satisfies \(\vert \delta ^{(t)}_{j}\vert \le \epsilon _j\), the matrix \(H = (h_1, \dots , h_d)^\top \) and \({\widetilde{H}} = A^{1/2} + H A^{-1/2} = ( {\widetilde{h}}_1, \dots , {\widetilde{h}}_d)^\top \). In the case \(j=0\) the vector \(h_{j+1} = h_1\) is simply zero. This allows us to obtain the following estimation:

$$\begin{aligned} \big \Vert g(\theta ^{(t)})\big \Vert ^2&= \sum _{j=1}^d g(\theta ^{(t)})_j^2 = \sum _{j=1}^d ({\widetilde{h}}_j^\top {\widetilde{g}}_t + h_j^\top A^{-1} \delta ^{(t)})^2 \nonumber \\&\le \sum _{j=1}^d 2({\widetilde{h}}_j^\top {\widetilde{g}}_t)^2 + 2(h_j^\top A^{-1} \delta ^{(t)})^2 \nonumber \\&= 2\big \Vert {\widetilde{H}} {\widetilde{g}}_t\big \Vert ^2 + 2\big \Vert H A^{-1} \delta ^{(t)}\big \Vert ^2 \nonumber \\&\le 2\big \Vert {\widetilde{H}}\big \Vert ^2 \big \Vert {\widetilde{g}}_t\big \Vert ^2 + \frac{2}{L_{\min }^2}\Vert H\Vert ^2 \epsilon _{Euc}^2 \nonumber \\&= 2\Vert {\widetilde{H}}\Vert ^2 \sum _{i=0}^{d-1} \frac{1}{L_{i+1}} g(\theta ^{(t)}_{i})_{i+1}^2 + \frac{2}{L_{\min }^2}\Vert H\Vert ^2 \epsilon _{Euc}^2 . \end{aligned}$$
(B9)

We can bound the spectral norm \(\Vert {\widetilde{H}}\Vert \) as follows:

$$\begin{aligned}{} & {} \Vert {{\widetilde{H}}\Vert ^2 = \Vert A^{1/2} + H A^{-1/2}\Vert ^2 \le 2\Vert A^{1/2}}^2 + 2\Vert H A^{-1/2}\Vert ^2 \\{} & {} \quad \le 2\left( L_{\max } + \frac{\Vert H\Vert ^2}{L_{\min }}\right) . \end{aligned}$$

For \(\Vert H\Vert \), we use the coordinate-wise Lipschitz-smoothness in order to find

$$\begin{aligned}\Vert H\Vert ^2{} & {} \le \Vert H\Vert _F^2 = \sum _{j=1}^d \Vert h_j\Vert ^2 \le \sum _{j=1}^d \big \Vert \nabla g_j(\gamma ^{(t)}_{j-1})\big \Vert ^2\\{} & {} \le \sum _{j=1}^d L_j^2 \le d L_{\max }^2. \end{aligned}$$

Combining the previous inequality with (B8) and (B9), we find:

$$\begin{aligned}&R(\theta ^{(t)}) - R(\theta ^{(t+1)}) \\&\quad \ge \frac{1}{8L_{\max }\left( 1 + d\frac{L_{\max }}{L_{\min }}\right) }\big \Vert g(\theta ^{(t)})\big \Vert ^2 \\&\qquad - \frac{\epsilon _{Euc}^2}{2} \left( \frac{1}{L_{\min }} + \frac{d\left( \frac{L_{\max }}{L_{\min }}\right) ^2}{2L_{\max }\left( 1 + d\frac{L_{\max }}{L_{\min }}\right) }\right) \\&\quad \ge \frac{1}{8L_{\max }\left( 1 + d\frac{L_{\max }}{L_{\min }}\right) }\big \Vert g(\theta ^{(t)})\big \Vert ^2 \\&\qquad - \frac{\epsilon _{Euc}^2}{2} \left( \frac{1}{L_{\min }} + \frac{1}{2L_{\min }}\frac{dL_{\max }/L_{\min }}{1 + d\frac{L_{\max }}{L_{\min }}}\right) \\&\quad \ge \underbrace{\frac{1}{8L_{\max }(1 + d\frac{L_{\max }}{L_{\min }})}}_{=: \kappa }\big \Vert g(\theta ^{(t)})\big \Vert ^2 - \frac{3}{4L_{\min }} \epsilon _{Euc}^2, \end{aligned}$$

where the last step uses that \(\frac{dL_{\max }/L_{\min }}{1 + d\frac{L_{\max }}{L_{\min }}} \le 1\). Using \(\lambda \)-strong convexity by choosing \(\theta _1 = \theta ^{(t)}\) in inequality (9) and minimizing both sides w.r.t. \(\theta _2\) we obtain:

$$\begin{aligned}R(\theta ^{(t)}) - R^\star \le \frac{1}{2\lambda }\Vert g(\theta ^{(t)})\Vert ^2, \end{aligned}$$

which combined with the previous inequality yields the contraction inequality:

$$\begin{aligned} R(\theta ^{(t+1)}) - R^\star \le (R(\theta ^{(t)}) - R^\star )(1 - 2\lambda \kappa ) + \frac{3}{4L_{\min }} \epsilon _{Euc}^2, \end{aligned}$$

and after T iterations we have:

$$\begin{aligned} R(\theta ^{(T)}) - R^\star \le (R(\theta ^{(0)}) - R^\star )(1 - 2\lambda \kappa )^T + \frac{3\epsilon _{Euc}^2}{8 L_{\min }\lambda \kappa }, \end{aligned}$$

which concludes the proof of Theorem 2. To see that the proof still holds for any choice of coordinates satisfying the conditions in the main claim, notice that the computations leading up to Inequality (B9) work all the same if one were to apply a permutation to the coordinates beforehand.

1.3 B.3 Convergence of the parameter error

We state and prove a result about the linear convergence of the parameter under strong convexity.

Theorem 8

Grant Assumptions 1, 3 and 4. Let \(\theta ^{(T)}\) be the output of Algorithm 1 with constant step-size \(\beta = \frac{2}{\lambda + L},\) an initial iterate \(\theta ^{(0)},\) uniform coordinates sampling \(p_j = 1 / d\) and estimators of the partial derivatives with error vector \(\epsilon (\cdot )\). Then, we have

$$\begin{aligned}{} & {} {\mathbb {E}}\big \Vert \theta ^{(T)} - \theta ^\star \big \Vert _2 \le \big \Vert \theta ^{(0)} - \theta ^\star \big \Vert _2 \Big (1 - \frac{2\beta \lambda L}{d(\lambda + L)}\Big )^T \nonumber \\{} & {} \quad + \frac{\sqrt{d}(\lambda + L)}{\lambda L } \big \Vert \epsilon ( \delta ) \big \Vert _2 \end{aligned}$$
(B10)

with probability at least \(1 - \delta \), where the expectation is w.r.t. the sampling of the coordinates.

Proof

As in the proof of Theorem 1, let \(({\widehat{g}}_j(\theta ))_{j=1}^d\) be the estimators used and introduce the notations

$$\begin{aligned} {\widehat{g}}_t = {\widehat{g}}_{j_t}(\theta ^{(t)}) \quad \text { and } \quad g_t = g_{j_t}(\theta ^{(t)}). \end{aligned}$$

We also condition on the event (B5) which holds with probability \(1 - \delta \) and use the notations \(\epsilon _{Euc} = \Vert \epsilon (\delta )\Vert _2\) and \(\epsilon _j = \epsilon _j(\delta )\). We denote \(\Vert \cdot \Vert _{L_2}\) the \(L_2\)-norm w.r.t. the distribution over \(j_t\) i.e. for a random variable \(\xi \) we have \(\Vert \xi \Vert _{L_2} = \sqrt{{\mathbb {E}}_{j_t} \Vert \xi \Vert ^2}\). We compute:

$$\begin{aligned}&\big \Vert \theta ^{(t+1)} - \theta ^\star \big \Vert _{L_2} = \big \Vert \theta ^{(t)} - \beta _{j_t} {\widehat{g}}_t e_{j_t} - \theta ^\star \big \Vert _{L_2} \nonumber \\&\le \big \Vert \theta ^{(t)} - \beta _{j_t} g_t e_{j_t} - \theta ^\star \big \Vert _{L_2} + \big \Vert \beta _{j_t}({\widehat{g}}_t - g_t)\big \Vert _{L_2}. \end{aligned}$$
(B11)

We first treat the first term of (B11), in the case of uniform sampling with equal step-sizes \(\beta _j = \beta \) we have:

$$\begin{aligned}{} & {} \big \Vert \theta ^{(t)} - \beta g_t e_{j_t} - \theta ^\star \big \Vert ^2\\{} & {} = \big \Vert \theta ^{(t)} - \theta ^\star \big \Vert ^2 + \beta ^2 g_t^2 - 2\beta \big \langle g_t e_{j_t},\theta ^{(t)} - \theta ^\star \big \rangle . \end{aligned}$$

By taking the expectation w.r.t. the random coordinate \(j_t\) we find:

$$\begin{aligned}&\big \Vert \theta ^{(t)} - \beta g_t e_{j_t} - \theta ^\star \big \Vert _{L_2}^2 = {\mathbb {E}}\big \Vert \theta ^{(t)} - \beta g_t e_{j_t} - \theta ^\star \big \Vert ^2 \\&\quad ={\mathbb {E}}\big \Vert \theta ^{(t)} - \theta ^\star \big \Vert ^2 +\frac{\beta ^2}{d} {\mathbb {E}}\big \Vert g(\theta ^{(t)})\big \Vert ^2 \\&\qquad - 2\frac{\beta }{d} {\mathbb {E}}\big \langle g(\theta ^{(t)}), \theta ^{(t)} - \theta ^\star \big \rangle \\&\quad ={\mathbb {E}}\Vert \theta ^{(t)} \!-\! \theta ^\star \Vert ^2 \!+\!\Big (\frac{\beta }{d}\Big )^2 {\mathbb {E}}\Vert g(\theta ^{(t)})\Vert ^2 \\&\qquad - 2\frac{\beta }{d} {\mathbb {E}}\big \langle g(\theta ^{(t)}), \theta ^{(t)} \!-\! \theta ^\star \big \rangle \\&\qquad + \frac{\beta ^2}{d} {\mathbb {E}}\big \Vert g(\theta ^{(t)})\big \Vert ^2\Big (1 \!-\! \frac{1}{d}\Big ) \\&\quad \le {\mathbb {E}}\big \Vert \theta ^{(t)} \!-\! \theta ^\star \big \Vert ^2\Big (1 \!-\! \frac{2\beta \lambda L}{d(\lambda + L)}\Big ) \!+\! \frac{\beta }{d} \Big (\frac{\beta }{d} \!-\! \frac{2}{\lambda + L}\Big ) {\mathbb {E}}\big \Vert g(\theta ^{(t)})\big \Vert ^2 \\&\qquad + \frac{\beta ^2}{d} {\mathbb {E}}\big \Vert g(\theta ^{(t)})\big \Vert ^2\Big (1 \!-\! \frac{1}{d}\Big ) \\&\quad = {\mathbb {E}}\big \Vert \theta ^{(t)} - \theta ^\star \big \Vert ^2\Big (1 - \frac{2\beta \lambda L}{d(\lambda + L)}\Big ) \\&\qquad + \frac{\beta }{d} \Big (\beta - \frac{2}{\lambda + L}\Big ){\mathbb {E}}\big \Vert g(\theta ^{(t)})\big \Vert ^2 \\&\quad \le {\mathbb {E}}\big \Vert \theta ^{(t)} - \theta ^\star \big \Vert ^2\underbrace{\Big (1 - \frac{2\beta \lambda L}{d(\lambda + L)}\Big )}_{=:\kappa ^2}. \end{aligned}$$

The first inequality is obtained by applying inequality (2.1.24) from Nesterov (2004) (see also Bubeck (2015) Lemma 3.11) and the second one is due to the choice of \(\beta \). We can bound the second term as follows:

$$\begin{aligned}{} & {} \big \Vert {\widehat{g}}_t - g_t\big \Vert ^2_{L_2} = {\mathbb {E}}_{j_t}\big \vert {\widehat{g}}_t - g_t\big \vert ^2 \\{} & {} \quad = \frac{1}{d} \sum _{j=1}^d\big \vert {\widehat{g}}_j(\theta ^{(t)}) - g_j(\theta ^{(t)})\big \vert ^2 \le \frac{\epsilon ^2_{Euc}}{d}. \end{aligned}$$

Combining the latter with the former bound, we obtain the approximate contraction:

$$\begin{aligned} \big \Vert \theta ^{(t+1)} - \theta ^\star \big \Vert _{L_2} \le \kappa \big \Vert \theta ^{(t)} - \theta ^\star \big \Vert _{L_2} + \frac{\beta \epsilon _{Euc}}{\sqrt{d}}. \end{aligned}$$

By iterating this argument on T rounds we find that:

$$\begin{aligned} \big \Vert \theta ^{(T)} - \theta ^\star \big \Vert _{L_2} \le \kappa ^T\big \Vert \theta ^{(0)} - \theta ^\star \big \Vert _{L_2} + \frac{\beta \epsilon _{Euc}}{\sqrt{d}(1 - \kappa )}. \end{aligned}$$

Finally, the following inequality yields the result in the case of uniform sampling:

$$\begin{aligned} \frac{1}{1-\kappa } \le \frac{1 + \sqrt{1 - \frac{2\beta \lambda L}{d(\lambda + L)}}}{\frac{2\beta \lambda L}{d(\lambda + L)}} \le \frac{d(\lambda + L)}{\beta \lambda L}. \end{aligned}$$

\(\square \)

1.4 B.4 Proof of Lemma 1

Let \(\theta \in \Theta \), using Assumption 1 we have:

$$\begin{aligned}{} & {} \vert \ell (\theta ^{\top } X, Y)\vert \le C_{\ell , 1} + C_{\ell , 2}\vert \theta ^{\top } X- Y\vert ^{q} \\{} & {} \quad \le C_{\ell , 1} + 2^{q-1} C_{\ell , 2}(\vert \theta ^{\top } X\vert ^{q} + \vert Y\vert ^{q}). \end{aligned}$$

Taking the expectation and using Assumption 2 shows that the risk \(R(\theta )\) is well defined (recall that \(q\le 2\)). Next, since \(1\le q \le 2\), simple algebra gives

$$\begin{aligned}&\big \vert \ell ' (\theta ^{\top } X, Y) X_j \big \vert ^{1 + \alpha } \\ {}&\le \big \vert \big (C_{\ell ,1}' + C_{\ell ,2}'\vert \theta ^\top X - Y\vert ^{q-1}\big )X^j \big \vert ^{1 + \alpha } \\ {}&\le 2^\alpha \big (\big \vert C_{\ell ,1}' X^j\big \vert ^{1+\alpha } \\ {}&\quad + (C_{\ell ,2}' (\vert (\theta ^\top X)^{q-1} X^j\vert + \vert Y^{q-1} X^j\vert ))^{1 + \alpha }\big ) \\ {}&\le 2^\alpha \left( \big \vert C_{\ell ,1}' X^j\big \vert ^{1+\alpha } \right. \\ {}&\quad \left. + \left( C_{\ell ,2}' \left( \sum _{k=1}^d\vert \theta _k\vert ^{q-1} \vert (X^k)^{q-1} X^j\vert + \vert Y^{q-1} X^j\vert \right) \right) ^{1 + \alpha }\right) \\ {}&\le 2^\alpha \left( \big \vert C_{\ell ,1}' X^j\big \vert ^{1+\alpha }\right. \\ {}&\left. + 2^\alpha (C_{\ell ,2}')^{1+\alpha } \left( d^\alpha \sum _{k=1}^d\vert \theta _k\vert ^{(q-1)(1+\alpha )} \right. \right. \\ {}&\quad \left. \left. \vert (X^k)^{q-1} X^j\vert ^{1+\alpha } + \vert Y^{q-1} X^j\vert ^{1 + \alpha }\right) \right) . \end{aligned}$$

Given Assumption 2, it is straightforward that \({\mathbb {E}}\vert X^j\big \vert ^{1+\alpha } < \infty \) and \({\mathbb {E}}\vert Y^{q-1} X^j\vert ^{1 + \alpha }< \infty \). Moreover, using a Hölder inequality with exponents \(a = \frac{q(1+\alpha )}{(q-1)(1+\alpha )}\) and \(b = q\) (the case \(q=1\) is trivial) we find:

$$\begin{aligned} {\mathbb {E}}\big \vert (X^k)^{q-1} X^j\big \vert ^{1+\alpha }\le \big ({\mathbb {E}}\big \vert X^k\big \vert ^{q(1+\alpha )}\big )^{1/a} \big ({\mathbb {E}}\big \vert X^j\big \vert ^{q(1+\alpha )}\big )^{1/b}, \end{aligned}$$

which is finite under Assumption 2. This concludes the proof of Lemma 1.

1.5 B.5 Proof of Lemma 2

This proof follows a standard argument from Lugosi and Mendelson (2019a); Geoffrey et al. (2020) in which we use a Lemma from Bubeck et al. (2013) in order to control the \((1+\alpha )\)-moment of the block means instead of their variance. Indeed, we know from Lemma 1 that under Assumptions 1 and 2, the gradient coordinates have finite \((1+\alpha )\)-moments, namely \({\mathbb {E}}[\vert \ell '(X^\top \theta , Y) X_j \vert ^{1 + \alpha } ] < +\infty \) for any \(j \in \llbracket d \rrbracket \). Recall that \((\widehat{g}_j^{(k)}(\theta ))_{k \in \llbracket K \rrbracket }\) stands for the block-wise empirical mean given by Eq. (15) and introduce the set of non-corrupted block indices given by \({\mathcal {K}}= \{ k \in \llbracket K \rrbracket \;: \; B_k \cap {\mathcal {O}} = \emptyset \}\). We will initially assume that the number of outliers satisfies \(\vert {\mathcal {O}}\vert \le (1 - \varepsilon ) K / 2\) for some \(0< \varepsilon < 1\). Note that since samples are i.i.d in \(B_k\) for \(k \in {\mathcal {K}}\), we have \({\mathbb {E}}\big [\widehat{g}_j^{(k)}(\theta )\big ] = g_j(\theta )\). We use the following Lemma from Bubeck et al. (2013).

Lemma 7

Let \(Z, Z_1, \ldots , Z_n\) be a i.i.d sequence with \(m_\alpha = {\mathbb {E}}[\vert Z - {\mathbb {E}}Z\vert ^{1 + \alpha }] < +\infty \) for some \(\alpha \in (0, 1]\) and put \({{\bar{Z}}}_n = \frac{1}{n} \sum _{i \in \llbracket n \rrbracket } Z_i\). Then, we have

$$\begin{aligned} {{\bar{Z}}}_n \le {\mathbb {E}}Z + \Big ( \frac{3 m_\alpha }{\delta n^{\alpha }} \Big )^{1 / (1 + \alpha )} \end{aligned}$$

for any \(\delta \in (0, 1),\) with a probability \(1 - \delta \).

Lemma 7 entails that

$$\begin{aligned} \big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert \le \Big ( \frac{3 m_{j, \alpha }(\theta )}{\delta ' (n / K)^\alpha } \Big )^{1 / (1 + \alpha )} =: \eta _{j, \alpha , \delta '}(\theta ) \end{aligned}$$

with probability larger than \(1 - 2 \delta '\), for each \(k \in {\mathcal {K}}\), since we have n/K samples in block \(B_k\). Now, recalling that \(\widehat{g}_j(\theta )\) is the median (see (14)), we can upper bound its failure probability as follows:

$$\begin{aligned}&{\mathbb {P}}\Big [ \big \vert \widehat{g}_j^{\texttt {MOM} }(\theta ) - g_j(\theta ) \big \vert \ge \eta _{j, \alpha , \delta '}(\theta ) \Big ] \\&\quad \le {\mathbb {P}}\bigg [ \sum _{k \in \llbracket K \rrbracket } {{{\textbf {1}}}}_{}\Big \{\big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert \ge \eta _{j, \alpha , \delta '}(\theta )\Big \}> K / 2 \bigg ] \\&\quad {\le } {\mathbb {P}}\bigg [ \sum _{k \in {\mathcal {K}}} {{{\textbf {1}}}}_{}\Big \{\big \vert \widehat{g}_j^{(k)}(\theta ) {-} g_j(\theta )\big \vert {\ge } \eta _{j, \alpha , \delta '}(\theta )\Big \} > K / 2 {-} \vert {\mathcal {O}}\vert \bigg ], \end{aligned}$$

since at most \(\vert {\mathcal {O}}\vert \) blocks contain one outlier. Since the blocks \(B_k\) are disjoint and contain i.i.d samples for \(k \in {\mathcal {K}}\), we know that

$$\begin{aligned} \sum _{k \in {\mathcal {K}}} {{{\textbf {1}}}}_{}\Big \{\big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert \ge \eta _{j, \alpha , \delta '}(\theta )\Big \} \end{aligned}$$

follows a binomial distribution \(\text {Bin}(\vert {\mathcal {K}}\vert , p)\) with \(p \le 2\delta '\). Using the fact that \(\text {Bin}(\vert {\mathcal {K}}\vert , p)\) is stochastically dominated by \(\text {Bin}(\vert {\mathcal {K}}\vert , 2\delta ')\) and that \({\mathbb {E}}[\text {Bin}(\vert {\mathcal {K}}\vert , 2\delta ')] = 2\delta '\vert {\mathcal {K}}\vert \), we obtain, if \(S \sim \text {Bin}(\vert {\mathcal {K}}\vert , 2\delta ')\), that

$$\begin{aligned}&{\mathbb {P}}\Big [ \big \vert \widehat{g}_j^{\texttt {MOM} }(\theta ) - g_j(\theta ) \big \vert \ge \eta _{j, \alpha , \delta '}(\theta ) \Big ] \\&\quad \le {\mathbb {P}}\big [ S> K/2 - \vert {\mathcal {O}}\vert \big ] \\&\quad = {\mathbb {P}}\big [ S - {\mathbb {E}}S> K/2 - \vert {\mathcal {O}}\vert - 2\delta '\vert {\mathcal {K}}\vert \big ] \\&\quad \le {\mathbb {P}}\big [ S - {\mathbb {E}}S > K (\varepsilon - 4\delta ') / 2 \big ] \\&\quad \le \exp \big (-K (\varepsilon - 4\delta ')^2 / 2 \big ), \end{aligned}$$

where we used the fact that \(\vert {\mathcal {O}}\vert \le (1 - \varepsilon ) K / 2\) and \(\vert {\mathcal {K}}\vert \le K\) for the second inequality and the Hoeffding inequality for the last. This concludes the proof of Lemma 2 for the choice \(\varepsilon = 5/6\) and \(\delta ' = 1/8\).

1.6 B.6 Proof of Proposition 3

Step 1 First, we fix \(\theta \in \Theta \) and try to bound \(\big \vert \widehat{g}_j^{\texttt {MOM} }(\theta ) - g_j(\theta )\big \vert \) in terms of quantities only depending on \({\widetilde{\theta }}\) which is the closest point to \(\theta \) in an \(\varepsilon \)-net. Recall that \(\Delta \) is the diameter of the parameter set \(\Theta \) and let \(\varepsilon > 0\) be a positive number. There exists an \(\varepsilon \)-net covering \(\Theta \) with cardinality no more than \((3\Delta /2\varepsilon )^d\) i.e. a set \(N_{\varepsilon }\) such that for all \(\theta \in \Theta \) there exists \({\widetilde{\theta }} \in N_{\varepsilon }\) such that \(\Vert {\widetilde{\theta }} - \theta \Vert \le \varepsilon \). Consider a fixed \(\theta \in \Theta \) and \(j\in \llbracket d \rrbracket \), we wish to bound the quantity \(\big \vert \widehat{g}_j^{\texttt {MOM} }(\theta ) - g_j(\theta )\big \vert \). Using the \(\varepsilon \)-net \(N_{\varepsilon }\), there exists \({\widetilde{\theta }}\) such that \(\Vert {\widetilde{\theta }} - \theta \Vert \le \varepsilon \) which we can use as follows:

$$\begin{aligned} \big \vert \widehat{g}_j^{\texttt {MOM} }(\theta ) - g_j(\theta )\big \vert&\le \big \vert \widehat{g}_j^{\texttt {MOM} }(\theta ) - g_j({\widetilde{\theta }})\big \vert + \big \vert g_j({\widetilde{\theta }}) - g_j(\theta )\big \vert \nonumber \\&\le \big \vert \widehat{g}_j^{\texttt {MOM} }(\theta ) - g_j({\widetilde{\theta }})\big \vert + L_j \varepsilon , \end{aligned}$$
(B12)

where we used the gradient’s coordinate Lipschitz constant to bound the second term. We now focus on the second term. Introducing the notation \(g_j^i(\theta ) = \ell '(\theta ^\top X_i, Y_i)X_i^j\), we have

$$\begin{aligned} g_j^i(\theta ) = \ell '({\widetilde{\theta }}^{\top } X_i, Y_i)X_i^j + \underbrace{(\ell '(\theta ^{ \top } X_i, Y_i) - \ell '({\widetilde{\theta }}^{\top } X_i, Y_i))X_i^j}_{=:\Delta _i}. \end{aligned}$$

Let \((B_k)_{k\in \llbracket K \rrbracket }\) be the blocks used to compute the \(\texttt {MOM} \) estimator and associated block means \(\widehat{g}_j^{(k)}(\theta )\) and \(\widehat{g}_j^{(k)}({\widetilde{\theta }})\). Notice that the \(\texttt {MOM} \) estimator is monotonous non decreasing w.r.t. to each of the entries \(g_j^i(\theta )\) when the others are fixed. Without loss of generality, assume that \(\widehat{g}_j^{\texttt {MOM} }(\theta ) - g_j({\widetilde{\theta }}) \ge 0\) then we have:

(B13)

where is the \(\texttt {MOM} \) estimator obtained using the entries \(\ell '\big ({\widetilde{\theta }}^{\top } X_i, Y_i\big )X_i^j + \varepsilon \gamma \Vert X_i\Vert ^2 = g_j^i({\widetilde{\theta }}) + \varepsilon \gamma \Vert X_i\Vert ^2\) instead of \(g_j^i(\theta )\). Note that no longer depends on \(\theta \) except through the fact that \({\widetilde{\theta }}\) is chosen in \(N_{\varepsilon }\) so that \(\big \Vert {\widetilde{\theta }} - \theta \big \Vert \le \varepsilon \). Indeed, using the Lipschitz smoothness of the loss function and a Cauchy-Schwarz inequality we find that:

$$\begin{aligned} \vert \Delta _i\vert \le \gamma \Vert \theta - {\widetilde{\theta }}\Vert \cdot \Vert X_i\Vert \cdot \vert X_i^j\vert \le \varepsilon \gamma \Vert X_i\Vert ^2. \end{aligned}$$

Step 2

We now use the concentration property of \(\texttt {MOM} \) to bound the quantity which is in terms of \({\widetilde{\theta }}\). The samples \((g_j^i({\widetilde{\theta }}) + \varepsilon \gamma \Vert X_i\Vert ^2)_{i \in \llbracket n \rrbracket }\) are independent and distributed according to the random variable \(\ell '({\widetilde{\theta }}^{\top } X, Y)X^j + \varepsilon \gamma \Vert X\Vert ^2\). Denote \(\overline{L} = \gamma {\mathbb {E}}\Vert X\Vert ^2\) and for \(k \in \llbracket K \rrbracket \) let \(\widehat{g}_j^{(k)}({\widetilde{\theta }}) = \frac{K}{n}\sum _{i\in B_k} g_j^i({\widetilde{\theta }})\) and \(\widehat{L}^{(k)} = \frac{K}{n}\sum _{i\in B_k} \gamma \Vert X_i\Vert ^2\). We use Lemma 7 for each of these pairs of means to obtain that with probability at least \(1 - \delta '/2\):

$$\begin{aligned} \big \vert \widehat{g}_j^{(k)}({\widetilde{\theta }}) - g_j({\widetilde{\theta }}) \big \vert \le \Big ( \frac{6 m_{j, \alpha }({\widetilde{\theta }})}{\delta ' (n / K)^\alpha } \Big )^{1 / (1 + \alpha )} =: \eta _{j, \alpha , \delta '/2}({\widetilde{\theta }}), \end{aligned}$$

and with probability at least \(1 - \delta '/2\)

$$\begin{aligned} \big \vert \widehat{L}^{(k)} - \overline{L} \big \vert \le \Big ( \frac{6 m_{L, \alpha }}{\delta ' (n / K)^\alpha } \Big )^{1 / (1 + \alpha )} =: \eta _{L, \alpha , \delta '/2}, \end{aligned}$$

where \(m_{L, \alpha } = {\mathbb {E}}\vert \gamma \Vert X\Vert ^2 - \overline{L} \vert ^{1+\alpha }\). Hence for all \(k \in \llbracket K \rrbracket \)

$$\begin{aligned} {\mathbb {P}}&\big ( \big \vert \widehat{g}_j^{(k)}({\widetilde{\theta }}) + \varepsilon \widehat{L}^{(k)} - g_j({\widetilde{\theta }}) \big \vert> \eta _{j, \alpha , \delta '/2}({\widetilde{\theta }}) \\&\quad + \varepsilon (\overline{L} + \eta _{L, \alpha , \delta '/2}) \big ) \\ \quad&\le {\mathbb {P}}\big ( \big \vert \widehat{g}_j^{(k)}({\widetilde{\theta }}) - g_j({\widetilde{\theta }}) \big \vert> \eta _{j, \alpha , \delta '/2}({\widetilde{\theta }}) \big )\\&\quad + {\mathbb {P}}\big ( \big \vert \widehat{L}^{(k)} - \overline{L} \big \vert > \eta _{L, \alpha , \delta '/2} \big )\\ \quad&\le \delta '/2 + \delta '/2 = \delta '. \end{aligned}$$

Now defining the Bernoulli variables

$$\begin{aligned}{} & {} U_k:= {{{\textbf {1}}}}_{}\Big \{\big \vert \widehat{g}_j^{(k)}({\widetilde{\theta }}) + \delta \widehat{L}^{(k)} - g_j({\widetilde{\theta }}) \big \vert > \eta _{j, \alpha , \delta '/2}({\widetilde{\theta }}) \\{} & {} + \varepsilon \big (\overline{L} + \eta _{L, \alpha , \delta '/2}\big ) \Big \}, \end{aligned}$$

we have just seen they have success probability \(\le \delta '\), moreover

since at most \(\vert {\mathcal {O}}\vert \) blocks contain one outlier. Since the blocks \(B_k\) are disjoint and contain i.i.d samples for \(k \in {\mathcal {K}}\), we know that \(\sum _{k \in {\mathcal {K}}} U_k\) follows a binomial distribution \(\text {Bin}(\vert {\mathcal {K}}\vert , p)\) with \(p \le \delta '\). Using the fact that \(\text {Bin}(\vert {\mathcal {K}}\vert , p)\) is stochastically dominated by \(\text {Bin}(\vert {\mathcal {K}}\vert , \delta ')\) and that \({\mathbb {E}}[\text {Bin}(\vert {\mathcal {K}}\vert , \delta ')] = \delta '\vert {\mathcal {K}}\vert \), we obtain, if \(S \sim \text {Bin}(\vert {\mathcal {K}}\vert , \delta ')\), that

where we used the condition \(\vert {\mathcal {O}}\vert \le (1 - \varepsilon ') K / 2\) and \(\vert {\mathcal {K}}\vert \le K\) for the second inequality and the Hoeffding inequality for the last. To conclude, we choose \(\varepsilon ' = 5/6\) and \(\delta ' = 1/4\) and combine (B12), (B13) and the last inequality in which we take \(K = \lceil 18\log (1/\delta ) \rceil \) and use a union bound argument to obtain that with probability at least \(1 - \delta \) for all \(j \in \llbracket d \rrbracket \)

(B14)

Step 3 We use the \(\varepsilon \)-net to obtain a uniform bound. For \(\theta \in \Theta \) denote \({\widetilde{\theta }}(\theta ) \in N_{\varepsilon }\) the closest point in \(N_{\varepsilon }\) satisfying in particular \(\Vert {\widetilde{\theta }}(\theta ) - \theta \Vert \le \varepsilon \), we write, following previous arguments

Here, we make a union bound argument over \({\widetilde{\theta }} \in N_{\varepsilon }\) for the inequality (B14) and choose \(\varepsilon = n^{-\alpha /(1+\alpha )}\) to obtain the final result concluding the proof of Proposition 3.

1.7 B.7 Proof of Proposition 4

This proof reuses arguments from the proof of Theorem 2 in Lecué et al. (2020). We wish to bound \(\big \vert \widehat{g}_j^{\texttt {MOM} }(\theta ) - g_j(\theta ) \big \vert \) with high probability and uniformly on \(\theta \in \Theta \). Fix \(\theta \in \Theta \) and \(j \in \llbracket d \rrbracket \), we have \(\widehat{g}_j^{\texttt {MOM} }(\theta ) = {{\,\textrm{median}\,}}\big ( \widehat{g}_j^{(1)}(\theta ), \dots , \widehat{g}_j^{(K)}(\theta ) \big )\) with \(\widehat{g}_j^{(k)}(\theta ) = \frac{K}{n}\sum _{i\in B_k} g^i_j(\theta )\) where the blocks \(B_1, \dots , B_K\) constitute a partition of \(\llbracket n \rrbracket \).

Define the function \(\phi (t) = (t-1){{{\textbf {1}}}}_{1\le t \le 2} + {{{\textbf {1}}}}_{t > 2}\), let \({\mathcal {K}}= \{ k\in \llbracket K \rrbracket , \,\, B_k \cap {\mathcal {O}}= \emptyset \}\) and \({\mathcal {J}} = \bigcup _{k \in {\mathcal {K}}} B_k\). Thanks to the inequality \(\phi (t) \ge {{{\textbf {1}}}}_{t \ge 2}\), we have:

$$\begin{aligned}&\sup _{\theta \in \Theta } \sum _{k=1}^K {{{\textbf {1}}}}_{}\Big \{\big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert > x\Big \} \\&\quad \le \sup _{\theta \in \Theta } \sum _{k\in {\mathcal {K}}} {\mathbb {E}}\big [ \phi \big (2\big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert / x\big )\big ] + \vert {\mathcal {O}}\vert \\&\quad + \sup _{\theta \in \Theta } \sum _{k\in {\mathcal {K}}} \Big (\phi \big (2\big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert / x\big ) \\&\quad - {\mathbb {E}}\big [\phi \big (2\big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert / x\big )\big ]\Big ) . \end{aligned}$$

Besides, the inequality \(\phi (t) \le {{{\textbf {1}}}}_{t \ge 1}\), an application of Markov’s inequality and Lemma 7 yield:

$$\begin{aligned}{} & {} {\mathbb {E}}\big [\phi \big (2\big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert / x\big )\big ] \le {\mathbb {P}}\big ( \big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert \ge x/2 \big ) \\{} & {} \quad \le \frac{3 m_{\alpha , j}(\theta )}{(x/2)^{1 + \alpha } (n/K)^{\alpha }}. \end{aligned}$$

Therefore, recalling that we defined \(M_{\alpha , j}:= \sup _{\theta \in \Theta } m_{\alpha , j}(\theta )\) we have

$$\begin{aligned} \sup _{\theta \in \Theta }&\sum _{k=1}^K {{{\textbf {1}}}}_{}\Big \{\big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert > x\Big \} \\&\quad \le K\bigg (\frac{3 M_{\alpha , j}}{(x/2)^{1 + \alpha } (n/K)^{\alpha }} + \frac{\vert {\mathcal {O}}\vert }{K} \\&+ \sup _{\theta \in \Theta } \frac{1}{K} \left( \sum _{k\in {\mathcal {K}}} \phi \big (2\big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert / x\big ) \right. \\&\quad \left. - {\mathbb {E}}\big [\phi \big (2\big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert / x\big )\big ] \right) \bigg ). \end{aligned}$$

Now since for all t we have \(0\le \phi (t)\le 1 \), McDiarmid’s inequality says with probability \(\ge 1 - \exp (-2y^2 K)\) that:

$$\begin{aligned} \sup _{\theta \in \Theta }&\frac{1}{K} \Big ( \sum _{k\in {\mathcal {K}}} \phi \big (2\big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert / x\big ) \\&\quad - {\mathbb {E}}\big [\phi \big (2\big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert / x\big )\big ] \Big ) \le \\&{\mathbb {E}}\bigg [ \sup _{\theta \in \Theta } \frac{1}{K} \Big ( \sum _{k\in {\mathcal {K}}} \phi \big (2\big \vert \widehat{g}_j^{(k)}(\theta ) \!-\! g_j(\theta )\big \vert / x\big ) \\&\quad - {\mathbb {E}}\big [\phi \big (2\big \vert \widehat{g}_j^{(k)}(\theta ) \!-\! g_j(\theta )\big \vert / x\big )\big ] \Big ) \bigg ] + y. \end{aligned}$$

Using a simple symmetrization argument (see for instance Lemma 11.4 in Boucheron et al. (2013)) we find:

$$\begin{aligned} {\mathbb {E}}\bigg [&\sup _{\theta \in \Theta } \frac{1}{K}\Big ( \sum _{k\in {\mathcal {K}}} \phi \big (2\big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert / x\big ) \\ {}&\quad - {\mathbb {E}}\big [\phi \big (2\big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert / x\big )\big ] \Big ) \bigg ] \\ {}&\le 2{\mathbb {E}}\bigg [ \sup _{\theta \in \Theta } \frac{1}{K} \sum _{k\in {\mathcal {K}}} \varepsilon _k \phi \big (2\big \vert \widehat{g}^{(k)}(\theta ) - g(\theta )\big \vert / x\big ) \bigg ], \end{aligned}$$

where the \(\varepsilon _k\)s are independent Rademacher variables. Since \(\phi \) is 1-Lipschitz and satisfies \(\phi (0)=0\) we can use the contraction principle [see Theorem 11.6 in Boucheron et al. (2013)] followed by another symmetrization step to find

$$\begin{aligned}&2{\mathbb {E}}\bigg [ \sup _{\theta \in \Theta } \frac{1}{K} \!\sum _{k\in {\mathcal {K}}}\! \varepsilon _k \phi \big (2\big \vert \widehat{g}_j^{(k)}(\theta ) \!-\! g_j(\theta )\big \vert \big / x\big ) \bigg ] \\&\quad \le \! 4{\mathbb {E}}\bigg [ \sup _{\theta \in \Theta } \frac{1}{K} \!\sum _{k\in {\mathcal {K}}}\! \varepsilon _k \big \vert \widehat{g}_j^{(k)}(\theta ) \!-\! g_j(\theta )\big \vert \big / x \bigg ]\\&\quad \le \frac{8}{x n} {\mathbb {E}}\bigg [ \sup _{\theta \in \Theta } \sum _{i\in {\mathcal {J}}} \varepsilon _i g^{i}_j(\theta ) \bigg ] \le \frac{8 {\mathcal {R}}_j(\Theta )}{x n }. \end{aligned}$$

Taking \(\vert {\mathcal {O}}\vert \le (1 - \varepsilon )K/2\), we found that with probability \(\ge 1 - \exp (-2y^2 K)\)

$$\begin{aligned}&\sup _{\theta \in \Theta } \sum _{k=1}^K {{{\textbf {1}}}}_{}\Big \{\big \vert \widehat{g}_j^{(k)}(\theta ) - g_j(\theta )\big \vert > x\Big \} \\&\quad \le K\bigg (\frac{3 M_{\alpha , j}}{(x/2)^{1 + \alpha } (n/K)^{\alpha }} + \frac{\vert {\mathcal {O}}\vert }{K} + \frac{8{\mathcal {R}}_j(\Theta )}{x n }\bigg ). \end{aligned}$$

Now by choosing \(y = 1/4 - \vert {\mathcal {O}}\vert /K\) and \(x = \max \Big (\Big (\frac{36M_{\alpha , j}}{(n/K)^\alpha }\Big )^{1/(1+\alpha )}, \frac{64 {\mathcal {R}}_j(\Theta )}{n}\Big )\), we obtain the deviation bound:

$$\begin{aligned}&{\mathbb {P}}\bigg (\sup _{\theta \in \Theta } \big \vert \widehat{g}_j^{\texttt {MOM} }(\theta ) - g_j(\theta )\big \vert \\&\quad \ge \max \Big (\Big (\frac{36M_{\alpha , j}}{(n/K)^\alpha }\Big )^{1/(1+\alpha )} , \frac{64{\mathcal {R}}_j(\Theta )}{n}\Big )\bigg ) \\&\quad \le {\mathbb {P}}\bigg (\sup _{\theta \in \Theta } \sum _{k=1}^K {{{\textbf {1}}}}_{}\Big \{\big \vert \widehat{g}_j^{(k)}(\theta ) \!-\! g_j(\theta )\big \vert \!>\! x\Big \} \!>\! K/2\bigg ) \\&\quad \le \exp (-2 (\varepsilon - 1/2)^2 K/4)\\&\quad \le \exp (-K/18), \end{aligned}$$

where the last inequality comes from the choice \(\varepsilon = 5/6\). A simple union bound argument lets the previous inequality hold for all \(j\in \llbracket d \rrbracket \) with high probability.

Finally, assuming that \(X^j\) has finite fourth moment for all \(j\in \llbracket d \rrbracket \), we can control the Rademacher complexity. In this part, we assume without loss of generality that \({\mathcal {I}}= \llbracket n \rrbracket \), we first write

$$\begin{aligned} {\mathcal {R}}_j(\Theta )&= {\mathbb {E}}\bigg [ \sup _{\theta \in \Theta } \sum _{i=1}^n \varepsilon _i \ell '(\theta ^\top X_i, Y_i)X_i^j \bigg ] \\&= {\mathbb {E}}\bigg [\sum _{i=1}^n \varepsilon _i \ell '(0, Y_i)X_i^j + \sup _{\theta \in \Theta } \\&\quad \sum _{i=1}^n \varepsilon _i (\ell '(\theta ^\top X_i, Y_i) - \ell '(0, Y_i))X_i^j \bigg ]. \end{aligned}$$

Denote \(\phi _i(t) = (\ell '(t, Y_i) - \ell '(0, Y_i))X_i^j\) and notice that \({\mathbb {E}}\big [\sum _{i=1}^n \varepsilon _i \ell '(0, Y_i)X_i^j\big ] = 0\). Notice also that \(\phi _i(0) = 0\) and \(\phi _i\) is \(\gamma \vert X_i^j\vert \)-Lipschitz for all i. We use a variant of the contraction principle adapted to our case in which functions with different Lipschitz constants appear. We use Lemma 11.7 from Boucheron et al. (2013) and adapt the proof of their Theorem 11.6 to make the following estimations:

$$\begin{aligned}&{\mathbb {E}}\bigg [\sup _{\theta \in \Theta } \sum _{i=1}^n \varepsilon _i \phi _i(\theta ^\top X_i) \bigg ] \\&= {\mathbb {E}}\bigg [ {\mathbb {E}}\bigg [\sup _{\theta \in \Theta } \sum _{i=1}^{n-1}\varepsilon _i \phi _i(\theta ^\top X_i) \\&\quad + \varepsilon _n \phi _n(\theta ^\top X_n) \Big \vert (\varepsilon _i)_{i=1}^{n-1}, (X_i, Y_i)_{i\in \llbracket n \rrbracket } \bigg ] \bigg ] \\&\le {\mathbb {E}}\bigg [ {\mathbb {E}}\bigg [\sup _{\theta \in \Theta } \sum _{i=1}^{n-1}\varepsilon _i \phi _i(\theta ^\top X_i) \\&\quad + \varepsilon _n \gamma \vert X_n^j\vert \theta ^\top X_n \Big \vert (\varepsilon _i)_{i=1}^{n-1}, (X_i, Y_i)_{i\in \llbracket n \rrbracket } \bigg ] \bigg ] \\&= {\mathbb {E}}\bigg [\sup _{\theta \in \Theta } \sum _{i=1}^{n-1}\varepsilon _i \phi _i(\theta ^\top X_i) + \varepsilon _n \gamma \vert X_n^j\vert \theta ^\top X_n \bigg ]. \end{aligned}$$

By iterating the previous argument n times we find:

$$\begin{aligned} {\mathbb {E}}&\bigg [\sup _{\theta \in \Theta } \sum _{i=1}^n \varepsilon _i \phi _i(\theta ^\top X_i) \bigg ] \\&\quad \le {\mathbb {E}}\bigg [\sup _{\theta \in \Theta } \sum _{i=1}^{n-1}\varepsilon _i \gamma \vert X_i^j\vert \theta ^\top X_i \bigg ]. \end{aligned}$$

Now recalling that the diameter of \(\Theta \) is \(\Delta \), we use Lemma 8 below with \(p=1\) to bound the previous quantity as:

$$\begin{aligned}&{\mathbb {E}}\bigg [\sup _{\theta \in \Theta } \sum _{i=1}^{n}\varepsilon _i \gamma \vert X_i^j\vert \theta ^\top X_i \bigg ]\\&\quad = \gamma {\mathbb {E}}\bigg [ \sup _{\theta \in \Theta } \bigg \langle \theta , \sum _{i=1}^{n} \varepsilon _i X_i\vert X_i^j\vert \bigg \rangle \bigg ] \\&\quad \le \gamma \Delta {\mathbb {E}}\bigg [ {\mathbb {E}}\bigg [ \bigg \Vert \sum _{i=1}^{n} \varepsilon _i X_i\vert X_i^j\vert \bigg \Vert _1\Big \vert (X_i)_{i\in \llbracket n \rrbracket } \bigg ]\bigg ] \\&\quad \le \gamma \Delta C_{\alpha } {\mathbb {E}}\bigg [ \sum _{i=1}^{n} \Vert X_i \Vert ^{1+\alpha } \vert X_i^j\vert ^{1+\alpha } \bigg ]^{1/(1+\alpha )} \\&\quad \le \gamma \Delta C_{\alpha } \bigg (n {\mathbb {E}}\big [(X^j)^{2(1+\alpha )}\big ]^{1/2} \sum _{k\in \llbracket d \rrbracket } {\mathbb {E}}\big [(X^k)^{2(1+\alpha )}\big ]^{1/2}\bigg )^{1/(1+\alpha )}, \end{aligned}$$

where we used a Cauchy–Schwarz inequality in the last step, which concludes the proof of Proposition 4. \(\square \)

Lemma 8

(Khintchine inequality variant) Let \(\alpha \in (0,1]\) and \((x_i)_{i\in \llbracket n \rrbracket }\) be real numbers with \(n \in {\mathbb {N}}\) and \(p>0\) and \((\varepsilon _i)_{i\in \llbracket n \rrbracket }\) be i.i.d Rademacher random variables then we have the inequality:

$$\begin{aligned} {\mathbb {E}}\bigg [ \bigg \vert \sum _{i=1}^n \varepsilon _i x_i\bigg \vert ^p\bigg ]^{1/p} \le B_{p,\alpha } \bigg (\sum _{i=1}^n \vert x_i\vert ^{1+\alpha }\bigg )^{1/(1+\alpha )} \end{aligned}$$

with the constant \(B_{p,\alpha }:= 2p \Big (\frac{1+\alpha }{\alpha }\Big )^{\alpha p/(1+\alpha ) - 1} \Gamma \Big (\frac{\alpha p}{1+\alpha }\Big )\). Moreover, for \(p=1\) the constant \(B_{1,\alpha }\) is bounded for any \(\alpha \ge 0\).

Proof

This proof is a generalization of Lemma 4.1 from Ledoux and Talagrand (1991) and uses similar methods. For all \(\lambda > 0\) we have:

$$\begin{aligned} {\mathbb {E}}\exp \Big (\lambda \sum _i \varepsilon _i x_i\Big )&= \prod _i {\mathbb {E}}\exp (\lambda \varepsilon _i x_i) = \prod _i \cosh (\lambda x_i) \\&\le \prod _i \exp \Big ( \frac{\vert \lambda x_i\vert ^{1+\alpha }}{1+\alpha }\Big ) \\&= \exp \Big ( \sum _i \frac{\vert \lambda x_i\vert ^{1+\alpha }}{1+\alpha }\Big ), \end{aligned}$$

where we used the inequality \(\cosh (u) \le \exp \Big ( \frac{\vert u\vert ^{1+\alpha }}{1+\alpha }\Big )\) valid for all \(u\in {\mathbb {R}}\) which can be quickly proven. Since both functions are even, fix \(u>0\) and define \(f_u(\alpha )=\exp \Big ( \frac{\vert u\vert ^{1+\alpha }}{1+\alpha }\Big ) - \cosh (u)\), we can show that \(f_u\) is monotonous on [0, 1] separately for \(u\in (0,\sqrt{e})\) and \((e, +\infty )\) and notice that \(f_u(0)\) and \(f_u(1)\) are both non-negative for all \(u >0\) thanks to the famous inequality \(\cosh (u)\le e^{u^2/2}\). Therefore, the inequality holds for \(u\in (0,\sqrt{e})\) and \((e, +\infty )\). Finally, for \(u\in (\sqrt{e}, e)\), the function \(f_u(\alpha )\) reaches a minimum at \(f_u(1/\log (u) - 1) = u^e - \cosh (u)\) and by taking logarithms we have \(u^e \ge \cosh (u) \iff \log (1+e^{2u})\le u + \log (2) + e\log (u)\) but since the derivatives verify \(\frac{2}{1+e^{-2u}}\le 2 \le 1+e/u\) for \(u\in (\sqrt{e}, e)\) and \(e^{e/2}\ge \cosh (\sqrt{e})\) the desired inequality follows by integration.

By homogeneity, we can focus on the case \(\big (\sum _{i=1}^n \vert x_i\vert ^{1+\alpha }\big )^{1/(1+\alpha )} = 1\), we compute:

$$\begin{aligned} {\mathbb {E}}\Big \vert \sum _i \varepsilon _i x_i\Big \vert ^p&= \int _0^{+\infty } {\mathbb {P}}\Big (\Big \vert \sum _i \varepsilon _i x_i\Big \vert ^p > t\Big )dt \\&\le 2\int _0^{+\infty } \exp \Big (\frac{\lambda ^{1+\alpha }}{1+\alpha } -\lambda t^{1/p}\Big ) dt \\&= 2\int _0^{+\infty } \exp \Big (-\frac{\alpha }{1+\alpha } u^{(1+\alpha )/\alpha }\Big ) du^p \\&= 2p \Big (\frac{1+\alpha }{\alpha }\Big )^{\alpha p/(1+\alpha ) - 1} \Gamma \Big (\frac{\alpha p}{1+\alpha }\Big ) = B_{p,\alpha }^p, \end{aligned}$$

where we used the previous inequality and chose \(\lambda = (t^{1/p})^{1/\alpha }\) in the last step. This proves the main inequality. Finally, it is easy to see that \(B_{1,\alpha }\) is bounded for high values of \(\alpha \) while for \(\alpha \sim 0\) it is consequence of the fact that \(\Gamma (x) \sim 1/x\) near 0 and the limit \(x^x \rightarrow 0\) when \(x \rightarrow 0^+\). \(\square \)

1.8 B.8 Proof of Lemma 3

As previously, Lemma 1 along with Assumptions 1 and 2 guarantee that the gradient coordinates have finite \((1+\alpha )\)-moments. From here, Lemma 3 is a direct application of Lemma 9 stated and proved below. In the following lemma, for any sequence \((z_i)_{i=1}^N\) of real numbers, \((z_i^*)_{i=1}^N\) denotes a non-decreasing reordering of it.

Lemma 9

Let \({\widetilde{X}}_1, \ldots , {\widetilde{X}}_N, {\widetilde{Y}}_1, \dots , {\widetilde{Y}}_N\) denote an \(\eta \)-corrupted i.i.d sample with rate \(\eta \) from a random variable X with expectation \(\mu = {\mathbb {E}}X\) and with finite \(1 + \gamma \) centered moment \({\mathbb {E}}\vert X - \mu \vert ^{1+\gamma } =M < \infty \) for some \(0 < \gamma \le 1\). Denote \({\widehat{\mu }}\) the \(\epsilon \)-trimmed mean estimator computed as \({\widehat{\mu }}=\frac{1}{N}\sum _{i=1}^N \phi _{\alpha , \beta }({\widetilde{X}}_i)\) with \(\phi _{\alpha , \beta }(x) = \max (\alpha , \min (x, \beta ))\) and the thresholds \(\alpha = {\widetilde{Y}}^*_{\epsilon N}\) and \(\beta = {\widetilde{Y}}^*_{(1-\epsilon ) N}\). Let \(1 > \delta \ge e^{-N}/4,\) taking \(\epsilon = 8\eta +12 \frac{\log (4/\delta )}{n},\) we have

$$\begin{aligned} \vert {\widehat{\mu }} - \mu \vert \le 7 M^{\frac{1}{1 + \gamma }}(\epsilon /2)^{\frac{\gamma }{1 + \gamma }} \end{aligned}$$
(B15)

with probability at least \(1 - \delta \).

Proof

This proof goes along the lines of the proof of Theorem 1 from Lugosi and Mendelson (2021) with the main difference that only the \((1+\gamma )\)-moment is used instead of the variance. Denote X the random variable whose expectation \(\mu = {\mathbb {E}}X\) is to be estimated and \(\overline{X} = X - \mu \). Let \(X_1, \dots , X_N, Y_1, \dots , Y_N\) the original uncorrupted i.i.d. sample from X and let \({\widetilde{X}}_1, \dots , {\widetilde{X}}_N, {\widetilde{Y}}_1, \dots , {\widetilde{Y}}_N\) denote the corrupted sample with rate \(\eta \). We define the following quantity which will intervene in the proof:

$$\begin{aligned} \overline{{\mathcal {E}}} (\epsilon , X){} & {} := \max \Big \{{\mathbb {E}}\big [ \big \vert \overline{X} - Q_{\epsilon /2}(\overline{X})\big \vert {{{\textbf {1}}}}_{\overline{X} \le Q_{\epsilon /2}(\overline{X})}\big ], \nonumber \\{} & {} {\mathbb {E}}\big [ \big \vert \overline{X} - Q_{1 - \epsilon /2}(\overline{X})\big \vert {{{\textbf {1}}}}_{\overline{X} \ge Q_{1 - \epsilon /2}(\overline{X})} \big ]\Big \}. \end{aligned}$$
(B16)

Step 1 We first derive confidence bounds on the truncation thresholds. Define the random variable \(U = {{{\textbf {1}}}}_{\overline{X} \ge Q_{1-2\epsilon }(\overline{X})}\). Its standard deviation satisfies \(\sigma _U \le {\mathbb {P}}^{1/2}(\overline{X} \ge Q_{1-2\epsilon }(\overline{X})) = \sqrt{2\epsilon }\). By applying Bernstein’s inequality we find with probability \(\ge 1 - \exp (-\epsilon N/12)\) that:

$$\begin{aligned} \big \vert \big \{i \,: \, Y_i \ge \mu + Q_{1-2\epsilon }(\overline{X})\big \}\big \vert \ge 3\epsilon N/2, \end{aligned}$$

a similar argument with \(U = {{{\textbf {1}}}}_{\overline{X} > Q_{1 - \epsilon /2}(\overline{X})}\) yields with probability \(\ge 1 - \exp (-\epsilon N /12)\) that:

$$\begin{aligned} \big \vert \big \{i \,: \, Y_i \le \mu + Q_{1-\epsilon /2}(\overline{X})\big \}\big \vert \ge (1 - (3/4)\epsilon ) N, \end{aligned}$$

and similarly with probability \(\ge 1 - \exp (-\epsilon N /12)\) we have:

$$\begin{aligned} \big \vert \big \{i \,: \, Y_i \le \mu + Q_{2\epsilon }(\overline{X})\big \vert \}\big \vert \ge 3\epsilon N/2, \end{aligned}$$

and with probability \(\ge 1 - \exp (-\epsilon N /12)\):

$$\begin{aligned} \big \vert \big \{i \,: \, Y_i \ge \mu + Q_{\epsilon /2}(\overline{X})\big \}\big \vert \ge (1 - (3/4)\epsilon ) N, \end{aligned}$$

so that with probability \(\ge 1 - 4\exp (-\epsilon N /12) \ge 1 - \delta /2\) the four previous inequalities hold simultaneously. We call this event E which only depends on the variables \(Y_1, \dots , Y_N\). Since \(\eta \le \epsilon /8\), if \(2\eta N\) samples are corrupted we still have:

$$\begin{aligned}\big \vert \big \{i \,: \, {\widetilde{Y}}_i \ge \mu + Q_{1 - 2\epsilon }(\overline{X})\big \vert \}\big \vert \ge ((3/2)\epsilon - 2\eta ) N \ge \epsilon N\end{aligned}$$

and

$$\begin{aligned}{} & {} \big \vert \big \{i \,: \, {\widetilde{Y}}_i \le \mu + Q_{1 - \epsilon /2}(\overline{X})\big \}\big \vert \\{} & {} \quad \ge (1 - (3/4)\epsilon - 2\eta ) N \ge (1 - \epsilon ) N\end{aligned}$$

consequently, the two following bounds hold

$$\begin{aligned}Q_{1-2\epsilon }(\overline{X})\le {\widetilde{Y}}^*_{(1-\epsilon )N} - \mu \le Q_{1-\epsilon /2}(\overline{X})\end{aligned}$$

and similarly

$$\begin{aligned} Q_{\epsilon /2}(\overline{X})\le {\widetilde{Y}}^*_{\epsilon N} - \mu \le Q_{2\epsilon }(\overline{X}). \end{aligned}$$

This provides guarantees on the truncation levels used which are \(\alpha = {\widetilde{Y}}^*_{\epsilon N}\) and \(\beta = {\widetilde{Y}}^*_{(1-\epsilon ) N}\).

Step 2

We first bound the deviation \(\Big \vert \frac{1}{N} \sum _{i=1}^N \phi _{\alpha , \beta }(X_i) - \mu \Big \vert \) in the absence of corruption. W e write:

$$\begin{aligned} \frac{1}{N} \!\!\sum _{i=1}^N&\phi _{\alpha , \beta }(X_i) \!\le \!\! \frac{1}{N}\!\sum _{i=1}^N\! \phi _{\mu + Q_{2\epsilon }(\overline{X}), \mu + Q_{1 - \epsilon /2}(\overline{X})}(X_i) \nonumber \\&\quad =\! {\mathbb {E}}\big [ \phi _{\mu + Q_{2\epsilon }(\overline{X}), \mu + Q_{1 - \epsilon /2}(\overline{X})}(X)\big ] \nonumber \\ {}&+ \frac{1}{N}\sum _{i=1}^N\Big ( \phi _{\mu + Q_{2\epsilon }(\overline{X}), \mu + Q_{1 - \epsilon /2}(\overline{X})}(X_i) \nonumber \\&\quad - {\mathbb {E}}\big [ \phi _{\mu + Q_{2\epsilon }(\overline{X}), \mu + Q_{1 - \epsilon /2}(\overline{X})}(X) \big ]\Big ) . \end{aligned}$$
(B17)

The first term is dominated by:

$$\begin{aligned}&{\mathbb {E}}\big [\phi _{\mu + Q_{2\epsilon }(\overline{X}), \mu + Q_{1 - \epsilon /2}(\overline{X})}(X)\big ] = {\mathbb {E}}\big [\phi _{Q_{2\epsilon }(X), Q_{1 - \epsilon /2}(X)}(X)\big ]\\ {}&= {\mathbb {E}}\big [ Q_{2\epsilon }(X){{{{\textbf {1}}}}}_{X \le Q_{2\epsilon }(X)} \!+\! X{{{{\textbf {1}}}}}_{Q_{2\epsilon }(X)< X < Q_{1 - \epsilon /2}(X)} \\ {}&\quad \!+\! Q_{1 - \epsilon /2}(X){{{{\textbf {1}}}}}_{X \ge Q_{1 - \epsilon /2}(X)}\big ] \\ {}&= \mu + {\mathbb {E}}\big [(Q_{2\epsilon }(X) - X){{{{\textbf {1}}}}}_{X \le Q_{2\epsilon }(X)} \\ {}&\quad + (Q_{1 - \epsilon /2}(X) - X){{{{\textbf {1}}}}}_{X \ge Q_{1 - \epsilon /2}(X)}\big ] \\ {}&\le \mu + {\mathbb {E}}\big [(Q_{2\epsilon }(X) - X){{{{\textbf {1}}}}}_{X \le Q_{2\epsilon }(X)}\big ] \\ {}&= \mu + {\mathbb {E}}\big [(Q_{2\epsilon }(\overline{X}) - \overline{X}){{{{\textbf {1}}}}}_{\overline{X} \le Q_{2\epsilon }(\overline{X})}\big ] \\ {}&\le \mu + \overline{{\mathcal {E}}} (4\epsilon , X), \end{aligned}$$

and lower bounded by:

$$\begin{aligned}&{\mathbb {E}}\big [\phi _{\mu + Q_{2\epsilon }(\overline{X}), \mu + Q_{1 - \epsilon /2}(\overline{X})}(X)\big ] \\&= \mu + {\mathbb {E}}\big [(Q_{2\epsilon }(X) - X){{{\textbf {1}}}}_{X \le Q_{2\epsilon }(X)} \\&\quad + (Q_{1 - \epsilon /2}(X) - X){{{\textbf {1}}}}_{X \ge Q_{1 - \epsilon /2}(X)}\big ] \\&\ge \mu \!+\! {\mathbb {E}}\big [(Q_{1 - \epsilon /2}(X) \!-\! X){{{\textbf {1}}}}_{X \ge Q_{1 - \epsilon /2}(X)}\big ] \\&\quad =\! \mu \!+\! {\mathbb {E}}\big [(Q_{1 - \epsilon /2}(\overline{X}) \!-\! \overline{X}){{{\textbf {1}}}}_{\overline{X} \ge Q_{1 - \epsilon /2}(\overline{X})}\big ] \\&\ge \mu - \overline{{\mathcal {E}}} (\epsilon , X). \end{aligned}$$

The sum in (B17) above has terms upper bounded by \(Q_{1 - \epsilon /2}(\overline{X}) + \overline{{\mathcal {E}}}(\epsilon , X)\). We need to work with the knowledge that \({\mathbb {E}}\vert \overline{X}\vert ^{1+\gamma } = M < \infty \) in order to bound their variance:

$$\begin{aligned}&{\mathbb {E}}\big [ \phi _{\mu + Q_{2\epsilon }(\overline{X}), \mu + Q_{1 - \epsilon /2}(\overline{X})}(X) \\ {}&\quad - {\mathbb {E}}[\phi _{\mu + Q_{2\epsilon }(\overline{X}), \mu + Q_{1 - \epsilon /2}(\overline{X})}(X)] \big ]^2 \\ {}&\le {\mathbb {E}}\big [ \phi _{\mu + Q_{2\epsilon }(\overline{X}), \mu + Q_{1 - \epsilon /2}(\overline{X})}(X) - \mu \big ]^2 \\ {}&\quad = {\mathbb {E}}\big [\phi _{Q_{2\epsilon }(\overline{X}), Q_{1 - \epsilon /2}(\overline{X})}(\overline{X})^2\big ] \\ {}&= {\mathbb {E}}\big [ Q_{2\epsilon }(\overline{X}){{{{\textbf {1}}}}}_{\overline{X} \le Q_{2\epsilon }(\overline{X})} +\! \overline{X}{{{{\textbf {1}}}}}_{Q_{2\epsilon }(\overline{X})< \overline{X}< Q_{1 - \epsilon /2}(\overline{X})} \\ {}&\quad +\! Q_{1 - \epsilon /2}(\overline{X}){{{{\textbf {1}}}}}_{\overline{X} \ge Q_{1 - \epsilon /2}(\overline{X})}\big ]^2 \\ {}&= {\mathbb {E}}\big [ Q_{2\epsilon }(\overline{X})^2{{{{\textbf {1}}}}}_{\overline{X} \le Q_{2\epsilon }(\overline{X})} +\! \overline{X}^2{{{{\textbf {1}}}}}_{Q_{2\epsilon }(\overline{X})< \overline{X} < Q_{1 - \epsilon /2}(\overline{X})} \\ {}&\quad +\! Q_{1 \!-\! \epsilon /2}(\overline{X})^2{{{{\textbf {1}}}}}_{\overline{X} \ge Q_{1 \!-\!\epsilon /2}(\overline{X})}\big ]. \end{aligned}$$

To control the three terms in the previous expression we mimic the proof of Chebyshev’s inequality to obtain that, when \(Q_{2\epsilon }(\overline{X}) < 0\):

$$\begin{aligned} 2\epsilon= & {} {\mathbb {P}}\big (\overline{X} \le Q_{2\epsilon }(\overline{X})\big ) \le {\mathbb {P}}\big (\vert \overline{X}\vert ^{1+\gamma } \nonumber \\{} & {} \quad \ge \vert Q_{2\epsilon }(\overline{X})\vert ^{1+\gamma }\big ) \le \frac{M}{\vert Q_{2\epsilon }(\overline{X})\vert ^{1+\gamma }}, \end{aligned}$$
(B18)

analogously, when \(Q_{1 - \epsilon /2}(\overline{X}) > 0\) we have:

$$\begin{aligned} \epsilon /2= & {} {\mathbb {P}}\big (\overline{X} \ge Q_{1-\epsilon /2}(\overline{X})\big ) \le {\mathbb {P}}\big (\vert \overline{X}\vert ^{1+\gamma } \nonumber \\{} & {} \quad \ge \vert Q_{1-\epsilon /2}(\overline{X})\vert ^{1+\gamma }\big ) \le \frac{M}{\vert Q_{1-\epsilon /2}(\overline{X})\vert ^{1+\gamma }}, \end{aligned}$$
(B19)

from (B18), we deduce that

$$\begin{aligned}{} & {} {\mathbb {E}}\big [Q_{2\epsilon }(\overline{X})^2{{{\textbf {1}}}}_{\overline{X} \le Q_{2\epsilon }(\overline{X})}\big ] = 2\epsilon Q_{2\epsilon }(\overline{X})^2 \\{} & {} \quad \le 2\epsilon \Big (\frac{M}{2\epsilon }\Big )^{\frac{2}{1 + \gamma }} \le 2\epsilon \Big (\frac{2M}{\epsilon }\Big )^{2/(1 + \gamma )}, \end{aligned}$$

and from (B19) we find

$$\begin{aligned}{} & {} {\mathbb {E}}\big [Q_{1 - \epsilon /2}(\overline{X})^2{{{\textbf {1}}}}_{\overline{X} \ge Q_{1 - \epsilon /2}(\overline{X})}\big ] = Q_{1 - \epsilon /2}(\overline{X})^2 \epsilon /2 \\{} & {} \quad \le 2\epsilon \Big (\frac{2M}{\epsilon }\Big )^{2/(1 + \gamma )}. \end{aligned}$$

In the pathological case where we have \(Q_{2\epsilon }(\overline{X}) \ge 0\) we use that \(Q_{2\epsilon }(\overline{X}) \le Q_{1 - \epsilon /2}(\overline{X})\) (for \(\epsilon \le 2/5\)) we deduce \(\vert Q_{2\epsilon }(\overline{X})\vert \le \vert Q_{1 - \epsilon /2}(\overline{X})\vert \) and hence we still have

$$\begin{aligned}{} & {} {\mathbb {E}}\big [Q_{2\epsilon }(\overline{X})^2{{{\textbf {1}}}}_{\overline{X} \le Q_{2\epsilon }(\overline{X})}\big ] \le 2\epsilon Q_{1 - \epsilon /2}(\overline{X})^2 \\{} & {} \quad \le 2\epsilon \Big (\frac{2M}{\epsilon }\Big )^{2/(1 + \gamma )}. \end{aligned}$$

The case \(Q_{1 - \epsilon /2}(\overline{X})\le 0\) is similarly handled. Moreover, a simple calculation yields

$$\begin{aligned}{} & {} {\mathbb {E}}\big [\overline{X}^2{{{\textbf {1}}}}_{Q_{2\epsilon }(\overline{X}) \!\le \! \overline{X} \le Q_{1 - \epsilon /2}(\overline{X})}\big ] \!\le \! M \!\max \\{} & {} \quad \big \{\vert Q_{2\epsilon }(\overline{X})\vert , \vert Q_{1 - \epsilon /2}(\overline{X})\vert \big \}^{1 - \gamma } \!\le \! 2\epsilon \Big (\frac{2M}{\epsilon }\Big )^{2/(1 + \gamma )}. \end{aligned}$$

All in all, we have shown the inequality:

$$\begin{aligned}{} & {} {\mathbb {E}}\big [ \phi _{\mu + Q_{2\epsilon }(\overline{X}), \mu + Q_{1 - \epsilon /2}(\overline{X})}(X) \\{} & {} \quad - {\mathbb {E}}[\phi _{\mu + Q_{2\epsilon }(\overline{X}), \mu + Q_{1 -\epsilon /2}(\overline{X})}(X)]\big ]^2 \le 6\epsilon \Big (\frac{2M}{\epsilon }\Big )^{2/(1 + \gamma )}, \end{aligned}$$

which we now use to apply Bernstein’s inequality on the sum in (B17) to find, conditionally on \(Y_1, \dots , Y_n\), with probability at least \(1 - \delta /4\):

$$\begin{aligned}&\frac{1}{N}\sum _{i=1}^N \phi _{\alpha , \beta }(X_i) \\&\le \mu + \overline{{\mathcal {E}}}(4\epsilon , X) + \sqrt{\frac{6 \epsilon \log (4/\delta )}{N}} \Big (\frac{2M}{\epsilon }\Big )^{1/(1+\gamma )} \\&\quad + \frac{\log (4/\delta )}{3N}(Q_{1 - \epsilon /2}(\overline{X}) + \overline{{\mathcal {E}}}(\epsilon , X)) \\&\le \mu + 2\overline{{\mathcal {E}}}(4\epsilon , X) + \sqrt{\frac{6 \epsilon \log (4/\delta )}{N}} \Big (\frac{2M}{\epsilon }\Big )^{1/(1+\gamma )} \\&\quad + \frac{\log (4/\delta )}{3N}Q_{1 - \epsilon /2}(\overline{X}) \\&\le \mu + 2\overline{{\mathcal {E}}}(4\epsilon , X) + (3/2) M^{1/(1+\gamma )} (\epsilon /2)^{\gamma /(1+\gamma )}, \end{aligned}$$

where we used (B19), the fact that \(\frac{\log (4/\delta )}{N} \le \epsilon /12\) and the assumption that \(\delta \ge e^{-N}/4\). Using the same argument on the lower tail, we obtain, on the event E, that with probability at least \(1 - \delta /2\)

$$\begin{aligned}{} & {} \Big \vert \frac{1}{N}\sum _{i=1}^N \phi _{\alpha , \beta }(X_i) - \mu \Big \vert \\{} & {} \le 2\overline{{\mathcal {E}}}(4\epsilon , X) + (3/2)M^{\frac{1}{1+\gamma }} (\epsilon /2)^{\gamma /(1+\gamma )}. \end{aligned}$$

Step 3

Now we show that \(\Big \vert \frac{1}{N}\sum _{i=1}^N \phi _{\alpha , \beta }(X_i) - \frac{1}{N}\sum _{i=1}^N \phi _{\alpha , \beta }({\widetilde{X}}_i)\Big \vert \) is of the same order as the previous bounds. There are at most \(2\eta N\) indices such that \(X_i \ne {\widetilde{X}}_i\) and for such differences we have the bound:

$$\begin{aligned}&\big \vert \phi _{\alpha , \beta }(X_i) - \phi _{\alpha , \beta }({\widetilde{X}}_i)\big \vert \le \vert Q_{\epsilon /2}(\overline{X})\vert + \vert Q_{1 - \epsilon /2}(\overline{X})\vert , \end{aligned}$$

and since we have \(\eta \le \epsilon /8\) then

$$\begin{aligned}&\Big \vert \frac{1}{N} \sum _{i=1} \phi _{\alpha , \beta }(X_i) - \frac{1}{N} \sum _{i=1} \phi _{\alpha , \beta }({\widetilde{X}}_i)\Big \vert \\&\quad \le 2\eta \big (\vert Q_{\epsilon /2}(\overline{X})\vert + \vert Q_{1 - \epsilon /2}(\overline{X})\vert \big ) \\&\quad \le \frac{\epsilon }{2}\max \big \{\vert Q_{\epsilon /2}(\overline{X})\vert , \vert Q_{1 - \epsilon /2}(\overline{X})\vert \big \} \\&\quad \le M^{1/(1+\gamma )}(\epsilon /2)^{\gamma /(1+\gamma )}, \end{aligned}$$

where the last step follows from (B18) and (B19). Finally, using similar arguments along with Hölder’s inequality, we show that:

$$\begin{aligned}&{\mathbb {E}}\big [ \vert \overline{X} - Q_{\epsilon /2}(\overline{X})\vert {{{{\textbf {1}}}}}_{\overline{X} \le Q_{\epsilon /2}(\overline{X})}\big ] \\ {}&\quad \le {\mathbb {E}}\big [ \vert \overline{X}\vert {{{{\textbf {1}}}}}_{\overline{X} \le Q_{\epsilon /2}(\overline{X})}\big ] + {\mathbb {E}}\big [ \vert Q_{\epsilon /2}(\overline{X})\vert {{{{\textbf {1}}}}}_{\overline{X} \le Q_{\epsilon /2}(\overline{X})}\big ] \\ {}&\quad \le M^{1/(1+\gamma )}(\epsilon /2)^{\gamma /(1+\gamma )} + \vert Q_{\epsilon /2}(\overline{X})\vert (\epsilon /2) \\ {}&\quad \le 2M^{1/(1+\gamma )}(\epsilon /2)^{\gamma /(1+\gamma )}, \end{aligned}$$

and a similar computation for \({\mathbb {E}}\big [ \vert \overline{X} - Q_{1 - \epsilon /2}(\overline{X})\vert {{{\textbf {1}}}}_{\overline{X} \ge Q_{1 - \epsilon /2}(\overline{X})}\big ]\) leads to

$$\begin{aligned} \overline{{\mathcal {E}}}(4\epsilon , X) \le 2M^{1/(1+\gamma )}(2\epsilon )^{\gamma /(1+\gamma )}. \end{aligned}$$

This completes the proof of Lemma 9. \(\square \)

1.9 B.9 Proof of Proposition 5

Step 1. Notice that the \(\texttt {TM} \) estimator is also a monotonous non decreasing function of each of its entries when the others are fixed. This allows us to replicate Step 1 of the proof of Proposition 3. We define an \(\varepsilon \)-net \(N_{\varepsilon }\) on the set \(\Theta \), fix \(\theta \in \Theta \) and let \({\widetilde{\theta }}\) be the closest point in \(N_{\varepsilon }\). We obtain, for all \(j\in \llbracket d \rrbracket \), the inequalities:

(B20)

where is the \(\texttt {TM} \) estimator obtained for the entries \(\ell '\big ({\widetilde{\theta }}^{\top } X_i, Y_i\big )X_i^j + \varepsilon \gamma \Vert X_i\Vert ^2 = g_j^i({\widetilde{\theta }}) + \varepsilon \gamma \Vert X_i\Vert ^2\).

Step 2

We use the concentration property of the \(\texttt {TM} \) estimator to bound the previous quantity which is in terms of \({\widetilde{\theta }}\). The terms \(\big (g_j^i({\widetilde{\theta }}) + \varepsilon \gamma \Vert X_i\Vert ^2\big )_{i\in \llbracket n \rrbracket }\) are independent and distributed according to \(Z:= \ell '\big ({\widetilde{\theta }}^\top X, Y\big )X^j + \gamma \varepsilon \Vert X\Vert ^2.\) Obviously we have \({\mathbb {E}}\ell '\big (\theta ^\top X, Y\big )X^j = g_j(\theta ).\) Furthermore, let \(\overline{L} = {\mathbb {E}}\gamma \Vert X\Vert ^2\), so that \({\mathbb {E}}\big [g_j^i({\widetilde{\theta }}) + \varepsilon \gamma \Vert X_i\Vert ^2\big ] = g_j(\theta ) + \varepsilon \overline{L}\). We will apply Lemma 9 for Before we do so, we need to compute the centered \((1+\alpha )\)-moment of Z. Let \(m_{j,\alpha }({\widetilde{\theta }})\) and \(m_{L,\alpha }\) be the centered \((1+\alpha )\)-moments of \(\ell '(\theta ^\top X, Y)X^j\) and \(\gamma \Vert X\Vert ^2\) respectively, we have:

$$\begin{aligned} {\mathbb {E}}\big \vert Z - {\mathbb {E}}Z\big \vert ^{1+\alpha } \le 2^{\alpha }\big (m_{j, \alpha }(\theta ) + \varepsilon ^{1+\alpha }m_{L,\alpha }\big ). \end{aligned}$$

Now applying Lemma 9 we find with probability no less than \(1-\delta \)

with \(\epsilon _{\delta } = 8\eta +12 \frac{\log (4/\delta )}{n}\). By combining with (B20) and using a union bound argument, we deduce that with the same probability, we have for all \(j\in \llbracket d \rrbracket \)

(B21)

Step 3 We use the \(\varepsilon \)-net to obtain a uniform bound. We proceed similarly as in the proof of Proposition 3. For \(\theta \in \Theta \) denote \({\widetilde{\theta }}(\theta ) \in N_{\varepsilon }\) the closest point in \(N_{\varepsilon }\) satisfying in particular \(\Vert {\widetilde{\theta }}(\theta ) - \theta \Vert \le \varepsilon \), we write, following previous arguments

Taking union bound over \({\widetilde{\theta }} \in N_{\varepsilon }\) for the inequality (B21) and choosing \(\varepsilon = n^{-\alpha /(1+\alpha )}\) concludes the proof of Proposition 5.

1.10 B.10 Proof of Corollary 1

We first write the result of Proposition 5 with a big O notation. This tells us that with probability at least \(1 - \delta \) for all \(\theta \in \Theta \), for all \(j\in \llbracket d \rrbracket \) we have:

$$\begin{aligned} \big \vert \epsilon _j^{\texttt {TM} }(\delta )\big \vert \le O \bigg ( M_{j,\alpha }^{1/(1+\alpha )}\Big ( \frac{\log (d/\delta ) + d\log (n)}{n} \Big )^{\alpha /(1+\alpha )}\bigg ) \end{aligned}$$

It only remains to apply Theorem 1 with importance sampling. The main result corresponds to having the second term (the statistical error) dominate the bound given by Theorem 1. This happens as soon as the number of iterations T is high enough so that

$$\begin{aligned}{} & {} \big ( R(\theta ^{(0)}) - R^\star \big )\Big ( 1 - \frac{\lambda }{\sum _{j\in \llbracket d \rrbracket }L_j} \Big )^T \\{} & {} \quad \le \frac{\big \Vert \epsilon ^{\texttt {TM} }(\delta ) \big \Vert _2^2}{2 \lambda }. \end{aligned}$$

From here, it is straightforward to check that the stated number of iterations suffices.

1.11 B.11 Proof of Lemma 4

Similarly to the proof of Lemma 2, the assumptions, this time taken with \(\alpha = 1\), imply that the gradient has a second moment so that the existence of \(\sigma _j^2 = {\mathbb {V}}(g_j(\theta ))\) is guaranteed. We apply Lemma 1 from Holland et al. (2019) with \(\delta /2\) to obtain:

$$\begin{aligned} \frac{1}{2}\vert {\widehat{g}}_j^{\texttt {CH} }(\theta ) - g_j(\theta )\vert \le \frac{C \sigma _j^2}{s} + \frac{s \log (4\delta ^{-1})}{n} \end{aligned}$$

with probability at least \(1 - \delta /2\), where C is a constant such that we have:

$$\begin{aligned} -\log (1 - u + Cu^2) \le \psi (u) \le \log (1 + u + Cu^2), \end{aligned}$$

and one can easily check that our choice of \(\psi \), the Gudermannian function, satisfies the previous inequality for \(C = 1/2\). This, along with the choice of scale s according to (21) and our assumption on \({\widehat{\sigma }}_j\) yields the announced deviation bound by a simple union bound argument.

1.12 B.12 Proof of Proposition 6

In this proof, for a scale \(s > 0\) and a set of real numbers \((x_i)_{i\in \llbracket n \rrbracket }\), we let \({\bar{x}}=\frac{1}{n} \sum _{i\in \llbracket n \rrbracket } x_i\) be their mean and define the function \(\zeta _s\big ((x_i)_{i\in \llbracket n \rrbracket }\big )\) as the unique x satisfying

$$\begin{aligned} \sum _{i\in \llbracket n \rrbracket } \psi \Big ( \frac{x - {\bar{x}}}{s} \Big ) = 0. \end{aligned}$$

Since the function \(\psi \) is increasing the previous equation has a unique solution. Moreover, for fixed scale s, the function \(\zeta _s\big ((x_i)_{i\in \llbracket n \rrbracket }\big )\) is monotonous non decreasing w.r.t. each \(x_i\) when the others are fixed.

Step 1

We proceed similarly as in the proof of Proposition 3 except that we only use the monotonicity of the \(\texttt {CH} \) estimator with fixed scale. Let \(N_{\varepsilon }\) be an \(\varepsilon \)-net for \(\Theta \) with \(\varepsilon = 1/\sqrt{n}\). We have \(\vert N_{\varepsilon }\vert \le (3\Delta /2\varepsilon )^d\) with \(\Delta \) the diameter of \(\Theta \). Fix a coordinate \(j\in \llbracket d \rrbracket \), a point \(\theta \in \Theta \) and let \({{\widetilde{\theta }}}\) be the closest point to it in the \(\varepsilon \)-net. We wish to bound the difference

$$\begin{aligned}&\big \vert \widehat{g}_j^{\texttt {CH} }(\theta ) - g_j(\theta )\big \vert \le \big \vert \widehat{g}_j^{\texttt {CH} }(\theta ) - g_j({{\widetilde{\theta }}})\big \vert \\&\quad + \big \vert g_j({{\widetilde{\theta }}}) - g_j(\theta )\big \vert \\&\quad \le \big \vert \widehat{g}_j^{\texttt {CH} }(\theta ) - g_j(\widetilde{\theta })\big \vert + \varepsilon L_j, \end{aligned}$$

where we have the \(\texttt {CH} \) estimator \(\widehat{g}_j^{\texttt {CH} }(\theta ) = \zeta _{s(\theta )}\big ((g_j^i(\theta ))_{i\in \llbracket n \rrbracket }\big )\) with scale \(s(\theta )\) computed according to (21) and (22). Assume, without loss of generality that \(\widehat{g}_j^{\texttt {CH} }(\theta ) - g_j({{\widetilde{\theta }}}) \ge 0\). Using the non-decreasing property of the \(\texttt {CH} \) estimator at a fixed scale, we find that

$$\begin{aligned}&\big \vert \widehat{g}_j^{\texttt {CH} }(\theta ) - g_j({{\widetilde{\theta }}})\vert \\&\quad = \big \vert \zeta _{s(\theta )}\big ((g_j^i(\theta ))_{i\in \llbracket n \rrbracket }\big ) - g_j({{\widetilde{\theta }}})\big \vert \\&\quad \le \big \vert \zeta _{s(\theta )}\big ((g_j^i({{\widetilde{\theta }}}) + \varepsilon \gamma \Vert X_i\Vert ^2)_{i\in \llbracket n \rrbracket }\big ) - g_j(\widetilde{\theta })\big \vert . \end{aligned}$$

Indeed, one has

$$\begin{aligned} g_j^i(\theta )&= g_j^i({{\widetilde{\theta }}}) + \big (g_j^i(\theta ) - g_j^i({{\widetilde{\theta }}})\big ) \\&\le g_j^i({{\widetilde{\theta }}}) + \gamma \Vert {{\widetilde{\theta }}} - \theta \Vert \cdot \Vert X_i\Vert \cdot \vert X_i^j\vert \\&\le g_j^i({{\widetilde{\theta }}}) + \varepsilon \gamma \Vert X_i\Vert ^2. \end{aligned}$$

We introduce the notation so that:

Step 2

We now use the concentration property of \(\texttt {CH} \) to bound the previous quantity which is in terms of \({\widetilde{\theta }}\). We apply Lemma 1 from Holland et al. (2019) with \(\delta /2\) and scale \(s(\theta )\) to the samples \((g_j^i({{\widetilde{\theta }}}) + \varepsilon \gamma \Vert X_i\Vert ^2)_{i\in \llbracket n \rrbracket }\) which are independent and distributed according to the random variable \(\ell '\big ({{\widetilde{\theta }}}^\top X, Y\big )X^j + \varepsilon \gamma \Vert X\Vert ^2\) with expectation \(g_j({{\widetilde{\theta }}}) + \varepsilon \overline{L}\). Using our assumptions on \(\sigma _L, \sigma _j(\theta ), \sigma _j({{\widetilde{\theta }}}), \widehat{\sigma }_j(\theta )\) and the definition of the scale \(s(\theta )\) according to (21) we find:

A simple union bound yields that for all \(j\in \llbracket d \rrbracket \)

(B22)

Step 3

We use the \(\varepsilon \)-net to obtain a uniform bound. We proceed similarly to the proof of Proposition 3. For \(\theta \in \Theta \) denote \({\widetilde{\theta }}(\theta ) \in N_{\varepsilon }\) the closest point in \(N_{\varepsilon }\) satisfying in particular \(\Vert {\widetilde{\theta }}(\theta ) - \theta \Vert \le \varepsilon \), we write, following previous arguments

Taking union bound over \({\widetilde{\theta }} \in N_{\varepsilon }\) for the inequality (B22) and using the choice \(\varepsilon = 1/\sqrt{n}\) concludes the proof of Proposition 6.

1.13 B.13 Proof of Corollary 2

Under the assumptions made, the constants \((L_j)_{j\in \llbracket d \rrbracket }\) are estimated using the \(\texttt {MOM} \) estimator and we obtain the bounds \(({\overline{L}}_j)_{j\in \llbracket d \rrbracket }\) which hold with probability at least \(1 - \delta /2\) by a union bound argument. The rest of the proof is the same as that of Theorem 1 using a failure probability \(\delta /2\) instead of \(\delta \) and replacing the constants \((L_j)_{j\in \llbracket d \rrbracket }\) by their upperbounds accordingly. The result then follows after a simple union bound argument.

1.14 B.14 Proof of Lemma 5

Let \(B_1, \dots , B_K\) be the blocks used for the estimation so that \(B_1 \cup \dots \cup B_K = \llbracket n \rrbracket \) and \(B_{k_1} \cap B_{k_2} = \emptyset \) for \(k_1\ne k_2\). Let \({\mathcal {K}}\) denote the uncorrupted block indices \({\mathcal {K}}= \{k \in \llbracket K \rrbracket \text { such that } B_k \cap {\mathcal {O}}= \emptyset \}\) and assume \(\vert {\mathcal {O}}\vert \le (1 - \varepsilon )K/2\). For \(k \in \llbracket K \rrbracket \) let \(\widehat{\sigma }^2_k = \frac{K}{n}\sum _{i\in B_k}X_i^2\) be the block means computed by MOM. Denote \(N = n/K\), by using (a slight generalization of) Lemma 7 and the \(L^{(1+\alpha )}\)-\(L^{1}\) condition satisfied by \(X^2\) with a known constant C, we obtain that with probability at least \(1-\delta \) we have

$$\begin{aligned}{} & {} \vert \widehat{\sigma }^2_k - \sigma ^2\vert \!\le \! \left( \frac{3{\mathbb {E}}\vert X^2 \!-\! \sigma ^2\vert ^{1+\alpha }}{\delta N^{\alpha }} \right) ^{\frac{1}{1+\alpha }} \\{} & {} \quad \le \! \left( \frac{3}{\delta N^{\alpha }} \right) ^{\frac{1}{1+\alpha }}C {\mathbb {E}}\vert X^2 - \sigma ^2\vert \le \left( \frac{3}{\delta N^{\alpha }} \right) ^{\frac{1}{1+\alpha }}C \sigma ^2, \end{aligned}$$

which implies the inequality

$$\begin{aligned} \sigma ^2 \le \Big (1 - C\Big ( \frac{3}{\delta N^{\alpha }} \Big )^{\frac{1}{1+\alpha }}\Big )^{-1} \widehat{\sigma }^2_k. \end{aligned}$$

Define the Bernoulli random variables \(U_k = {{{\textbf {1}}}}_{}\Big \{\sigma ^2 > \Big (1 - C\big ( \frac{3}{\delta N^{\alpha }} \big )^{\frac{1}{1+\alpha }}\Big )^{-1} \widehat{\sigma }^2_k\Big \}\) for \(k \in \llbracket K \rrbracket \) which have success probability \(\le \delta \). Denote \(S = \sum _k U_k\), we can bound the failure probability of the estimator as follows:

$$\begin{aligned}&{\mathbb {P}}\Big ( \Big (1 - C\Big ( \frac{3}{\delta N^{\alpha }} \Big )^{\frac{1}{1+\alpha }}\Big )^{-1} \widehat{\sigma }^2 < \sigma ^2 \Big ) \\&\quad \le {\mathbb {P}}\big [ S> K/2 - \vert {\mathcal {O}}\vert \big ] \\&\quad = {\mathbb {P}}\big [ S - {\mathbb {E}}S> K/2 - \vert {\mathcal {O}}\vert - \delta \vert {\mathcal {K}}\vert \big ] \\&\quad \le {\mathbb {P}}\big [ S - {\mathbb {E}}S > K (\varepsilon - 2\delta ) / 2 \big ] \\&\quad \le \exp \big (-K (\varepsilon - 2\delta )^2 / 2 \big ), \end{aligned}$$

where we used the fact that \(\vert {\mathcal {O}}\vert \le (1 - \varepsilon ) K / 2\) and \(\vert {\mathcal {K}}\vert \le K\) for the second inequality and Hoeffding’s inequality for the last. The proof is finished by taking \(\varepsilon = 5/6\) and \(\delta = 1/4.\)

1.15 B.15 Proof of Lemma 6

Lemma 6 is a direct consequence of the following result.

Lemma 10

Let \(X_1, \dots , X_n\) an i.i.d sample of a random variable X with expectation \({\mathbb {E}}X = \mu \) and \((1+\alpha )\)-moment \({\mathbb {E}}\vert X - \mu \vert ^{1+\alpha } = m_{\alpha }< \infty \). Assume that the variable X satisfies the \(L^{(1+\alpha )^2}\)-\(L^{(1+\alpha )}\) condition with constant \(C >1\). Let \(\widehat{\mu }\) be the median-of-means estimate of \(\mu \) with K blocks and \(\widehat{m}_{\alpha }\) a similarly obtained estimate of \(m_{\alpha }\) from the samples \((\vert X_i - \widehat{\mu }\vert ^{1+\alpha })_{i\in \llbracket n \rrbracket }\). Then, with probability at least \(1 - 2\exp (-K/18)\) we have

$$\begin{aligned} \widehat{m}_{\alpha } \ge (1 - \kappa )m_{\alpha }, \end{aligned}$$

with \(\kappa = \epsilon + 24(1+\alpha ) \Big (\frac{1 + \epsilon }{n/K}\Big )^{\frac{\alpha }{1+\alpha }}\) and \(\epsilon = \Big ( \frac{3\times 2^{2+\alpha }(1 + C^{(1+\alpha )^2})}{(n/K)^{\alpha }} \Big )^{\frac{1}{1+\alpha }}\).

Proof

Let \(\widehat{\mu }\) be the MOM estimate of \(\mu \) with K blocks, using Lemma 2, we have with probability at least \(1 -\exp (-K/18)\),

$$\begin{aligned} \vert \mu - \widehat{\mu }\vert > (24 m_{\alpha })^{\frac{1}{1+\alpha }} \Big (\frac{K}{n}\Big )^{\frac{\alpha }{1+\alpha }}. \end{aligned}$$
(B23)

Let \(\widehat{m}_{\alpha }\) be the MOM estimate of \(m_{\alpha }\) obtained from the samples \(\big (\vert X_i - \widehat{\mu }\vert ^{1+\alpha }\big )_{i\in \llbracket n \rrbracket }\). Denote \(B_1, \dots , B_K\) the blocks we use, we have:

$$\begin{aligned} \widehat{m}_{\alpha } = {{\,\textrm{median}\,}}\bigg (\frac{K}{n} \sum _{i\in B_j} \vert X_i - \widehat{\mu }\vert ^{1+\alpha }\bigg )_{j \in \llbracket K \rrbracket } \end{aligned}$$

for any \(i \in \llbracket n \rrbracket \). Let \(N = n/K\), using the convexity of the function \(f(x) = \vert x\vert ^{1+\alpha }\) we find that:

$$\begin{aligned}&\frac{1}{N} \sum _{i\in B_j} \big \vert X_i - \widehat{\mu }\big \vert ^{1+\alpha } \nonumber \\&\quad = \frac{1}{N} \sum _{i\in B_j} \big \vert (X_i - \mu ) + (\mu - \widehat{\mu })\big \vert ^{1+\alpha } \nonumber \\&\quad \ge \frac{1}{N} \sum _{i\in B_j} \vert X_i - \mu \vert ^{1+\alpha } +\frac{1}{N} (1+\alpha ) \sum _{i\in B_j} \vert X_i \nonumber \\&\quad - \mu \vert ^{\alpha }\text {sign}(X_i - \mu )(\mu - \widehat{\mu }) \nonumber \\&\quad \ge \frac{1}{N} \sum _{i\in B_j} \vert X_i - \mu \vert ^{1+\alpha } - (1+\alpha ) \vert \mu \nonumber \\&\quad - \widehat{\mu }\vert \Big [ \frac{1}{N}\sum _{i\in B_j} \vert X_i - \mu \vert ^{\alpha } \Big ] \nonumber \\&\quad \ge \frac{1}{N} \sum _{i\in B_j} \vert X_i - \mu \vert ^{1+\alpha } - (1+\alpha ) \vert \mu \nonumber \\&\quad - \widehat{\mu }\vert \Big [ \frac{1}{N}\sum _{i\in B_j} \vert X_i - \mu \vert ^{1+\alpha } \Big ]^{\frac{\alpha }{1+\alpha }}, \end{aligned}$$
(B24)

where the last step uses Jensen’s inequality. Using Lemma 7 we have, for \(\delta > 0\), the concentration bound

$$\begin{aligned}{} & {} {\mathbb {P}}\bigg ( \Big \vert \frac{1}{N} \sum _{i\in B_j} \big \vert X_i - \mu \big \vert ^{1+\alpha } - m_{\alpha }\Big \vert \\{} & {} \quad > \Big ( \frac{3 {\mathbb {E}}\big \vert \vert X - \mu \vert ^{1+\alpha } - m_{\alpha }\big \vert ^{1+\alpha }}{\delta N^{\alpha }} \Big )^{\frac{1}{1+\alpha }} \bigg ) \le \delta \end{aligned}$$

which, using that X satisfies the \(L^{(1+\alpha )^2}\)-\(L^{(1+\alpha )}\) condition, translates to

$$\begin{aligned}&{\mathbb {P}}\bigg ( \Big \vert \frac{1}{N} \sum _{i\in B_j} \big \vert X_i - \mu \big \vert ^{1+\alpha } - m_{\alpha }\Big \vert > \epsilon \bigg ) \\&\quad \le \frac{3{\mathbb {E}}\big \vert \vert X - \mu \vert ^{1+\alpha } - m_{\alpha }\big \vert ^{1+\alpha }}{\epsilon ^{1+\alpha } N^{\alpha }} \\&\le \frac{3\times 2^{\alpha } \big ( {\mathbb {E}}\vert X-\mu \vert ^{(1+\alpha )^2} + m_{\alpha }^{1+\alpha }\big )}{\epsilon ^{1+\alpha } N^{\alpha }}\\&\le \frac{3\times 2^{\alpha }m_{\alpha }^{1+\alpha } \big ( 1 + C^{(1+\alpha )^2}\big )}{\epsilon ^{1+\alpha } N^{\alpha }}. \end{aligned}$$

Replacing \(\epsilon \) with \(\epsilon m_{\alpha }\) we find

$$\begin{aligned}{} & {} {\mathbb {P}}\bigg ( \Big \vert \frac{1}{N} \sum _{i\in B_j} \vert X_i - \mu \vert ^{1+\alpha } - m_{\alpha }\Big \vert > \epsilon m_{\alpha } \bigg )\\{} & {} \le \frac{3\times 2^{\alpha }\big (1 + C^{(1+\alpha )^2}\big )}{N^{\alpha } \epsilon ^{1+\alpha }}. \end{aligned}$$

Now conditioning on the event (B23) and using the previous bound with \(\epsilon = \Big ( \frac{3\times 2^{\alpha }\big (1 + C^{(1+\alpha )^2}\big )}{N^{\alpha }\delta } \Big )^{\frac{1}{1+\alpha }}\) in (B24), we obtain that

$$\begin{aligned}&{\mathbb {P}}\bigg ( \frac{1}{N} \sum _{i\in B_j} \big \vert X_i - \widehat{\mu }\big \vert ^{1+\alpha } \le (1 - \epsilon )m_{\alpha } - (1+\alpha ) \\&\quad \Big (\frac{24 m_{\alpha }}{N^\alpha }\Big )^{\frac{1}{1+\alpha }}((1 + \epsilon ) m_{\alpha })^{\frac{\alpha }{1+\alpha }} \bigg ) \le \delta \\&\quad \implies {\mathbb {P}}\bigg ( \frac{1}{N}\sum _{i\in B_j} \big \vert X_i - \widehat{\mu }\big \vert ^{1+\alpha } \\&\quad \le \underbrace{\Big (1 - \epsilon - 24(1+\alpha ) \Big (\frac{1 + \epsilon }{N}\Big )^{\frac{\alpha }{1+\alpha }} \Big )}_{=:(1- \kappa )}m_{\alpha } \bigg ) \le \delta . \end{aligned}$$

Now define \(U_j\) as the indicator variable of the event in the last probability. We have just seen it has success rate less than \(\delta \). We can use the MOM trick, assuming the number of outliers satisfies \(\vert {\mathcal {O}}\vert \le K(1- \varepsilon )/2\) for \(\varepsilon \in (0,1)\), we have for \(S = \sum _j U_j\)

$$\begin{aligned} {\mathbb {P}}(\widehat{m}_{\alpha } \le (1 - \kappa )m_{\alpha } )&\le {\mathbb {P}}(S> K/2 - \vert {\mathcal {O}}\vert ) \\&= {\mathbb {P}}\big [ S - {\mathbb {E}}S> K/2 - \vert {\mathcal {O}}\vert - \delta \vert {\mathcal {K}}\vert \big ] \\&\le {\mathbb {P}}\big [ S - {\mathbb {E}}S > K (\varepsilon - 2\delta ) / 2 \big ] \\&\le \exp \big (-K (\varepsilon - 2\delta )^2 / 2 \big ). \end{aligned}$$

Taking \(\varepsilon = 5/6\) and \(\delta = 1/4\) yields that the previous probability is \(\le \exp (-K/18)\). Finally, recall that we conditioned on the event where the deviation \(\vert \mu - \widehat{\mu }\vert \) is bounded as previously stated and that this event holds with \(\ge 1 - \exp (-K/18)\). Taking this conditioning into account and using a union bound argument leads to the fact that the bound

$$\begin{aligned} \widehat{m}_{\alpha } \ge (1 - \kappa )m_{\alpha } \end{aligned}$$

holds with probability at least \(1 - 2\exp (-K/18)\). \(\square \)

1.16 B.16 Proof of Theorem 7

This proof is inspired from Theorem 5 in Nesterov (2012) and Theorem 1 in Shalev-Shwartz and Tewari (2011) while keeping track of the degradations caused by the errors on the gradient coordinates.

We condition on the event (B5) and denote \(\epsilon _j = \epsilon _j(\delta )\) and \(\epsilon _{Euc} = \Vert \epsilon (\delta )\Vert _2\). We define for all \(\theta \in \Theta \)

$$\begin{aligned} u_j(\theta )&= \mathop {{{\,\mathrm{\hbox {argmin}}\,}}}\limits _{\vartheta \in \Theta _j} \widehat{g}_j(\theta )(\vartheta - \theta _j) + \frac{L_j}{2} (\vartheta - \theta _j)^2 \\&\quad + \epsilon _j\vert \vartheta - \theta _j\vert \\&= {{\,\textrm{proj}\,}}_{\Theta _j} \big (\theta _j - \beta _j \tau _{\epsilon _{j}}\big (\widehat{g}_j(\theta )\big )\big ) \end{aligned}$$

and denote \(\theta ^{(t)}\) the optimization iterates for \(t=0,\dots , T\) and \(j_t\) the random coordinate sampled at step t and let \(\widehat{g}_t = \widehat{g}_{j_t}(\theta ^{(t)})\) for brevity. We have that \(u_{j_t}(\theta ^{(t)})\) satisfies the following optimality condition

$$\begin{aligned} \forall{} & {} \vartheta \in \Theta _{j_t} \big ( \widehat{g}_t + L_{j_t}\big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big ) \\{} & {} \quad + \epsilon _{j_t} \rho _t \big ) \big ( \vartheta - u_{j_t}(\theta ^{(t)}) \big ) \ge 0, \end{aligned}$$

where \(\rho _t = \text {sign}\big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big )\). Using this condition for \(\vartheta = \theta ^{(t)}_{j_t}\) and the coordinate-wise Lipschitz smoothness property of R we find

$$\begin{aligned} R(\theta ^{(t+1)})&\le R(\theta ^{(t)}) + g_{j_t}(\theta ^{(t)})\big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big ) \nonumber \\&\quad + \frac{L_{j_t}}{2}\big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big )^2 \nonumber \\&\le R(\theta ^{(t)}) + (\widehat{g}_t + \epsilon _{j_t}\rho _t)\big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big ) \nonumber \\&\quad + \frac{L_{j_t}}{2}\big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big )^2 \end{aligned}$$
(B25)
$$\begin{aligned}&\le R(\theta ^{(t)}) - \frac{L_{j_t}}{2}\big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big )^2. \end{aligned}$$
(B26)

Defining the potential \(\Phi (\theta ) = \sum _{j=1}^d L_j(\theta _j - \theta ^\star _j)^2\), we have:

$$\begin{aligned}&\Phi (\theta ^{(t+1)}) = \Phi (\theta ^{(t)}) + 2L_{j_t}\big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big )\\&\quad \qquad \big (\theta ^{(t)}_{j_t} - \theta ^\star _{j_t}\big ) + L_{j_t} \big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big )^2 \\&\quad = \Phi (\theta ^{(t)}) + 2L_{j_t}\big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big )\big (u_{j_t}(\theta ^{(t)}) - \theta ^\star _{j_t}\big ) \\&\qquad - L_{j_t} \big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big )^2 \\&\quad \le \Phi (\theta ^{(t)}) - 2(\widehat{g}_t + \epsilon _{j_t} \rho _t)\big (u_{j_t}(\theta ^{(t)}) - \theta ^\star _{j_t}\big ) \\&\qquad - L_{j_t} \big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big )^2 \\&\quad = \Phi (\theta ^{(t)}) + 2(\widehat{g}_t + \epsilon _{j_t} \rho _t)\big (\theta ^\star _{j_t} - \theta ^{(t)}_{j_t}\big ) \\&\qquad - 2 \Big ( (\widehat{g}_t + \epsilon _{j_t} \rho _t)\big (u_{j_t}(\theta ^{(t)})- \theta ^{(t)}_{j_t}\big ) \\ {}&\quad + \frac{L_{j_t}}{2} \big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big )^2 \Big ) \\&\quad {\le } \Phi (\theta ^{(t)}) {+} 2(\widehat{g}_t {+} \epsilon _{j_t} \rho _t)\big (\theta ^\star _{j_t} {-} \theta ^{(t)}_{j_t}\big ) + 2 \big ( R(\theta ^{(t)}) - R(\theta ^{(t+1)}) \big ) \\&\quad \le \Phi (\theta ^{(t)}) + 2g_{j_t}(\theta ^{(t)}) \big (\theta ^\star _{j_t} - \theta ^{(t)}_{j_t}\big ) + 2 \big ( R(\theta ^{(t)}) - R(\theta ^{(t+1)}) \big ) \\&\qquad + 4\epsilon _{j_t} \big \vert \theta ^\star _{j_t} - \theta ^{(t)}_{j_t}\big \vert , \end{aligned}$$

where the first inequality uses the optimality condition with \(\vartheta = \theta ^\star _{j_t}\) and the second one uses (B25). Now, defining \(\Psi (\theta ) = \frac{1}{2} \Phi (\theta ) + R(\theta )\), taking the expectation w.r.t. \(j_t\) and using the convexity of R and a Cauchy-Schwarz inequality, we find

$$\begin{aligned}&{\mathbb {E}}\big [\Psi (\theta ^{(t)}) - \Psi (\theta ^{(t+1)})\big ] \ge \frac{1}{d} \\&\quad \big (R(\theta ^{(t)}) - R(\theta ^\star ) - 2 \epsilon _{Euc} \big \Vert \theta ^{(t)}- \theta ^\star \big \Vert _2\big ). \end{aligned}$$

Recall that according to (B26), we have \(R (\theta ^{(t+1)}) \le R (\theta ^{(t)})\), summing over \(t = 0,\dots , T\) we find:

$$\begin{aligned}&{\mathbb {E}}\left[ \frac{T+1}{d}\big (R (\theta ^{(T)}) - R (\theta ^\star )\big )\right] \\&\quad \le {\mathbb {E}}\left[ \frac{1}{d}\sum _{t=0}^T\big (R (\theta ^{(t)}) - R (\theta ^\star )\big )\right] \\&\quad \le \sum _{t=0}^T \left( {\mathbb {E}}\big [\Psi (\theta ^{(t)}) - \Psi (\theta ^{(t+1)})\big ] \right. \\&\quad \left. + \frac{2\epsilon _{Euc}}{d}{\mathbb {E}}\big [\big \Vert \theta ^{(t)} - \theta ^\star \big \Vert _2\big ]\right) \\&\quad = {\mathbb {E}}\left[ \Psi (\theta ^{(0)}) - \Psi (\theta ^{(t+1)})\right] + \frac{2\epsilon _{Euc}}{d}\sum _{t=0}^T{\mathbb {E}}\big [\big \Vert \theta ^{(t)} - \theta ^\star \big \Vert _2\big ] \\&\quad \le \Psi (\theta ^{(0)}) + \frac{2\epsilon _{Euc}}{d}\sum _{t=0}^T{\mathbb {E}}\big [\big \Vert \theta ^{(t)} - \theta ^\star \big \Vert _2\big ], \end{aligned}$$

which yields the result after multiplying by \(\frac{d}{T+1}\). To finish, we show that conditionally on any choice of \(j_t\) we have \(\Vert \theta ^{(t+1)} - \theta ^\star \Vert _2 \le \Vert \theta ^{(t)} - \theta ^\star \Vert _2.\) Indeed a straightforward computation yields

$$\begin{aligned}{} & {} \big \Vert \theta ^{(t+1)} - \theta ^\star \big \Vert _2^2 = \big \Vert \theta ^{(t)} - \theta ^\star \big \Vert _2^2 + \big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big )^2 \\{} & {} \quad +2 \big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big ) \big (\theta ^{(t)}_{j_t} - \theta ^\star _{j_t}\big ). \end{aligned}$$

We need to show that \(\delta _t^2 \le -2 \delta _t \big (\theta ^{(t)}_{j_t} - \theta ^\star _{j_t}\big )\) with \(\delta _t = (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t})\). Notice that \(\delta _t\) always has the opposite sign of \(g_{j_t}(\theta ^{(t)})\) (thanks to the thresholding) so by convexity of R along the coordinate \(j_t\) we have \(\delta _t \big (\theta ^{(t)}_{j_t} - \theta ^\star _{j_t}\big ) \le 0\) and so it is down to showing \(\vert \delta _t\vert \le 2\big \vert \theta ^{(t)}_{j_t} - \theta ^\star _{j_t}\big \vert \) which can be seen from

$$\begin{aligned} \vert \delta _t\vert \le \frac{\big \vert g_{j_t}(\theta ^{(t)})\big \vert }{L_{j_t}} = \frac{\big \vert g_{j_t}(\theta ^{(t)}) - g_{j_t}(\theta ^\star )\big \vert }{L_{j_t}} \le \big \vert \theta ^{(t)}_{j_t} - \theta ^\star _{j_t}\big \vert , \end{aligned}$$

which concludes the proof of Theorem 7.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Merad, I., Gaïffas, S. Robust supervised learning with coordinate gradient descent. Stat Comput 33, 116 (2023). https://doi.org/10.1007/s11222-023-10283-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-023-10283-7

Keywords

Navigation