Learning with risks based on M-location

In this work, we study a new class of risks defined in terms of the location and deviation of the loss distribution, generalizing far beyond classical mean-variance risk functions. The class is easily implemented as a wrapper around any smooth loss, it admits finite-sample stationarity guarantees for stochastic gradient methods, it is straightforward to interpret and adjust, with close links to M-estimators of the loss location, and has a salient effect on the test loss distribution, giving us control over symmetry and deviations that are not possible under naive ERM.


Introduction
In machine learning, the important yet ambiguous notion of "good off-sample generalization" (or "small test error") is typically formalized in terms of minimizing the expected value of a random loss E µ L(h), where h is a candidate decision rule and L(h) is a random variable on an underlying probability space (Ω, F, µ).This setup based on average off-sample performance has been famously called the "general setting of the learning problem" by Vapnik [48], and is central to the decision-theoretic formulation of learning in the influential work of Haussler [17].This is by no means a purely theoretical concern; when average performance dictates the ultimate objective of learning, the data-driven feedback used for training in practice will naturally be designed to prioritize average performance [8, 20, 24, 43].Take the default optimizers in popular software libraries such as PyTorch or TensorFlow; virtually without exception, these methods amount to efficient implementations of empirical risk minimization.While the minimal expected loss formulation is clearly an intuitive choice, the tacit emphasis on average performance represents an important and non-trivial value judgment, which may or may not be appropriate for any given real-world learning task.
To make this value judgment an explicit part of the machine learning workflow, in this work we consider a generalized class of risk functions, designed to give the user much greater flexibility in terms of how they choose to evaluate performance, while still allowing for theoretical performance guarantees.One core statistical concept is that of the M-location of the loss distribution under a candidate h, defined by Here ρ : R → [0, ∞) is a function controlling how we measure deviations, and σ > 0 is a scaling parameter.Since the loss distribution µ is unknown, clearly M(h) is an ideal, unobservable quantity.If we replace µ with the empirical distribution induced by a sample (L 1 , . . ., L n ), then for certain special cases of ρ we get an M-estimator of the location of L(h), a classical notion dating back to Huber [19], which justifies our naming.Ignoring integrability concerns for the moment, note that in the special case of ρ(u) = u 2 , we get the classical risk M(h) = E µ L(h), and in the case of ρ(u) = |u|, we get M(h) = inf{u : µ{L(h) ≤ u} ≥ 0.5}, namely the median of the loss distribution.This rich spectrum of evaluation metrics makes the notion of casting learning problems in terms of minimization of M-locations (via their corresponding M-estimators) very conceptually appealing.However, while the statistical properties of the minima of M-estimators in special cases are understood [10], the optimization involved is both costly and difficult, making the task of designing and studying M(•)-minimizing learning algorithms highly intractable.With these issues in mind, we study a closely related alternative which retains the conceptual appeal of raw M-locations, but is more computationally congenial.
Our approach With σ and ρ as before, our generalized risks will be defined implicitly by where η > 0 is a weighting parameter that controls the balance of priority between location and deviation.A more formal definition will be given in section 2 (see equations ( 3)-( 5)), including concrete forms for ρ(•) that are conducive to both fast computation and meaningful learning guarantees.In addition, we will see (cf.Proposition 3) that after minimization, this objective 2 can be written as where the shift term c M > 0 can be simply characterized by the equality noting that c M → 0 + as η → ∞.By utilizing smoothness properties of loss functions typical to machine learning problems (e.g., squared error, cross-entropy, etc.), even though the generalized risks need not be convex, they can be shown to satisfy weak notions of convexity, which still admit finite-sample guarantees of near-stationary for stochastic gradient-based learning algorithms (details in section 3).This approach has the additional benefit that implementation only requires a single wrapper around any given loss which can be set prior to training, making for easy integration with frameworks such as PyTorch and TensorFlow, while incurring negligible computational overhead.

Our contributions
The key contribution here is a new concrete class of risk functions, defined and analyzed in section 2. These risks are statistically easy to interpret, their empirical counterparts are simple to implement in practice, and as we prove in section 3, their design allows for standard stochastic gradient-based algorithms to be given competitive finite-sample excess risk guarantees.We also verify empirically (section 4) that the proposed feedback generation scheme has a demonstrable effect on the test distribution of the loss, which as a side-effect can be easily leveraged to outperform traditional ERM implementations, a result which is in line with early insights from Breiman [9] and Reyzin and Schapire [34] regarding the impact of the loss distribution on generalization.More broadly, the choice of which risk to use plays a central role in pursuit of increased transparency in machine learning, and our results represent a first step towards formalizing this process.
Relation to existing literature With respect to alternative notions of "risk" in machine learning, perhaps the most salient example is conditional value-at-risk (CVaR) [33, 18], namely the expected loss conditioned on it exceeding a quantile at a pre-specified probability level.CVaR allows one to encode a sensitivity to extremely large losses, and admits convexity when the underlying loss is convex, though the conditioning often leaves the effective sample size very small.Other notions such as cumulative prospect theory (CPT) scores have also been considered [7, 26, 23], but the technical difficulties involved with computation and analysis arguably outweigh the conceptual benefits of learning using such scores.These proposals can all be interpreted as "location" parameters of the underlying loss distribution; our risks take the form of a sum of a location and a deviation term, where the location is a shifted M-location, as described above.The basic notion of combining location and deviation information in evaluation is a familiar concept; the mean-variance objective E µ L(•) + var µ L(•) dates back to classical work by Markowitz [28]; our proposed class includes this as a special case, but generalizes far beyond it.Mean-variance and other risk function classes are studied by Ruszczyński and Shapiro [40, 41], who give minimizers a useful dual characterization, though our proposed class is not captured by their work (see also Remark 4).We note also that the recent (and independent) work of Lee et al. [25] considers a form which is similar to (2) in the context of empirical risk minimizers; the critical technical difference is that their formulation is restricted to ρ which is monotonic, an assumption which enforces convexity.The special case of mean-variance is also treated in depth in more recent work by Duchi and Namkoong [14], who consider stochastic learning algorithms for doing empirical risk minimization with variance-based regularization.Finally, we note that our technical analysis in section 3 makes crucial use of weak convexity properties of function compositions, an area of active research in recent years [15, 13, 12].Since our proposed objective can be naturally cast as a composition taking us from parameter space to a Banach space and finally to R, leveraging the insights of these previous works, we extend the existing machinery to handle learning over Banach spaces, and give finite-sample guarantees for arbitrary Hilbert spaces.More details are provided in section 3, plus the appendix.

Notation and terminology
To give the reader an approximate idea of the technical level of this paper, we assume some familiarity with probability spaces, the notion of sub-gradients and the sub-differential of convex functions, as well as special classes of vector spaces like Banach and Hilbert spaces, although the main text is written with a wide audience in mind.Strictly speaking, we will also deal with sub-differentials of non-convex functions, but these technical concepts are relegated to the appendix, where all formal proofs are given.In the main text, to improve readability, we write ∂f (x) to denote the sub-differential of f at x, regardless of the convexity of f .When we refer to a function being λ-smooth, this refers to its gradient being λ-Lipschitz continuous, and weak smoothness just requires such continuity on directional derivatives; all these concepts are given a detailed introduction in the appendix.Throughout this paper, we use E[•] for taking expectation, and P as a general-purpose probability function.
For indexing, we will write [k] . .= {1, . . ., k}.Distance of a vector v from a set A will be denoted by dist(v;

A concrete class of risk functions
The risks described by ( 2) are fairly intuitive as-is, but a bit more structure is needed to ensure they are well-defined and useful in practice.To make things more concrete, let us fix ρ as This function is handy in that it behaves approximately quadratically around zero, and it is both π/2-Lipschitz and strictly convex on the real line. 1Fixing this particular choice of ρ and letting Z be a random variable (any F-measurable function), we interpolate between mean-and median-centric quantities via the following class of functions, indexed by σ ∈ [0, ∞]: With this class of ancillary functions in hand, it is natural to define to construct a class of risk functions.In the context of learning, we will use this risk function to derive a generalized risk , namely the composite function h → R σ (L(h)).As a special case, clearly this includes risks of the form (2) given earlier.Visualizations of these functions are given in the supplementary appendix.Minimizing R σ (L(h)) in h is our formal learning problem of interest.
Before considering learning algorithms, we briefly cover the basic properties of the functions r σ and R σ .Without restricting ourselves to the specialized context of "losses," note that if Z is any square-µ-integrable random variable, this immediately implies that | r σ (Z, θ)| < ∞ for all θ ∈ R, and thus R σ (Z) < ∞.Furthermore, the following result shows that it is straightforward to set the weight η to ensure R σ (Z) > −∞ also holds, and a minimum exists.
take any η > 0. Under these settings, the function θ → r σ (Z, θ) is bounded below and takes its minimum on R. Thus, for each square-µ-integrable Z, there always exists a (non-random) Furthermore, when σ > 0, this minimum θ Z is unique.
Remark 2 (Mean-median interpolation).In order to ensure that risk modulation via σ ∈ [0, ∞] smoothly transitions from a median-centric (σ = 0 case) to a mean-centric (σ = ∞ case) location, the parameter η plays a key role.Noting that for any u ∈ R, for ρ defined by (3) we have 2σ 2 ρ(u/σ) → u 2 as σ → ∞, and thus for large values of σ > 0 it is natural to set η = 2σ 2 .On the other end of the spectrum, since σ log(1 + (u/σ) 2 ) → 0 + whenever σ → 0 + , it is thus natural to set η = σ/ atan(∞) = 2σ/π when σ > 0 is small.Strictly speaking, in light of the conditions in Proposition 1, to ensure R σ is finite one should take η > 2σ/π.What can we say about our risk functions R σ in terms of more traditional statistical risk properties?The form of R σ given in ( 6) has a simple interpretation as a weighted sum of "location" and "deviation" terms.In the statistical risk literature, the seminal work of Artzner et al. [1] gives an axiomatic characterization of location-based risk functions that can be considered coherent, while Rockafellar et al. [36] characterize functions which capture the intuitive notion of "deviation," and establish a lucid relationship between coherent risks and their deviation class.The following result describes key properties of the proposed risk functions, in particular highlighting the fact that while our location terms are monotonic, our risk functions are non-traditional in that they are non-monotonic.
Proposition 3 (Non-monotonic risk functions).Let Z be a Banach space of square-µ-integrable functions.For any σ ∈ [0, ∞], let η > 0 be set as in Proposition 1.Then, the functions r σ : Z × R → R and R σ : Z → R satisfy the following properties: • Both r σ and R σ are continuous, convex, and sub-differentiable.
• The deviation in ( 6) is non-negative and translation-invariant, namely for any a ∈ R, we have In particular, the risk h → R σ (L(h)) need not be convex, even if L(•) is.
Remark 4. Since our risk function R σ is not monotonic, standard results in the literature on optimizing generalized risks do not apply here.We remark that our proposed risk class does not appear among the comprehensive list of examples given in the works of Ruszczyński and Shapiro [40, 41], aside from the special case of σ = ∞ with η = 1.While the continuity and sub-differentiability of any risk function which is convex and monotonic is well-known for a large class of Banach spaces [40, Sec.3], in Proposition 3 we obtain such properties without monotonicity by using square-µ-integrability combined with properties of our function class ρ σ .
Algorithm 1 Projected sub-gradient method with randomized output.
Since our principal interest is the case where Z = L(h), the key takeaways from this section are that while the proposed risk h → R σ (L(h)) is well-defined and easy to estimate given a random sample L 1 (h), . . ., L n (h), the learning task is non-trivial since R σ (L(•)) is not differentiable (and thus non-smooth) when σ = 0, and for any σ ∈ [0, ∞] need not be convex, even when the underlying loss is both smooth and convex.Fortunately, smoothness properties of the losses typically used in machine learning can be leveraged to overcome these technical barriers, opening a path towards learning guarantees for practical algorithms.This is the topic of the next section.

Learning algorithm analysis
Thus far, we have only been concerned with ideal quantities R σ and r σ used to define the ultimate formal goal of learning.In practice, the learner will only have access to noisy, incomplete information.In this work, we focus on iterative algorithms based on stochastic gradients, largely motivated by their practical utility and ubiquity in modern machine learning applications.For the rest of the paper, we overload our risk definitions to enhance readability, writing r σ (h, θ) . .= r σ (L(h), θ) and R σ (h) . .= R σ (L(h)).First note that we can break down the underlying joint objective as r σ (h, θ) = E µ (f 2 • F 1 )(h, θ), where we have defined From the point of view of the probability space (Ω, F, µ), the function F 1 is random, whereas f 2 is deterministic; our use of upper-and lower-case letters is just meant to emphasize this.Given some initial value (h 0 , θ 0 ) ∈ H × R, one naively hopes to construct an efficient stochastic gradient algorithm using the update where α t > 0 is a step-size parameter, Π C [•] denotes projection to some set C ⊂ H × R, and the stochastic feedback G t is just a composition of sub-gradients, namely We call this approach "naive" since it is exactly what we would do if we knew a priori that the underlying objective was convex and/or smooth. 2 The precise learning algorithm studied here is summarized in Algorithm 1. Fortunately, as we describe below, this naive procedure actually enjoys lucid non-asymptotic guarantees, on par with the smooth case.
How to measure algorithm performance?Before stating any formal results, we briefly discuss the means by which we evaluate learning algorithm performance.Since the sequence (R σ (h t )) cannot be controlled in general, a more tractable problem is that of finding a stationary point of r σ , namely any (h * , θ * ) such that 0 ∈ ∂ r σ (h * , θ * ).However, it is not practical to analyze dist(0; ∂ r σ (h t , θ t )) directly, due to a lack of continuity.Instead, we consider a smoothed version of r σ : This is none other than the Moreau envelope of r σ , with weighting parameter β > 0. A familiar concept from convex analysis on Hilbert spaces [5, Ch. 12 and 24], the Moreau envelope of non-smooth functions satisfying weak convexity properties has recently been shown to be a very useful metric for evaluating stochastic optimizers [12, 13].Our basic performance guarantees will first be stated in terms of the gradient of the smoothed function r σ,β .We will then relate this to the joint risk r σ and subsequently the risk R σ .
Guarantees based on joint risk minimization Within the context of the stochastic updates characterized by ( 8)-( 9 . .= (G 0 , . . ., G t ), we formalize our assumptions as follows: A1.For all h ∈ H, the random loss L(h) is square-µ-integrable, locally Lipschitz, and weakly λ-smooth, with a gradient satisfying A2. H is a Hilbert space, and C ⊂ H × R is a closed convex set.
The following is a performance guarantee for Algorithm 1 in terms of the smoothed joint risk.
Implications in terms of the original objective The results described in Theorem 5 and Remark 6 are with respect to a smoothed version of the joint risk function r σ .Linking these facts to insights in terms of the original proposed risk R σ can be done as follows.Assuming we take n ≥ 2γκ 2 ∆ 0 /ε4 to achieve the ε-precision discussed in Remark 6, the immediate conclusion is that the algorithm output is (ε/2γ)-close to a ε-nearly stationary point of r σ .More precisely, we have that there exists an ideal point The above fact follows from basic properties of the Moreau envelope (cf.Appendix B.4).These non-asymptotic guarantees of being close to a "good" point extend to the function values of the risk R σ since we are close to a candidate h * n whose risk value can be no worse than We remark that these learning guarantees hold for a class of risks that are in general non-convex and need not even be differentiable, let alone satisfy smoothness requirements.
Key points in the proof of Theorem 5 Here we briefly highlight the key sub-results involved in proving Theorem 5; please see Appendix C.2 for all the details.The key structure that we require is a smooth loss, reflected in assumption A1.This along with the Lipschitz property of our function ρ σ for all 0 ≤ σ < ∞ allows us to prove that the underlying objective r σ is weakly convex, where H can be any Banach space (Proposition 12); this generalizes a result of Drusvyatskiy and Paquette [13, Lem.4.2] from Euclidean space to any Banach space.This alone is not enough to obtain the desired result.Note that the assumption A3 is very weak, and trivially satisfied in most traditional machine learning settings (e.g., where losses are based on a sequence of iid data points).The question of whether the feedback is unbiased or not, i.e., whether E µ G t is in the sub-differential of r σ at step t or not, is something that needs to be formally verified.In Proposition 14 we show that as long as the gradient has a finite expectation, this indeed holds for the feedback generated by (9), when H is any Banach space.With the two key properties of a weakly convex objective and unbiased random feedback in hand, we can leverage the techniques used in Davis and Drusvyatskiy [12, Thm.3.1] for proximal stochastic gradient methods applied to weakly convex functions on R d , extending their core argument to the case of any Hilbert space.Combining this technique with the proof of weak convexity and unbiasedness lets us obtain Theorem 5.

Empirical analysis
In this section we introduce representative results for a series of experiments designed to investigate the quantitative and qualitative repercussions of modulating the underlying risk function class.

Sanity check in one dimension
As a natural starting point, we use a toy example to ensure that Algorithm 1 takes us where we expect to go for a particular risk setting.Consider a loss on R with the form L(h) = h L wide +(1 − h) L thin , where L wide and L thin are random variables independent of h and each other.As a simple example, we use a folded Normal distribution for both, namely , where a wide = 0, a thin = 2.0, b wide = 1.0, and b thin = 0.1.For simplicity, we fix α t = 0.001 throughout, and each step uses a mini-batch of size 8. Regarding the risk settings, we look in particular at the case of σ = ∞, where we modify the setting of η = 2 k over k = 0, 1, . . ., 7. Results averaged over 100 trials are given in Figure 1.By modifying η, we can control whether the learning algorithm "prefers" candidates whose losses have a high degree of dispersion centered around a good location, or those whose losses are well-concentrated near a weaker location.

Impact of risk choice on linear regression
Next we consider how the key choice of σ (and thus the underlying risk R σ ) plays a role on the behavior of Algorithm 1.As another simple, yet more traditional example, consider linear regression in one dimension, where Y = w * 0 + w * 1 X + , where X and are independent zero-mean random variables, and (w * 0 , w * 1 ) ∈ R 2 are unknown to the learner.Using squared error L(h) = (Y − h(X)) 2 , we run Algorithm 1 again with minibatches of size 8 and α t = 0.001 fixed throughout, over a range of σ ∈ [0, ∞] settings, for the same number of iterations as in the previous experiment.The initial value (h 0 , θ 0 ) is initialized at zero plus uniform noise on [−0.05, 0.05].We also consider multiple noise distributions; as a concrete example, letting N = Normal(0, (0.8) 2 ), we consider both ε = N (Normal case) and ε = e N − E e N (log-Normal case).In Figure 2, we plot the learned regression lines (averaged over 100 trials) for each choice of σ and each noise setting.By modulating the target risk function, we can effectively choose between a self-imposed bias (smaller slope, lower intercept here), and a sensitivity to outlying values.
Tests using real-world data Finally, we consider an application to some well-known benchmark datasets for classification.At a high level, we run Algorithm 1 for multi-class logistic regression for 10 independent trials, where in each trial we randomly shuffle and re-split each full dataset (88% training, 12% testing), and randomly re-initialize the model weights identically to the previous paragraph, again with mini-batches of size 8, and step sizes fixed to α t = 0.01/ √ d, where d is the number of free parameters.Additional background on the datasets is given in appendix E. The key question of interest is how the test loss distribution changes as we modify the learner's feedback to optimize a range of risks R σ .In Figure 3, we see a stark difference  between doing traditional empirical risk minimization (ERM, denoted "off") and using R σ -based feedback, particularly for moderately large values of σ.The logistic losses are concentrated much more tightly (visible in the bottom row histograms), and this also leads to a better classification error (visible in the top row plots), an interesting trend that we observed across many distinct datasets.

A Overview of appendix contents
Our appendix is comprised of several sections, ordered as follows: B Background and setup C Detailed proofs D Helper results

E Empirical supplement
As with the main paper, we handle theoretical topics before diving into empirical topics.Section B gives a very detailed background including numerous formal definitions, supporting lemmas, and discussion on results used later in the detailed proofs (section C) for the main paper's results.Additional numerical test results are at the very end of section E.
To provide additional visual intuition for the reader, we include at the start of this appendix several figures related to ρ defined in (3), ρ σ defined in (4), and the resulting risk functions.In Figure 4 we plot ρ and its derivatives, plus ρ σ for a wide variety of σ ∈ [0, ∞] values.Additional details are given in the figure caption.In Figure 5, we show how specific choices of standard loss functions lead two different forms of the function composition h → L(h) → ηρ σ (L(h) − θ).

B.1 Preliminaries
General notation (probability) Underlying all our analysis is a probability space (Ω, F, µ). 5 All random variables, unless otherwise specified, will be assumed to be F-measurable functions with domain Ω.Integration using µ will be denoted by E µ Z . .= Ω Z(ω) µ(dω), and P will be used as a generic probability function, typically representing µ itself, or the product measure Cross-entropy: L(h) = log(1 + exp(−hh * )).In all cases, we have fixed h * = π.
induced by a sample of random variables on (Ω, F, µ).We use L 2 . .= L 2 (Ω, F, µ) to denote the set of all square-µ-integrable functions. 6neral notation (normed spaces) Let V denote an arbitrary vector space.When we call V a normed (linear) space, we are referring to (V, • ), where • : V → R denotes the relevant norm.For any normed space V, we shall denote by V * the usual dual space of V, namely all continuous linear functionals defined on V.The space V * is equipped with the norm v * . .= sup{v * (u) : ∀ u ∈ V, u ≤ 1}.We shall use the notation •, • to represent the "coupling" function between V and V * , that is for any u ∈ V and v * ∈ V * , we will write u, v * . .= v * (u).For any sequence (x n ) of elements x 1 , x 2 , . . .∈ V, we denote convergence of (x n ) to some element x by x n → x .When we take limits and do not specify a particular sequence, for example writing x → x , then this refers to any sequence (of elements from V) that converges to x .In the special case of real-valued sequences (where V ⊂ R), if we write x n → x + (respectively x n → x − ), this refers to all sequences from above (resp.below), i.e., any convergent sequence such that x n ≥ x (resp.x n ≤ x ) for all n.We denote the open ball of radius r > 0 centered at x 0 ∈ V by B(x 0 ; r) . .= {x ∈ V : x 0 − x < r}.We denote the extended real line by R. On normed space V, we denote the interior of a set U ⊂ V by int U (all x ∈ U such that B(x; δ) ⊂ U for some δ).
General terminology On any normed linear space V, a set A ⊂ V is said to be compact if for any sequence of elements in A, there exists a sub-sequence which converges on A. We denote the effective domain of an extended real-valued function f by dom f . .= {x : f (x) < ∞}.We call a convex function f : 7 For a function f : X → Y, with X and Y being normed spaces, we say f is (locally) Lipschitz at x 0 ∈ X if there exists δ > 0 and λ > 0 such that We say f is λ-Lipschitz on X if this property holds with a common coefficient λ for all x 0 ∈ X .

Semi-continuous functions
We say that a function f is lower semi-continuous 8 (LSC) at a point x if for each ε > 0, there exists δ > 0 such that x − x < δ implies f (x ) > f (x) − ε.If −g is LSC, then we say g is upper semi-continuous (USC).The property that f is LSC at a point x is equivalent 9 to the property that for any sequence x n → x, we have Ordinary continuity is equivalent to being both USC and LSC, but the added generality of these weaker notions of continuity is often useful.
Differentiability We start by introducing some common notions of directional differentiability at a high level of generality. 10Let X and Y be normed linear spaces, U ⊂ X an open set, and A slight modification to this gives us the (Hadamard) directional derivative of f at x ∈ U in direction u: When f r (x; u) exists for all directions u, we say that f is radially differentiable at x. Identically, when f (x; u) exists for all directions u, we say that f is directionally differentiable at x.When the map u → f r (x; u) is continuous and linear, we say that f is Gateaux differentiable at x.When the map u → f (x; u) is continuous and linear, we say f is Hadamard differentiable at x.If f is Hadamard differentiable, then it is Gateaux differentiable.The converse does not hold in general, but if f is Lipschitz on a neighborhood of x ∈ U , then radial differentiability and directional differentiability (at x) are equivalent. 11hen we simply say that a function f : X → Y is differentiable at x ∈ U , we mean that there exists a function f (x)(•) : X → Y that is linear, continuous, and which satisfies This property is often referred to as Fréchet differentiability.When f is differentiable at x, the map f (x) is uniquely determined. 12In the special case where Y ⊂ R, the linear functional represented by f (x) ∈ X * is called the gradient of f at x. Differentiability is also closely related to directional differentiability; if f is Gateaux differentiable on U and the map x → f (x; •) is continuous at x, then f is differentiable at x. 13 Sub-differentials Let V be any normed linear space.If f : The second characterization of ∂f (x), given using the radial derivative (13), is useful and intuitive. 14Some authors refer to this as the Moreau-Rockafellar sub-differential to emphasize the context of convex analysis.More generally, however, if f is not convex, then the strong global property used to define the MR sub-differential is so restrictive that most interesting functions are left out.A more general notion is that of the Fréchet sub-differential . 15Denoted ∂ F f (x), the Fréchet sub-differential of f at x is the set of all bounded linear functionals v * ∈ V * such that for any ε > 0, there exists δ > 0 such that This local requirement is much weaker than the condition characterizing the MR-sub-differential, and clearly we have ∂f (x) ⊂ ∂ F f (x).When f is assumed to be locally Lipschitz, another class of sub-differentials is often useful.Define the Clarke directional derivative of f at x in the direction u by The corresponding Clarke sub-differential is defined as In the special case where f is convex, all the sub-differentials coincide, i.e., ∂f 16 We say that a function f is sub-differentiable at x if its sub-differential (in any sense) at x is non-empty.Finally, a remark on notation when using set-valued functions like x → ∂ C f (x).When we write something like "we have u, ∂ C f (x) ≥ g(u)," it is the same as writing "we have u, v * ≥ g(u) for all v * ∈ ∂ C f (x)."This kind of notation will be used frequently.

B.2 Generalized convexity
Let X be a normed linear space.Take an open set U ⊂ X and fix some point x 0 ∈ U .For a function f : X → R and parameter γ ∈ R, say that there exists δ > 0 such that for all x, x ∈ B(x 0 ; δ) and α ∈ (0, 1), we have When γ ≥ 0, we say f is γ-weakly convex at x 0 .When γ ≤ 0, we say f is (−γ)-strongly convex at x 0 .When (20) holds for all x 0 ∈ U , we say that f is γ-weakly/strongly convex on U .The special case of γ = 0 is the traditional definition of convexity on U . 17 The ability to construct a quadratic lower-bounding function for f is closely related to notions of weak/strong convexity.Consider the following condition: given γ ∈ R, there exists δ > 0 such that for all x, x ∈ B(x 0 ; δ) we have Here ∂ C f denotes the Clarke sub-differential of f , defined by (19).Let us assume henceforth that X is Banach, f is locally Lipschitz, and ∂ C f (x) is non-empty for all x ∈ U . 18For any γ ∈ R, it is straightforward to show that ( 20) =⇒ (21) holds. 19Since (21) gives us a lower bound on both f (x) and f (x ) for any x and x close enough to x 0 , adding up the inequalities immediately implies When X is Banach and f is locally Lipschitz, it is straightforward to show that ( 22) =⇒ ( 20) is valid. 20As such, for Banach spaces and locally Lipschitz functions, we have that the conditions ( 20), (21), and ( 22) are all equivalent for the general case of γ ∈ R.
Let us consider one more closely related property on the same open set U ⊂ X : In the special case where X is a real Hilbert space and the norm • is induced by the inner product as • = •, • , then for any x, x ∈ U and α ∈ R, the equality is easily checked to be valid. 21In this case, the equivalence (20) ⇐⇒ (23) follows from direct verification using (24). 22he facts above are summarized in the following result: Proposition 7 (Characterization of generalized convexity).Consider a function f : X → R on normed linear space X .When X is Banach and f is locally Lipschitz, then with respect to open set U ⊂ X we have the following equivalence: When X is Hilbert, this equivalence extends to (23). 17See for example Nesterov [30, Ch. 3]. 18Penot [31, Prop.5.3]. 19See for example Daniilidis and Malick [11, Thm.3.1]; in particular their proof of (i) =⇒ (iii).Their result is stated for X = R d and locally Lipschitz f , but the proof easily generalizes to Banach spaces.See also the remarks following their proof about how the local Lipschitz condition can be removed. 20Just apply the argument for (ii) =⇒ (i) employed by Daniilidis and Malick [11, Thm.3.1], and strengthen their argument by using a more general form of Lebourg's mean value theorem [31, Thm.5.12]. 21Bauschke and Combettes [5, Cor.2.15]. 22See also Davis and Drusvyatskiy [12, Lem.2.1] for a similar result when X = R d and f is LSC.

B.3 Function composition on normed spaces
Next we consider the properties of compositions involving functions which are smooth and/or convex.Let X and Y be normed linear spaces.Let g : X → Y and h : Y → R be the maps used in our composition, and denote by f . .= h • g the composition, i.e., f (x) = h(g(x)) for each x ∈ X .Our goal will be to present sufficient conditions for the composition f to be weakly convex on an open set U ⊂ X , in the sense of (20).If we assume simply that h is convex, fixing any point x 0 ∈ U such that h is sub-differentiable at g(x 0 ), it follows that for any choice of x ∈ X we have Let us further assume that h is λ 0 -Lipschitz, and g is smooth in the sense that it is differentiable on U and the map x → g (x) is λ 1 -Lipschitz.For readability, denote the derivative g (x 0 ) : X → Y by g 0 (•) . .= g (x 0 )(•).Taking any choice of v 0 ∈ ∂h(g(x 0 )), we can write The first inequality follows from the definition of the norm for linear functionals and the fact that ∂h(g(x 0 )) ⊂ Y * .The second inequality follows from a Taylor approximation for Banach spaces (Proposition 16), using the smoothness of g.The final equality follows from the fact that for convex functions, the Lipschitz coefficient implies a bound on all sub-gradients, see (47).To deal with the remaining term, note that we can write To explain the notation here, we use (•) * to denote the adjoint, namely A * (y * ) . .= y * • A, induced by any continuous linear map A : X → Y, defined for each y * ∈ Y * . 23The special case we have considered here is where Au = g 0 (u), noting that differentiability means that the map u → g 0 (u) is continuous and linear.Recalling the desired form of (21), we need to establish a connection with ∂ C f (x 0 ).If we further assume that g is locally Lipschitz, then we have where the equality follows from the coincidence of sub-differentials in the convex case, and the key inclusion follows from direct application of a generalized chain rule. 24Taking these facts together yields the following result.
Proposition 8 (Weak convexity for composite functions).Let X and Y be Banach spaces.Let g : X → Y be locally Lipschitz and λ 1 -smooth on an open set U ⊂ X .Let h : Y → R be convex and λ 0 -Lipschitz on g(U ) ⊂ Y. Furthermore, let g(U ) ⊂ dom h.Then, the composite function f . .= h • g is γ-weakly convex on U , for γ = λ 0 λ 1 .
Proof.The desired result just requires us to piece together the key facts we have outlined in the main text.Local Lipschitz properties for g and h imply that f is locally Lipschitz, and thus ∂ C f (x 0 ) = ∅ for all x 0 ∈ U .Using (28), we have that ∂h(g(x 0 )) = ∅ as well.With this in mind, linking up ( 25)-( 27), under the assumptions stated, for each x, x 0 ∈ U we have Using the inclusion ( 28) with ( 29), we have (21) for γ = λ 0 λ 1 , and the desired result holds since (21) implies (20).

B.4 Proximal maps of weakly convex functions
For normed linear space X and function f : X → R, the Moreau envelope env β f and proximal mapping (or proximity operator ) prox β f are respectively defined for each x ∈ X as follows: Here β > 0 is a parameter.In the case where f is convex, the basic properties of the proximal map and envelope are well-understood, particularly when X is a Hilbert space. 25These insights extend readily to the setting of weak convexity.Under the assumption that X is Hilbert, let f be γ-weakly convex on X .Trivially we can write If we write f γ,x (u) . .= f (u) + (γ/2) x − u 2 and β γ . .= (β −1 − γ) −1 for readability, then as long as β γ > 0 we have for all x ∈ X that env β f (x) = env βγ f γ,x (x) and prox β f (x) = prox βγ f γ,x (x).By leveraging Proposition 7 under the Hilbert space assumption, we have that for any x ∈ X , the function f γ,x (•) is convex.This means that as long as β γ > 0, namely whenever γ < β −1 , all the standard results available for the case of convex functions can be brought to bear on the problem. 26Of particular importance to us is the fact that when f is LSC and γ-weakly convex, the Moreau envelope is differentiable, with gradient well-defined for all β < γ −1 and x ∈ X . 27We will be interested in finding stationary points of f , namely those x ∈ X such that 0 ∈ ∂ C f (x).From the basic properties of the envelope and proximal mapping, for γ-weakly convex f we have That is, for any point x ∈ X , the point prox β f (x) ∈ X is approximately stationary.The degree of precision is controlled by the gradient of env β f evaluated at x.In addition, it follows immediately from (32) that Since one trivially also has f (prox β f (x)) ≤ f (x), the norm of the gradient of env β f evaluated at x also tells us how far we are from a point (namely prox β f (x) ∈ X ) which is no worse than x in terms of function value.These basic facts directly motivate the use of the Moreau envelope norm to quantify algorithm performance. 28Detailed proofs C.1 Proofs for section 2 Lemma 10 (Lower semi-continuity).Let Z be a linear space of F-measurable random variables, and let ρ : R → R be any non-negative LSC function that is Borel-measurable.Then we have that the functional (Z, θ) → E µ ρ(Z − θ) is also LSC.
Proof of Lemma 10.Let (Z k ) and (θ k ) respectively be convergent sequences on Z and R. As we take k → ∞, say Z k → Z * pointwise, for some Z ∈ Z, and θ k → θ * ∈ R. Since by assumption ρ is LSC on R, using (12) we have (again, pointwise) that The former inequality follows from monotonicity of the integral, and the latter inequality follows from an application of Fatou's inequality, which is valid since ρ k ≥ 0.29 Taking both ends of (35) together, since the choice of sequences (Z k ) and (θ k ) were arbitrary, it follows again from the equivalence (12) that the functional (Z, θ) Lemma 11 (Basic integration properties).Let E µ Z 2 < ∞ hold, and take any θ ∈ R. Then the following properties of integrals based on ρ σ defined in (5) hold: Furthermore, the Leibniz integration property holds for both derivatives, that is for any 0 < σ < ∞, and for the special case of σ = ∞, we have These equalities hold for any θ ∈ R.
Proof of Lemma 11.Non-negativity of ρ σ implies 0 ≤ E µ ρ σ (Z − θ) for all σ.Regarding finiteness, starting with σ = 0 we have For the Leibniz property, let (a k ) be any real sequence such that a k → 0. Using the fact that ρ σ is bounded, the dominated convergence theorem lets us deduce the following: We note that the first equality just uses µ-integrability and linearity of the Lebesgue integral, the second equality uses boundedness and integrability of the derivative, plus dominated convergence (e.g., Ash and Doléans-Dade [2, Thm.1.6.9]).The last equality is just the chain rule applied to the differentiable function ρ σ (•).Since the sequence (a k ) was arbitrary, we conclude that the first equality of (36) holds.The second equality of (36), as well as both equalities in (37) hold via an identical argument.
For the case of σ = 0, note that Clearly, with η > 1 the right-hand side grows without bound as θ → −∞.For the case of 0 < σ = ∞, writing Z θ . .= (Z − θ)/σ for readability, the joint risk can be written conveniently as Since atan(•) is monotonic (increasing) on R, bounded as | atan(•)| < π/2, and atan(u) → π/2 as u → ∞, we have that E µ atan(Z θ ) → π/2 as θ → −∞, by monotone convergence. 30Thus, taking η > 2σ/π ensures that eventually as θ → ∞, the first term on the right-hand side of (38) will become positive.Since this term grows linearly, it dominates the other unbounded term (which is logarithmic), and thus we have shown that r σ is coercive whenever σ ≥ 0. Convexity and coercivity together imply that θ → r σ (Z, θ) takes its minimum on R; see Bertsekas [6,  Sec.B.3.2] or Barbu and Precupanu [3, Thm.2.11] for standard references.The case of σ = ∞ is easy, since by direct inspection we can write Since the sum of a strongly convex function and an affine function is strongly convex, we have that θ → r ∞ (Z, θ) has a unique minimum on R.
It only remains to prove the uniqueness of θ Z in the proposition statement for the case of 0 < σ < ∞.The most direct way of doing this is to use the Leibniz property (36) proved in our helper Lemma 11, which in particular tells us that where positivity follows from the fact that ρ (•) = 1/(1 + (•) 2 ) > 0. This implies strict convexity, and thus that the minimizer θ Z is unique.
Proof of Proposition 3. We take the points in the statement of the proposition in order, one at a time.To being, the (joint) convexity of r σ follows from direct inspection, using the convexity of ρ σ for any σ ∈ [0, ∞].With this fact in mind, note that the convexity of R σ can be checked easily as follows.For any Z 1 , Z 2 ∈ Z and θ 1 , θ 2 ∈ R, the definition and convexity of r σ immediately implies that for any α ∈ (0, 1) we have Using the notation (and statement) of Proposition 1, we can set and plugging this in to the above inequalities, we obtain and thus both r σ and R σ are convex for any σ ∈ [0, ∞].Note that this does not require the minima θ Z 1 and θ Z 2 to be unique, and thus holds for the σ = 0 case without issue.From Lemma 11, we also have that so both functions are proper convex.As for continuity, note that from Lemma 10 and the continuity of ρ σ (•) for all σ ∈ [0, ∞], we can immediately infer that r σ is LSC.It is well-known that on Banach spaces, any proper convex LSC function is continuous and sub-differentiable on the interior of the effective domain. 31Since our integrability assumptions imply dom r σ = Z × R, the continuity and sub-differentiability of r σ is thus proved.To handle R σ , take any sequence (Z k ) converging to an arbitrarily chosen point Z * ∈ Z.Let (θ k ) be any sequence converging to θ Z * ∈ R. Then by definition of R σ and continuity of r σ , we have The two ends of the inequality (39) imply that R σ is USC, via (12) and the relation of USC to LSC functions.On the effective domain of any convex USC function, the function is in fact continuous. 32Thus, we have that R σ is continuous.Furthermore, the sub-differentiability of R σ follows in the exact same fashion as for r σ .
Next, for the monotonicity of the location term Z → θ Z in (6), with 0 < σ ≤ ∞, recall that we can utilize the Leibniz properties ( 36)-( 37) from the integration Lemma 11.To start, we know that for any Z, the corresponding θ Z must satisfy the following first-order optimality condition: The desired monotonicity property is obvious for the σ = ∞ case using (40).As for the case of 0 < σ < ∞, it is evident from the second equality of (36) that the function θ → E µ ρ σ (Z − θ) is monotonically decreasing on R. Thus, if we assume Z 1 ≤ Z 2 almost surely but θ Z 1 > θ Z 2 , the first order optimality combined with monotonicity implies which is a contradiction.Thus, θ Z 1 ≤ θ Z 2 as desired for the 0 < σ < ∞ case as well.
The translation-equivariance property of Z → θ Z follows from direct inspection using the condition (40), that is for 0 < σ < ∞, we trivially have and thus r σ (Z + a, θ Z + a) = R σ (Z + a).Since Proposition 1 guarantees that the minimizer of θ → r σ (Z, θ) is unique, we can safely write θ Z+a = θ Z + a.The proof for the σ = ∞ case is analogous.
Finally, to prove that R σ is not in general monotonic, we give a concrete example of Z 1 and 2 , and for any η > 0 direct inspection shows that That is, the special case of σ = ∞ is equivalent to the mean-variance risk function of classical portfolio theory, dating back to Markowitz [28].The random variables Z 1 and Z 2 are constructed as follows.Let c 1 and c 2 be the respective centers, w 1 and w 2 the respective widths, and v 1 < w 2 1 and v 2 < w 2 2 the respective scaling factors of Z 1 and Z 2 , which are characterized as for each j ∈ {1, 2}.Note that our assumptions imply 0 < P{Z j = c j } < 1, and direct inspection shows that E Z j = c j and var Z j = v j , again for each j ∈ {1, 2}.As a simple concrete example, note that setting c 2 = c 1 + w 1 + w 2 guarantees Z 1 ≤ Z 2 with probability 1.From the equality (41) given above, the difference in risks can be written as For concreteness, say for some ε > 0, we fix the variance factors to v 1 = w 2 1 − ε and v 2 = w 2 2 − ε respectively.Then the condition simplifies to w 2 1 > w 1 + w 2 + w 2 2 .As an example, setting w 1 = 2 and w 2 = 1/2, the condition holds, implying R ∞ (Z 1 ) > R ∞ (Z 2 ), despite the fact that Z 1 ≤ Z 2 .This gives us a simple but intuitive example where monotonicity of R σ does not hold, and concludes the proof.

C.2 Proofs for section 3
Recall that our basic probabilistic setup for the learning problem has an underlying probability space (Ω, F, µ), a hypothesis class H, and a random loss L(h) indexed by H.That is, we consider any F-measurable function L(h; •) : Ω → R as a loss.When a particular realization ω ∈ Ω is important, we will write L(h; ω), but otherwise, for readability we will typically write L(h) . .= L(h; •).Our basic integrability assumption, carried over from section 2, is that of square-µ-integrability, which in the context of losses is written explicitly as Loss-specific terminology To ensure our use of formal terms is clear, we apply the definitions of section B.1 to losses here.We shall typically suppress the dependence on ω ∈ Ω in directional derivatives and gradients, writing L r (h; g) . .= L r (h; g, •), L (h; g) . .= L (h; g, •), and L (h) . .= L (h; •).Let H ⊂ H be an open set.We say that L is radially differentiable at h ∈ H if the radial derivative L r (h; g) exists for all directions g ∈ H, µ-almost surely.We say that L is directionally differentiable at h ∈ H if the directional derivative L (h; g) exists for all directions g ∈ H, µ-almost surely.On this "good" event of probability 1, if the map g → L r (h; g) is linear and continuous, we say L is Gateaux differentiable at h, and if the map g → L (h; g) is linear and continuous, we say L is Hadamard differentiable at h.When we say that L is (Fréchet) differentiable at h ∈ H, we mean that there exists a function L (h)(•) : H → R + that is linear, continuous, and which satisfies (15) µ-almost surely. 33We say that

With the running assumption about second moments, this amounts to requiring
We say that L is weakly λ-smooth at h ∈ H if L is Gateaux differentiable and the map h → L r (h; •) is λ-Lipschitz µ-almost surely at h.That is, if for small enough δ > 0 we have Note that the norm used here is the operator norm applied to the linear map L r (h; •) − L r (h ; •).

C.2.1 Weak convexity of joint composition function
The joint risk function r σ (L(h), θ) can be written as a simple composition (h, θ) For any 0 ≤ σ < ∞ and any smooth loss, using the preliminary results established section B.3, it is straightforward to show the weak convexity of this composite function.
Proposition 12. Let the hypothesis class H be Banach.Let the loss L be locally Lipschitz and weakly λ -smooth on H.Then, for any 0 ≤ σ < ∞, defining a σ-dependent factor λ σ as Proof.Recall the generic result given in Proposition 8 for the weak convexity of generic composite functions.Our proof here amounts to checking that the assumptions of Proposition 8 are satisfied for the composition f To start, let us consider the properties of f 1 .Since L is locally Lipschitz and Gateaux differentiable, it follows that L is also Hadamard differentiable. 34Since the map h → L r (h; •) = L (h; •) is continuous (by weak smoothness), it follows that L is (Fréchet) differentiable. 35Since (h, θ) → f 1 (h, θ) = (L(h), θ) just passes θ through the identity, trivially the second component is also differentiable, and the differentiability of both components thus implies f 1 is differentiable. 36urthermore, the local Lipschitz property of L is clearly retained by f 1 .Evaluating the gradients we have f 1 (h, θ)(g, r) = (L (h)(g), r), and thus using a typical product space norm we have By weak smoothness of L, it follows that f 1 is max{1, λ }-smooth µ-almost surely, where smoothness is in the sense defined in section B.3.
Next, let us look at properties of r σ .Note that ρ σ (•) is trivially 1-Lipschitz in the case of σ = 0, and for 0 < σ < ∞, the π/2-Lipschitz property of ρ defined in (3) implies that ρ σ (•) is π/(2σ)-Lipschitz.With these basic facts in place, it follows that where the σ-dependent Lipschitz coefficient λ σ is as defined in the statement of the desired result.To obtain bounds in terms of the correct norm, note that which follows from the fact that µ is a probability, and a simple application of Hölder's inequality. 37Plugging this into the previous inequality, and noting that it holds for any choice of Z 1 , Z 2 ∈ L 2 (Ω, F, µ) and θ 1 , θ 2 ∈ R, it follows that r σ is (1 + ηλ σ )-Lipschitz on L 2 (Ω, F, µ) × R, and f 1 (H × R) ⊂ dom f 2 .Furthermore, from Proposition 3, we have that r σ is convex.
Taking the above points together, if we consider the good event of probability 1 where f 1 satisfies the desired properties, direct application of Proposition 8 to the map (h, θ) → (r σ •f 1 )(h, θ) yields the desired result.
Remark 13.The result in the preceding Proposition 12 is rather useful, and it does not require the loss to be convex.When the loss is convex, the analysis becomes somewhat simpler and stronger arguments are naturally possible; composite risks under convex losses and convex, monotonic risk functions is the setting considered by Ruszczyński and Shapiro [40, Sec.3.2], for example.
We have established conditions under which the intermediate joint objective r σ (L(h), θ) is weakly convex, and characterized this weak convexity with respect to properties of the underlying risk function and data distribution.Since the data distribution µ is unknown, we can never actually compute r σ (L(h), θ).Any learning algorithm will only have access to feedback of a stochastic nature which provides incomplete, noisy information.Our next task is to establish conditions under which the feedback available to the learner is "good enough" to ensure reasonable performance guarantees.

C.2.2 Unbiased stochastic feedback
In considering stochastic feedback, recall that r σ (L(h), θ) = E µ (f 2 • F 1 )(h, θ), with F 1 and f 2 given by (7) in the main text.For each h ∈ H and θ ∈ R, the value F 1 (h, θ) returned by F 1 is a random vector.We shall assume that for any h ∈ H, the learner can obtain independent random samples of the loss L(h) and the associated gradient L (h).Since F 1 (h, θ) = (L (h), 1) by which we mean F 1 (h, θ)(g, r) = (L (h)(g), r) for all g ∈ H and r ∈ R, clearly the learner can also independently sample from F 1 (h, θ) and F 1 (h, θ).Sub-differentiability is already guaranteed by Proposition 3, and since ρ σ and η are by design known to the learner, they can readily acquire an element from ∂f 2 (u, θ).Thus if (L(h), θ) ∈ int(dom f 2 ) = R 2 µ-almost surely, it follows that the learner can sample from ∂f 2 (L(h), θ) • F 1 (h, θ).This is the stochastic feedback available to the learner, and when we ask that it be "good enough," this means we require it to be an unbiased estimator of the (Clarke) sub-differential of r σ .The following result gives mild conditions under which this is achieved.Proposition 14.Under the conditions of Proposition 12, for any h ∈ H and θ ∈ R, as long as and this holds for any choice of 0 ≤ σ < ∞.
Proof.Using the weak smoothness of L, with probability 1, the map Since F 1 and f 2 are (locally) Lipschitz, the facts we have just laid out imply a strong chain rule. 39That is, it holds µ-almost surely that where the second equality follows from the convexity of f 2 .
Remark 15.The validity of interchanging the operations of (sub-)differentiation and expectation is a topic of fundamental importance in stochastic optimization and statistical learning theory.
A useful, modern reference on this topic is included in Ruszczyński and Shapiro [39, Ch. 2].A classical reference is Rockafellar and Wets [37]; see also Rockafellar [35] for a look at measurability of convex integrands.The interchangeability problem appears in various places in the literature over the years, see for example Shapiro [45]

C.2.3 Proximity to a nearly-stationary point
Proof of Theorem 5.With all the results established thus far, this proof has just two simple parts.First, we need to show that the objective function of interest is weakly convex, and that we have access to unbiased estimates of the sub-differential; this is done here.This is done using the critical preparatory results in Propositions 12 and 14.Once this has been established, the remaining part just has us applying recent results from the literature for non-asymptotic control of the envelope gradient norm.
To begin, the assumptions of Proposition 12 are satisfied by A1, which ensures that r σ is γ-weakly convex for γ = (1 + ηλ σ ) max{1, λ}.Furthermore, the µ-integrability assumption on L (h) lets us use Proposition 14 to ensure that feedback drawn from ( 9) is such that for all t, since Algorithm 1 uses G t sampled via (9).
The desired result follows from an application of Davis and Drusvyatskiy [12, Thm.3.1], where their objective function f corresponds to our r σ . 42While their proof is given for the case of H = R d , using assumption A2, if we leverage our characterization of weak convexity (Proposition 7), and replace their Lemma 2.2 with our (32), it is straightforward to see that their insights extend to arbitrary Hilbert spaces using the usual norm induced by the inner product.Thus with the moment bound A4 in hand, the generalized result can be applied to Algorithm 1, for objective function r σ , which has just been proved to be γ-weakly convex.The desired result follows immediately.

D Helper results
In this section, we provide some standard results that are leveraged in the main paper.

D.1 Useful results based on Lipschitz properties
Let X be a normed linear space, and let f : X → R be convex and λ-Lipschitz.If f is sub-differentiable at a point x, then using the definition of the sub-differential, we have that That is, all sub-gradients of f at x have norm no greater than the Lipschitz coefficient λ.
Let X and Y be Banach spaces, and let f : X → Y be differentiable on U ⊂ X , an open set.Further, assume that the derivative is λ-Lipschitz on U , that is, for each x, x ∈ U , we have f (x) − f (x ) ≤ λ x − x .First-order Taylor approximations have direct analogues in this general setting, as the following result shows. 43oposition 16.Let f : X → Y be differentiable on an open set U ⊂ X , with X and Y assumed to be Banach.If f (•) is λ-Lipschitz on U , then for any x, u ∈ U such that x + u ∈ U , we have

D.2 Radial derivatives of convex functions
Say a function f : V → R is convex.Take any u, v ∈ dom f , and any scalar c ≥ 0 such that Filling in definitions and rearranging we have Note that this can be done for any pair of u, v and scalar c that keeps the relevant points on the domain.Clearly this property is necessary for convexity, but it is in fact also sufficient. 44or any function f : V → R and open set U ⊂ X , fix a point x ∈ U .We denote the difference quotient of f at x, incremented in the direction u, modulated by scalar α = 0 as q(α) . .= q(α; f, x, u) . .
Consider the map g(t) . .= f (x + tu) − f (x), with all elements but t ≥ 0 fixed.When f is convex, direct inspection immediately shows that t → g(t) is convex.For any 0 ≤ t 1 < t 2 , take some t ∈ (t 1 , t 2 ).Clearly, there exists a β ∈ (0, 1) such that t = βt 1 + (1 − β)t 2 .Then, we have where the inequality follows from convexity of g.If we use this inequality in the special case of t 1 = 0, alongside the basic relation q(α) = (g(α) − g(0))/α, it immediately follows that α → q(α) is monotonic (non-decreasing) on the positive reals.Furthermore, the set {q(α) : α > 0} is bounded below.To see this, take some γ > 0 small enough that x − γu ∈ dom f , and note that by direct application of convexity and the basic property (48), it follows that That is, dividing both sides by α, we have Since the choice of γ > 0 depends only on x and u, and is free of α, it follows that the set {q(α) : α > 0} is bounded below, as desired.Using this boundedness alongside the monotonicity of α → q(α), we have that the infimum is finite.Thus, recalling the definition (13) of the radial derivative of f at x in the direction u, since we have it follows immediately that the radial derivative always exists (i.e., f (x; u) ∈ R).Note also that using convexity, direct inspection shows that for all u we have Furthermore, it is easily verified that whenever x ∈ dom f , the map u → f (x; u) is sub-additive and positively homogeneous, i.e., a sub-linear functional. 45The basic facts of interest here are summarized in the following proposition.
Proposition 17 (Difference quotients for convex functions).Let V be a vector space.If function f : V → R is proper and convex, then it is radially differentiable on int(dom f ).
Proof.The desired result follows immediately from previous discussion leading up to (51), and the fact that if x is an interior point of the effective domain of f , it follows that for any u ∈ V, we can find a γ > 0 small enough that x − γu ∈ dom f , which means we can apply the lower bound of (50) to the difference quotients q(α; f, x, u) indexed by α > 0.

D.3 Loss example
Example 18.While stated with a somewhat high degree of abstraction, let us give a concrete example to emphasize that the assumptions of Proposition 12 are readily satisfied under natural and important learning settings.Consider the regression problem, where we observe random pairs (X, Y ) ∼ µ, assuming that X is a finite-dimensional real-valued random vector, and Y is a real-valued random variable, related to the inputs by the relation Y = h * (X) + , where is a zero-mean random noise term.For simplicity, let h * be a continuous linear map, and let H be the set of all continuous linear maps on the space that X is distributed over.Finally, let the loss by the squared error, such that Since we make almost no assumptions on the nature of the underlying noise distribution, clearly both the losses and the "gradients" can be unbounded and heavy-tailed.Fix any h 0 ∈ H, and note that for any h ∈ H, we have Absolute values can be bounded above as It follows immediately that as long as E µ X 4 < ∞, we have that the local Lipschitz property (42) of the loss is satisfied, for arbitrary choice of h 0 .
As for the weak smoothness requirement on the loss, note that Thus, if the random inputs X are µ-almost surely bounded, the desired smoothness condition (43) holds.Note that this does not preclude heavy-tailed losses and gradients since no additional assumptions have been made on the noise term.

E Empirical supplement
Due to limited space, we could only include key details and a few representative results in section 4 of the main text.Here we fill in those additional details.To begin, all our numerical experiments have been implemented entirely in Python (v.features and the number of classes, since we are using multi-class logistic regression (one linear model for each class).All categorical features are given a one-hot representation.All input features are standardized to take values on the unit interval [0, 1].As mentioned in the main text, we have prepared a GitHub repository that includes code for both re-creating the empirical tests and pre-processing the data, that will be made public following the review phase.

Additional results
In Figures 6-8, we give additional results that complement Figure 3 in the main text.The trends in terms of the histograms of test loss distributions are essentially uniform across this wide variety of datasets.We also see that a sharply-concentrated logistic (test) loss tends to correlate with better classification error (average zero-one error), with cifar10 being the only exception to this trend.As another point not raised in the main text, intuitively we would hope that Algorithm 1 performs well in terms of the risk R σ corresponding to its particular σ setting; we have found this to be true across the benchmark datasets studied here.See Figure 9 for an example from the adult dataset.Moving from top to bottom, the order of colors shows a rather clear reversal.Very similar trends can be observed on the other datasets as well.In estimating R σ on the test set, we use an empirical mean estimate of r σ , and then minimize with respect to θ using the minimize_scalar function of the SciPy (v.1.6.2) optimize module.

4 0Figure 1 :
Figure 1: A simple toy example using L(h) = h L wide +(1 − h) L thin .Trajectories shown are the sequence (h t ) generated by running (8) on R 2 , with h 0 = 0.5 and θ 0 = 0.5, averaged over all trials.Densities of L wide (red) and L thin (blue) are also plotted, with additional details in the main text.

Figure 6 :
Figure 6: Additional average classification error trajectories over epochs.

Figure 9 :
Figure 9: Empirical mean estimates of R σ for a variety of σ ∈ [0, ∞] settings.The colored curves correspond to different σ settings in running Algorithm 1, just as in previous plots, whereas the distinct plots correspond to the different σ used in evaluation.
step sizes (α t ), and max iterations n.
), we consider the case in which H is any Hilbert space.All Hilbert spaces are reflexive Banach spaces, and the stochastic sub-gradient G t ∈ (H × R) * (the dual of H × R) can be uniquely identified with an element of H × R, for which we use the same notation G t .Denoting the partial sequence G[t] 9. We note that Proposition 8 extends a result of Drusvyatskiy and Paquette[13,  Lem.4.2] from the case where X and Y are finite-dimensional Euclidean spaces, to the general Banach space setting considered here.For the classical case of Euclidean spaces, exact chain rules are well-known[38, Ch. 10.B].

Table 1 :
3.8) using the following additional open-source software: Jupyter notebook (for interactive demos), 46 matplotlib (v.3.4.1,for all visuals), 47 PyTables (v.3.6.1,fordatasethandling), 48 (v.1.20.0,foralmostallcomputations),49andSciPy(v.1.6.2, for random variable statistics and special functions).In the following paragraphs, we provide information about the benchmark datasets used, as well as several figures including additional experimental results.Dataset descriptionThe real-world benchmark datasets used in our classification tests are as follows: adult, 50 australian, 51 cifar10, 52 cod_rna, 53 covtype, 54 emnist_balanced, 55 fashion_mnist, 56 and mnist. 57Table1for a summary.Further background on all datasets is available at the URLs provided in the footnotes.Dataset size reflects the size after removal of instances with missing values, where applicable.For all datasets with categorical features, the "input features" given in Table1represents the number of features after doing a one-hot encoding of all such features.The "model dimension" is just the product of the number of input A summary of the benchmark datasets used for performance evaluation.