1 Introduction

In machine learning, the important yet ambiguous notion of “good off-sample generalization” (or “small test error”) is typically formalized in terms of minimizing the expected value of a random loss \({{\,\mathrm{\mathbf {E}}\,}}_{\mu }{{\,\mathrm{L}\,}}(h)\), where h is a candidate decision rule and \({{\,\mathrm{L}\,}}(h)\) is a random variable on an underlying probability space \((\Omega ,\mathcal {F},\mu )\). This setup based on average off-sample performance has been famously called the “general setting of the learning problem” by Vapnik (1999), and is central to the decision-theoretic formulation of learning in the influential work of Haussler (1992). This is by no means a purely theoretical concern; when average performance dictates the ultimate objective of learning, the data-driven feedback used for training in practice will naturally be designed to prioritize average performance (Bottou et al. 2016; Johnson and Zhang 2014; Le Roux et al. 2013; Shalev-Shwartz and Zhang 2013). Take the default optimizers in popular software libraries such as PyTorch or TensorFlow; virtually without exception, these methods amount to efficient implementations of empirical risk minimization. While the minimal expected loss formulation is clearly an intuitive choice, the tacit emphasis on average performance represents an important and non-trivial value judgment, which may or may not be appropriate for any given real-world learning task.

1.1 Our basic motivation

Put simply, our goal is to develop learning algorithms that make risk function design an explicit part of the modern machine learning workflow, rather than the traditional tacit emphasis of average performance while ignoring all other properties of the (test) loss distribution that are inherent in traditional empirical risk minimization (ERM). More concretely, we want to design a transformation that can be applied to any loss function, which is flexible enough to express varying degrees of sensitivity to loss deviations and tail behavior, easy to implement using existing stochastic gradient-based frameworks (e.g., PyTorch), and ultimately leads to novel risks which are analytically tractable. In the following paragraphs, we will discuss the basic ideas underlying our approach at a high level, and touch on some of the key related literature. After some basic theoretical setup, we discuss the strengths and drawbacks of our proposed approach (relative to the existing literature) in more detail in Sect. 2.3.

1.2 Some statistical context

In order to make the value judgement of “how to define risk” an explicit part of the machine learning methodology, in this work we consider a generalized class of risk functions, designed to give the user much greater flexibility in terms of how they choose to evaluate performance, while still allowing for theoretical performance guarantees. One core statistical concept is that of the M-location of the loss distribution under a candidate h, defined by

$$\begin{aligned} {{\,\mathrm{M}\,}}(h) :=\mathop {\mathrm{arg\,min}}\limits _{\theta \in \mathbb {R}} \, {{\,\mathrm{\mathbf {E}}\,}}_{\mu }\rho \left( \frac{{{\,\mathrm{L}\,}}(h)-\theta }{\sigma }\right) . \end{aligned}$$
(1)

Here \(\rho :\mathbb {R}\rightarrow [0,\infty )\) is a function controlling how we measure deviations, and \(\sigma > 0\) is a scaling parameter. Since the loss distribution \(\mu \) is unknown, clearly \({{\,\mathrm{M}\,}}(h)\) is an ideal, unobservable quantity. If we replace \(\mu \) with the empirical distribution induced by a sample \(({{\,\mathrm{L}\,}}_1,\ldots ,{{\,\mathrm{L}\,}}_n)\), then for certain special cases of \(\rho \) we get an M-estimator of the location of \({{\,\mathrm{L}\,}}(h)\), a classical notion dating back to Huber (1964), which justifies our naming. Ignoring integrability concerns for the moment, note that in the special case of \(\rho (u) = u^{2}\), we get the classical risk \({{\,\mathrm{M}\,}}(h) = {{\,\mathrm{\mathbf {E}}\,}}_{\mu }{{\,\mathrm{L}\,}}(h)\), and in the case of \(\rho (u) = |u |\), we get \({{\,\mathrm{M}\,}}(h) = \inf \{u: \mu \{ {{\,\mathrm{L}\,}}(h) \le u \} \ge 0.5 \}\), namely the median of the loss distribution. This rich spectrum of evaluation metrics makes the notion of casting learning problems in terms of minimization of M-locations (via their corresponding M-estimators) very conceptually appealing. However, while the statistical properties of the minima of M-estimators in special cases are understood (Brownlees et al. 2015), the optimization involved is both costly and difficult, making the task of designing and studying \({{\,\mathrm{M}\,}}(\cdot )\)-minimizing learning algorithms highly intractable. With these issues in mind, we study a closely related alternative which retains the conceptual appeal of raw M-locations, but is more computationally congenial.

1.3 Our approach

With \(\sigma \) and \(\rho \) as before, our generalized risks will be defined implicitly by

$$\begin{aligned} \min _{\theta \in \mathbb {R}} \left[ \theta + \eta {{\,\mathrm{\mathbf {E}}\,}}_{\mu }\rho \left( \frac{{{\,\mathrm{L}\,}}(h)-\theta }{\sigma }\right) \right] , \end{aligned}$$
(2)

where \(\eta > 0\) is a weighting parameter that controls the balance of priority between location and deviation. A more formal definition will be given in Sect. 2 (see Eqs. 35), including concrete forms for \(\rho (\cdot )\) that are conducive to both fast computation and meaningful learning guarantees. In addition, we will show (cf. Proposition 3) that this new objective can be written as

$$\begin{aligned} {[}{{\,\mathrm{M}\,}}(h) - c_{{{\,\mathrm{M}\,}}}] + \eta {{\,\mathrm{\mathbf {E}}\,}}_{\mu }\rho \left( \frac{{{\,\mathrm{L}\,}}(h)-[{{\,\mathrm{M}\,}}(h) - c_{{{\,\mathrm{M}\,}}}]}{\sigma }\right) , \end{aligned}$$

where the shift term \(c_{{{\,\mathrm{M}\,}}} > 0\) can be simply characterized by the equality

$$\begin{aligned} {{\,\mathrm{\mathbf {E}}\,}}_{\mu }\rho ^{\prime }\left( \frac{{{\,\mathrm{L}\,}}(h)-[{{\,\mathrm{M}\,}}(h) - c_{{{\,\mathrm{M}\,}}}]}{\sigma }\right) = \frac{\sigma }{\eta }, \end{aligned}$$

noting that \(c_{{{\,\mathrm{M}\,}}} \rightarrow 0_{+}\) as \(\eta \rightarrow \infty \). By utilizing smoothness properties of loss functions typical to machine learning problems (e.g., squared error, cross-entropy, etc.), even though the generalized risks need not be convex, they can be shown to satisfy weak notions of convexity, which still admit finite-sample guarantees of near-stationary for stochastic gradient-based learning algorithms (details in Sect. 3). This approach has the additional benefit that implementation only requires a single wrapper around any given loss which can be set prior to training, making for easy integration with frameworks such as PyTorch and TensorFlow, while incurring negligible computational overhead.

1.4 Our contributions

The key contribution here is a new concrete class of risk functions, defined and analyzed in Sect. 2. These risks are statistically easy to interpret, their empirical counterparts are simple to implement in practice, and as we prove in Sect. 3, their design allows for standard stochastic gradient-based algorithms to be given competitive finite-sample excess risk guarantees. We also verify empirically (Sect. 4) that the proposed feedback generation scheme has a demonstrable effect on the test distribution of the loss, which as a side-effect can be easily leveraged to outperform traditional ERM implementations, a result which is in line with early insights from Breiman (1999) and Reyzin and Schapire (2006) regarding the impact of the loss distribution on generalization. More broadly, the choice of which risk to use plays a central role in pursuit of increased transparency in machine learning, and our results represent a first step towards formalizing this process.

1.5 Relation to existing literature

With respect to alternative notions of “risk” in machine learning, perhaps the most salient example is conditional value-at-risk (CVaR) (Prashanth et al. 2020; Holland and Haress 2021a), namely the expected loss conditioned on it exceeding a quantile at a pre-specified probability level. CVaR allows one to encode a sensitivity to extremely large losses, and admits convexity when the underlying loss is convex, though the conditioning often leaves the effective sample size very small. Other notions such as spectral risk and cumulative prospect theory (CPT) scores have also been considered (Bhat and Prashanth 2020; Holland and Haress 2021b; Khim et al. 2020; Leqi et al. 2019), but the technical difficulties involved with computation and analysis arguably outweigh the conceptual benefits of learning using such scores. These proposals can all be interpreted as “location” parameters of the underlying loss distribution; our risks take the form of a sum of a location and a deviation term, where the location is a shifted M-location, as described above. The basic notion of combining location and deviation information in evaluation is a familiar concept; the mean-variance objective \({{\,\mathrm{\mathbf {E}}\,}}_{\mu }{{\,\mathrm{L}\,}}(\cdot )+{{\,\mathrm{var}\,}}_{\mu }{{\,\mathrm{L}\,}}(\cdot )\) dates back to classical work by Markowitz (1952); our proposed class includes this as a special case, but generalizes far beyond it. Mean-variance and other risk function classes are studied by Ruszczyński and Shapiro (2006, 2006b), who give minimizers a useful dual characterization, though our proposed class is not captured by their work (see also Remark 4). We note also that the recent (and independent) work of Lee et al. (2020) considers a form which is similar to (2) in the context of empirical risk minimizers; the critical technical difference is that their formulation is restricted to \(\rho \) which is monotonic, an assumption which enforces convexity. The special case of mean-variance is also treated in depth in more recent work by Duchi and Namkoong (2019), who consider stochastic learning algorithms for doing empirical risk minimization with variance-based regularization. Finally, we note that our technical analysis in Sect. 3 makes crucial use of weak convexity properties of function compositions, an area of active research in recent years (Duchi and Ruan 2018; Drusvyatskiy and Paquette 2019; Davis and Drusvyatskiy 2019). Since our proposed objective can be naturally cast as a composition taking us from parameter space to a Banach space and finally to \(\mathbb {R}\), leveraging the insights of these previous works, we extend the existing machinery to handle learning over Banach spaces, and give finite-sample guarantees for arbitrary Hilbert spaces. More details are provided in Sect. 3, plus the appendix.

1.6 Notation and terminology

To give the reader an approximate idea of the technical level of this paper, we assume some familiarity with probability spaces, the notion of sub-gradients and the sub-differential of convex functions, as well as special classes of vector spaces like Banach and Hilbert spaces, although the main text is written with a wide audience in mind. Strictly speaking, we will also deal with sub-differentials of non-convex functions, but these technical concepts are relegated to the appendix, where all formal proofs are given. In the main text, to improve readability, we write \(\partial f(x)\) to denote the sub-differential of f at x, regardless of the convexity of f. When we refer to a function being \(\lambda \)-smooth, this refers to its gradient being \(\lambda \)-Lipschitz continuous, and weak smoothness just requires such continuity on directional derivatives; all these concepts are given a detailed introduction in the appendix. Throughout this paper, we use \({{\,\mathrm{\mathbf {E}}\,}}[\cdot ]\) for taking expectation, and \({{\,\mathrm{\mathbf {P}}\,}}\) as a general-purpose probability function. For indexing, we will write \([k] :=\{1,\ldots ,k\}\). Distance of a vector v from a set A will be denoted by \({{\,\mathrm{dist}\,}}(v;A) :=\inf \{\Vert v-v^{\prime }\Vert : v^{\prime } \in A \}\).

2 A concrete class of risk functions

Fig. 1
figure 1

Left: graphs of \(\rho \) defined in (3) (solid line), \(\rho ^{\prime }\) (dashed line), and \(\rho ^{\prime \prime }\) (dot-dash line). Middle–right: these are graphs of \(\theta \mapsto \eta \rho _{\sigma }(1.0-\theta )\), with \(\rho _{\sigma }\) defined in (4), where the minimizer is \(\theta _{\min }=1.0\), the colors correspond to different \(\sigma \) values, and \(\eta \) is set following Remark 2. That is, for \(0< \sigma < 1.0\), set \(\eta = \sigma /{{\,\mathrm{atan}\,}}(\infty )\). For \(\sigma = 0\), set \(\eta = 1.05\). For \(1.0 \le \sigma < \infty \), set \(\eta = 2\sigma ^{2}\). For \(\sigma = \infty \), set \(\eta = 1.0\)

The risks described by (2) are fairly intuitive as-is, but a bit more structure is needed to ensure they are well-defined and useful in practice. In this section, we introduce a concrete choice of \(\rho \) for measuring deviations, consider a risk class modulated via re-scaling, establish basic theoretical properties and discuss both the utility and limitations of the proposed risk class.

2.1 Deviations and re-scaling

To make things more concrete, let us fix \(\rho \) as

$$\begin{aligned} \rho (u) :=u {{\,\mathrm{atan}\,}}(u) - \frac{\log (1+u^2)}{2}, \quad u \in \mathbb {R}. \end{aligned}$$
(3)

This function is handy in that it behaves approximately quadratically around zero, and it is both \(\pi /2\)-Lipschitz and strictly convex on the real line.Footnote 1 Fixing this particular choice of \(\rho \) and letting Z be a random variable (any \(\mathcal {F}\)-measurable function), we interpolate between mean- and median-centric quantities via the following class of functions, indexed by \(\sigma \in [0,\infty ]\):

$$\begin{aligned} {{\,\mathrm{r}\,}}_{\sigma }(Z,\theta ) :=\theta + \eta {{\,\mathrm{\mathbf {E}}\,}}_{\mu } \rho _{\sigma }\left( Z-\theta \right) , \text { where } \rho _{\sigma }(u) :={\left\{ \begin{array}{ll} |u |, &{} \text {if } \sigma = 0\\ \rho \left( u/\sigma \right) , &{} \text {if } 0< \sigma < \infty \\ u^2, &{} \text {if } \sigma = \infty . \end{array}\right. } \end{aligned}$$
(4)

With this class of ancillary functions in hand, it is natural to define

$$\begin{aligned} {{\,\mathrm{R}\,}}_{\sigma }(Z) :=\inf _{\theta \in \mathbb {R}} \, {{\,\mathrm{r}\,}}_{\sigma }(Z,\theta ) \end{aligned}$$
(5)

to construct a class of risk functions. In the context of learning, we will use this risk function to derive a generalized risk, namely the composite function \(h \mapsto {{\,\mathrm{R}\,}}_{\sigma }({{\,\mathrm{L}\,}}(h))\). As a special case, clearly this includes risks of the form (2) given earlier. Visualizations of \(\rho \) highlighting the role of re-scaling are given in Fig. 1. See also Fig. 5 in the appendix for visualizations of different loss functions transformed according to (4). Minimizing \({{\,\mathrm{R}\,}}_{\sigma }({{\,\mathrm{L}\,}}(h))\) in h is our formal learning problem of interest.

2.2 Basic properties

Before considering learning algorithms, we briefly cover the basic properties of the functions \({{\,\mathrm{r}\,}}_{\sigma }\) and \({{\,\mathrm{R}\,}}_{\sigma }\). Without restricting ourselves to the specialized context of “losses,” note that if Z is any square-\(\mu \)-integrable random variable, this immediately implies that \(|{{\,\mathrm{r}\,}}_{\sigma }(Z,\theta )|< \infty \) for all \(\theta \in \mathbb {R}\), and thus \({{\,\mathrm{R}\,}}_{\sigma }(Z) < \infty \). Furthermore, the following result shows that it is straightforward to set the weight \(\eta \) to ensure \({{\,\mathrm{R}\,}}_{\sigma }(Z) > -\infty \) also holds, and a minimum exists.

Proposition 1

Assuming that \({{\,\mathrm{\mathbf {E}}\,}}_{\mu }Z^2 < \infty \), set \(\eta \) based on \(\sigma \in [0,\infty ]\) as follows: if \(\sigma = 0\), take \(\eta > 1\); if \(0< \sigma < \infty \), take \(\eta > 2\sigma /\pi \); if \(\sigma = \infty \), take any \(\eta > 0\). Under these settings, the function \(\theta \mapsto {{\,\mathrm{r}\,}}_{\sigma }(Z,\theta )\) is bounded below and takes its minimum on \(\mathbb {R}\). Thus, for each square-\(\mu \)-integrable Z, there always exists a (non-random) \(\theta _{Z} \in \mathbb {R}\) such that

$$\begin{aligned} {{\,\mathrm{R}\,}}_{\sigma }(Z) = \theta _{Z} + \eta {{\,\mathrm{\mathbf {E}}\,}}_{\mu }\rho _{\sigma }\left( Z-\theta _{Z}\right) . \end{aligned}$$
(6)

Furthermore, when \(\sigma > 0\), this minimum \(\theta _{Z}\) is unique.

Remark 2

(Mean-median interpolation) In order to ensure that risk modulation via \(\sigma \in [0,\infty ]\) smoothly transitions from a median-centric (\(\sigma = 0\) case) to a mean-centric (\(\sigma = \infty \) case) location, the parameter \(\eta \) plays a key role. Noting that for any \(u \in \mathbb {R}\), for \(\rho \) defined by (3) we have \(2\sigma ^2 \rho (u/\sigma ) \rightarrow u^2\) as \(\sigma \rightarrow \infty \), and thus for large values of \(\sigma > 0\) it is natural to set \(\eta = 2\sigma ^2\). On the other end of the spectrum, since \(\sigma \log (1+(u/\sigma )^2) \rightarrow 0_{+}\) whenever \(\sigma \rightarrow 0_{+}\), it is thus natural to set \(\eta = \sigma /{{\,\mathrm{atan}\,}}(\infty ) = 2\sigma /\pi \) when \(\sigma > 0\) is small. Strictly speaking, in light of the conditions in Proposition 1, to ensure \({{\,\mathrm{R}\,}}_{\sigma }\) is finite one should take \(\eta > 2\sigma /\pi \).

What can we say about our risk functions \({{\,\mathrm{R}\,}}_{\sigma }\) in terms of more traditional statistical risk properties? The form of \({{\,\mathrm{R}\,}}_{\sigma }\) given in (6) has a simple interpretation as a weighted sum of “location” and “deviation” terms. In the statistical risk literature, the seminal work of Artzner et al. (1999) gives an axiomatic characterization of location-based risk functions that can be considered coherent, while Rockafellar et al. (2006) characterize functions which capture the intuitive notion of “deviation,” and establish a lucid relationship between coherent risks and their deviation class. The following result describes key properties of the proposed risk functions, in particular highlighting the fact that while our location terms are monotonic, our risk functions are non-traditional in that they are non-monotonic.

Proposition 3

(Non-monotonic risk functions) Let \(\mathcal {Z}\) be a Banach space of square-\(\mu \)-integrable functions. For any \(\sigma \in [0,\infty ]\), let \(\eta > 0\) be set as in Proposition 1. Then, the functions \({{\,\mathrm{r}\,}}_{\sigma }:\mathcal {Z}\times \mathbb {R}\rightarrow \mathbb {R}\) and \({{\,\mathrm{R}\,}}_{\sigma }:\mathcal {Z}\rightarrow \mathbb {R}\) satisfy the following properties:

  • Both \({{\,\mathrm{r}\,}}_{\sigma }\) and \({{\,\mathrm{R}\,}}_{\sigma }\) are continuous, convex, and sub-differentiable.

  • The location in (6) is monotonic (i.e., \(Z_1 \le Z_2\) implies \(\theta _{Z_1} \le \theta _{Z_2}\)) and translation-equivariant (i.e., \(\theta _{Z+a} = \theta _{Z} + a\) for any \(a \in \mathbb {R}\)), for any \(0 < \sigma \le \infty \).

  • The deviation in (6) is non-negative and translation-invariant, namely for any \(a \in \mathbb {R}\), we have \({{\,\mathrm{\mathbf {E}}\,}}_{\mu }\rho _{\sigma }(Z+a-\theta _{Z+a}) = {{\,\mathrm{\mathbf {E}}\,}}_{\mu }\rho _{\sigma }(Z-\theta _{Z})\), for any \(0 < \sigma \le \infty \).

  • The risk \({{\,\mathrm{R}\,}}_{\sigma }\) is not monotonic, i.e., \(\mu \{Z_1 \le Z_2\}=1\) need not imply \({{\,\mathrm{R}\,}}_{\sigma }(Z_1) \le {{\,\mathrm{R}\,}}_{\sigma }(Z_2)\).

In particular, the risk \(h \mapsto {{\,\mathrm{R}\,}}_{\sigma }({{\,\mathrm{L}\,}}(h))\) need not be convex, even if \({{\,\mathrm{L}\,}}(\cdot )\) is.

Remark 4

Since our risk function \({{\,\mathrm{R}\,}}_{\sigma }\) is not monotonic, standard results in the literature on optimizing generalized risks do not apply here. We remark that our proposed risk class does not appear among the comprehensive list of examples given in the works of Ruszczyński and Shapiro (2006, 2006b), aside from the special case of \(\sigma = \infty \) with \(\eta = 1\). While the continuity and sub-differentiability of any risk function which is convex and monotonic is well-known for a large class of Banach spaces (Ruszczyński and Shapiro 2006, Sec. 3), in Proposition 3 we obtain such properties without monotonicity by using square-\(\mu \)-integrability combined with properties of our function class \(\rho _{\sigma }\).

Since our principal interest is the case where \(Z = {{\,\mathrm{L}\,}}(h)\), the key takeaways from this section are that while the proposed risk \(h \mapsto {{\,\mathrm{R}\,}}_{\sigma }({{\,\mathrm{L}\,}}(h))\) is well-defined and easy to estimate given a random sample \({{\,\mathrm{L}\,}}_1(h),\ldots ,{{\,\mathrm{L}\,}}_n(h)\), the learning task is non-trivial since \({{\,\mathrm{R}\,}}_{\sigma }({{\,\mathrm{L}\,}}(\cdot ))\) is not differentiable (and thus non-smooth) when \(\sigma = 0\), and for any \(\sigma \in [0,\infty ]\) need not be convex, even when the underlying loss is both smooth and convex. Fortunately, smoothness properties of the losses typically used in machine learning can be leveraged to overcome these technical barriers, opening a path towards learning guarantees for practical algorithms, which is the topic of Sect. 3.

2.3 Strengths and limitations

Before diving further into the analysis of learning algorithms in the next section, let us pause for a moment to consider the following important question: when should we use the proposed risk class over more traditional alternatives? In addition, what kind of limitations or tradeoffs are faced when using this risk class? In the following paragraphs we attempt to provide an initial answer to these questions, in the context of traditional ERM and the alternative risk functions discussed in the literature review of Sect. 1.

2.3.1 Flexible control over deviations

The most obvious feature of the risk class defined in (4)–(5) is that it offers significant control over how we penalize deviations in the loss distribution. Unlike traditional ERM which simply requires that we minimize the location \({{\,\mathrm{\mathbf {E}}\,}}_{\mu }{{\,\mathrm{L}\,}}(h)\), the proposed risk can only be made small when both the location and deviations are sufficiently small. It is well known that encouraging the (test) loss distribution to have small variance yields sharper bounds on the expected test loss than are available with naive ERM (Maurer and Pontil 2009). Furthermore, loss deviations are deeply linked to notions of fairness in machine learning (Williamson and Menon 2019), where an explicit design decision is made to ensure performance is similar across sensitive features (such as age, race, or gender), potentially at the cost of a larger expected loss. The alternative risks considered in this line of research are characterized by risks that add a deviation term to the expected loss \({{\,\mathrm{\mathbf {E}}\,}}_{\mu }{{\,\mathrm{L}\,}}(h)\) (Hashimoto et al. 2018), and thus are always at least as sensitive to the loss distribution tails as the mean itself is. On the other hand, our deviation penalty based on \(\sigma \) and \(\rho \) can be used to enforce fairness with significant flexibility over the influence that errant observations have, since the location term is not fixed to the mean, but rather determined by the \(\rho \) and \(\sigma \) setting that we use; e.g., by taking \(\sigma \rightarrow 0\) we obtain a location close to the median (often much lower than the mean), with deviations measured using a function that is insensitive to outliers. Please see Sect. 4 for empirical evidence that our risk class can be effectively used to control test loss deviations.

2.3.2 Symmetry

Another important feature of the risk class that we study here is the bidirectional nature of the function \(\rho \) used to measure deviations. This is in stark contrast with existing alternative risk classes such as CVaR, entropic risk, and other so-called “optimal certainty equivalent” risks (Lee et al. 2020), as well as distributionally robust optimization (DRO) risks (Namkoong and Duchi 2016; Zhai et al. 2021), which all place a strong emphasis on the loss tails on the upside, while downplaying or completely ignoring tails on the downside. This difference becomes important for losses that are unbounded below and can take on negative values, such as in logistic regression or for more general negative log-likelihood minimization. When the loss distribution (during training) is asymmetric with heavy tails on the downside, risks such as CVaR will provide a much smaller penalty than our risk class, which by design picks up on deviations in both directions. This can be expected to encourage greater symmetry in the test loss distribution, a phenomenon that we have almost observed empirically (e.g., Fig. 4).

2.3.3 Drawbacks and workarounds

One of the key limitations to our approach is that the risk class proposed does not preserve convexity when the underlying loss \({{\,\mathrm{L}\,}}(h)\) is convex in h. One obvious limitation is that analysis of learning algorithms cannot in general yield (global) excess risk bounds, but rather must be limited to either analysis centered around an arbitrary local minimum or global analysis of stationarity (see Sect. 3 for more details). Another side to this is computational. For extremely computationally intensive tasks where production-class convex solvers are an essential ingredient, our risk class cannot be used as-is. One natural work-around is to replace \(\rho (\cdot )\) with \(\rho ((\cdot )_{+})\), which sacrifices the aforementioned symmetry of the deviation measure in exchange for monotonicity and convexity, while still retaining flexible control over the influence of tails on the upside. On the other hand, a great deal of modern machine learning applications are built upon non-linear models (e.g., most neural networks) which involve \({{\,\mathrm{L}\,}}(h)\) that is non-convex in h to begin with, so the drawbacks discussed here really only arise within the context of (large production-grade) linear models.

3 Learning algorithm analysis

figure a

Thus far, we have only been concerned with ideal quantities \({{\,\mathrm{R}\,}}_{\sigma }\) and \({{\,\mathrm{r}\,}}_{\sigma }\) used to define the ultimate formal goal of learning. In practice, the learner will only have access to noisy, incomplete information. In this work, we focus on iterative algorithms based on stochastic gradients, largely motivated by their practical utility and ubiquity in modern machine learning applications. For the rest of the paper, we overload our risk definitions to enhance readability, writing \({{\,\mathrm{r}\,}}_{\sigma }(h,\theta ) :={{\,\mathrm{r}\,}}_{\sigma }({{\,\mathrm{L}\,}}(h),\theta )\) and \({{\,\mathrm{R}\,}}_{\sigma }(h) :={{\,\mathrm{R}\,}}_{\sigma }({{\,\mathrm{L}\,}}(h))\). First note that we can break down the underlying joint objective as \({{\,\mathrm{r}\,}}_{\sigma }(h,\theta ) = {{\,\mathrm{\mathbf {E}}\,}}_{\mu }(f_2 \circ F_1)(h,\theta )\), where we have defined

$$\begin{aligned} F_1(h,\theta ) :=({{\,\mathrm{L}\,}}(h),\theta ), \qquad f_2(u,\theta ) :=\theta + \eta \rho _{\sigma }(u-\theta ). \end{aligned}$$
(7)

From the point of view of the probability space \((\Omega ,\mathcal {F},\mu )\), the function \(F_1\) is random, whereas \(f_2\) is deterministic; our use of upper- and lower-case letters is just meant to emphasize this. Given some initial value \((h_0,\theta _0) \in \mathcal {H}\times \mathbb {R}\), one naively hopes to construct an efficient stochastic gradient algorithm using the update

$$\begin{aligned} (h_{t+1},\theta _{t+1})&= {{\,\mathrm{\Pi }\,}}_{C} \left[ (h_t,\theta _t) - \alpha _{t} G_t \right] , \end{aligned}$$
(8)

where \(\alpha _t > 0\) is a step-size parameter, \({{\,\mathrm{\Pi }\,}}_{C}[\cdot ]\) denotes projection to some set \(C \subset \mathcal {H}\times \mathbb {R}\), and the stochastic feedback \(G_t\) is just a composition of sub-gradients, namely

$$\begin{aligned} G_t&\in \partial f_2({{\,\mathrm{L}\,}}(h_t),\theta _t) \circ \partial F_1(h_t,\theta _t). \end{aligned}$$
(9)

We call this approach “naive” since it is exactly what we would do if we knew a priori that the underlying objective was convex and/or smooth.Footnote 2 The precise learning algorithm studied here is summarized in Algorithm 1. Fortunately, as we describe below, this naive procedure actually enjoys lucid non-asymptotic guarantees, on par with the smooth case.

3.1 How to measure algorithm performance?

Before stating any formal results, we briefly discuss the means by which we evaluate learning algorithm performance. Since the sequence \(({{\,\mathrm{R}\,}}_{\sigma }(h_t))\) cannot be controlled in general, a more tractable problem is that of finding a stationary point of \({{\,\mathrm{r}\,}}_{\sigma }\), namely any \((h^{*},\theta ^{*})\) such that \(0 \in \partial {{\,\mathrm{r}\,}}_{\sigma }(h^{*},\theta ^{*})\). However, it is not practical to analyze \({{\,\mathrm{dist}\,}}(0;\partial {{\,\mathrm{r}\,}}_{\sigma }(h_t,\theta _t))\) directly, due to a lack of continuity. Instead, we consider a smoothed version of \({{\,\mathrm{r}\,}}_{\sigma }\):

$$\begin{aligned} {{\,\mathrm{\widetilde{r}}\,}}_{\sigma ,\beta }(h,\theta ) :=\inf _{h^{\prime },\theta ^{\prime }} \left[ {{\,\mathrm{r}\,}}_{\sigma }(h^{\prime },\theta ^{\prime }) + \frac{1}{2\beta }\Vert (h,\theta )-(h^{\prime },\theta ^{\prime })\Vert ^{2} \right] . \end{aligned}$$
(10)

This is none other than the Moreau envelope of \({{\,\mathrm{r}\,}}_{\sigma }\), with weighting parameter \(\beta > 0\). A familiar concept from convex analysis on Hilbert spaces (Bauschke and Combettes 2017, Ch. 12 and 24), the Moreau envelope of non-smooth functions satisfying weak convexity properties has recently been shown to be a very useful metric for evaluating stochastic optimizers (Davis and Drusvyatskiy 2019; Drusvyatskiy and Paquette 2019). Our basic performance guarantees will first be stated in terms of the gradient of the smoothed function \({{\,\mathrm{\widetilde{r}}\,}}_{\sigma ,\beta }\). We will then relate this to the joint risk \({{\,\mathrm{r}\,}}_{\sigma }\) and subsequently the risk \({{\,\mathrm{R}\,}}_{\sigma }\).

3.2 Guarantees based on joint risk minimization

Within the context of the stochastic updates characterized by (8)–(9), we consider the case in which \(\mathcal {H}\) is any Hilbert space. All Hilbert spaces are reflexive Banach spaces, and the stochastic sub-gradient \(G_t \in (\mathcal {H}\times \mathbb {R})^{*}\) (the dual of \(\mathcal {H}\times \mathbb {R}\)) can be uniquely identified with an element of \(\mathcal {H}\times \mathbb {R}\), for which we use the same notation \(G_t\). Denoting the partial sequence \(G_{[t]} :=(G_0,\ldots ,G_t)\), we formalize our assumptions as follows:

  1. A1.

    For all \(h \in \mathcal {H}\), the random loss \({{\,\mathrm{L}\,}}(h)\) is square-\(\mu \)-integrable, locally Lipschitz, and weakly \(\lambda \)-smooth, with a gradient satisfying \({{\,\mathrm{\mathbf {E}}\,}}_{\mu }|{{\,\mathrm{L}\,}}^{\prime }(h;\cdot ) |^{2} < \infty \).

  2. A2.

    \(\mathcal {H}\) is a Hilbert space, and \(C \subset \mathcal {H}\times \mathbb {R}\) is a closed convex set.

  3. A3.

    The feedback (9) satisfies \({{\,\mathrm{\mathbf {E}}\,}}[G_t \,\vert \,G_{[t-1]}] = {{\,\mathrm{\mathbf {E}}\,}}_{\mu }G_t\) for all \(t > 0\).Footnote 3

  4. A4.

    For some \(0< \kappa < \infty \), the second moments are bounded as \({{\,\mathrm{\mathbf {E}}\,}}_{\mu }\Vert G_t\Vert ^{2} \le \kappa ^{2}\) for all t.

The following is a performance guarantee for Algorithm 1 in terms of the smoothed joint risk.

Theorem 5

(Nearly-stationary point of smoothed objective) If \(0< \sigma < \infty \), set smoothing parameter \(\gamma = (1+\eta \pi /(2\sigma ))\max \{1,\lambda \}\). Otherwise, if \(\sigma = 0\), set \(\gamma = (1+\eta )\max \{1,\lambda \}\). Under these \(\sigma \)-dependent settings and assumptions A1–A4, let \((\overline{h}_{[n]},\overline{\theta }_{[n]})\) denote the output of Algorithm 1, with \({{\,\mathrm{r}\,}}_{C}^{*} :=\inf \{{{\,\mathrm{r}\,}}_{\sigma }(h,\theta ): (h,\theta ) \in C\}\) denoting the minimum over the feasible set and \(\Delta _0 :={{\,\mathrm{\widetilde{r}}\,}}_{\sigma ,\beta }(h_0,\theta _0) - {{\,\mathrm{r}\,}}_{C}^{*}\) denoting the initialization error. Then, for any choice of \(n>1\), \(\eta > 0\), and \(\beta < 1/\gamma \), we have that

$$\begin{aligned} {{\,\mathrm{\mathbf {E}}\,}}\Vert {{\,\mathrm{\widetilde{r}}\,}}_{\sigma ,\beta }^{\prime }(\overline{h}_{[n]},\overline{\theta }_{[n]})\Vert ^{2} \le \left( \frac{1}{1-\beta \gamma }\right) \frac{\Delta _0 + \gamma \kappa ^{2}\sum _{t=0}^{n-1}\alpha _{t}^{2}/2}{\sum _{t=0}^{n-1}\alpha _t}, \end{aligned}$$

where expectation is taken over all the feedback \(G_{[n-1]}\).

Remark 6

(Sample complexity) Let us briefly describe a direct take-away from Theorem 5. If \(\Delta _0\), \(\gamma \), and \(\kappa \) are known (upper bounds will of course suffice), then constructing step sizes as \(\alpha _t^2 \ge \Delta _0/(n\gamma \kappa ^2)\), if we set \(\beta = 1/(2\gamma )\), it follows immediately that

$$\begin{aligned} {{\,\mathrm{\mathbf {E}}\,}}\Vert {{\,\mathrm{\widetilde{r}}\,}}_{\sigma ,\beta }^{\prime }(\overline{h}_{[n]},\overline{\theta }_{[n]})\Vert ^2 \le \sqrt{\frac{2\gamma \kappa ^{2}\Delta _0}{n}}. \end{aligned}$$

Fixing some desired precision level of \(\sqrt{{{\,\mathrm{\mathbf {E}}\,}}\Vert {{\,\mathrm{\widetilde{r}}\,}}_{\sigma ,\beta }^{\prime }(\overline{h}_{[n]},\overline{\theta }_{[n]})\Vert ^2} \le \varepsilon \), the sample complexity is \(\mathcal {O}(\varepsilon ^{-4})\). This matches guarantees available in the smooth (but non-convex) case (Ghadimi and Lan 2013), and suggests that the “naive” strategy implemented by Algorithm 1 in fact comes with a clear theoretical justification.

3.3 Implications in terms of the original objective

The results described in Theorem 5 and Remark 6 are with respect to a smoothed version of the joint risk function \({{\,\mathrm{r}\,}}_{\sigma }\). Linking these facts to insights in terms of the original proposed risk \({{\,\mathrm{R}\,}}_{\sigma }\) can be done as follows. Assuming we take \(n \ge 2\gamma \kappa ^{2}\Delta _0/\varepsilon ^{4}\) to achieve the \(\varepsilon \)-precision discussed in Remark 6, the immediate conclusion is that the algorithm output is \((\varepsilon /2\gamma )\)-close to a \(\varepsilon \)-nearly stationary point of \({{\,\mathrm{r}\,}}_{\sigma }\). More precisely, we have that there exists an ideal point \((h_{n}^{*},\theta _{n}^{*})\) such that

$$\begin{aligned} {{\,\mathrm{\mathbf {E}}\,}}\left[ {{\,\mathrm{dist}\,}}\left( 0;\partial {{\,\mathrm{r}\,}}_{\sigma }(h_{n}^{*},\theta _{n}^{*})\right) \right] \le \varepsilon , \text { and } {{\,\mathrm{\mathbf {E}}\,}}\left\| (\overline{h}_{[n]},\overline{\theta }_{[n]})-(h_{n}^{*},\theta _{n}^{*}) \right\| = \frac{\varepsilon }{2\gamma }. \end{aligned}$$
(11)

The above fact follows from basic properties of the Moreau envelope (cf. Appendix 2.4). These non-asymptotic guarantees of being close to a “good” point extend to the function values of the risk \({{\,\mathrm{R}\,}}_{\sigma }\) since we are close to a candidate \(h_{n}^{*}\) whose risk value can be no worse than

$$\begin{aligned} {{\,\mathrm{\mathbf {E}}\,}}\left[ {{\,\mathrm{R}\,}}_{\sigma }(h_{n}^{*})\right] \le {{\,\mathrm{\mathbf {E}}\,}}\left[ {{\,\mathrm{r}\,}}_{\sigma }(h_{n}^{*},\theta _{n}^{*})\right] \le {{\,\mathrm{\mathbf {E}}\,}}\left[ {{\,\mathrm{r}\,}}_{\sigma }(\overline{h}_{[n]},\overline{\theta }_{[n]})\right] . \end{aligned}$$

We remark that these learning guarantees hold for a class of risks that are in general non-convex and need not even be differentiable, let alone satisfy smoothness requirements.

3.4 Key points in the proof of Theorem 5

Here we briefly highlight the key sub-results involved in proving Theorem 5; please see Appendix 3.2 for all the details. The key structure that we require is a smooth loss, reflected in assumption A1. This along with the Lipschitz property of our function \(\rho _{\sigma }\) for all \(0 \le \sigma < \infty \) allows us to prove that the underlying objective \({{\,\mathrm{r}\,}}_{\sigma }\) is weakly convex, where \(\mathcal {H}\) can be any Banach space (Proposition 12); this generalizes a result of (Drusvyatskiy and Paquette 2019, Lem. 4.2) from Euclidean space to any Banach space. This alone is not enough to obtain the desired result. Note that the assumption A3 is very weak, and trivially satisfied in most traditional machine learning settings (e.g., where losses are based on a sequence of iid data points). The question of whether the feedback is unbiased or not, i.e., whether \({{\,\mathrm{\mathbf {E}}\,}}_{\mu }G_t\) is in the sub-differential of \({{\,\mathrm{r}\,}}_{\sigma }\) at step t or not, is something that needs to be formally verified. In Proposition 14 we show that as long as the gradient has a finite expectation, this indeed holds for the feedback generated by (9), when \(\mathcal {H}\) is any Banach space. With the two key properties of a weakly convex objective and unbiased random feedback in hand, we can leverage the techniques used in (Davis and Drusvyatskiy 2019, Thm. 3.1) for proximal stochastic gradient methods applied to weakly convex functions on \(\mathbb {R}^{d}\), extending their core argument to the case of any Hilbert space. Combining this technique with the proof of weak convexity and unbiasedness lets us obtain Theorem 5.

4 Empirical analysis

In this section we introduce representative results for a series of experiments designed to investigate the quantitative and qualitative repercussions of modulating the underlying risk function class. We have prepared a GitHub repository that includes code for both re-creating the empirical tests and pre-processing all the benchmark data.Footnote 4

4.1 Sanity check in one dimension

As a natural starting point, we use a toy example to ensure that Algorithm 1 takes us where we expect to go for a particular risk setting. Consider a loss on \(\mathbb {R}\) with the form \({{\,\mathrm{L}\,}}(h) = h {{\,\mathrm{L}\,}}_{\text {wide}} + (1-h){{\,\mathrm{L}\,}}_{\text {thin}}\), where \({{\,\mathrm{L}\,}}_{\text {wide}}\) and \({{\,\mathrm{L}\,}}_{\text {thin}}\) are random variables independent of h and each other. As a simple example, we use a folded Normal distribution for both, namely \({{\,\mathrm{L}\,}}_{*}=|\text {Normal}(a_{*},b_{*}^{2}) |\), where \(a_{\text {wide}}=0\), \(a_{\text {thin}}=2.0\), \(b_{\text {wide}}=1.0\), and \(b_{\text {thin}}=0.1\). For simplicity, we fix \(\alpha _t = 0.001\) throughout, and each step uses a mini-batch of size 8.Footnote 5 Regarding the risk settings, we look in particular at the case of \(\sigma = \infty \), where we modify the setting of \(\eta = 2^k\) over \(k=0,1,\ldots ,7\). Results averaged over 100 trials are given in Fig. 2. By modifying \(\eta \), we can control whether the learning algorithm “prefers” candidates whose losses have a high degree of dispersion centered around a good location, or those whose losses are well-concentrated near a weaker location.

Fig. 2
figure 2

A simple toy example using \({{\,\mathrm{L}\,}}(h) = h {{\,\mathrm{L}\,}}_{\text {wide}} + (1-h){{\,\mathrm{L}\,}}_{\text {thin}}\). Trajectories shown are the sequence \((h_t)\) generated by running (8) on \(\mathbb {R}^{2}\), with \(h_0 = 0.5\) and \(\theta _0 = 0.5\), averaged over all trials. Densities of \({{\,\mathrm{L}\,}}_{\text {wide}}\) (red) and \({{\,\mathrm{L}\,}}_{\text {thin}}\) (blue) are also plotted, with additional details in the main text (Color figure online)

4.2 Impact of risk choice on linear regression

Next we consider how the key choice of \(\sigma \) (and thus the underlying risk \({{\,\mathrm{R}\,}}_{\sigma }\)) plays a role on the behavior of Algorithm 1. As another simple, yet more traditional example, consider linear regression in one dimension, where \(Y = w_{0}^{*} + w_{1}^{*}X + \epsilon \), where X and \(\epsilon \) are independent zero-mean random variables, and \((w_{0}^{*},w_{1}^{*}) \in \mathbb {R}^{2}\) are unknown to the learner. Using squared error \({{\,\mathrm{L}\,}}(h) = (Y - h(X))^{2}\), we run Algorithm 1 again with mini-batches of size 8 and \(\alpha _t = 0.001\) fixed throughout, over a range of \(\sigma \in [0, \infty ]\) settings, for the same number of iterations as in the previous experiment. The initial value \((h_0,\theta _0)\) is initialized at zero plus uniform noise on \([-0.05,0.05]\). We also consider multiple noise distributions; as a concrete example, letting \(N = \text {Normal}(0,(0.8)^2)\), we consider both \(\varepsilon = N\) (Normal case) and \(\varepsilon = \mathrm {e}^{N} - {{\,\mathrm{\mathbf {E}}\,}}\mathrm {e}^{N}\) (log-Normal case). In Fig. 3, we plot the learned regression lines (averaged over 100 trials) for each choice of \(\sigma \) and each noise setting. By modulating the target risk function, we can effectively choose between a self-imposed bias (smaller slope, lower intercept here), and a sensitivity to outlying values.

Fig. 3
figure 3

Learned regression lines (solid; colors denote \(\sigma \in [0,\infty ]\)) are plotted along with the true model \((w_{0}^{*},w_{1}^{*}) = (1.0,1.0)\) (dashed; black). Histograms are of independent samples of \(w_{0}^{*} + \varepsilon \). The left plots are the Normal case, and right plots are the log-Normal case

4.3 Tests using real-world data

Finally, we consider an application to some well-known benchmark datasets for classification. At a high level, we run Algorithm 1 for multi-class logistic regression for 10 independent trials, where in each trial we randomly shuffle and re-split each full dataset (88% training, 12% testing), and randomly re-initialize the model weights identically to the previous paragraph, again with mini-batches of size 8, and step sizes fixed to \(\alpha _t = 0.01/\sqrt{d}\), where d is the number of free parameters. Additional background on the datasets is given in Appendix 6. The key question of interest is how the test loss distribution changes as we modify the learner’s feedback to optimize a range of risks \({{\,\mathrm{R}\,}}_{\sigma }\). In Fig. 4, we see a stark difference between doing traditional empirical risk minimization (ERM, denoted “off”) and using \({{\,\mathrm{R}\,}}_{\sigma }\)-based feedback, particularly for moderately large values of \(\sigma \). The logistic losses are concentrated much more tightly (visible in the bottom row histograms), and this also leads to a better classification error (visible in the top row plots), an interesting trend that we observed across many distinct datasets.

Fig. 4
figure 4

Top row: average test error (zero-one loss) as a function of epochs, for four datasets and five \(\sigma \) levels, plus traditional ERM (denoted “off”). Middle and bottom rows: histograms of the test error (logistic loss) incurred after the final epoch for one trial under the covtype and emnist_balanced datasets

5 Concluding remarks

Moving forward, an appealing direction is to consider risk function classes which can encode properties going beyond location and deviation, such as explicit symmetry, multi-modality, tail properties, and so forth. The hope is to develop a flexible and highly expressive framework coupled with efficient and practical algorithms with guarantees, which encodes distributional properties that are not readily captured by the current framework or traditional risk classes such as spectral risks (Khim et al. 2020).

Taking the results of this paper together as a whole, we have obtained a generalized class of risk functions and a practical class of learning algorithms, which together effectively allow the user to encode the intuitive notions of location and deviation into the learning process. Of particular note is how changing the underlying risk has a lucid impact on the outcome of learning. In Sect. 4, we empirically verified that a simple modification of the scaling parameter of the underlying risk class (changing \(\sigma \)) was enough to result in a salient effect on the test loss distribution, using real-world datasets and typical stochastic gradient-based learning methods. While the importance of considering more properties of the loss distribution than just the expected value has been long understood (Breiman 1999; Reyzin and Schapire 2006), the key takeaway here is that we have seen how principled algorithmic modifications can bring about interpretable and meaningful effects on the test distribution, without having to abandon formal performance guarantees.