1 Introduction

Optimal transport theory has a long and distinguished history in mathematics dating back to the seminal work of Monge [107] and Kantorovich [79]. While originally envisaged for applications in civil engineering, logistics and economics, optimal transport problems provide a natural framework for comparing probability measures and have therefore recently found numerous applications in statistics and machine learning. Indeed, the minimum cost of transforming a probability measure \(\mu \) on \({\mathcal {X}}\) to some other probability measure \(\nu \) on \({\mathcal {Y}}\) with respect to a prescribed cost function on \({\mathcal {X}}\times {\mathcal {Y}}\) can be viewed as a measure of distance between \(\mu \) and \(\nu \). If \({\mathcal {X}}={\mathcal {Y}}\) and the cost function coincides with (the \(p^{\text {th}}\) power of) a metric on \({\mathcal {X}}\times {\mathcal {X}}\), then the resulting optimal transport distance represents (the \(p^{\text {th}}\) power of) a Wasserstein metric on the space of probability measures over \({\mathcal {X}}\) [168]. In the remainder of this paper we distinguish discrete, semi-discrete and continuous optimal transport problems in which either both, only one or none of the two probability measures \(\mu \) and \(\nu \) are discrete, respectively.

In the wider context of machine learning, discrete optimal transport problems are nowadays routinely used, for example, in the analysis of mixture models [84, 118] as well as in image processing [8, 58, 83, 121, 160], computer vision and graphics [124, 125, 140, 156, 157], data-driven bioengineering [59, 86, 169], clustering [73], dimensionality reduction [29, 60, 139, 145, 148], domain adaptation [38, 109], distributionally robust optimization [106, 117, 150, 151], scenario reduction [72, 142], scenario generation [74, 129], the assessment of the fairness properties of machine learning algorithms [67, 161, 162] and signal processing [163].

The discrete optimal transport problem represents a tractable linear program that is susceptible to the network simplex algorithm [119]. Alternatively, it can be addressed with dual ascent methods [21], the Hungarian algorithm for assignment problems [85] or customized auction algorithms [19, 20]. The currently best known complexity bound for computing an exact solution is attained by modern interior-point algorithms. Indeed, if N denotes the number of atoms in \(\mu \) or in \(\nu \), whichever is larger, then the discrete optimal transport problem can be solved in timeFootnote 1 \(\mathcal {{\tilde{O}}}(N^{2.5})\) with an interior point algorithm by Lee and Sidford [89]. The need to evaluate optimal transport distances between increasingly fine-grained histograms has also motivated efficient approximation schemes. Blanchet et al. [23] and Quanrud [134] show that an \(\epsilon \)-optimal solution can be found in time \({\mathcal {O}}(N^2/\epsilon )\) by reducing the discrete optimal transport problem to a matrix scaling or a positive linear programming problem, which can be solved efficiently by a Newton-type algorithm. Jambulapati et al. [77] describe a parallelizable primal-dual first-order method that achieves a similar convergence rate.

The tractability of the discrete optimal transport problem can be improved by adding an entropy regularizer to its objective function, which penalizes the entropy of the transportation plan for morphing \(\mu \) into \(\nu \). When the weight of the regularizer grows, this problem reduces to the classical Schrödinger bridge problem of finding the most likely random evolution from \(\mu \) to \(\nu \) [147]. Generic linear programs with entropic regularizers were first studied by Fang [56]. Cominetti and San Martín [35] prove that the optimal values of these regularized problems converge exponentially fast to the optimal values of the corresponding unregularized problems as the regularization weight drops to zero. Non-asymptotic convergence rates for entropy regularized linear programs are derived by Weed [171]. Cuturi [39] was the first to realize that entropic penalties are computationally attractive because they make the discrete optimal transport problem susceptible to a fast matrix scaling algorithm by Sinkhorn [155]. This insight has spurred widespread interest in machine learning and led to a host of new applications of optimal transport in color transfer [31], inverse problems [2, 80], texture synthesis [128], the analysis of crowd evolutions [126] and shape interpolation [157] to name a few. This surge of applications inspired in turn several new algorithms for the entropy regularized discrete optimal transport problem such as a greedy dual coordinate descent method also known as the Greenkhorn algorithm [1, 6, 30]. Dvurechensky et al. [51] and Lin et al. [94] prove that both the Sinkhorn and the Greenkhorn algorithms are guaranteed to find an \(\epsilon \)-optimal solution in time \(\tilde{{\mathcal {O}}}({N^2}/{\epsilon ^2})\). In practice, however, the Greenkhorn algorithm often outperforms the Sinkhorn algorithm [94]. The runtime guarantee of both algorithms can be improved to \(\tilde{{\mathcal {O}}}(N^{7/3}/\epsilon )\) via a randomization scheme [93]. In addition, the regularized discrete optimal transport problem can be addressed by tailoring general-purpose optimization algorithms such as accelerated gradient descent algorithms [51], iterative Bregman projections [18], quasi-Newton methods [24] or stochastic average gradient descent algorithms [64]. While the original optimal transport problem induces sparse solutions, the entropy penalty forces the optimal transportation plan of the regularized optimal transport problem to be strictly positive and thus completely dense. In applications where the interpretability of the optimal transportation plan is important, the lack of sparsity could be undesirable; examples include color transfer [131], domain adaptation [38] or ecological inference [110]. Hence, there is merit in exploring alternative regularization schemes that retain the attractive computational properties of the entropic regularizer but induce sparsity. Examples that have attracted significant interest include smooth convex regularization and Tikhonov regularization [24, 47, 54, 149], Lasso regularization [92], Tsallis entropy regularization [110] or group Lasso regularization [38].

Much like the discrete optimal transport problems, the significantly more challenging semi-discrete optimal transport problems emerge in numerous applications including variational inference [9], blue noise sampling [133], computational geometry [90], image quantization [42] or deep learning with generative adversarial networks [11, 65, 68]. Semi-discrete optimal transport problems are also used in fluid mechanics to simulate incompressible fluids [43].

Exact solutions of a semi-discrete optimal transport problem can be constructed by solving an incompressible Euler-type partial differential equation discovered by Brenier [27]. Any optimal solution is known to partition the support of the non-discrete measure into cells corresponding to the atoms of the discrete measure [12], and the resulting tessellation is usually referred to as a power diagram. Mirebeau [103] uses this insight to solve Monge-Ampère equations with a damped Newton algorithm, and Kitagawa et al. [82] show that a closely related algorithm with a global linear convergence rate lends itself for the numerical solution of generic semi-discrete optimal transport problems. In addition, Mérigot [102] proposes a quasi-Newton algorithm for semi-discrete optimal transport, which improves a method due to Aurenhammer et al. [12] by exploiting Llyod’s algorithm to iteratively simplify the discrete measure. If the transportation cost is quadratic, Bonnotte [25] relates the optimal transportation plan to the Knothe-Rosenblatt rearrangement for mapping \(\mu \) to \(\nu \), which is very easy to compute.

As usual, regularization improves tractability. Genevay et al. [64] show that the dual of a semi-discrete optimal transport problem with an entropic regularizer is susceptible to an averaged stochastic gradient descent algorithm that enjoys a convergence rate of \(\mathcal O(1/\sqrt{T})\), T being the number of iterations. Altschuler et al. [7] show that the optimal value of the entropically regularized problem converges to the optimal value of the unregularized problem at a quadratic rate as the regularization weight drops to zero. Improved error bounds under stronger regularity conditions are derived by Delalande [46].

Continuous optimal transport problems constitute difficult variational problems involving infinitely many variables and constraints. Benamou and Brenier [17] recast them as boundary value problems in fluid dynamics, and Papadakis et al. [122] solve discretized versions of these reformulations using first-order methods. For a comprehensive survey of the interplay between partial differential equations and optimal transport we refer to [55]. As nearly all numerical methods for partial differential equations suffer from a curse of dimensionality, current research focuses on solution schemes for regularized continuous optimal transport problems. For instance, Genevay et al. [64] embed their duals into a reproducing kernel Hilbert space to obtain finite-dimensional optimization problems that can be solved with a stochastic gradient descent algorithm. Seguy et al. [149] solve regularized continuous optimal transport problems by representing the transportation plan as a multilayer neural network. This approach results in finite-dimensional optimization problems that are non-convex and offer no approximation guarantees. However, it provides an effective means to compute approximate solutions in high dimensions. Indeed, the optimal value of the entropically regularized continuous optimal transport problem is known to converge to the optimal value of the unregularized problem at a linear rate as the regularization weight drops to zero [32, 36, 53, 120]. Due to a lack of efficient algorithms, applications of continuous optimal transport problems are scarce in the extant literature. Peyré and Cuturi [127] provide a comprehensive survey of numerous applications and solution methods for discrete, semi-discrete and continuous optimal transport problems.

This paper focuses on semi-discrete optimal transport problems. Our main goal is to formally establish that these problems are computationally hard, to propose a unifying regularization scheme for improving their tractability and to develop efficient algorithms for solving the resulting regularized problems, assuming only that we have access to independent samples from the continuous probability measure \(\mu \). Our regularization scheme is based on the observation that any dual semi-discrete optimal transport problem maximizes the expectation of a piecewise affine function with N pieces, where the expectation is evaluated with respect to \(\mu \), and where N denotes the number of atoms of the discrete probability measure \(\nu \). We argue that this piecewise affine function can be interpreted as the optimal value of a discrete choice problem, which can be smoothed by adding random disturbances to the underlying utility values [99, 164]. As probabilistic discrete choice problems are routinely studied in economics and psychology, we can draw on a wealth of literature in choice theory to design various smooth (dual) optimal transport problems with favorable numerical properties. For maximal generality we will also study semi-parametric discrete choice models where the disturbance distribution is itself subject to uncertainty [4, 57, 105, 111]. Specifically, we aim to evaluate the best-case (maximum) expected utility across a Fréchet ambiguity set containing all disturbance distributions with prescribed marginals. Such models can be addressed with customized methods from modern distributionally robust optimization [111]. For Fréchet ambiguity sets, we prove that smoothing the dual objective is equivalent to regularizing the primal objective of the semi-discrete optimal transport problem. The corresponding regularizer penalizes the discrepancy between the chosen transportation plan and the product measure \(\mu \otimes \nu \) with respect to a divergence measure constructed from the marginal disturbance distributions. Connections between primal regularization and dual smoothing were previously recognized by Blondel et al. [24] and Paty and Cuturi [123] in discrete optimal transport and by Genevay et al. [64] in semi-discrete optimal transport. As they are constructed ad hoc or under a specific adversarial noise model, these existing regularization schemes lack the intuitive interpretation offered by discrete choice theory and emerge as special cases of our unifying scheme.

The key contributions of this paper are summarized below.

  1. i.

    We study the computational complexity of semi-discrete optimal transport problems. Specifically, we prove that computing the optimal transport distance between two probability measures \(\mu \) and \(\nu \) on the same Euclidean space is \(\#\)P-hard even if only approximate solutions are sought and even if \(\mu \) is the Lebesgue measure on the standard hypercube and \(\nu \) is supported on merely two points.

  2. ii.

    We propose a unifying framework for regularizing semi-discrete optimal transport problems by leveraging ideas from distributionally robust optimization and discrete choice theory [4, 57, 105, 111]. Specifically, we perturb the transportation cost to every atom of the discrete measure \(\nu \) with a random disturbance, and we assume that the vector of all disturbances is governed by an uncertain probability distribution from within a Fréchet ambiguity set that prescribes the marginal disturbance distributions. Solving the dual optimal transport problem under the least favorable disturbance distribution in the ambiguity set amounts to smoothing the dual and regularizing the primal objective function. We show that numerous known and new regularization schemes emerge as special cases of this framework, and we derive a priori approximation bounds for the resulting regularized optimal transport problems.

  3. iii.

    We derive new convergence guarantees for an averaged stochastic gradient descent (SGD) algorithm that has only access to a biased stochastic gradient oracle. Specifically, we prove that this algorithm enjoys a convergence rate of \(\mathcal O(1/\sqrt{T})\) for Lipschitz continuous and of \(\mathcal O(1/T)\) for generalized self-concordant objective functions. We also show that this algorithm lends itself to solving the smooth dual optimal transport problems obtained from the proposed regularization scheme. When the smoothing is based on a semi-parametric discrete choice model with a Fréchet ambiguity set, the algorithm’s convergence rate depends on the smoothness properties of the marginal noise distributions, and its per-iteration complexity depends on our ability to compute the optimal choice probabilities. We demonstrate that these choice probabilities can indeed be computed efficiently via bisection or sorting, and in special cases they are even available in closed form. As a byproduct, we show that our algorithm can improve the state-of-the-art \(\mathcal O(1/\sqrt{T})\) convergence guarantee of Genevay et al. [64] for the semi-discrete optimal transport problem with an entropic regularizer.

The rest of this paper unfolds as follows. In Sect. 2 we study the computational complexity of semi-discrete optimal transport problems, and in Sect. 3 we develop our unifying regularization scheme. In Sect. 4 we analyze the convergence rate of an averaged SGD algorithm with a biased stochastic gradient oracle that can be used for solving smooth dual optimal transport problems, and in Sect. 5 we compare its empirical convergence behavior against the theoretical convergence guarantees.

Notation. We denote by \(\Vert \cdot \Vert \) the 2-norm, by \([N] = \{1, \ldots , N \}\) the set of all integers up to \(N\in {\mathbb {N}}\) and by \(\Delta ^d = \{\varvec{x} \in {\mathbb {R}}_+^d : \sum _{i = 1}^d x_i =1\}\) the probability simplex in \(\mathbb R^d\). For a logical statement \(\mathcal E\) we define \(\mathbbm {1}_{\mathcal E} = 1\) if \(\mathcal E\) is true and \(\mathbbm {1}_{\mathcal E} = 0\) if \(\mathcal E\) is false. For any closed set \({\mathcal {X}}\subseteq {\mathbb {R}}^d\) we define \({\mathcal {M}}({\mathcal {X}})\) as the family of all Borel measures and \({\mathcal {P}}({\mathcal {X}})\) as its subset of all Borel probability measures on \({\mathcal {X}}\). For \(\mu \in {\mathcal {P}}({\mathcal {X}})\), we denote by \({\mathbb {E}}_{\varvec{x} \sim \mu }[\cdot ]\) the expectation operator under \(\mu \) and define \({\mathcal {L}}({\mathcal {X}}, \mu )\) as the family of all \(\mu \)-integrable functions \(f:{\mathcal {X}}\rightarrow {\mathbb {R}}\), that is, \(f \in {\mathcal {L}}({\mathcal {X}}, \mu )\) if and only if \(\int _{{\mathcal {X}}} |f(\varvec{x})| \mu (\mathrm {d}\varvec{x})<\infty \). The Lipschitz modulus of a function \(f: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) is defined as \({{\,\mathrm{lip}\,}}(f) = \sup _{\varvec{x}, \varvec{x}'}\{|f(\varvec{x}) - f(\varvec{x}')|/\Vert \varvec{x} - \varvec{x}'\Vert : \varvec{x} \ne \varvec{x}'\}\). The convex conjugate of \(f: {\mathbb {R}}^d \rightarrow [-\infty ,+\infty ]\) is the function \(f^*:{\mathbb {R}}^d\rightarrow [-\infty ,+\infty ]\) defined through \(f^{*}(\varvec{y}) = \sup _{\varvec{x} \in {\mathbb {R}}^d}\varvec{y}^\top \varvec{x} - f(\varvec{x})\).

2 Hardness of computing optimal transport distances

If \({\mathcal {X}}\) and \({\mathcal {Y}}\) are closed subsets of finite-dimensional Euclidean spaces and \(c: {\mathcal {X}}\times {\mathcal {Y}}\rightarrow [0,+\infty ]\) is a lower-semicontinuous cost function, then the Monge-Kantorovich optimal transport distance between two probability measures \(\mu \in \mathcal P({\mathcal {X}})\) and \(\nu \in \mathcal P({\mathcal {Y}})\) is defined as

$$\begin{aligned} W_c(\mu , \nu ) = \min \limits _{\pi \in \Pi (\mu ,\nu )} ~ {\mathbb {E}}_{(\varvec{x}, \varvec{y}) \sim \pi }\left[ {c(\varvec{x}, \varvec{y})}\right] , \end{aligned}$$
(1)

where \(\Pi (\mu ,\nu )\) denotes the family of all couplings of \(\mu \) and \(\nu \), that is, the set of all probability measures on \({\mathcal {X}}\times {\mathcal {Y}}\) with marginals \(\mu \) on \({\mathcal {X}}\) and \(\nu \) on \({\mathcal {Y}}\). One can show that the minimum in (1) is always attained ([168], Theorem 4.1). If \({\mathcal {X}}={\mathcal {Y}}\) is a metric space with metric \(d:{\mathcal {X}}\times {\mathcal {X}}\rightarrow {\mathbb {R}}_+\) and the transportation cost is defined as \(c(\varvec{x}, \varvec{y})=d^p(\varvec{x},\varvec{y})\) for some \(p \ge 1\), then \(W_c(\mu , \nu )^{1/p}\) is termed the p-th Wasserstein distance between \(\mu \) and \(\nu \). The optimal transport problem (1) constitutes an infinite-dimensional linear program over measures and admits a strong dual linear program over functions ([168], Theorem 5.9).

Proposition 2.1

(Kantorovich duality) The optimal transport distance between \(\mu \in {\mathcal {P}}({\mathcal {X}})\) and \(\nu \in {\mathcal {P}}({\mathcal {Y}})\) admits the dual representation

$$\begin{aligned} W_c(\mu , \nu ) =\left\{ \begin{array}{c@{\quad }l@{\qquad }l} \sup &{} \displaystyle {\mathbb {E}}_{\varvec{y} \sim \nu }\left[ {\phi (\varvec{y})}\right] - {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ {\psi (\varvec{x})}\right] &{} \\ \mathrm {s.t.}&{} \psi \in {\mathcal {L}}({\mathcal {X}}, \mu ),~ \phi \in {\mathcal {L}}({\mathcal {Y}}, \nu )&{} \\ &{} \phi (\varvec{y}) - \psi (\varvec{x}) \le c(\varvec{x}, \varvec{y}) \quad \forall \varvec{x} \in {\mathcal {X}},~ \varvec{y} \in {\mathcal {Y}}. \end{array}\right. \end{aligned}$$
(2)

The linear program (2) optimizes over the two Kantorovich potentials \(\psi \in {\mathcal {L}}({\mathcal {X}}, \mu )\) and \(\phi \in {\mathcal {L}}({\mathcal {Y}}, \nu )\), but it can be reformulated as the following non-linear program over a single potential function,

$$\begin{aligned} W_c(\mu , \nu ) =\sup _{\phi \in {\mathcal {L}}({\mathcal {Y}}, \nu )} ~ \displaystyle {\mathbb {E}}_{\varvec{y} \sim \nu }\left[ \phi (\varvec{y})\right] - {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \phi _c(\varvec{x}) \right] , \end{aligned}$$
(3)

where \(\phi _c:{\mathcal {X}}\rightarrow [-\infty ,+\infty ]\) is called the c-transform of \(\phi \) and is defined through

$$\begin{aligned} \phi _c(\varvec{x}) = \sup _{\varvec{y} \in {\mathcal {Y}}} ~ \phi (\varvec{y}) - c(\varvec{x}, \varvec{y}) \qquad \forall \varvec{x} \in {\mathcal {X}}, \end{aligned}$$
(4)

see Villani ([168], § 5) for details. The Kantorovich duality is the key enabling mechanism to study the computational complexity of the optimal transport problem (1).

Theorem 2.2

(Hardness of computing optimal transport distances) Computing \(W_c(\mu , \nu )\) is #P-hard even if \({\mathcal {X}}={\mathcal {Y}}={\mathbb {R}}^d\), \(c(\varvec{x}, \varvec{y}) = \Vert \varvec{x}-\varvec{y}\Vert ^{p}\) for some \(p\ge 1\), \(\mu \) is the Lebesgue measure on the standard hypercube \([0,1]^d\), and \(\nu \) is a discrete probability measure supported on only two points.

To prove Theorem 2.2, we will show that computing the optimal transport distance \(W_c(\mu , \nu )\) is at least as hard computing the volume of the knapsack polytope \(P( \varvec{w}, b) = \{\varvec{x}\in [0,1]^d : \varvec{w}^\top \varvec{x}\le b\}\) for a given \(\varvec{w}\in {\mathbb {R}}^d_+\) and \( b \in {\mathbb {R}}_+\), which is known to be \(\#\)P-hard ([52], Theorem 1). Specifically, we will leverage the following variant of this hardness result, which establishes that approximating the volume of the knapsack polytope \(P( \varvec{w}, b)\) to a sufficiently high accuracy is already \(\#\)P-hard.

Lemma 2.3

(Hanasusanto et al. ([70], Lemma 1)) Computing the volume of the knapsack polytope \(P( \varvec{w}, b)\) for a given \(\varvec{w}\in {\mathbb {R}}^d_+\) and \( b \in {\mathbb {R}}_+\) to within an absolute accuracy of \(\delta >0\) is \(\#\)P-hard whenever

$$\begin{aligned} \delta <\frac{1}{ {2d!(\Vert \varvec{w}\Vert _1+2)^d(d+1)^{d+1}\prod _{i = 1}^{d}w_i}}. \end{aligned}$$
(5)

Fix now any knapsack polytope \(P( \varvec{w}, b)\) encoded by \(\varvec{w}\in {\mathbb {R}}_+^d\) and \( b \in {\mathbb {R}}_+\). Without loss of generality, we may assume that \(\varvec{w} \ne \varvec{0}\) and \(b > 0\). Indeed, we are allowed to exclude \(\varvec{w} = \varvec{0} \) because the volume of \(P(\varvec{0}, b) \) is trivially equal to 1. On the other hand, \(b= 0\) can be excluded by applying a suitable rotation and translation, which are volume-preserving transformations. In the remainder, we denote by \(\mu \) the Lebesgue measure on the standard hypercube \([0,1]^d\) and by \({\nu }_ t = t \delta _{\varvec{y}_1} + (1-t) \delta _{\varvec{y}_2}\) a family of discrete probability measures with two atoms at \(\varvec{y}_1=\varvec{0}\) and \(\varvec{y}_2=2b\varvec{w}/ \Vert \varvec{w}\Vert ^2\), respectively, whose probabilities are parameterized by \(t \in [0, 1]\).

The following preparatory lemma relates the volume of \(P( \varvec{{w}},b)\) to the optimal transport problem (1) and is thus instrumental for the proof of Theorem 2.2.

Lemma 2.4

If \(c(\varvec{x}, \varvec{y})=\Vert \varvec{x}- \varvec{y} \Vert ^p\) for some \(p\ge 1\), then we have \({\mathrm{Vol}}(P( \varvec{{w}},b)) = {{\,\mathrm{argmin}\,}}_{ t \in [0,1]} W_c(\mu , {\nu }_ t )\).

Proof

By the definition of the optimal transport distance in (1) and our choice of \(c(\varvec{x}, \varvec{y})\), we have

$$\begin{aligned}&\underset{ t \in [0,1]}{\min }W_c(\mu , {\nu }_ t )\\&\quad = \underset{ t \in [0,1]}{\min } ~ \min \limits _{\pi \in \Pi (\mu ,\nu _t)} ~ {\mathbb {E}}_{(\varvec{x}, \varvec{y})\sim \pi }\left[ \Vert \varvec{x}- \varvec{y} \Vert ^p \right] \\&\quad =\min \limits _{ t \in [0,1]}~ \left\{ \begin{array}{cl} \min \limits _{q_1, q_2 \in {\mathcal {P}}({\mathbb {R}}^d)}&{} t \displaystyle \int _{{\mathbb {R}}^d} \Vert \varvec{x}-\varvec{y}_1\Vert ^p q_1(\mathrm {d}\varvec{x}) + (1-t) \displaystyle \int _{{\mathbb {R}}^d}\left\| \varvec{x}-\varvec{y}_2 \right\| ^p q_2(\mathrm {d}\varvec{x})\\ &{}\quad \text {s.t.} t \cdot q_1 + (1-t) \cdot q_2 = \mu , \end{array}\right. \end{aligned}$$

where the second equality holds because any coupling \(\pi \) of \(\mu \) and \(\nu _t\) can be constructed from the marginal probability measure \(\nu _t\) of \(\varvec{y}\) and the probability measures \(q_1\) and \(q_2\) of \(\varvec{x}\) conditional on \(\varvec{y} =\varvec{y}_1\) and \(\varvec{y} = \varvec{y}_2\), respectively, that is, we may write \(\pi = t\cdot q_1\otimes \delta _{\varvec{y}_1} + (1-t)\cdot q_2\otimes \delta _{\varvec{y}_2}\). The constraint of the inner minimization problem ensures that the marginal probability measure of \(\varvec{x}\) under \(\pi \) coincides with \(\mu \). By applying the variable transformations \(q_1\leftarrow t \cdot q_1 \) and \(q_2 \leftarrow (1-t)\cdot q_2\) to eliminate all bilinear terms, we then obtain

$$\begin{aligned} \underset{ t \in [0,1]}{\min }W_c(\mu , {\nu }_ t )=\left\{ \begin{array}{cll} \underset{\begin{array}{c} t \in [0,1] \\ q_1, q_2 \in {\mathcal {M}}({\mathbb {R}}^d) \end{array}}{\min } &{}\displaystyle \int _{{\mathbb {R}}^d} \Vert \varvec{x} -\varvec{y}_1\Vert ^p q_1(\mathrm {d}\varvec{x}) + \displaystyle \int _{{\mathbb {R}}^d} \left\| \varvec{x}-\varvec{y}_2 \right\| ^p q_2(\mathrm {d}\varvec{x})\\ \text {s.t.} &{}\displaystyle \int _{{\mathbb {R}}^d} q_1(\mathrm {d}\varvec{x}) = t \\ &{}\displaystyle \int _{{\mathbb {R}}^d} q_2(\mathrm {d}\varvec{x}) = 1- t \\ &{} q_1 + q_2 = \mu . \end{array}\right. \end{aligned}$$

Observe next that the decision variable t and the two normalization constraints can be eliminated without affecting the optimal value of the resulting infinite-dimensional linear program because the Borel measures \(q_1\) and \(q_2\) are non-negative and because the constraint \(q_1+q_2=\mu \) implies that \(q_1({\mathbb {R}}^d)+q_2({\mathbb {R}}^d)=\mu ({\mathbb {R}}^d)=1\). Thus, there always exists \(t\in [0,1]\) such that \(q_1({\mathbb {R}}^d)=t\) and \(q_2({\mathbb {R}}^d)=1-t\). This reasoning implies that

$$\begin{aligned} \underset{ t \in [0,1]}{\min }W_c(\mu , {\nu }_ t )=\left\{ \begin{array}{ccll} &{}\min \limits _{q_1,q_2\in {\mathcal {M}}({\mathbb {R}}^d)}\; &{} \displaystyle \int _{{\mathbb {R}}^d} \Vert \varvec{x} -\varvec{y}_1\Vert ^p q_1(\mathrm {d}\varvec{x}) + \displaystyle \int _{{\mathbb {R}}^d}\left\| \varvec{x}-\varvec{y}_2 \right\| ^p q_2(\mathrm {d}\varvec{x}) \\ &{} \text {s.t.} &{} q_1 + q_2= \mu . \end{array}\right. \end{aligned}$$

The constraint \(q_1+q_2=\mu \) also implies that \(q_1\) and \(q_2\) are absolutely continuous with respect to \(\mu \), and thus

$$\begin{aligned} \underset{ t \in [0,1]}{\min }W_c(\mu , {\nu }_ t )&=\left\{ \begin{array}{ccll} &{}\min \limits _{q_1,q_2\in {\mathcal {M}}({\mathbb {R}}^d)}\; &{} \displaystyle \int _{{\mathbb {R}}^d} \Vert \varvec{x} \!-\!\varvec{y}_1\Vert ^p \frac{\mathrm {d}q_1}{\mathrm {d}\mu }(\varvec{x}) \!+\! \left\| \varvec{x} \!-\! \varvec{y}_2 \right\| ^p \, \frac{\mathrm {d}q_2}{\mathrm {d}\mu }(\varvec{x})\, \mu (\mathrm {d}\varvec{x}) \\ &{} \text {s.t.} &{} \displaystyle \frac{\mathrm {d}q_1}{\mathrm {d}\mu }(\varvec{x}) + \frac{\mathrm {d}q_2}{\mathrm {d}\mu }(\varvec{x})= 1 \quad \forall \varvec{x}\in [0,1]^d \end{array}\right. \nonumber \\&= \int _{{\mathbb {R}}^d} \min \left\{ \Vert \varvec{x} -\varvec{y}_1 \Vert ^p,\left\| \varvec{x} - \varvec{y}_2 \right\| ^p \right\} \,\mu (\mathrm {d}\varvec{x}), \end{aligned}$$
(6)

where the second equality holds because at optimality the Radon-Nikodym derivatives must satisfy

$$\begin{aligned} \frac{\mathrm {d}q_i}{\mathrm {d}\mu }(\varvec{x})=\left\{ \begin{array}{cl} 1 &{} \text {if } \Vert \varvec{x}-\varvec{y}_i\Vert ^p \le \Vert \varvec{x}-\varvec{y}_{3-i}\Vert ^p \\ 0 &{} \text {otherwise} \end{array} \right. \end{aligned}$$

for \(\mu \)-almost every \(\varvec{x}\in {\mathbb {R}}^d\) and for every \(i=1,2\).

In the second part of the proof we will demonstrate that the minimization problem \(\min _{t\in [0,1]} W_c(\mu , \nu _ t )\) is solved by \(t^\star =\text {Vol}(P(\varvec{w}, b))\). By Proposition 2.1 and the definition of the c-transform, we first note that

$$\begin{aligned} W_c(\mu , \nu _ {t^\star } )&=\underset{\phi \in {\mathcal {L}}({\mathbb {R}}^d, \nu _{t^\star })}{\max } ~ {\mathbb {E}}_{\varvec{y}\sim \nu _{t^\star }}[\phi (\varvec{y})] - {\mathbb {E}}_{\varvec{x}\sim \mu }[\phi _c(\varvec{x})] \nonumber \\&= \underset{\varvec{\phi }\in {\mathbb {R}}^2}{\max } ~ t^\star \cdot \phi _1 + (1- t^\star ) \cdot \phi _2- \int _{{\mathbb {R}}^d}\max _{i=1,2}\left\{ \phi _i- \Vert \varvec{x}-\varvec{y}_i \Vert ^p\right\} \mu (\mathrm {d}\varvec{x})\nonumber \\&= \max \limits _{\varvec{\phi }\in {\mathbb {R}}^2} ~ t^\star \cdot \phi _1 + (1-t^\star )\cdot \phi _2- \sum \limits _{i = 1}^2 \int _{{\mathcal {X}}_i(\varvec{\phi })}(\phi _i - \Vert \varvec{x} - \varvec{y_i}\Vert ^p)\,\mu (\mathrm {d}\varvec{x}), \end{aligned}$$
(7)

where

$$\begin{aligned} {\mathcal {X}}_i(\varvec{\phi }) = \{\varvec{x}\in {\mathbb {R}}^d: \phi _i - \Vert \varvec{x}-\varvec{y}_i \Vert ^p \ge \phi _{3-i} - \left\| \varvec{x} - \varvec{y}_{3-i} \right\| ^p\}\quad \forall i=1,2. \end{aligned}$$

The second equality in (7) follows from the construction of \(\nu _{t^\star }\) as a probability measure with only two atoms at the points \(\varvec{y}_i\) for \(i=1,2\). Indeed, by fixing the corresponding function values \(\phi _i=\phi (\varvec{y}_i)\) for \(i=1,2\), the expectation \({\mathbb {E}}_{\varvec{y} \sim \nu _{t}}[\phi (\varvec{y})]\) simplifies to \(t^\star \cdot \phi _1 + (1-t^\star )\cdot \phi _2\), while the negative expectation \(-{\mathbb {E}}_{\varvec{x} \sim \mu }[\phi _c(\varvec{x})]\) is maximized by setting \(\phi (\varvec{y})\) to a large negative constant for all \(\varvec{y}\notin \{\varvec{y}_1,\varvec{y}_2\}\), which implies that

$$\begin{aligned} \phi _c(\varvec{x}) = \sup _{\varvec{y} \in {\mathbb {R}}^d} \phi (\varvec{y}) - \Vert \varvec{x} - \varvec{y}\Vert ^p = \max _{i=1,2}\left\{ \phi _i- \Vert \varvec{x}-\varvec{y}_i \Vert ^p\right\} \quad \forall \varvec{x}\in [0,1]^d. \end{aligned}$$

Next, we will prove that any \(\varvec{\phi }^\star \in {\mathbb {R}}^2\) with \(\phi ^\star _1=\phi ^\star _2\) attains the maximum of the unconstrained convex optimization problem on the last line of (7). To see this, note that

$$\begin{aligned} \nabla _{\varvec{\phi }} \left[ \sum \limits _{i = 1}^2 \int _{{\mathcal {X}}_i(\varvec{\phi })}(\phi _i - \Vert \varvec{x} - \varvec{y}_i\Vert ^p)\,\mu (\mathrm {d}\varvec{x})\right]= & {} \sum \limits _{i = 1}^2 \int _{{\mathcal {X}}_i(\varvec{\phi })} \nabla _{\varvec{\phi }}(\phi _i - \Vert \varvec{x} - \varvec{y}_i\Vert ^p)\,\mu (\mathrm {d}\varvec{x})\\= & {} \begin{bmatrix} \mu ({\mathcal {X}}_1(\varvec{\phi }))\\ \mu ({\mathcal {X}}_2(\varvec{\phi })) \end{bmatrix} \end{aligned}$$

by virtue of the Reynolds theorem. Thus, the first-order optimality conditionFootnote 2\(t^\star =\mu ({\mathcal {X}}_1(\varvec{\phi }))\) is necessary and sufficient for global optimality. Fix now any \(\varvec{\phi }^\star \in {\mathbb {R}}^2\) with \(\phi ^\star _1=\phi ^\star _2\) and observe that

$$\begin{aligned} t^\star =\text {Vol}(P(\varvec{w}, b)) =&\mu \left( \left\{ \varvec{x}\in {\mathbb {R}}^d: \varvec{w}^\top \varvec{x}\le b \right\} \right) \\ =&\mu \left( \left\{ \varvec{x}\in {\mathbb {R}}^d: \Vert \varvec{x} \Vert ^2\le \Vert \varvec{x}-2b \varvec{w}/\Vert \varvec{w}\Vert ^2\Vert ^2 \right\} \right) \\ =&\mu \left( \left\{ \varvec{x}\in {\mathbb {R}}^d: \Vert \varvec{x} -\varvec{y}_1\Vert ^p\le \Vert \varvec{x}-\varvec{y}_2\Vert ^p \right\} \right) =\mu ({\mathcal {X}}_1(\varvec{\phi }^\star )), \end{aligned}$$

where the first and second equalities follow from the definitions of \(t^\star \) and the knapsack polytope \(P(\varvec{w}, b)\), respectively, the fourth equality holds because \(\varvec{y}_1=\varvec{0}\) and \(\varvec{y}_2=2b\varvec{w}/\Vert \varvec{w}\Vert ^2\), and the fifth equality follows from the definition of \({\mathcal {X}}_1(\varvec{\phi }^\star )\) and our assumption that \(\phi ^\star _1=\phi ^\star _2\). This reasoning implies that \(\varvec{\phi }^\star \) attains indeed the maximum of the optimization problem on the last line of (7). Hence, we find

$$\begin{aligned} W_c(\mu , \nu _ {t^\star } )&= t^\star \cdot \phi ^\star _1 + (1-t^\star )\cdot \phi ^\star _2- \sum \limits _{i = 1}^2 \int _{{\mathcal {X}}_i(\varvec{\phi }^\star )}(\phi ^\star _i - \Vert \varvec{x} - \varvec{y_i}\Vert ^p)\,\mu (\mathrm {d}\varvec{x})\\&= \sum \limits _{i = 1}^2 \int _{{\mathcal {X}}_i(\varvec{\phi }^\star )} \Vert \varvec{x} - \varvec{y_i}\Vert ^p \,\mu (\mathrm {d}\varvec{x}) = \int _{{\mathbb {R}}^d} \min _{i=1,2}\left\{ \Vert \varvec{x} -\varvec{y}_i \Vert ^p\right\} \,\mu (\mathrm {d}\varvec{x})\\&=\underset{ t \in [0,1]}{\min }W_c(\mu , {\nu }_ t ), \end{aligned}$$

where the second equality holds because \(\phi ^\star _1=\phi ^\star _2\), the third equality exploits the definition of \({\mathcal {X}}_1(\varvec{\phi }^\star )\), and the fourth equality follows from (6). We may therefore conclude that \(t^\star =\text {Vol}(P(\varvec{w}, b))\) solves indeed the minimization problem \(\min _{t\in [0,1]} W_c(\mu , \nu _ t )\). Using similar techniques, one can further prove that \(\partial _t W_c(\mu , \nu _t)\) exists and is strictly increasing in t, which ensures that \(W_c(\mu , \nu _t)\) is strictly convex in t and, in particular, that \(t^\star \) is the unique solution of \(\min _{t\in [0,1]} W_c(\mu , \nu _ t )\). Details are omitted for brevity. \(\square \)

Proof of Theorem 2.2

Lemma 2.4 applies under the assumptions of the theorem, and therefore the volume of the knapsack polytope \(P(\varvec{w}, b)\) coincides with the unique minimizer of

$$\begin{aligned} \min _{ t \in [0,1]} W_c(\mu , {\nu }_ t ). \end{aligned}$$
(8)

From the proof of Lemma 2.4 we know that the Wasserstein distance \(W_c(\mu ,{\nu }_ t )\) is strictly convex in t, which implies that the minimization problem (8) constitutes a one-dimensional convex program with a unique minimizer. A near-optimal solution that approximates the exact minimizer to within an absolute accuracy \(\delta =(6d!(\Vert \varvec{w}\Vert _1+2)^d(d+1)^{d+1}\prod _{i = 1}^{d}w_i)^{-1}\) can readily be computed with a binary search method such as Algorithm 3 described in Lemma A.1 (i), which evaluates \(g(t)=W_c(\mu ,\nu _t)\) at exactly \(2L=2({\lceil }{\log _2(1/\delta )}{\rceil } + 1)\) test points. Note that \(\delta \) falls within the interval (0, 1) and satisfies the strict inequality (5). Note also that L grows only polynomially with the bit length of \(\varvec{w}\) and b; see Appendix B for details. One readily verifies that all operations in Algorithm 3 except for the computation of \(W_c(\mu , \nu _t)\) can be carried out in time polynomial in the bit length of \(\varvec{w}\) and b. Thus, if we could compute \(W_c(\mu , \nu _t)\) in time polynomial in the bit length of \(\varvec{w}\), b and t, then we could efficiently compute the volume of the knapsack polytope \(P( \varvec{w}, b)\) to within accuracy \(\delta \), which is \(\#\)P-hard by Lemma 2.3. We have thus constructed a polynomial-time Turing reduction from the \(\#\)P-hard problem of (approximately) computing the volume of a knapsack polytope to computing the Wasserstein distance \(W_c(\mu , {\nu }_ t )\). By the definition of the class of \(\#\)P-hard problems (see, e.g., ([167], Definition 1)), we may thus conclude that computing \(W_c(\mu , \nu _t)\) is \(\#\)P-hard. \(\square \)

Corollary 2.5

(Hardness of computing approximate optimal transport distances) Computing \(W_c(\mu , \nu )\) to within an absolute accuracy of

$$\begin{aligned} \varepsilon =\frac{1}{4} \min \limits _{l\in [ 2^L]} \left\{ |W_c(\mu , \nu _{t_{l}}) - W_c(\mu , \nu _{t_{l-1}})| : W_c(\mu , \nu _{t_{l}}) \ne W_c(\mu , \nu _{t_{l-1}})\right\} , \end{aligned}$$

where \(L = {\lceil }{\log _2(1/ \delta )}{\rceil } + 1\), \(\delta = (6 d!(\Vert \varvec{w}\Vert _1+2)^d(d+1)^{d+1}\prod _{i = 1}^{d}w_i)^{-1} \) and \(t_l = l/ 2^{L}\) for all \(l =0, \ldots , 2^L\), is #P-hard even if \({\mathcal {X}}={\mathcal {Y}}={\mathbb {R}}^d\), \(c(\varvec{x}, \varvec{y}) = \Vert \varvec{x}-\varvec{y}\Vert ^{p}\) for some \(p\ge 1\), \(\mu \) is the Lebesgue measure on the standard hypercube \([0,1]^d\), and \(\nu \) is a discrete probability measure supported on only two points.

Proof

Assume that we have access to an inexact oracle that outputs, for any fixed \(t\in [0,1]\), an approximate optimal transport distance \({{\widetilde{W}}}_c(\mu , \nu _t)\) with \(|{{\widetilde{W}}}_c(\mu , \nu _t) - W_c(\mu , \nu _t) |\le \varepsilon \). By Lemma A.1 (ii), which applies thanks to the definition of \(\varepsilon \), we can then find a \(2\delta \)-approximation for the unique minimizer of (8) using 2L oracle calls. Note that \(\delta '=2\delta \) falls within the interval (0, 1) and satisfies the strict inequality (5). Recall also that L grows only polynomially with the bit length of \(\varvec{w}\) and b; see Appendix B for details. Thus, if we could compute \({{\widetilde{W}}}_c(\mu , \nu _t)\) in time polynomial in the bit length of \(\varvec{w}\), b and t, then we could efficiently compute the volume of the knapsack polytope \(P( \varvec{w}, b)\) to within accuracy \(\delta '\), which is \(\#\)P-hard by Lemma 2.3. Computing \(W_c(\mu , \nu )\) to within an absolute accuracy of \(\varepsilon \) is therefore also \(\#\)P-hard. \(\square \)

The hardness of optimal transport established in Theorem 2.2 and Corollary 2.5 is predicated on the hardness of numerical integration. A popular technique to reduce the complexity of numerical integration is smoothing, whereby an initial (possibly discontinuous) integrand is approximated with a differentiable one [48]. Smoothness is also a desired property of objective functions when designing scalable optimization algorithms [28]. These observations prompt us to develop a systematic way to smooth the optimal transport problem that leads to efficient approximate numerical solution schemes.

3 Smooth optimal transport

The semi-discrete optimal transport problem evaluates the optimal transport distance (1) between an arbitrary probability measure \(\mu \) supported on \({\mathcal {X}}\) and a discrete probability measure \(\nu = \sum _{i=1}^N {\nu }_i\delta _{\varvec{y_i}}\) with atoms \(\varvec{y}_1,\ldots , \varvec{y}_N \in {\mathcal {Y}}\) and corresponding probabilities \(\varvec{\nu }=(\nu _1,\ldots , \nu _N)\in \Delta ^N\) for some \(N\ge 2\). In the following, we define the discrete c-transform \(\psi _c:{\mathbb {R}}^N\times {\mathcal {X}}\rightarrow [-\infty ,+\infty )\) of \(\varvec{\phi }\in {\mathbb {R}}^N\) through

$$\begin{aligned} \psi _c(\varvec{\phi }, \varvec{x}) = \max \limits _{i \in [N]} \phi _i - c(\varvec{x}, \varvec{y}_i) \quad \forall \varvec{x} \in {\mathcal {X}}. \end{aligned}$$
(9)

Armed with the discrete c-transform, we can now reformulate the semi-discrete optimal transport problem as a finite-dimensional maximization problem over a single dual potential vector.

Lemma 3.1

(Discrete c-transform) The semi-discrete optimal transport problem is equivalent to

$$\begin{aligned} W_c(\mu , \nu ) = \sup _{ \varvec{\phi } \in {\mathbb {R}}^N} \varvec{\nu }^\top \varvec{\phi } - {\mathbb {E}}_{\varvec{x} \sim \mu }[{\psi _c(\varvec{\phi }, \varvec{x}) } ]. \end{aligned}$$
(10)

Proof

As \(\nu = \sum _{i=1}^N {\nu }_i\delta _{\varvec{y_i}}\) is discrete, the dual optimal transport problem (3) simplifies to

$$\begin{aligned} W_c(\mu , \nu )&=\sup _{\varvec{\phi }\in {\mathbb {R}}^N} \sup _{\phi \in {\mathcal {L}}({\mathcal {Y}}, \nu )} \left\{ \varvec{\nu }^\top \varvec{\phi }- {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \phi _c(\varvec{x}) \right] \;:\;\phi (\varvec{y}_i)=\phi _i~\forall i\in [N] \right\} \\&=\sup _{\varvec{\phi }\in {\mathbb {R}}^N}~ \varvec{\nu }^\top \varvec{\phi }- \inf _{\phi \in {\mathcal {L}}({\mathcal {Y}}, \nu )} \Big \{ {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \phi _c(\varvec{x}) \right] \;:\;\phi (\varvec{y}_i)=\phi _i~\forall i\in [N] \Big \} . \end{aligned}$$

Using the definition of the standard c-transform, we can then recast the inner minimization problem as

$$\begin{aligned}&\inf _{\phi \in {\mathcal {L}}({\mathcal {Y}}, \nu )} \left\{ {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \sup _{\varvec{y} \in {\mathcal {Y}}} \phi (\varvec{y}) - c(\varvec{x}, \varvec{y}) \right] \;:\;\phi (\varvec{y}_i)=\phi _i~\forall i\in [N] \right\} \\&\quad = ~{\mathbb {E}}_{\varvec{x} \sim \mu } \left[ \max _{i \in [N]}\left\{ \phi _i- c(\varvec{x}, \varvec{y}_i)\right\} \right] ~=~ {\mathbb {E}}_{\varvec{x} \sim \mu } \left[ {\psi _c(\varvec{\phi }, \varvec{x}) } \right] , \end{aligned}$$

where the first equality follows from setting \(\phi (\varvec{y})={{\underline{\phi }}}\) for all \(\varvec{y}\notin \{\varvec{y}_1, \ldots , \varvec{y}_N\}\) and letting \({{\underline{\phi }}}\) tend to \(-\infty \), while the second equality exploits the definition of the discrete c-transform. Thus, (10) follows. \(\square \)

The discrete c-transform (9) can be viewed as the optimal value of a discrete choice model, where a utility-maximizing agent selects one of N mutually exclusive alternatives with utilities \(\phi _i - c(\varvec{x}, \varvec{y}_i)\), \(i\in [N]\), respectively. Discrete choice models are routinely used for explaining the preferences of travelers selecting among different modes of transportation [16], but they are also used for modeling the choice of residential location [100], the interests of end-users in engineering design [170] or the propensity of consumers to adopt new technologies [69].

In practice, the preferences of decision-makers and the attributes of the different choice alternatives are invariably subject to uncertainty, and it is impossible to specify a discrete choice model that reliably predicts the behavior of multiple individuals. Psychological theory thus models the utilities as random variables [164], in which case the optimal choice becomes random, too. The theory as well as the econometric analysis of probabilistic discrete choice models were pioneered by McFadden [99].

The availability of a wealth of elegant theoretical results in discrete choice theory prompts us to add a random noise term to each deterministic utility value \(\phi _i - c(\varvec{x}, \varvec{y}_i)\) in (9). We will argue below that the expected value of the resulting maximal utility with respect to the noise distribution provides a smooth approximation for the c-transform \(\psi _c(\varvec{\phi }, \varvec{x})\), which in turn leads to a smooth optimal transport problem that displays favorable numerical properties. For a comprehensive survey of additive random utility models in discrete choice theory we refer to Dubin and McFadden [49] and Daganzo [40]. Generalized semi-parametric discrete choice models where the noise distribution is itself subject to uncertainty are studied by Natarajan et al. [111]. Using techniques from modern distributionally robust optimization, these models evaluate the best-case (maximum) expected utility across an ambiguity set of multivariate noise distributions. Semi-parametric discrete choice models are studied in the context of appointment scheduling [97], traffic management [3] and product line pricing [91].

We now define the smooth (discrete) c-transform as a best-case expected utility of the type studied in semi-parametric discrete choice theory, that is,

$$\begin{aligned} {\overline{\psi }}_c(\varvec{\phi }, \varvec{x}) = \sup _{\theta \in \Theta }\;{\mathbb {E}}_{\varvec{z} \sim \theta }\left[ \max _{i \in [N]} \phi _i -c(\varvec{x}, \varvec{y_i}) +z_i \right] , \end{aligned}$$
(11)

where \(\varvec{z}\) represents a random vector of perturbations that are independent of \(\varvec{x}\) and \(\varvec{y}\). Specifically, we assume that \(\varvec{z}\) is governed by a Borel probability measure \(\theta \) from within some ambiguity set \(\Theta \subseteq {\mathcal {P}}({\mathbb {R}}^N)\). Note that if \(\Theta \) is a singleton that contains only the Dirac measure at the origin of \({\mathbb {R}}^N\), then the smooth c-transform collapses to ordinary c-transform defined in (9), which is piecewise affine and thus non-smooth in \(\varvec{\phi }\). For many commonly used ambiguity sets, however, we will show below that the smooth c-transform is indeed differentiable in \(\varvec{\phi }\). In practice, the additive noise \(z_i\) in the transportation cost could originate, for example, from uncertainty about the position \(\varvec{y}_i\) of the i-th atom of the discrete distribution \(\nu \). This interpretation is justified if \(c(\varvec{x},\varvec{y})\) is approximately affine in \(\varvec{y}\) around the atoms \(\varvec{y}_i\), \(i\in [N]\). The smooth c-transform gives rise to the following smooth (semi-discrete) optimal transport problem in dual form.

$$\begin{aligned} {{\overline{W}}}_c (\mu , \nu ) = \sup \limits _{\varvec{\phi } \in {\mathbb {R}}^N} {\mathbb {E}}_{\varvec{x} \sim \mu } \left[ \varvec{\nu }^\top \varvec{\phi }- {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\right] \end{aligned}$$
(12)

Note that (12) is indeed obtained from the original dual optimal transport problem (10) by replacing the original c-transform \(\psi _c(\varvec{\phi }, \varvec{x})\) with the smooth c-transform \({{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\). As smooth functions are susceptible to efficient numerical integration, we expect that (12) is easier to solve than (10). A key insight of this work is that the smooth dual optimal transport problem (12) typically has a primal representation of the form

$$\begin{aligned} \min \limits _{\pi \in \Pi (\mu ,\nu )}\mathbb E_{(\varvec{x}, \varvec{y}) \sim \pi }\left[ c(\varvec{x}, \varvec{y})\right] + R_\Theta (\pi ), \end{aligned}$$
(13)

where \(R_\Theta (\pi )\) can be viewed as a regularization term that penalizes the complexity of the transportation plan \(\pi \). In the remainder of this section we will prove (13) and derive \(R_\Theta (\pi )\) for different ambiguity sets \(\Theta \). We will see that this regularization term is often related to an f-divergence, where \(f:{\mathbb {R}}_+ \rightarrow {\mathbb {R}}\cup \{\infty \}\) constitutes a lower-semicontinuous convex function with \(f(1) = 0\). If \(\tau \) and \(\rho \) are two Borel probability measures on a closed subset \(\mathcal Z\) of a finite-dimensional Euclidean space, and if \(\tau \) is absolutely continuous with respect to \(\rho \), then the continuous f-divergence form \(\tau \) to \(\rho \) is defined as \(D_f(\tau \parallel \rho ) = \int _{\mathcal Z} f({\mathrm {d}\tau }/{\mathrm {d}\rho }(\varvec{z})) \rho (\mathrm {d}\varvec{z})\), where \({\mathrm {d}\tau }/{\mathrm {d}\rho }\) stands for the Radon-Nikodym derivative of \(\tau \) with respect to \(\rho \). By slight abuse of notation, if \(\varvec{\tau }\) and \(\varvec{\rho }\) are two probability vectors in \(\Delta ^N\) and if \(\varvec{\rho }>\varvec{0}\), then the discrete f-divergence form \(\varvec{\tau }\) to \(\varvec{\rho }\) is defined as \(D_f(\varvec{\tau }\parallel \varvec{\rho }) = \sum _{i =1}^N f({\tau _i}/{\rho _i}) \rho _i\). The correct interpretation of \(D_f\) is usually clear from the context.

The following lemma shows that the smooth optimal transport problem (13) equipped with an f-divergence regularization term is equivalent to a finite-dimensional convex minimization problem. This result will be instrumental to prove the equivalence of (12) and (13) for different ambiguity sets \(\Theta \).

Lemma 3.2

(Strong duality) If \(\varvec{\eta }\in \Delta ^N\) with \(\varvec{\eta }>\varvec{0}\) and \(\eta = \sum _{i=1}^N \eta _i \delta _{\varvec{y}_i}\) is a discrete probability measure on \({\mathcal {Y}}\), then problem (13) with regularization term \(R_\Theta (\pi ) = D_{f}(\pi \Vert \mu \otimes \eta )\) is equivalent to

$$\begin{aligned} \sup \limits _{ \varvec{\phi }\in {\mathbb {R}}^N} ~ {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \min \limits _{\varvec{p}\in \Delta ^N} \sum \limits _{i=1}^N{\phi _i\nu _i}- (\phi _i - c(\varvec{x}, \varvec{y_i}))p_i + D_f(\varvec{p} \parallel \varvec{\eta }) \right] . \end{aligned}$$
(14)

Proof of Lemma 3.2

If \({\mathbb {E}}_{\varvec{x} \sim \mu }[c(\varvec{x},\varvec{y}_i)]=\infty \) for some \(i\in [N]\), then both (13) and (14) evaluate to infinity, and the claim holds trivially. In the remainder of the proof we may thus assume without loss of generality that \({\mathbb {E}}_{\varvec{x} \sim \mu }[c(\varvec{x},\varvec{y}_i)]<\infty \) for all \(i\in [N]\). Using ([138], Theorem 14.6) to interchange the minimization over \(\varvec{p}\) with the expectation over \(\varvec{x}\), problem (14) can first be reformulated as

$$\begin{aligned} \begin{array}{ccccll} &{} &{}\sup \limits _{ \varvec{\phi }\in {\mathbb {R}}^N} &{}\min \limits _{\varvec{p}\in \mathcal L_\infty ^N({\mathcal {X}},\mu )} ~ &{}{\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \displaystyle \sum \limits _{i=1}^N{\phi _i\nu _i} - (\phi _i - c(\varvec{x}, \varvec{y_i}))p_i(\varvec{x})+ D_f(\varvec{p}(\varvec{x})\Vert \varvec{\eta })\right] \\ &{}&{}&{}\text {s.t.} &{}\displaystyle \varvec{p}(\varvec{x})\in \Delta ^N \quad \mu \text {-a.s.}, \end{array} \end{aligned}$$

where \(\mathcal L_\infty ^N({\mathcal {X}},\mu )\) denotes the Banach space of all Borel-measurable functions from \({\mathcal {X}}\) to \({\mathbb {R}}^N\) that are essentially bounded with respect to \(\mu \). Interchanging the supremum over \(\varvec{\phi }\) with the minimum over \(\varvec{p}\) and evaluating the resulting unconstrained linear program over \(\varvec{\phi }\) in closed form then yields the dual problem

$$\begin{aligned} \begin{array}{ccl} &{} \min \limits _{\varvec{p}\in \mathcal L_\infty ^N({\mathcal {X}},\mu )} &{}\displaystyle {\mathbb {E}}_{\varvec{x} \sim \mu }\Bigg [ \sum \limits _{i=1}^Nc(\varvec{x}, \varvec{y_i})p_{i}(\varvec{x}) +\displaystyle D_f (\varvec{p}(\varvec{x}) \! \parallel \!\varvec{\eta }) \Bigg ] \\ &{}\text {s.t.} &{}\displaystyle {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \varvec{p}(\varvec{x})\right] = \varvec{\nu },\quad \varvec{p}(\varvec{x})\in \Delta ^N \quad \mu \text {-a.s.} \end{array} \end{aligned}$$
(15)

Strong duality holds for the following reasons. As c and f are lower-semicontinuous and c is non-negative, we may proceed as in ([154], § 3.2) to show that the dual objective function is weakly\({}^*\) lower semicontinuous in \(\varvec{p}\). Similarly, as \(\Delta ^N\) is compact, one can use the Banach-Alaoglu theorem to show that the dual feasible set is weakly\({}^*\) compact. Finally, as f is real-valued and \({\mathbb {E}}_{\varvec{x} \sim \mu }[c(\varvec{x},\varvec{y}_i)]<\infty \) for all \(i\in [N]\), the constant solution \(\varvec{p}(\varvec{x})=\varvec{\nu }\) is dual feasible for all \(\varvec{\nu }\in \Delta ^N\). Thus, the dual problem is solvable and has a finite optimal value. This argument remains valid if we add a perturbation \(\varvec{\delta }\in H=\{\varvec{\delta }'\in {\mathbb {R}}^N: \sum _{i=1}^N\delta '_i=0\}\) to the right hand side vector \(\varvec{\nu }\) as long as \(\varvec{\delta }>-\varvec{\nu }\). The optimal value of the perturbed dual problem is thus pointwise finite as well as convex and—consequently—continuous and locally bounded in \(\varvec{\delta }\) at the origin of H. As \(\varvec{\nu }>\varvec{0}\), strong duality therefore follows from ([137], Theorem 17 (a)).

Any dual feasible solution \(\varvec{p}\in \mathcal L^N_\infty ({\mathcal {X}},\mu )\) gives rise to a Borel probability measure \(\pi \in \mathcal P(\mathcal X \times \mathcal Y)\) defined through \(\pi ( \varvec{y} \in \mathcal B) = \nu (\varvec{y} \in \mathcal B)\) for all Borel sets \(\mathcal B \subseteq \mathcal Y\) and \(\pi (\varvec{x} \in \mathcal A | \varvec{y} = \varvec{y}_i) = \int _{ \mathcal A} p_i(\varvec{x}) \mu (\mathrm {d}\varvec{x}) / \nu _i\) for all Borel sets \(\mathcal A \subseteq \mathcal X\) and \(i \in [N]\). This follows from the law of total probability, whereby the joint distribution of \(\varvec{x}\) and \(\varvec{y}\) is uniquely determined if we specify the marginal distribution of \(\varvec{y}\) and the conditional distribution of \(\varvec{x}\) given \(\varvec{y}=\varvec{y}_i\) for every \(i\in [N]\). By construction, the marginal distributions of \(\varvec{x}\) and \(\varvec{y}\) under \(\pi \) are determined by \(\mu \) and \(\nu \), respectively. Indeed, note that for any Borel set \(\mathcal A \subseteq \mathcal X\) we have

$$\begin{aligned} \pi (\varvec{x} \in \mathcal A)&= \sum \limits _{i=1}^N \pi (\varvec{x} \in \mathcal A | \varvec{y} = \varvec{y}_i) \cdot \pi (\varvec{y} = \varvec{y}_i) = \sum \limits _{i=1}^N \pi (\varvec{x} \in \mathcal A | \varvec{y} = \varvec{y}_i) \cdot \nu _i\\&= \sum \limits _{i=1}^N \int _{\mathcal A} {p_i(\varvec{x})}\mu (\mathrm {d}\varvec{x}) = \int _{\mathcal A} \mu (\mathrm {d}\varvec{x}) = \mu (\varvec{x}\in \mathcal A), \end{aligned}$$

where the first equality follows from the law of total probability, the second and the third equalities both exploit the construction of \(\pi \), and the fourth equality holds because \(\varvec{p}(\varvec{x})\in \Delta ^N\) \(\mu \)-almost surely due to dual feasibility. This reasoning implies that \(\pi \) constitutes a coupling of \(\mu \) and \(\nu \) (that is, \(\pi \in \Pi (\mu , \nu )\)) and is thus feasible in (13). Conversely, any \(\pi \in \Pi (\mu ,\nu )\) gives rise to a function \(\varvec{p}\in \mathcal L_\infty ^N({\mathcal {X}},\mu )\) defined through

$$\begin{aligned} p_i(\varvec{x}) =\nu _i\cdot \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )} (\varvec{x}, \varvec{y}_i)\quad \forall i\in [N]. \end{aligned}$$

By the properties of the Randon-Nikodym derivative, we have \(p_i(\varvec{x})\ge 0\) \(\mu \)-almost surely for all \(i\in [N]\). In addition, for any Borel set \(\mathcal A\subseteq {\mathcal {X}}\) we have

$$\begin{aligned} \int _{\mathcal A}\sum _{i=1}^N p_i(\varvec{x})\,\mu (\mathrm {d}\varvec{x})&= \int _{\mathcal A} \sum _{i=1}^N \nu _i\cdot \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )} (\varvec{x}, \varvec{y}_i)\,\mu (\mathrm {d}\varvec{x})\\&= \int _{\mathcal A\times {\mathcal {Y}}} \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )} (\varvec{x}, \varvec{y})\,(\mu \otimes \nu )(\mathrm {d}\varvec{x},\mathrm {d}\varvec{y}) \\&= \int _{\mathcal A\times {\mathcal {Y}}} \pi (\mathrm {d}\varvec{x}, \mathrm {d}\varvec{y}) = \int _{\mathcal A}\mu (\mathrm {d}\varvec{x}), \end{aligned}$$

where the second equality follows from Fubini’s theorem and the definition of \(\nu =\sum _{i=1}^N\nu _i\delta _{\varvec{y}_i}\), while the fourth equality exploits that the marginal distribution of \(\varvec{x}\) under \(\pi \) is determined by \(\mu \). As the above identity holds for all Borel sets \(\mathcal A\subseteq {\mathcal {X}}\), we find that \(\sum _{i=1}^N p_i(\varvec{x})=1\) \(\mu \)-almost surely. Similarly, we have

$$\begin{aligned} \mathbb E_{\varvec{x}\sim \mu }\left[ p_i(\varvec{x})\right]&=\int _{\mathcal {X}}\nu _i\cdot \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )} (\varvec{x}, \varvec{y}_i) \,\mu (\mathrm {d}\varvec{x}) \\&=\int _{{\mathcal {X}}\times \{\varvec{y}_i\}} \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )} (\varvec{x}, \varvec{y}) \,(\mu \otimes \nu )(\mathrm {d}\varvec{x},\mathrm {d}\varvec{y}) \\&= \int _{{\mathcal {X}}\times \{\varvec{y}_i\}} \pi (\mathrm {d}\varvec{x},\mathrm {d}\varvec{y})=\int _{\{\varvec{y}_i\}}\nu (\mathrm {d}\varvec{y})=\nu _i \end{aligned}$$

for all \(i\in [N]\). In summary, \(\varvec{p}\) is feasible in (15). Thus, we have shown that every probability measure \(\pi \) feasible in (13) induces a function \(\varvec{p}\) feasible in (15) and vice versa. We further find that the objective value of \(\varvec{p}\) in (15) coincides with the objective value of the corresponding \(\pi \) in (13). Specifically, we have

$$\begin{aligned}&{\mathbb {E}}_{\varvec{x} \sim \mu }\Bigg [ \sum \limits _{i=1}^N c(\varvec{x}, \varvec{y_i})\, p_{i}(\varvec{x}) +\displaystyle D_f (\varvec{p}(\varvec{x}) \Vert \varvec{\eta }) \Bigg ]\\&\quad =\displaystyle \int _{\mathcal {X}}\sum \limits _{i=1}^N c(\varvec{x}, \varvec{y}_i) p_i(\varvec{x}) \,\mu ( \mathrm {d}\varvec{x}) + \displaystyle \int _{\mathcal {X}}\sum _{i=1}^N f\left( \frac{p_i(\varvec{x})}{\eta _i}\right) \eta _i \, \mu (\mathrm {d}\varvec{x}) \\&\quad =\displaystyle \int _{\mathcal {X}}\sum \limits _{i=1}^N c(\varvec{x}, \varvec{y}_i) \cdot \nu _i\cdot \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )}(\varvec{x}, \varvec{y}_i)\, \mu ( \mathrm {d}\varvec{x}) \\&\qquad + \int _{\mathcal {X}}\sum _{i=1}^N f\left( \frac{\nu _i}{\eta _i} \cdot \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )}(\varvec{x}, \varvec{y}_i)\right) \cdot \eta _i \,\mu ( \mathrm {d}\varvec{x}) \\&\quad =\displaystyle \int _{{\mathcal {X}}\times {\mathcal {Y}}} c(\varvec{x}, \varvec{y})\frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )}(\varvec{x}, \varvec{y}) \,(\mu \otimes \nu )(\mathrm {d}\varvec{x}, \mathrm {d}\varvec{y}) \\&\qquad + \displaystyle \int _{{\mathcal {X}}\times {\mathcal {Y}}} f\left( \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \eta )}(\varvec{x}, \varvec{y})\right) (\mu \otimes \eta )(\mathrm {d}\varvec{x},\mathrm {d}\varvec{y}) \\&\quad =\mathbb E_{(\varvec{x}, \varvec{y}) \sim \pi } \left[ c(\varvec{x}, \varvec{y})\right] + D_f(\pi \Vert \mu \otimes \eta ), \end{aligned}$$

where the first equality exploits the definition of the discrete f-divergence, the second equality expresses the function \(\varvec{p}\) in terms of the corresponding probability measure \(\pi \), the third equality follows from Fubini’s theorem and uses the definitions \(\nu =\sum _{i=1}^N \nu _i\delta _{\varvec{y}_i}\) and \(\eta =\sum _{i=1}^N \eta _i\delta _{\varvec{y}_i}\), and the fourth equality follows from the definition of the continuous f-divergence. In summary, we have thus shown that (13) is equivalent to (15), which in turn is equivalent to (14). This observation completes the proof. \(\square \)

Proposition 3.3

(Approximation bound) If \(\varvec{\eta }\in \Delta ^N\) with \(\varvec{\eta }>\varvec{0}\) and \(\eta = \sum _{i=1}^N \eta _i \delta _{\varvec{y}_i}\) is a discrete probability measure on \({\mathcal {Y}}\), then problem (13) with regularization term \(R_\Theta (\pi ) = D_{f}(\pi \Vert \mu \otimes \eta )\) satisfies

$$\begin{aligned}|{{\overline{W}}}_c(\mu , \nu ) - W_c(\mu , \nu )| \le \max \Bigg \{\bigg |\min _{\varvec{p} \in \Delta ^N} D_f(\varvec{p} \Vert \varvec{\eta })\bigg |, \bigg |\max _{i \in [N]}\bigg \{ f\bigg (\frac{1}{\eta _i}\bigg ) \eta _i+ f(0) \sum _{k \ne i} \eta _k\bigg \}\bigg |\Bigg \}.\end{aligned}$$

Proof

By Lemma 3.2, problem (13) is equivalent to (14). Note that the inner optimization problem in (14) can be viewed as an f-divergence regularized linear program with optimal value \(\varvec{\nu }^\top \varvec{\phi }-\ell (\varvec{\phi }, \varvec{x})\), where

$$\begin{aligned} \ell (\varvec{\phi }, \varvec{x}) = \max \limits _{\varvec{p} \in \Delta ^N} \sum \limits _{i=1}^N (\phi _i - c(\varvec{x}, \varvec{y}_i)) p_i - D_f(\varvec{p} \Vert \varvec{\eta }). \end{aligned}$$

Bounding \(D_f(\varvec{p} \Vert \varvec{\eta })\) by its minimum and its maximum over \(\varvec{p}\in \Delta ^N\) then yields the estimates

$$\begin{aligned} \psi _c(\varvec{\phi }, \varvec{x}) - \max _{ \varvec{p} \in \Delta ^N} D_f(\varvec{p} \Vert \varvec{\eta }) \le \ell (\varvec{\phi }, \varvec{x}) \le \psi _c(\varvec{\phi }, \varvec{x}) - \min _{\varvec{p} \in \Delta ^N} D_f(\varvec{p} \Vert \varvec{\eta }). \end{aligned}$$
(16)

Here, \(\psi _c(\varvec{\phi }, \varvec{x})\) stands as usual for the discrete c-transform defined in (9), which can be represented as

$$\begin{aligned} \psi _c(\varvec{\phi }, \varvec{x}) = \max \limits _{\varvec{p} \in \Delta ^N}\sum \limits _{i=1}^N (\phi _i - c(\varvec{x}, \varvec{y}_i)) p_i. \end{aligned}$$
(17)

Multiplying (16) by \(-1\), adding \(\varvec{\nu }^\top \varvec{\phi }\), averaging over \(\varvec{x}\) using the probability measure \(\mu \) and maximizing over \(\varvec{\phi }\in {\mathbb {R}}^N\) further implies via (10) and (14) that

$$\begin{aligned} W_c(\mu ,\nu )+ \min _{ \varvec{p} \in \Delta ^N} D_f(\varvec{p} \Vert \varvec{\eta }) \le {{\overline{W}}}_c(\mu , \nu ) \le W_c(\mu ,\nu ) + \max _{\varvec{p} \in \Delta ^N} D_f(\varvec{p} \Vert \varvec{\eta }). \end{aligned}$$
(18)

As \(D_f(\varvec{p} \Vert \varvec{\eta })\) is convex in \(\varvec{p}\), its maximum is attained at a vertex of \(\Delta ^N\) ([75], Theorem 1), that is,

$$\begin{aligned} \max _{\varvec{p} \in \Delta ^N} D_f(\varvec{p} \Vert \varvec{\eta }) = \max _{i \in [N]}\bigg \{ f\bigg (\frac{1}{\eta _i}\bigg ) \eta _i + f(0) \sum _{k \ne i} \eta _k\bigg \}. \end{aligned}$$

The claim then follows by substituting the above formula into (18) and rearranging terms. \(\square \)

In the following we discuss three different classes of ambiguity sets \(\Theta \) for which the dual smooth optimal transport problem (12) is indeed equivalent to the primal reguarized optimal transport problem (13).

3.1 Generalized extreme value distributions

Assume first that the ambiguity set \(\Theta \) represents a singleton that accommodates only one single Borel probability measure \(\theta \) on \({\mathbb {R}}^N\) defined through

$$\begin{aligned} \theta (\varvec{z} \le \varvec{s}) = \exp \left( -G \left( \exp (-s_1),\ldots , \exp (-s_N) \right) \right) \quad \forall \varvec{s}\in {\mathbb {R}}^N, \end{aligned}$$
(19)

where \(G:{\mathbb {R}}^N \rightarrow {\mathbb {R}}_+\) is a smooth generating function with the following properties. First, G is homogeneous of degree \(1/\lambda \) for some \(\lambda >0\), that is, for any \(\alpha \ne 0\) and \(\varvec{s}\in {\mathbb {R}}^N\) we have \(G(\alpha \varvec{s}) = \alpha ^{1/\lambda }G(\varvec{s})\). In addition, \(G(\varvec{s})\) tends to infinity as \(s_i\) grows for any \(i \in [N]\). Finally, the partial derivative of G with respect to k distinct arguments is non-negative if k is odd and non-positive if k is even. These properties ensure that the noise vector \(\varvec{z}\) follows a generalized extreme value distribution in the sense of ([165], § 4.1).

Proposition 3.4

(Entropic regularization) Assume that \(\Theta \) is a singleton ambiguity set that contains only a generalized extreme value distribution with \(G( \varvec{s}) = \exp (-e)N\sum _{i=1}^N \eta _i s_i^{1/\lambda }\) for some \(\lambda > 0\) and \(\varvec{\eta }\in \Delta ^N\), \(\varvec{\eta }> \varvec{0}\), where e stands for Euler’s constant. Then, the components of \(\varvec{z}\) follow independent Gumbel distributions with means \(\lambda \log (N \eta _i)\) and variances \(\lambda ^2 \pi ^2 /6\) for all \(i\in [N]\),

while the smooth c-transform (11) reduces to the \(\log \)-partition function

$$\begin{aligned} {{\overline{\psi }}}(\varvec{\phi }, \varvec{x}) = \lambda \log \left( \sum _{i=1}^N \eta _i \exp \left( \frac{\phi _i -c(\varvec{x},\varvec{y_i})}{\lambda } \right) \right) . \end{aligned}$$
(20)

In addition, the smooth dual optimal transport problem (12) is equivalent to the regularized primal optimal transport problem (13) with \(R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )\), where \(f(s) =\lambda s\log (s)\) and \(\eta = \sum _{i =1}^N \eta _i \delta _{\varvec{y}_i}\).

Note that the log-partition function (20) constitutes indeed a smooth approximation for the maximum function in the definition (9) of the discrete c-transform. As \(\lambda \) decreases, this approximation becomes increasingly accurate. It is also instructive to consider the special case where \(\mu =\sum _{i=1}^M\mu _i\delta _{\varvec{x}_i}\) is a discrete probability measure with atoms \(\varvec{x}_1,\ldots ,\varvec{x}_M\in {\mathcal {X}}\) and corresponding vector of probabilities \(\varvec{\mu }\in \Delta ^M\). In this case, any coupling \(\pi \in \Pi (\mu ,\nu )\) constitutes a discrete probability measure \(\pi =\sum _{i=1}^M\sum _{j=1}^N \pi _{ij}\delta _{(\varvec{x}_i,\varvec{y}_j)}\) with matrix of probabilities \(\varvec{\pi }\in \Delta ^{M\times N}\). If \(f(x)=s\log (s)\), then the continuous f-divergence reduces to

$$\begin{aligned} D_f(\pi \Vert \mu \otimes \eta )&=\sum _{i=1}^M\sum _{j=1}^N \pi _{ij}\log (\pi _{ij})-\sum _{i=1}^M\sum _{j=1}^N \pi _{ij}\log (\mu _i)-\sum _{i=1}^M\sum _{j=1}^N \pi _{ij}\log (\eta _j)\\&=\sum _{i=1}^M\sum _{j=1}^N \pi _{ij}\log (\pi _{ij})-\sum _{i=1}^M\mu _i\log (\mu _i)-\sum _{j=1}^N \nu _j\log (\eta _j), \end{aligned}$$

where the second equality holds because \(\pi \) is a coupling of \(\mu \) and \(\nu \). Thus, \(D_f(\pi \Vert \mu \otimes \eta )\) coincides with the negative entropy of the probability matrix \(\varvec{\pi }\) offset by a constant that is independent of \(\varvec{\pi }\). For \(f(s)=s\log (s)\) the choice of \(\varvec{\eta }\) has therefore no impact on the minimizer of the smooth optimal transport problem (13), and we simply recover the celebrated entropic regularization proposed by Cuturi [39], Genevay et al. [64], Rigollet and Weed [135], Peyré and Cuturi [127] and Clason et al. [33].

Proof of Proposition 3.4

Substituting the explicit formula for the generating function G into (19) yields

$$\begin{aligned} \theta (\varvec{z} \le \varvec{s})&= \exp \left( -\exp (-e)N\sum \limits _{i=1}^N \eta _i \exp \left( -\frac{s_i}{\lambda }\right) \right) \\&=\prod \limits _{i=1}^N \exp \left( -\exp (-e)N\eta _i \exp \left( -\frac{s_i}{\lambda } \right) \right) \\&= \prod \limits _{i=1}^N \exp \left( -\exp \left( -\frac{s_i - \lambda (\log (N\eta _i)-e)}{\lambda }\right) \right) , \end{aligned}$$

where e stands for Euler’s constant. The components of the noise vector \(\varvec{z}\) are thus independent under \(\theta \), and \(z_i\) follows a Gumbel distribution with location parameter \(\lambda (\log (N\eta _i)-e)\) and scale parameter \(\lambda \) for every \(i \in [N]\). Therefore, \(z_i\) has mean \(\lambda \log (N \eta _i)\) and variance \(\lambda ^2 \pi ^2/6\).

If the ambiguity set \(\Theta \) contains only one single probability measure \(\theta \) of the form (19), then Theorem 5.2 of McFadden [101] readily implies that the smooth c-transform (11) simplifies to

$$\begin{aligned} {{\overline{\psi }}}(\varvec{\phi }, \varvec{x}) = \lambda \log G \left( \exp (\phi _1 -c(\varvec{x},\varvec{y}_1)),\dots , \exp (\phi _N - c(\varvec{x}, \varvec{y}_N)) \right) + \lambda e.\qquad \end{aligned}$$
(21)

The closed-form expression for the smooth c-transform in (20) follows immediately by substituting the explicit formula for the generating function G into (21). One further verifies that (20) can be reformulated as

$$\begin{aligned} {\overline{\psi }}_c(\varvec{\phi }, \varvec{x}) = \max \limits _{\varvec{p} \in \Delta ^N} \sum \limits _{i=1}^N (\phi _i - c(\varvec{x}, \varvec{y}_i)) p_i - \lambda \sum \limits _{i=1}^N p_i \log \left( \frac{p_i}{\eta _i}\right) . \end{aligned}$$
(22)

Indeed, solving the underlying Karush-Kuhn-Tucker conditions analytically shows that the optimal value of the nonlinear program (22) coincides with the smooth c-transform (20). In the special case where \(\eta _i = 1/N\) for all \(i \in [N]\), the equivalence of (20) and (22) has already been recognized by Anderson et al. [10]. Substituting the representation (22) of the smooth c-transform into the dual smooth optimal transport problem (12) yields (14) with \(f(s)= \lambda s \log (s)\). By Lemma 3.2, problem (12) is thus equivalent to the regularized primal optimal transport problem (13) with \(R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )\), where \(\eta = \sum _{i =1}^N \eta _i \delta _{\varvec{y}_i}\). \(\square \)

3.2 Chebyshev ambiguity sets

Assume next that \(\Theta \) constitutes a Chebyshev ambiguity set comprising all Borel probability measures on \({\mathbb {R}}^N\) with mean vector \(\varvec{0}\) and positive definite covariance matrix \(\lambda \varvec{\Sigma }\) for some \(\varvec{\Sigma }\succ \varvec{0}\) and \(\lambda > 0\). Formally, we thus set \(\Theta = \{\theta \in \mathcal P({\mathbb {R}}^N) : {\mathbb {E}}_\theta [\varvec{z}] = \varvec{0},\, \mathbb E_\theta [\varvec{z} \varvec{z}^\top ] = \lambda \varvec{\Sigma }\}\). In this case, ([4], Theorem 1) implies that the smooth c-transform (11) can be equivalently expressed as

$$\begin{aligned} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) = \max _{\varvec{p}\in \Delta ^N} \sum \limits _{i=1}^N(\phi _i -c(\varvec{x}, \varvec{y_i}))p_i + \lambda \,\text {tr}\left( (\varvec{\Sigma }^{1/2}(\text {diag}(\varvec{p})-\varvec{p}\varvec{p}^\top )\varvec{\Sigma }^{1/2})^{1/2}\right) ,\nonumber \\ \end{aligned}$$
(23)

where \(\text {diag}(\varvec{p})\in {\mathbb {R}}^{N\times N}\) represents the diagonal matrix with \(\varvec{p}\) on its main diagonal. Note that the maximum in (23) evaluates the convex conjugate of the extended real-valued regularization function

$$\begin{aligned} V(\varvec{p})=\left\{ \begin{array}{c@{\qquad }l} -\lambda \,\text {tr}\left( (\varvec{\Sigma }^{1/2}(\text {diag}(\varvec{p})-\varvec{p}\varvec{p}^\top )\varvec{\Sigma }^{1/2})^{1/2}\right) &{} \text {if }\quad \varvec{p}\in \Delta ^N \\ \infty &{} \text {if }\quad \varvec{p}\notin \Delta ^N \end{array}\right. \end{aligned}$$

at the point \((\phi _i -c(\varvec{x}, \varvec{y_i}))_{i\in [N]}\). As \(\varvec{\Sigma }\succ \varvec{0}\) and \(\lambda >0\), ([4], Theorem 1) implies that \(V(\varvec{p})\) is strongly convex over its effective domain \(\Delta ^N\). By [138,  Proposition 12.60], the smooth discrete c-transform \({{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\) is therefore indeed differentiable in \(\varvec{\phi }\) for any fixed \(\varvec{x}\). It is further known that problem (23) admits an exact reformulation as a tractable semidefinite program; see ([104], Proposition 1). If \(\varvec{\Sigma }= \varvec{I}\), then the regularization function \(V(\varvec{p})\) can be re-expressed in terms of a discrete f-divergence, which implies via Lemma 3.2 that the smooth optimal transport problem is equivalent to the original optimal transport problem regularized with a continuous f-divergence.

Proposition 3.5

(Chebyshev regularization) If \(\Theta \) is the Chebyshev ambiguity set of all Borel probability measures with mean \(\varvec{0}\) and covariance matrix \(\lambda \varvec{I}\) with \(\lambda > 0\), then the smooth c-transform (11) simplifies to

$$\begin{aligned} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) = \max _{ \varvec{p}\in \Delta ^N} \sum \limits _{i=1}^N(\phi _i -c(\varvec{x}, \varvec{y_i})) p_i + \lambda \sum _{i=1}^N\sqrt{p_i(1-p_i)}. \end{aligned}$$
(24)

In addition, the smooth dual optimal transport problem (12) is equivalent to the regularized primal optimal transport problem (13) with \(R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )+ \lambda \sqrt{N-1}\), where \(\eta = \frac{1}{N} \sum _{i =1}^N \delta _{\varvec{y}_i}\) and

$$\begin{aligned} f(s) = {\left\{ \begin{array}{ll} -\lambda \sqrt{s(N - s)} + \lambda s \sqrt{N-1} \quad \quad &{} \text {if }\quad 0 \le s \le N\\ +\infty &{} \text {if }\quad s>N. \end{array}\right. }\end{aligned}$$
(25)

Proof

The relation (24) follows directly from (23) by replacing \(\varvec{\Sigma }\) with \(\varvec{I}\). Next, one readily verifies that \(-\sum _{i \in [N]} \sqrt{p_i(1-p_i)} \) can be re-expressed as the discrete f-divergence \(D_f(\varvec{p}\Vert \varvec{\eta })\) from \(\varvec{p}\) to \(\varvec{\eta }=(\frac{1}{N},\ldots ,\frac{1}{N})\), where \(f(s) =-\lambda \sqrt{s (N - s)}+ \lambda \sqrt{N-1}\). This implies that (24) is equivalent to

$$\begin{aligned} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) = \max _{ \varvec{p}\in \Delta ^N} \sum \limits _{i=1}^N(\phi _i -c(\varvec{x}, \varvec{y_i})) p_i - D_f(\varvec{p}\Vert \varvec{\eta }). \end{aligned}$$

Substituting the above representation of the smooth c-transform into the dual smooth optimal transport problem (12) yields (14) with \(f(s)= -\lambda \sqrt{s (N - s)} +\lambda s \sqrt{N-1} \). By Lemma 3.2, (12) thus reduces to the regularized primal optimal transport problem (13) with \(R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )\), where \(\eta = \frac{1}{N} \sum _{i =1}^N \delta _{\varvec{y}_i}\). \(\square \)

Note that the function f(s) defined in (25) is indeed convex, lower-semicontinuous and satisfies \(f(1)=0\). Therefore, it induces a standard f-divergence. Proposition 3.5 can be generalized to arbitrary diagonal matrices \(\varvec{\Sigma }\), but the emerging f-divergences are rather intricate and not insightful. Hence, we do not show this generalization. We were not able to generalize Proposition 3.5 to non-diagonal matrices \(\varvec{\Sigma }\).

3.3 Marginal ambiguity sets

We now investigate the class of marginal ambiguity sets of the form

$$\begin{aligned} \Theta = \Big \{ \theta \in {\mathcal {P}}({\mathbb {R}}^N) \, : \, \theta (z_i \le s) = F_i(s)\;\forall s\in {\mathbb {R}}, \; \forall i \in [N] \Big \}, \end{aligned}$$
(26)

where \(F_i\) stands for the cumulative distribution function of the uncertain disturbance \(z_i\), \(i\in [N]\). Marginal ambiguity sets completely specify the marginal distributions of the components of the random vector \(\varvec{z}\) but impose no restrictions on their dependence structure (i.e., their copula). Sometimes marginal ambiguity sets are also referred to as Fréchet ambiguity sets [62]. We will argue below that the marginal ambiguity sets explain most known as well as several new regularization methods for the optimal transport problem. In particular, they are more expressive than the extreme value distributions as well as the Chebyshev ambiguity sets in the sense that they induce a richer family of regularization terms. Below we denote by \(F_i^{-1} : [0, 1] \rightarrow {\mathbb {R}}\) the (left) quantile function corresponding to \(F_i\), which is defined through

$$\begin{aligned} F_i^{-1}(t) = \inf \{s :F_i(s) \ge t \}\quad \forall t\in {\mathbb {R}}. \end{aligned}$$

We first prove that if \(\Theta \) constitutes a marginal ambiguity set, then the smooth c-transform (11) admits an equivalent reformulation as the optimal value of a finite convex program.

Proposition 3.6

(Smooth c-transform for marginal ambiguity sets) If \(\Theta \) is a marginal ambiguity set of the form (26), and if the underlying cumulative distribution functions \(F_i\), \(i\in [N]\), are continuous, then the smooth c-transform (11) can be equivalently expressed as

$$\begin{aligned} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})= & {} \max _{ \varvec{p} \in \Delta ^N} \displaystyle \sum \limits _{i=1}^N ~ (\phi _i - c(\varvec{x}, \varvec{y_i}))p_i \nonumber \\&+ \sum _{i=1}^N \int _{1-p_i}^1 F_i^{-1}(t) \mathrm {d}t \end{aligned}$$
(27)

for all \(\varvec{x}\in {\mathcal {X}}\) and \(\varvec{\phi }\in {\mathbb {R}}^N\). In addition, the smooth c-transform is convex and differentiable with respect to \(\varvec{\phi }\), and \(\nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\) represents the unique solution of the convex maximization problem (27).

Recall that the smooth c-transform (11) can be viewed as the best-case utility of a semi-parametric discrete choice model. Thus, (27) follows from [111, Theorem 1]. To keep this paper self-contained, we provide a new proof of Proposition 3.6, which exploits a natural connection between the smooth c-transform induced by a marginal ambiguity set and the conditional value-at-risk (CVaR).

Proof of Proposition 3.6

Throughout the proof we fix \(\varvec{x}\in {\mathcal {X}}\) and \(\varvec{\phi }\in {\mathbb {R}}^N\), and we introduce the nominal utility vector \(\varvec{u} \in {\mathbb {R}}^N\) with components \(u_i= \phi _i - c(\varvec{x}, \varvec{y}_i)\) in order to simplify notation. In addition, it is useful to define the binary function \(\varvec{r}: {\mathbb {R}}^N \rightarrow \{ 0, 1 \}^N\) with components

$$\begin{aligned} r_i(\varvec{z}) = {\left\{ \begin{array}{ll} 1 &{} \text {if } i = \displaystyle \min \, \mathop {\mathrm{argmax}}\limits _{j \in [N]} ~ u_j + z_j, \\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

For any fixed \(\theta \in \Theta \), we then have

$$\begin{aligned} {\mathbb {E}}_{\varvec{z} \sim \theta } \Big [ \max \limits _{i \in [N]} u_i + z_{i} \Big ] = {\mathbb {E}}_{\varvec{z} \sim \theta } \Big [ \; \sum _{i=1}^N ( u_i + z_i) r_i(\varvec{z}) \Big ]&= \sum _{i=1}^N u_i p_i + \sum _{i=1}^N {\mathbb {E}}_{\varvec{z} \sim \theta } \left[ z_i q_i(z_i) \right] , \end{aligned}$$

where \(p_i = {\mathbb {E}}_{\varvec{z} \sim \theta } [ r_i(\varvec{z}) ]\) and \(q_i(z_i) = {\mathbb {E}}_{\varvec{z} \sim \theta } [ r_i(\varvec{z}) | z_i ]\) almost surely with respect to \(\theta \). From now on we denote by \(\theta _i\) the marginal probability distribution of the random variable \(z_i\) under \(\theta \). As \(\theta \) belongs to a marginal ambiguity set of the form (26), we thus have \(\theta _i (z_i \le s) = F_i(s)\) for all \(s \in {\mathbb {R}}\), that is, \(\theta _i\) is uniquely determined by the cumulative distribution function \(F_i\). The above reasoning then implies that

$$\begin{aligned} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) = \sup _{\theta \in \Theta } ~ {\mathbb {E}}_{\varvec{z} \sim \theta } \Big [ \max _{i \in [N]} u_i + z_i \Big ]&= \left\{ \begin{array}{cll} \sup &{} \displaystyle \sum _{i=1}^N u_i p_i + \sum _{i=1}^N {\mathbb {E}}_{\varvec{z} \sim \theta } \left[ z_i q_i(z_i) \right] \\ \text {s.t.} &{} \theta \in \Theta , ~\varvec{p} \in \Delta ^N, ~\varvec{q} \in \mathcal L^N({\mathbb {R}}) \\ &{} {\mathbb {E}}_{\varvec{z} \sim \theta } \left[ r_i(\varvec{z}) \right] = p_i &{} \forall i \in [N] \\ &{} {\mathbb {E}}_{\varvec{z} \sim \theta } [ r_i(\varvec{z}) | z_i ] = q_i(z_i) \quad \theta \text {-a.s.} &{} \forall i \in [N] \end{array} \right. \nonumber \\&\le \left\{ \begin{array}{cll} \sup &{} \displaystyle \sum _{i=1}^N u_i p_i + \sum _{i=1}^N {\mathbb {E}}_{z_i \sim \theta _i} \left[ z_i q_i(z_i) \right] \\ \text {s.t.} &{} \varvec{p} \in \Delta ^N,~ \varvec{q} \in \mathcal L^N({\mathbb {R}}) \\ &{} {\mathbb {E}}_{z_i \sim \theta _i} \left[ q_i(z_i) \right] = p_i &{} \forall i \in [N] \\ &{} 0 \le q_i(z_i) \le 1 \quad \theta _i\text {-a.s.} &{} \forall i \in [N]. \end{array} \right. \end{aligned}$$
(28)

The inequality can be justified as follows. One may first add the redundant expectation constraints \(p_i = {\mathbb {E}}_{z_i \sim \theta } [q_i(z_i)]\) and the redundant \(\theta _i\)-almost sure constraints \(0\le q_i(z_i)\le 1\) to the maximization problem over \( \theta \), \(\varvec{p}\) and \(\varvec{q}\) without affecting the problem’s optimal value. Next, one may remove the constraints that express \(p_i\) and \(q_i(z_i)\) in terms of \(r_i(\varvec{z})\). The resulting relaxation provides an upper bound on the original maximization problem. Note that all remaining expectation operators involve integrands that depend on \(\varvec{z}\) only through \(z_i\) for some \(i\in [N]\), and therefore the expectations with respect to the joint probability measure \(\theta \) can all be simplified to expectations with respect to one of the marginal probability measures \(\theta _i\). As neither the objective nor the constraints of the resulting problem depend on \(\theta \), we may finally remove \(\theta \) from the list of decision variables without affecting the problem’s optimal value.

For any fixed \(\varvec{p} \in \Delta ^N\), the upper bounding problem (28) gives rise the following N subproblems indexed by \(i\in [N]\).

$$\begin{aligned} \sup _{q_i \in \mathcal L({\mathbb {R}})} \bigg \{ {\mathbb {E}}_{z_i \sim \theta _i} \left[ z_i q_i(z_i) \right] : {\mathbb {E}}_{z_i \sim \theta _i} \left[ q_i(z_i) \right] = p_i, ~ 0 \le q_i(z_i) \le 1 ~ \theta _i\text {-a.s.} \bigg \} \end{aligned}$$
(29a)

If \(p_i > 0 \), the optimization problem (29a) over the functions \(q_i \in \mathcal L({\mathbb {R}})\) can be recast as an optimization problem over probability measures \({{\tilde{\theta }}}_i \in \mathcal P({\mathbb {R}})\) that are absolutely continuous with respect to \(\theta _i\),

$$\begin{aligned} \sup _{{{\tilde{\theta }}}_i \in \mathcal P({\mathbb {R}})} \bigg \{ p_i \; {\mathbb {E}}_{z_i \sim {{\tilde{\theta }}}_i} \left[ z_i \right] : \frac{\mathrm {d}{{\tilde{\theta }}}_i}{\mathrm {d}\theta _i}(z_i) \le \frac{1}{p_i} ~ \theta _i\text {-a.s.} \bigg \}, \end{aligned}$$
(29b)

where \(\mathrm {d}{{\tilde{\theta }}}_i / \mathrm {d}\theta _i \) denotes as usual the Radon-Nikodym derivative of \({{\tilde{\theta }}}_i\) with respect to \(\theta _i\). Indeed, if \(q_i\) is feasible in (29a), then \({{\tilde{\theta }}}_i\) defined through \({{\tilde{\theta }}}_i[\mathcal B]= \frac{1}{p_i} \int _B q_i(z_i) \theta _i(\mathrm {d}z_i)\) for all Borel sets \(B\subseteq {\mathbb {R}}\) is feasible in (29b) and attains the same objective function value. Conversely, if \({{\tilde{\theta }}}_i\) is feasible in (29b), then \(q_i (z_i)= p_i \, \mathrm {d}{{\tilde{\theta }}}_i / \mathrm {d}\theta _i (z_i)\) is feasible in (29a) and attains the same objective function value. Thus, (29a) and (29b) are indeed equivalent. By [61, Theorem 4.47], the optimal value of (29b) is given by \(p_i \, \theta _i \text {-CVaR}_{p_i}(z_i) = \int _{1-p_i}^1 F_i^{-1}(t) \mathrm {d}t\), where \(\theta _i \text {-CVaR}_{p_i}(z_i)\) denotes the CVaR of \(z_i\) at level \(p_i\) under \(\theta _i\).

If \(p_i = 0\), on the other hand, then the optimal value of (29a) and the integral \(\int _{1-p_i}^1 F_i^{-1}(t) \mathrm {d}t\) both evaluate to zero. Thus, the optimal value of the subproblem (29a) coincides with \(\int _{1-p_i}^1 F_i^{-1}(t) \mathrm {d}t\) irrespective of \(p_i\). Substituting this optimal value into (28) finally yields the explicit upper bound

$$\begin{aligned} \sup _{\theta \in \Theta } ~ {\mathbb {E}}_{z \sim \theta } \Big [ \max \limits _{i \in [N]} u_i + z_i \Big ]&\le \sup _{\varvec{p} \in \Delta ^N} ~ \sum _{i=1}^N u_i p_i + \sum _{i=1}^N \int _{1-p_i}^1 F_i^{-1}(t) \mathrm {d}t. \end{aligned}$$
(30)

Note that the objective function of the upper bounding problem on the right hand side of (30) constitutes a sum of the strictly concave and differentiable univariate functions \(u_i p_i + \int _{1-p_i}^1 F_i^{-1}(t)\). Indeed, the derivative of the \(i^{\text {th}}\) function with respect to \(p_i\) is given by \(u_i + F_i^{-1}(1-p_i)\), which is strictly increasing in \(p_i\) because \(F_i\) is continuous by assumption. The upper bounding problem in (30) is thus solvable as it has a compact feasible set as well as a differentiable objective function. Moreover, the solution is unique thanks to the strict concavity of the objective function. In the following we denote this unique solution by \(\varvec{p}^\star \).

It remains to be shown that there exists a distribution \(\theta ^\star \in \Theta \) that attains the upper bound in (30). To this end, we define the functions \( q_i^\star (z_i) = \mathbbm {1}_{\{ z_i > F_i^{-1}(1 - p_i^\star ) \}}\) for all \(i \in [N]\). By ([61], Remark 4.48), \(q_i^\star (z_i)\) is optimal in (29a) for \(p_i=p_i^\star \). In other words, we have \({\mathbb {E}}_{z_i \sim \theta _i} [q_i^\star (z_i)] = p_i^\star \) and \({\mathbb {E}}_{z_i \sim \theta _i}[z_i q_i^\star (z_i)] = \int _{1 - p_i^\star }^1 F_i^{-1}(t) \mathrm {d}t\). In addition, we also define the Borel measures \(\theta _i^+\) and \(\theta _i^-\) through

$$\begin{aligned} \theta _i^+(B) = \theta _i(B | z_i > F_i^{-1}(1 - p_i^\star )) \quad \text {and} \quad \theta _i^-(B) = \theta _i(B | z_i \le F_i^{-1}(1 - p_i^\star )) \end{aligned}$$

for all Borel sets \(B \subseteq {\mathbb {R}}\), respectively. By construction, \(\theta _i^+\) is supported on \((F_i^{-1}(1 - p_i^\star ), \infty )\), while \(\theta _i^-\) is supported on \((-\infty , (F_i^{-1}(1 - p_i^\star )]\). The law of total probability further implies that \(\theta _i = p_i^\star \theta _i^+ + (1 - p_i^\star ) \theta _i^-\). In the remainder of the proof we will demonstrate that the maximization problem on the left hand side of (30) is solved by the mixture distribution

$$\begin{aligned} \theta ^\star = \sum _{j=1}^N p_j^\star \cdot \left( \otimes _{k=1}^{j-1} \theta _k^- \right) \otimes \theta _j^+ \otimes \left( \otimes _{k=j+1}^{N} \theta _k^- \right) . \end{aligned}$$

This will show that the inequality in (30) is in fact an equality, which in turn implies that the smooth c-transform is given by (27). We first prove that \(\theta ^\star \in \Theta \). To see this, note that for all \(i \in [N]\) we have

$$\begin{aligned} \theta ^\star (z_i \le s) = p_i^\star \theta _i^+ (z_i \le s) + \left( \sum _{j \ne i} p_j^\star \right) \theta _i^- (z_i \le s) = \theta _i (z_i \le s) = F_i(s), \end{aligned}$$

where the second equality exploits the relation \(\sum _{j \ne i} p_j^\star = 1 - p_i^\star \). This observation implies that \(\theta ^\star \in \Theta \). Next, we prove that \(\theta ^\star \) attains the upper bound in (30). By the definition of the binary function \(\varvec{r}\), we have

$$\begin{aligned} {\mathbb {E}}_{{\varvec{z}} \sim \theta ^\star } \Big [ \max \limits _{i \in [N]} u_i + z_{i} \Big ]&={\mathbb {E}}_{{\varvec{z}} \sim \theta ^\star } \left[ ( u_i + z_i) r_i({\varvec{z}}) \right] \\&= {\mathbb {E}}_{z_i \sim \theta _i} \left[ (u_i + z_i) {\mathbb {E}}_{{\varvec{z}} \sim \theta ^\star } \left[ r_i({\varvec{z}}) | z_i \right] \right] \\&= {\mathbb {E}}_{z_i \sim \theta _i} \Big [ ( u_i + z_i) \, \theta ^\star \Big ( i = \min \, \mathop {\mathrm{argmax}}\limits _{j \in [N]} ~ u_j + z_j \big | z_i \Big ) \Big ] \\&= {\mathbb {E}}_{z_i \sim \theta _i} \left[ ( u_i + z_i) \, \theta ^\star \left( z_j < u_i + z_i - u_j~ \forall j \ne i \big | z_i \right) \right] , \end{aligned}$$

where the third equality holds because \(r_i(\varvec{z})=1\) if and only if \(i = \min {{\,\mathrm{argmax}\,}}_{j \in [N]} u_j + z_j\), and the fourth equality follows from the assumed continuity of the marginal distribution functions \(F_i\), \(i\in [N]\), which implies that \(\theta ^\star ( z_j = u_i + z_i - u_j~ \forall j \ne i \big | z_i ) = 0\) \(\theta _i\)-almost surely for all \(i,j\in [N]\).

Hence, we find

$$\begin{aligned} {\mathbb {E}}_{\varvec{z} \sim \theta ^\star } \Big [ \max \limits _{i \in [N]} u_i + z_{i} \Big ]&= p_i^\star \, {\mathbb {E}}_{z_i \sim \theta _i^+} \left[ ( u_i + z_i) \, \theta ^\star \left( z_j< u_i + z_i - u_j~ \forall j \ne i \big | z_i \right) \right] \nonumber \\&\quad + (1 - p_i^\star )\, {\mathbb {E}}_{z_i \sim \theta _i^-} \left[ ( u_i + z_i) \, \theta ^\star \left( z_j< u_i + z_i - u_j~ \forall j \ne i \big | z_i \right) \right] \nonumber \\&= \displaystyle p_i^\star \, {\mathbb {E}}_{z_i \sim \theta _i^+} \Big [ (u_i + z_i) \Big ( \prod _{j \ne i} \theta _j^-(z_j < z_i + u_i - u_j) \Big ) \Big ] \end{aligned}$$
(31a)
$$\begin{aligned}&\quad + \displaystyle \sum _{j \ne i} p_j^\star \,{\mathbb {E}}_{z_i \sim \theta _i^-} \Big [ (u_i + z_i) \Big ( \!\prod _{k \ne i, j} \theta _k^-(z_k< z_i + u_i - u_k) \Big )\nonumber \\&\quad \theta _j^+(z_j < z_i + u_i - u_j) \Big ], \end{aligned}$$
(31b)

where the first equality exploits the relation \(\theta _i = p_i^\star \theta _i^+ + (1 - p_i^\star ) \theta _i^-\), while the second equality follows from the definition of \(\theta ^\star \). The expectations in (31) can be further simplified by using the stationarity conditions of the upper bounding problem in (30), which imply that the partial derivatives of the objective function with respect to the decision variables \(p_i\), \(i\in [N]\), are all equal at \(\varvec{p}=\varvec{p}^\star \). Thus, \(\varvec{p}^\star \) must satisfy

$$\begin{aligned} u_i + F_i^{-1}(1 - p_i^\star ) = u_j + F_j^{-1}(1 - p_j^\star ) \quad \forall i, j \in [N]. \end{aligned}$$
(32)

Consequently, for every \(z_i > F_i^{-1}(1 - p_i^\star )\) and \(j\ne i\) we have

$$\begin{aligned} \theta _j^-(z_j < z_i + u_i - u_j) \ge \theta _j^-(z_j \le F_i^{-1}(1 - p_i^\star ) + u_i - u_j) = \theta _j^-(z_j \le F_j^{-1}(1 - p_j^\star )) = 1, \end{aligned}$$

where the first equality follows from (32), and the second equality holds because \(\theta _j^-\) is supported on \((-\infty , F_j^{-1}(1 - p_j^\star )]\). As no probability can exceed 1, the above reasoning implies that \(\theta _j^-(z_j < z_i + u_i - u_j)=1\) for all \(z_i > F_i^{-1}(1 - p_i^\star )\) and \(j\ne i\). Noting that \(q_i^\star (z_i)= \mathbbm {1}_{\{ z_i > F_i^{-1}(1 - p_i^\star ) \}}\) represents the characteristic function of the set \((F_i^{-1}(1 - p_i^\star ), \infty )\) covering the support of \(\theta _i^+\), the term (31a) can thus be simplified to

$$\begin{aligned}&p_i^\star \,{\mathbb {E}}_{z_i \sim \theta _i^+} \left[ (u_i + z_i) \left( \prod _{j \ne i} \theta _j^-(z_j < z_i + u_i - u_j) \right) q_i^\star (z_i) \right] = {\mathbb {E}}_{z_i \sim \theta _i} \left[ (u_i + z_i) q_i^\star (z_i) \right] . \end{aligned}$$

Similarly, for any \(z_i \le F_i^{-1}(1 - p_i^\star )\) and \(j\ne i\) we have

$$\begin{aligned} \theta _j^+(z_j< z_i + u_i - u_j) \le \theta _j^+(z_j< F_i^{-1}(1 - p_i^\star ) + u_i - u_j) = \theta _j^+(z_j < F_j^{-1}(1 - p_j^\star )) = 0, \end{aligned}$$

where the two equalities follow from (32) and the observation that \(\theta _j^+\) is supported on \((F_j^{-1}(1 - p_j^\star ), \infty )\), respectively. As probabilities are non-negative, the above implies that \(\theta _j^+(z_j < z_i + u_i - u_j)=0\) for all \(z_i \le F_i^{-1}(1 - p_i^\star )\) and \(j\ne i\). Hence, as \(\theta _i^-\) is supported on \((-\infty , F_i^{-1}(1 - p_i^\star )]\), the term (31b) simplifies to

By combining the simplified reformulations of (31a) and (31b), we finally obtain

$$\begin{aligned} {\mathbb {E}}_{\varvec{z} \sim \theta ^\star } \Big [ \max \limits _{i \in [N]} u_i + z_{i} \Big ] = \sum _{i=1}^N {\mathbb {E}}_{z_i \sim \theta _i} \left[ ( u_i + z_i) q_i^\star (z_i) \right] = \sum _{i=1}^N u_i p_i^\star + \sum _{i=1}^N \int _{1-p_i^\star }^1 F_i^{-1}(t) \mathrm {d}t, \end{aligned}$$

where the last equality exploits the relations \({\mathbb {E}}_{z_i \sim \theta _i} [q_i^\star (z_i)] = p_i^\star \) and \({\mathbb {E}}_{z_i \sim \theta _i}[z_i q_i^\star (z_i)] = \int _{1 - p_i^\star }^1 F_i^{-1}(t) \mathrm {d}t\) derived in the first part of the proof. We have thus shown that the smooth c-transform is given by (27).

Finally, by the envelope theorem ([44], Theorem 2.16), the gradient of \(\nabla _{\varvec{\phi }}{{\overline{\psi }}}(\varvec{\phi }, \varvec{x})\) exists and coincides with the unique maximizer \(\varvec{p}^\star \) of the upper bounding problem in (27). \(\square \)

The next theorem reveals that the smooth dual optimal transport problem (12) with a marginal ambiguity set corresponds to a regularized primal optimal transport problem of the form (13).

Theorem 3.7

(Fréchet regularization) Suppose that \(\Theta \) is a marginal ambiguity set of the form (26) and that the marginal cumulative distribution functions are defined through

$$\begin{aligned} F_i(s) = \min \{1, \max \{0, 1-\eta _i F(-s)\}\} \end{aligned}$$
(33)

for some probability vector \(\varvec{\eta }\in \Delta ^N\) and strictly increasing function \(F: {\mathbb {R}}\rightarrow {\mathbb {R}}\) with \(\int _0^1 F^{-1} (t) \mathrm {d}t = 0\). Then, the smooth dual optimal transport problem (12) is equivalent to the regularized primal optimal transport problem (13) with \(R_\Theta = D_f(\pi \Vert \mu \otimes \eta )\), where \(f(s) = \int _{0 }^{s} F^{-1}(t) \mathrm {d}t\) and \(\eta = \sum _{i=1}^N \eta _i \delta _{y_i}\).

The function f(s) introduced in Theorem 3.7 is smooth and convex because its derivative \( \mathrm {d}f(s) / \mathrm {d}s = F^{-1}(s)\) is strictly increasing, and \(f(1) = \int _0^1 F^{-1}(t) \mathrm {d}t=0\) by assumption. Therefore, this function induces a standard f-divergence. From now on we will refer to F as the marginal generating function.

Proof of Theorem 3.7

By Proposition 3.6, the smooth dual optimal transport problem (12) is equivalent to

$$\begin{aligned} {\overline{W}}_{c}(\mu , \nu )&= \sup \limits _{ \varvec{\phi }\in {\mathbb {R}}^N} ~ {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \min \limits _{\varvec{p}\in \Delta ^N} \sum \limits _{i=1}^N{\phi _i\nu _i}- \sum \limits _{i=1}^N(\phi _i - c(\varvec{x}, \varvec{y_i}))p_i - \sum _{i=1}^N \displaystyle \int _{1-p_i}^1 F_i^{-1}(t)\mathrm {d}t \right] . \end{aligned}$$

As F is strictly increasing, we have \(F_i^{-1}(s) = -F^{-1}((1-s) / \eta _i)\) for all \(s \in (0, 1)\). Thus, we find

$$\begin{aligned} f(s) = \int _{0}^{s} F^{-1}(t) \mathrm {d}t = -\frac{1}{\eta _i} \int _{1}^{1 - s \eta _i} F^{-1} \left( \frac{1 - z}{\eta _i} \right) \mathrm {d}z= -\frac{1}{ \eta _i} \int _{1 - s \eta _i}^1 F_i^{-1}(z) \mathrm {d}z, \end{aligned}$$
(34)

where the second equality follows from the variable substitution \(z\leftarrow 1-\eta _i t\). This integral representation of f(s) then allows us to reformulate the smooth dual optimal transport problem as

$$\begin{aligned} {\overline{W}}_{c}(\mu , \nu )= \sup \limits _{ \varvec{\phi }\in {\mathbb {R}}^N} ~ {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \min \limits _{\varvec{p}\in \Delta ^N} \sum \limits _{i=1}^N{\phi _i\nu _i}- \sum \limits _{i=1}^N(\phi _i - c(\varvec{x}, \varvec{y_i}))p_i + \sum \limits _{i=1}^N \eta _i \,f\left( \frac{p_i}{\eta _i} \right) \right] , \end{aligned}$$

which is manifestly equivalent to problem (14) thanks to the definition of the discrete f-divergence. Lemma 3.2 finally implies that the resulting instance of (14) is equivalent to the regularized primal optimal transport problem (13) with regularization term \(R_\Theta (\pi ) = D_{f}(\pi \Vert \mu \otimes \eta )\). Hence, the claim follows. \(\square \)

Theorem 3.7 imposes relatively restrictive conditions on the marginals of \(\varvec{z}\). Indeed, it requires that all marginal distribution functions \(F_i\), \(i\in [N]\), must be generated by a single marginal generating function F through the relation (33). The following examples showcase, however, that the freedom to select F offers significant flexibility in designing various (existing as well as new) regularization schemes. Details of the underlying derivations are relegated to Appendix C. Table 1 summarizes the marginal generating functions F studied in these examples and lists the corresponding divergence generators f.

Table 1 Marginal generating functions F with parameter \(\lambda \) and corresponding divergence generators f

Example 3.8

(Exponential distribution model) Suppose that \(\Theta \) is a marginal ambiguity set with (shifted) exponential marginals of the form (33) induced by the generating function \(F(s) = \exp (s / \lambda - 1)\) with \(\lambda > 0\). Then the smooth dual optimal transport problem (12) is equivalent to the regularized optimal transport problem (13) with an entropic regularizer of the form \(R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )\), where \(f(s) =\lambda s \log (s)\), while the smooth c-transform (11) reduces to the log-partition function (20). This example shows that entropic regularizers are not only induced by singleton ambiguity sets containing a generalized extreme value distribution (see Sect. 3.1) but also by marginal ambiguity sets with exponential marginals.

Example 3.9

(Uniform distribution model) Suppose that \(\Theta \) is a marginal ambiguity set with uniform marginals of the form (33) induced by the generating function \(F(s) = s/(2\lambda ) + 1/2\) with \(\lambda > 0\). In this case the smooth dual optimal transport problem (12) is equivalent to the regularized optimal transport problem (13) with a \(\chi ^2\)-divergence regularizer of the form \(R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )\), where \(f(s) = \lambda (s^2 -s)\). Such regularizers were previously investigated by Blondel et al. [24] and Seguy et al. [149] under the additional assumption that \(\eta _i\) is independent of \(i\in [N]\), yet their intimate relation to noise models with uniform marginals remained undiscovered until now. In addition, the smooth c-transform (11) satisfies

$$\begin{aligned} {{\overline{\psi }}}(\varvec{\phi }, \varvec{x}) = \lambda + \lambda \, \mathop {\mathrm{spmax}}\limits _{i \in [N]} \;\frac{\phi _i - c(\varvec{x}, \varvec{y_i})}{\lambda }, \end{aligned}$$

where the sparse maximum operator ‘\({{\,\mathrm{spmax}\,}}\)’ inspired by Martins and Astudillo [98] is defined through

$$\begin{aligned} \mathop {\mathrm{spmax}}\limits _{i \in [N]} \; u_i = \max _{\varvec{p} \in \Delta ^N} \; \sum _{i=1}^N u_i p_i - {p_i^2}/{\eta _i} \qquad \forall \varvec{u}\in {\mathbb {R}}^N. \end{aligned}$$
(35)

The envelope theorem ([44], Theorem 2.16) ensures that \({{\,\mathrm{spmax}\,}}_{i \in [N]} u_i\) is smooth and that its gradient with respect to \(\varvec{u}\) is given by the unique solution \(\varvec{p}^\star \) of the maximization problem on the right hand side of (35). We note that \(\varvec{p}^\star \) has many zero entries due to the sparsity-inducing nature of the problem’s simplicial feasible set. In addition, we have \(\lim _{\lambda \downarrow 0} \lambda {{\,\mathrm{spmax}\,}}_{i \in [N]} u_i/\lambda = \max _{i\in [N]}u_i\). Thus, the sparse maximum can indeed be viewed as a smooth approximation of the ordinary maximum. In marked contrast to the more widely used LogSumExp function, however, the sparse maximum has a sparse gradient. Proposition D.1 in Appendix D shows that \(\varvec{p}^\star \) can be computed efficiently by sorting.

Example 3.10

(Pareto distribution model) Suppose that \(\Theta \) is a marginal ambiguity set with (shifted) Pareto distributed marginals of the form (33) induced by the generating function \(F(s) = (s (q-1) / (\lambda q)+1/q)^{1/(q-1)}\) with \(\lambda ,q>0\). Then the smooth dual optimal transport problem (12) is equivalent to the regularized optimal transport problem (13) with a Tsallis divergence regularizer of the form \(R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )\), where \(f(s) = \lambda (s^q - s)/(q-1)\). Such regularizers were investigated by [110] under the additional assumption that \(\eta _i\) is independent of \(i\in [N]\). The Pareto distribution model encapsulates the exponential model (in the limit \(q\rightarrow 1\)) and the uniform distribution model (for \(q=2\)) as special cases. The smooth c-transform admits no simple closed-form representation under this model.

Example 3.11

(Hyperbolic cosine distribution model) Suppose that \(\Theta \) is a marginal ambiguity set with hyperbolic cosine distributed marginals of the form (33) induced by the generating function \(F(s) = \sinh (s/\lambda - k)\) with \(k = \sqrt{2} - 1 - \text {arcsinh}(1)\) and \(\lambda > 0\). Then the marginal probability density functions are given by scaled and truncated hyperbolic cosine functions, and the smooth dual optimal transport problem (12) is equivalent to the regularized optimal transport problem (13) with a hyperbolic divergence regularizer of the form \(R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )\), where \(f(s) = \lambda (s \text {arcsinh}(s) - \sqrt{s^2 + 1} + 1 + ks)\). Hyperbolic divergences were introduced by Ghai et al. [66] in order to unify several gradient descent algorithms.

Example 3.12

(t-distribution model) Suppose that \(\Theta \) is a marginal ambiguity set where the marginals are determined by (33), and assume that the generating function is given by

$$\begin{aligned} F(s) = \frac{N}{2}\left( 1 + \frac{s - \sqrt{N-1}}{\sqrt{\lambda ^2 + (s - \sqrt{N-1})^{2}}}\right) \end{aligned}$$

for some \(\lambda > 0\). In this case one can show that all marginals constitute t-distributions with 2 degrees of freedom. In addition, one can show that the smooth dual optimal transport problem (12) is equivalent to the Chebyshev regularized optimal transport problem described in Proposition 3.5.

To close this section, we remark that different regularization schemes differ as to how well they approximate the original (unregularized) optimal transport problem. Proposition 3.3 provides simple error bounds that may help in selecting suitable regularizers. For the entropic regularization scheme associated with the exponential distribution model of Example 3.8, for example, the error bound evaluates to \(\max _{i\in [N]}\lambda \log (1/\eta _i)\), while for the \(\chi ^2\)-divergence regularization scheme associated with the uniform distribution model of Example 3.9, the error bound is given by \(\max _{i \in [N]}\lambda (1/\eta _i - 1)\). In both cases, the error is minimized by setting \(\eta _i = 1/N \) for all \(i \in [N]\). Thus, the error bound grows logarithmically with N for entropic regularization and linearly with N for \(\chi ^2\)-divergence regularization. Different regularization schemes also differ with regard to their computational properties, which will be discussed in Sect. 4.

4 Numerical solution of smooth optimal transport problems

The smooth semi-discrete optimal transport problem (12) constitutes a stochastic optimization problem and can therefore be addressed with a stochastic gradient descent (SGD) algorithm. In Sect. 4.1 we first derive new convergence guarantees for an averaged gradient descent algorithm that has only access to a biased stochastic gradient oracle. This algorithm outputs the uniform average of the iterates (instead of the last iterate) as the recommended candidate solution. We prove that if the objective function is Lipschitz continuous, then the suboptimality of this candidate solution is of the order \(\mathcal O(1/\sqrt{T})\), where T stands for the number of iterations. An improvement in the non-leading terms is possible if the objective function is additionally smooth. We further prove that a convergence rate of \(\mathcal O(1/{T})\) can be obtained for generalized self-concordant objective functions. In Sect. 4.2 we then show that the algorithm of Sect. 4.1 can be used to efficiently solve the smooth semi-discrete optimal transport problem (12) corresponding to a marginal ambiguity set of the type (26). As a byproduct, we prove that the convergence rate of the averaged SGD algorithm for the semi-discrete optimal transport problem with entropic regularization is of the order \(\mathcal O(1/T)\), which improves the \(\mathcal O(1/\sqrt{T})\) guarantee of Genevay et al. [64].

4.1 Averaged gradient descent algorithm with biased gradient oracles

Consider a general convex minimization problem of the form

$$\begin{aligned} \min _{\varvec{\phi }\in {\mathbb {R}}^n} ~ h(\varvec{\phi }), \end{aligned}$$
(36)

where the objective function \(h: {\mathbb {R}}^n \rightarrow {\mathbb {R}}\) is convex and differentiable. We assume that problem (36) admits a minimizer \(\varvec{\phi }^\star \). We study the convergence behavior of the inexact gradient descent algorithm

$$\begin{aligned} \varvec{\phi }_{t} = \varvec{\phi }_{t-1} - \gamma \varvec{g}_t(\varvec{\phi }_{t-1}), \end{aligned}$$
(37)

where \(\gamma > 0\) is a fixed step size, \(\varvec{\phi }_0\) is a given deterministic initial point and the function \(\varvec{g}_t: {\mathbb {R}}^n \rightarrow {\mathbb {R}}^n\) is an inexact gradient oracle that returns for every fixed \(\varvec{\phi }\in {\mathbb {R}}^n\) a random estimate of the gradient of h at \(\varvec{\phi }\). Note that we allow the gradient oracle to depend on the iteration counter t, which allows us to account for increasingly accurate gradient estimates. In contrast to the previous sections, we henceforth model all random objects as measurable functions on an abstract filtered probability space \((\Omega , \mathcal F, (\mathcal F_t)_{t \ge 0}, \mathbb P)\), where \({\mathcal {F}}_0 = \{ \emptyset ,\Omega \}\) represents the trivial \(\sigma \)-field, while the gradient oracle \(\varvec{g}_t(\varvec{\phi })\) is \(\mathcal F_t\)-measurable for all \(t\in \mathbb N\) and \(\varvec{\phi }\in {\mathbb {R}}^n\). In order to avoid clutter, we use \(\mathbb E[\cdot ]\) to denote the expectation operator with respect to \(\mathbb P\), and all inequalities and equalities involving random variables are understood to hold \(\mathbb P\)-almost surely.

In the following we analyze the effect of averaging in inexact gradient descent algorithms. We will show that after T iterations with a constant step size \(\gamma = \mathcal O(1 / \sqrt{T})\), the objective function value of the uniform average of all iterates generated by (37) converges to the optimal value of (36) at a sublinear rate. Specifically, we will prove that the rate of convergence varies between \({\mathcal {O}}(1 / \sqrt{T})\) and \({\mathcal {O}}(1/T)\) depending on properties of the objective function. Our convergence analysis will rely on several regularity conditions.

Assumption 4.1

 (Regularity conditions) Different combinations of the following regularity conditions will enable us to establish different convergence guarantees for the averaged inexact gradient descent algorithm.

  1. (i)

    Biased gradient oracle: There exists tolerances \(\varepsilon _t>0\), \(t\in \mathbb N\cup \{0\}\), such that

    $$\begin{aligned} \left\| {\mathbb {E}}\left[ \varvec{g}_t(\varvec{\phi }_{t-1}) \big | \mathcal F_{t-1} \right] - \nabla h(\varvec{\phi }_t) \right\| \le \varepsilon _{t-1}\quad \forall t\in \mathbb N. \end{aligned}$$
  2. (ii)

    Bounded gradients: There exists \(R > 0\) such that

    $$\begin{aligned} \Vert \nabla h(\varvec{\phi }) \Vert \le R\quad \text {and} \quad \Vert \varvec{g}_t(\varvec{\phi }) \Vert \le R \quad \forall \varvec{\phi }\in {\mathbb {R}}^n,~ \forall t \in \mathbb N. \end{aligned}$$
  3. (iii)

    Generalized self-concordance: The function h is M-generalized self-concordant for some \(M > 0\), that is, h is three times differentiable, and for any \(\varvec{\phi }, \varvec{\phi }' \in {\mathbb {R}}^n\) the function \(u(s) = h(\varvec{\phi }+ s (\varvec{\phi }' - \varvec{\phi }))\) satisfies the inequality

    $$\begin{aligned} \left| \frac{\mathrm {d}^3 u(s)}{\mathrm {d}s^3} \right| \le M \Vert \varvec{\phi }- \varvec{\phi }' \Vert \, \frac{\mathrm {d}^2 u(s)}{\mathrm {d}s^2} \quad \forall s \in {\mathbb {R}}.\end{aligned}$$
  4. (iv)

    Lipschitz continuous gradient: The function h is L-smooth for some \(L > 0\), that is, we have

    $$\begin{aligned} \Vert \nabla h(\varvec{\phi }) - \nabla h(\varvec{\phi }') \Vert \le L \Vert \varvec{\phi }- \varvec{\phi }' \Vert \quad \forall \varvec{\phi }, \varvec{\phi }' \in {\mathbb {R}}^n. \end{aligned}$$
  5. (v)

    Bounded second moments: There exists \(\sigma > 0\) such that

    $$\begin{aligned} {\mathbb {E}}\left[ \left\| \varvec{g}_t(\varvec{\phi }_{t-1}) - \nabla h(\varvec{\phi }_{t-1}) \right\| ^2 | \mathcal F_{t-1} \right] \le \sigma ^2 \quad \forall t \in \mathbb N. \end{aligned}$$

The averaged gradient descent algorithm with biased gradient oracles lends itself to solving both deterministic as well as stochastic optimization problems. In deterministic optimization, the gradient oracles \(\varvec{g}_t\) are deterministic and output inexact gradients satisfying \(\Vert \varvec{g}_t(\varvec{\phi }) - \nabla h(\varvec{\phi }) \Vert \le \varepsilon _t\) for all \(\varvec{\phi }\in {\mathbb {R}}^n\), where the tolerances \(\varepsilon _t\) bound the errors associated with the numerical computation of the gradients. A vast body of literature on deterministic optimization focuses on exact gradient oracles for which these tolerances can be set to 0. Inexact deterministic gradient oracles with bounded error tolerances are investigated by Nedićand Bertsekas [112] and d’Aspremont [41]. In this case exact convergence to \(\varvec{\phi }^\star \) is not possible. If the error bounds decrease to 0, however, Luo and Tseng [96], Schmidt et al. [144] and Friedlander and Schmidt [63] show that adaptive gradient descent algorithms are guaranteed to converge to \(\varvec{\phi }^\star \).

In stochastic optimization, the objective function is representable as \(h(\varvec{\phi }) = {\mathbb {E}}[H(\varvec{\phi }, \varvec{x})]\), where the marginal distribution of the random vector \(\varvec{x}\) under \(\mathbb P\) is given by \(\mu \), while the integrand \(H(\varvec{\phi },\varvec{x})\) is convex and differentiable in \(\varvec{\phi }\) and \(\mu \)-integrable in \(\varvec{x}\). In this setting it is convenient to use gradient oracles of the form \(\varvec{g}_t(\varvec{\phi }) = \nabla _{\varvec{\phi }} H(\varvec{\phi }, \varvec{x}_t)\) for all \(t \in \mathbb N\), where the samples \(\varvec{x}_t\) are drawn independently from \(\mu \). As these oracles output unbiased estimates for \(\nabla h(\varvec{\phi })\), all tolerances \(\varepsilon _t\) in Assumptions 4.1 (i) may be set to 0. SGD algorithms with unbiased gradient oracles date back to the seminal paper by Robbins and Monro [136]. Nowadays, averaged SGD algorithms with Polyak-Ruppert averaging figure among the most popular variants of the SGD algorithm [113, 132, 143]. For general convex objective functions the best possible convergence rate of any averaged SGD algorithm run over T iterations amounts to \({\mathcal {O}}(1 / \sqrt{T})\), but it improves to \({\mathcal {O}}(1 / T)\) if the objective function is strongly convex; see for example [50, 87, 108, 113, 116, 152, 153, 172]. While smoothness plays a critical role to achieve acceleration in deterministic optimization, it only improves the constants in the convergence rate in stochastic optimization [34, 45, 81, 88, 158]. In fact, Tsybakov [166] demonstrates that smoothness does not provide any acceleration in general, that is, the best possible convergence rate of any averaged SGD algorithm can still not be improved beyond \({\mathcal {O}}(1 / \sqrt{T})\). Nevertheless, a substantial acceleration is possible when focusing on special problem classes such as linear or logistic regression problems [14, 15, 71]. In these special cases, the improvement in the convergence rate is facilitated by a generalized self-concordance property of the objective function [13]. Self-concordance was originally introduced in the context of Newton-type interior point methods [115] and later generalized to facilitate the analysis of probabilistic models [13] and second-order optimization algorithms [159].

In the following we analyze the convergence properties of the averaged SGD algorithm when we have only access to an inexact stochastic gradient oracle, in which case the tolerances \(\varepsilon _t\) cannot be set to 0. To our best knowledge, inexact stochastic gradient oracles have only been considered by Cohen et al. [34], Hu et al. [76] and Ajalloeian and Stich [5]. Specifically, Hu et al. [76] use sequential semidefinite programs to analyze the convergence rate of the averaged SGD algorithm when \(\mu \) has a finite support. In contrast, we do not impose any restrictions on the support of \(\mu \). Cohen et al. [34] and Ajalloeian and Stich [5], on the other hand, study the convergence behavior of accelerated gradient descent algorithms for smooth stochastic optimization problems under the assumption that \(\varvec{\phi }\) ranges over a compact domain. The proposed algorithms necessitate a projection onto the compact feasible set in each iteration. In contrast, our convergence analysis does not rely on any compactness assumptions. We note that compactness assumptions have been critical for the convergence analysis of the averaged SGD algorithm in the context of convex stochastic optimization [28, 34, 45, 113]. By leveraging a trick due to Bach [14], however, we can relax this assumption provided that the objective function is Lipschitz continuous.

Proposition 4.2

Consider the inexact gradient descent algorithm (37) with constant step size \(\gamma > 0\). If Assumptions 4.1 (i)–(ii) hold with \(\varepsilon _t \le {{{\bar{\varepsilon }}}}/{(2\sqrt{1+t})}\) for some \({{\bar{\varepsilon }}} \ge 0\), then we have for all \( p \in \mathbb N\) that

$$\begin{aligned} {\mathbb {E}}\left[ \left( h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \right) ^p \right] ^{1/p} \le \frac{\Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}{\gamma T} + 20 \gamma \left( R + {{\bar{\varepsilon }}} \right) ^2 p. \end{aligned}$$

If additionally Assumption 4.1 (iii) holds and if \(G = \max \{ M, R + {{\bar{\varepsilon }}} \}\), then we have for all \( p \in \mathbb N\) that

$$\begin{aligned} {\mathbb {E}}\left[ \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\| ^{2p} \right] ^{1/p}&\le \frac{G^{2}}{T} \left( 10 \sqrt{p} + \frac{4p}{\sqrt{T}} + 80 G^2 \gamma \sqrt{T} p + \frac{2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}{\gamma \sqrt{T}} \right. \\&\quad \left. + \frac{3 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert }{G \gamma \sqrt{T}} \right) ^2. \end{aligned}$$

The proof of Proposition 4.2 relies on two lemmas. In order to state these lemmas concisely, we define the \(L_p\)-norm, of a random variable \(\varvec{z} \in {\mathbb {R}}^n\) for any \(p > 0\) through \(\Vert \varvec{z} \Vert _{L_p} = \left( {\mathbb {E}}\left[ \Vert \varvec{z} \Vert ^p \right] \right) ^{1/p}\). For any random variables \(\varvec{z}, \varvec{z}' \in {\mathbb {R}}^n\) and \(p \ge 1\), Minkowski’s inequality ([26], § 2.11) then states that

$$\begin{aligned} \Vert \varvec{z} + \varvec{z}' \Vert _{L_p} \le \Vert \varvec{z} \Vert _{L_p} + \Vert \varvec{z}' \Vert _{L_p}. \end{aligned}$$
(38)

Another essential tool for proving Proposition 4.2 is the Burkholder-Rosenthal-Pinelis (BRP) inequality ([130], Theorem 4.1), which we restate below without proof to keep this paper self-contained.

Lemma 4.3

(BRP inequality) Let \(\varvec{z}_t\) be an \(\mathcal F_t\)-measurable random variable for every \(t\in \mathbb N\) and assume that \(p \ge 2\). For any \(t \in [T]\) with \({\mathbb {E}}[\varvec{z}_t | \mathcal F_{t-1}] = 0 \) and \(\Vert \varvec{z}_t \Vert _{L_p}<\infty \) we then have

$$\begin{aligned} \left\| \max _{t \in [T]} \left\| \sum _{k=1}^t \varvec{z}_k \right\| \right\| _{L_p} \le \sqrt{p} \left\| \sum _{t=1}^T {\mathbb {E}}[ \Vert \varvec{z}_t \Vert ^2 | \mathcal F_{t-1}] \right\| _{L_{p/2}}^{1/2} + p \left\| \max _{t \in [T]} \Vert \varvec{z}_t \Vert \right\| _{L_p}. \end{aligned}$$

The following lemma reviews two useful properties of generalized self-concordant functions.

Lemma 4.4

(Generalized self-concordance) Assume that the objective function h of the convex optimization problem (36) is M-generalized self-concordant in the sense of Assumption 4.1 (iii) for some \(M>0\).

  1. (i)

    ([14], Appendix D.2) For any sequence \(\varvec{\phi }_0, \dots , \varvec{\phi }_{T-1} \in {\mathbb {R}}^n\), we have

    $$\begin{aligned} \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^T \varvec{\phi }_{t-1} \right) - \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| \le 2 M \left( \frac{1}{T} \sum _{t=1}^T h(\varvec{\phi }_{t-1}) - h(\varvec{\phi }^\star ) \right) . \end{aligned}$$
  2. (ii)

    ([14], Lemma 9) For any \(\varvec{\phi }\in {\mathbb {R}}^n\) with \( \Vert \nabla h(\varvec{\phi }) \Vert \le 3 \kappa / (4 M) \), where \(\kappa \) is the smallest eigenvalue of \(\nabla ^2 h(\varvec{\phi }^\star )\), and \(\varvec{\phi }^\star \) is the optimizer of (36), we have \( h(\varvec{\phi }) - h(\varvec{\phi }^\star ) \le 2 {\Vert \nabla h(\varvec{\phi }) \Vert ^2}/{\kappa }.\)

Armed with Lemmas 4.3 and 4.4, we are now ready to prove Proposition 4.2.

Proof of Proposition 4.2

The first claim generalizes Proposition 5 by Bach [14] to inexact gradient oracles. By the assumed convexity and differentiability of the objective function h, we have

$$\begin{aligned} h(\varvec{\phi }_{k-1})\le & {} h(\varvec{\phi }_{\star }) + \nabla h(\varvec{\phi }_{k-1})^\top (\varvec{\phi }_{k-1} - \varvec{\phi }_{\star }) \nonumber \\= & {} h(\varvec{\phi }_{\star }) + \varvec{g}_k(\varvec{\phi }_{k-1})^\top (\varvec{\phi }_{k-1} - \varvec{\phi }_{\star }) + \left( \nabla h(\varvec{\phi }_{k-1}) - \varvec{g}_k(\varvec{\phi }_{k-1}) \right) ^\top (\varvec{\phi }_{k-1} - \varvec{\phi }_{\star }).\nonumber \\ \end{aligned}$$
(39)

In addition, elementary algebra yields the recursion

$$\begin{aligned} \Vert \varvec{\phi }_{k} - \varvec{\phi }^\star \Vert ^2 = \Vert \varvec{\phi }_{k} - \varvec{\phi }_{k-1} \Vert ^2 + \Vert \varvec{\phi }_{k-1} - \varvec{\phi }^\star \Vert ^2 + 2 (\varvec{\phi }_{k} - \varvec{\phi }_{k-1})^\top (\varvec{\phi }_{k-1} - \varvec{\phi }^\star ). \end{aligned}$$

Thanks to the update rule (37), this recursion can be re-expressed as

$$\begin{aligned} \varvec{g}_k(\varvec{\phi }_{k-1})^\top (\varvec{\phi }_{k-1} - \varvec{\phi }^\star ) = \frac{1}{2 \gamma } \left( \gamma ^2 \Vert \varvec{g}_k(\varvec{\phi }_{k-1}) \Vert ^2 + \Vert \varvec{\phi }_{k-1} - \varvec{\phi }^\star \Vert ^2 - \Vert \varvec{\phi }_{k} - \varvec{\phi }^\star \Vert ^2 \right) , \end{aligned}$$

where \(\gamma > 0\) is an arbitrary step size. Combining the above identity with (39) then yields

$$\begin{aligned} ~h(\varvec{\phi }_{k-1}) \le&~h(\varvec{\phi }_{\star }) + \frac{1}{2 \gamma } \left( \gamma ^2 \Vert \varvec{g}_k(\varvec{\phi }_{k-1}) \Vert ^2 + \Vert \varvec{\phi }_{k-1} - \varvec{\phi }^\star \Vert ^2 - \Vert \varvec{\phi }_{k} - \varvec{\phi }^\star \Vert ^2 \right) \\&+ \left( \nabla h(\varvec{\phi }_{k-1}) - \varvec{g}_k(\varvec{\phi }_{k-1}) \right) ^\top \! (\varvec{\phi }_{k-1} - \varvec{\phi }_{\star }) \\ \le&~h(\varvec{\phi }_{\star }) + \frac{1}{2 \gamma } \left( \gamma ^2 R^2 + \Vert \varvec{\phi }_{k-1} - \varvec{\phi }^\star \Vert ^2 - \Vert \varvec{\phi }_{k} - \varvec{\phi }^\star \Vert ^2 \right) \\&+ \left( \nabla h(\varvec{\phi }_{k-1}) - \varvec{g}_k(\varvec{\phi }_{k-1}) \right) ^\top (\varvec{\phi }_{k-1} - \varvec{\phi }_{\star }), \end{aligned}$$

where the last inequality follows from Assumption 4.1 (ii). Summing this inequality over k then shows that

$$\begin{aligned} 2 \gamma \sum _{k=1}^t \big ( h ( \varvec{\phi }_{k-1}) - h(\varvec{\phi }_{\star }) \big ) + \Vert \varvec{\phi }_{t} - \varvec{\phi }^\star \Vert ^2 \le A_t, \end{aligned}$$
(40)

where

$$\begin{aligned} A_t = t \gamma ^2 R^2 + \Vert \varvec{\phi }_{0} - \varvec{\phi }^\star \Vert ^2 + \sum _{k=1}^t B_k \quad \text {and} \quad B_t = 2 \gamma \left( \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right) ^\top (\varvec{\phi }_{t-1} - \varvec{\phi }_{\star }) \end{aligned}$$

for all \(t \in \mathbb N\). Note that the term on the left hand side of (40) is non-negative because \(\varvec{\phi }^\star \) is a global minimizer of h, which implies that the random variable \(A_t\) is also non-negative for all \(t\in \mathbb N\). For later use we further define \(A_0 = \Vert \varvec{\phi }_{0} - \varvec{\phi }^\star \Vert ^2\). The estimate (40) for \(t=T\) then implies via the convexity of h that

$$\begin{aligned} h \left( \frac{1}{T} \sum _{t=1}^T \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }_{\star }) \le \frac{A_T}{2 \gamma T }, \end{aligned}$$
(41)

where we dropped the non-negative term \(\Vert \varvec{\phi }_T-\varvec{\phi }^\star \Vert ^2/(2\gamma T)\) without invalidating the inequality. In the following we analyze the \(L_p\)-norm of \(A_T\) in order to obtain the desired bounds from the proposition statement. To do so, we distinguish three different regimes for \(p \in \mathbb N\), and we show that the \(L_p\)-norm of the non-negative random variable \(A_T\) is upper bounded by an affine function of p in each of these regimes.

Case I (\(p \ge T / 4\)): By using the update rule (37) and Assumption 4.1 (ii), one readily verifies that

$$\begin{aligned} \Vert \varvec{\phi }_k - \varvec{\phi }^\star \Vert \le \Vert \varvec{\phi }_{k-1} - \varvec{\phi }^\star \Vert + \Vert \varvec{\phi }_k - \varvec{\phi }_{k-1} \Vert \le \Vert \varvec{\phi }_{k-1} - \varvec{\phi }^\star \Vert + \gamma R. \end{aligned}$$

Iterating the above recursion k times then yields the conservative estimate \(\Vert \varvec{\phi }_k - \varvec{\phi }^\star \Vert \le \Vert \varvec{\phi }_{0} - \varvec{\phi }^\star \Vert + k \gamma R\). By definitions of \(A_t\) and \(B_t\) for \(t\in \mathbb N\), we thus have

$$\begin{aligned} A_t&= t \gamma ^2 R^2 + \Vert \varvec{\phi }_{0} - \varvec{\phi }^\star \Vert ^2 + 2 \gamma \sum _{k=1}^t \left( \nabla h(\varvec{\phi }_{k-1}) - \varvec{g}_k(\varvec{\phi }_{k-1}) \right) ^\top (\varvec{\phi }_{k-1} - \varvec{\phi }_{\star }) \\&\le t \gamma ^2 R^2 + \Vert \varvec{\phi }_{0} - \varvec{\phi }^\star \Vert ^2 + 4 \gamma R \sum _{k=1}^t \Vert \varvec{\phi }_{k-1} - \varvec{\phi }_{\star } \Vert \\&\le t \gamma ^2 R^2 + \Vert \varvec{\phi }_{0} - \varvec{\phi }^\star \Vert ^2 + 4 \gamma R \sum _{k=1}^t \left( \Vert \varvec{\phi }_{0} - \varvec{\phi }^\star \Vert + (k-1) \gamma R \right) \\&\le t \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 4 t \gamma R \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert + 2 t^2 \gamma ^2 R^2 \\&\le t \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 4 t^2 \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 2 t^2 \gamma ^2 R^2 \\&\le 7 t^2 \gamma ^2 R^2 + 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2, \end{aligned}$$

where the first two inequalities follow from Assumption 4.1 (ii) and the conservative estimate derived above, respectively, while the fourth inequality holds because \(2 a b \le a^2 + b^2\) for all \(a,b\in {\mathbb {R}}\). As \(A_t \ge 0\), the random variable \(A_t\) is bounded and satisfies \(| A_t| \le 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 7 t^2 \gamma ^2 R^2\) for all \(t\in \mathbb N\), which implies that

$$\begin{aligned} \Vert A_T \Vert _{L_p} \le 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 7 T^2 \gamma ^2 R^2&\le 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 28 T \gamma ^2 R^2 p, \end{aligned}$$
(42)

where the last inequality holds because \(p \ge T/4\). Note that the resulting upper bound is affine in p.

Case II \(({2 \le p \le T/4})\): The subsequent analysis relies on the simple bounds

$$\begin{aligned} \max _{t \in [T]} \varepsilon _{t-1} \le \frac{{{\bar{\varepsilon }}}}{2} \quad \text {and} \quad \sum _{t=1}^T \varepsilon _{t-1} \le {{\bar{\varepsilon }}} \sqrt{T}, \end{aligned}$$
(43)

which hold because \(\varepsilon _t \le {{\bar{\varepsilon }}} / (2 \sqrt{1+t})\) by assumption and because \(\sum _{t=1}^T 1 / \sqrt{t} \le 2 \sqrt{T}\), which can be proved by induction. In addition, it proves useful to introduce the martingale differences \( {{\bar{B}}}_t = B_t - {\mathbb {E}}[B_t | \mathcal F_{t-1}]\) for all \(t\in \mathbb N\). By the definition of \(A_t\) and the subadditivity of the supremum operator, we then have

$$\begin{aligned} \max _{t \in [T+1]} A_{t-1}&= \max _{t \in [T+1]} \left\{ (t-1) \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \sum _{k=1}^{t-1} {\mathbb {E}}[B_k | \mathcal F_{k-1}] + \sum _{k=1}^{t-1} {{\bar{B}}}_k \right\} \\&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \max _{t \in [T]} \sum _{k=1}^t {\mathbb {E}}[B_k | \mathcal F_{k-1}] + \max _{t \in [T]} \sum _{k=1}^t {{\bar{B}}}_k . \end{aligned}$$

As \(p \ge 2\), Minkowski’s inequality (38) thus implies that

$$\begin{aligned} \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_p}&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 \nonumber \\&\quad + \left\| \max _{t \in [T]} \sum _{k=1}^t {\mathbb {E}}[B_k | \mathcal F_{k-1}] \right\| _{L_p} + \left\| \max _{t \in [T]} \sum _{k=1}^t {{\bar{B}}}_k \right\| _{L_p}. \end{aligned}$$
(44)

In order to bound the penultimate term in (44), we first note that

$$\begin{aligned} \left| {\mathbb {E}}[B_k | \mathcal F_{k-1}] \right|&= 2 \gamma \left| {\mathbb {E}}\left[ \left( \nabla h(\varvec{\phi }_{k-1}) - \varvec{g}_t(\varvec{\phi }_{k-1}) \right) | \mathcal F_{k-1} \right] ^\top (\varvec{\phi }_{k-1} - \varvec{\phi }_{\star }) \right| \nonumber \\&\le 2 \gamma \Vert {\mathbb {E}}\left[ \left( \nabla h(\varvec{\phi }_{k-1}) - \varvec{g}_k(\varvec{\phi }_{k-1}) \right) | \mathcal F_{k-1} \right] \Vert \Vert \varvec{\phi }_{k-1} - \varvec{\phi }_{\star } \Vert \nonumber \\&\le 2 \gamma \varepsilon _{k-1} \Vert \varvec{\phi }_{k-1} - \varvec{\phi }_{\star } \Vert \le 2 \gamma \varepsilon _{k-1}\sqrt{ A_{k-1}} \end{aligned}$$
(45)

for all \(k\in \mathbb N\), where the second inequality holds due to Assumption 4.1 (i), and the last inequality follows from (40). This in turn implies that for all \(t \in [T]\) we have

$$\begin{aligned} \left| \sum _{k=1}^t {\mathbb {E}}[B_k | \mathcal F_{k-1}] \right| \le&\,2 \gamma \sum _{k=1}^t \varepsilon _{k-1} \sqrt{A_{k-1}} \le 2 \gamma \left( \sum _{k=1}^t \varepsilon _{k-1} \right) \left( \max _{k \in [t]} \sqrt{A_{k-1}} \right) \\ \le&\,2 \gamma {{\bar{\varepsilon }}} \sqrt{t} \max _{k \in [t]} \sqrt{A_{k-1}}, \end{aligned}$$

where the last inequality exploits (43). Therefore, the penultimate term in (44) satisfies

$$\begin{aligned} \left\| \max _{t \in [T]} \sum _{k=1}^t {\mathbb {E}}[B_k | \mathcal F_{k-1}] \right\| _{L_p} \le 2 \gamma {{\bar{\varepsilon }}} \sqrt{T} \left\| \max _{t \in [T+1]} \sqrt{A_{t-1}} \right\| _{L_p} = 2 \gamma {{\bar{\varepsilon }}} \sqrt{T} \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_{p/2}}^{1/2}, \end{aligned}$$
(46)

where the equality follows from the definition of the \(L_p\)-norm.

Next, we bound the last term in (44) by using the BRP inequality of Lemma 4.3. To this end, note that

$$\begin{aligned} |{{\bar{B}}}_t |&\le | B_t | + | {\mathbb {E}}[B_t | \mathcal F_{t-1}] | \\&\le 2 \gamma \Vert \varvec{\phi }_{t-1} - \varvec{\phi }_{\star } \Vert \Vert \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \Vert + 2 \gamma \varepsilon _{t-1} \sqrt{A_{t-1}} \\&\le 2 \gamma \sqrt{A_{t-1}} \left( \Vert \nabla h(\varvec{\phi }_{t-1}) \Vert + \Vert \varvec{g}_t(\varvec{\phi }_{t-1}) \Vert \right) + 2 \gamma \varepsilon _{t-1} \sqrt{A_{t-1}} \le 2 \gamma (2R + \varepsilon _{t-1}) \sqrt{A_{t-1}} \end{aligned}$$

for all \(t\in \mathbb N\), where the second inequality exploits the definition of \(B_t\) and (45), the third inequality follows from (40), and the last inequality holds because of Assumption 4.1 (ii). Hence, we obtain

$$\begin{aligned} \textstyle \left\| \max _{t \in [T]} | {{\bar{B}}}_t | \right\| _{L_p} \le&\, 2 \gamma \left( 2 R + \max _{t \in [T]} \varepsilon _{t-1} \right) \left\| \max _{t \in [T]} \sqrt{A_{t-1}} \right\| _{L_p}\\ \le&\, ( 4 \gamma R + \gamma {{\bar{\varepsilon }}}) \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_{p/2}}^{1/2}, \end{aligned}$$

where the second inequality follows from (43) and the definition of the \(L_p\)-norm. In addition, we have

$$\begin{aligned} \left\| \sum _{t=1}^T {\mathbb {E}}[ {{\bar{B}}}_t^2 | \mathcal F_{t-1}] \right\| _{L_{p/2}}^{1/2}&= \left\| \sqrt{\sum _{t=1}^T {\mathbb {E}}[ {{\bar{B}}}_t^2 | \mathcal F_{t-1}]} \right\| _{L_p}\\&\le 2 \gamma \left\| \sqrt{ \sum _{t=1}^T (2R + \varepsilon _{t-1})^2 A_{t-1} } \right\| _{L_p} \\&\le 2 \gamma \left( \sum _{t=1}^T (2R + \varepsilon _{t-1})^2 \right) ^{1/2} \left\| \max _{t \in [T+1]} A_{t-1}^{1/2} \right\| _{L_p} \\&\le 2 \gamma \left( 2 R \sqrt{T} + \sqrt{\sum _{t=1}^T \varepsilon _{t-1}^2} \right) \left\| \max _{t \in [T+1]} A_{t-1}^{1/2} \right\| _{L_p} \\&\le \left( 4 \gamma R \sqrt{T} + \gamma {{\bar{\varepsilon }}} \sqrt{T} \right) \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_{p/2}}^{1/2}, \end{aligned}$$

where the first inequality exploits the upper bound on \(|{{\bar{B}}}_t|\) derived above, which implies that \({\mathbb {E}}[ {{\bar{B}}}_t ^2 | \mathcal F_{t-1}] \le 4 \gamma ^2 (2R + \varepsilon _{t-1})^2 A_{t-1}\). The last three inequalities follow from the Hölder inequality, the triangle inequality for the Euclidean norm and the two inequalities in (43), respectively. Recalling that \(p \ge 2\), we may then apply the BRP inequality of Lemma 4.3 to the martingale differences \({{\bar{B}}}_t\), \(t\in [T]\), and use the bounds derived in the last two display equations in order to conclude that

$$\begin{aligned} \left\| \max _{t \in [T]} \left| \sum _{k=1}^t {{\bar{B}}}_k \right| \right\| _{L_p}&\le \left( 4 \gamma R \sqrt{pT} + \gamma {{\bar{\varepsilon }}} \sqrt{pT} + \gamma {{\bar{\varepsilon }}} p + 4 \gamma R p \right) \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_{p/2}}^{1/2}. \end{aligned}$$
(47)

Substituting (46) and (47) into (44), we thus obtain

$$\begin{aligned} \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_p}&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \left( 4 \gamma R \left( \sqrt{pT} + p \right) \right. \\&\left. \quad + \gamma {{\bar{\varepsilon }}} \left( \sqrt{pT} + p +2 \sqrt{T} \right) \right) \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_{p/2}}^{1/2} \\&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 \gamma \left( R + {{\bar{\varepsilon }}} \right) \sqrt{pT} \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_{p/2}}^{1/2}, \end{aligned}$$

where the second inequality holds because \(p \le T/4\) by assumption, which implies that \(\sqrt{pT} + p \le 1.5 \sqrt{pT} \) and \( \sqrt{pT} + p + 2 \sqrt{T} \le 6 \sqrt{pT}\). As Jensen’s inequality ensures that \(\Vert \varvec{z} \Vert _{L_{p/2}} \le \Vert \varvec{z} \Vert _{L_p}\) for any random variable \(\varvec{z}\) and \(p > 0\), the following inequality holds for all \(2 \le p \le T/4\).

$$\begin{aligned} \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_p}&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 \gamma \left( R + {{\bar{\varepsilon }}} \right) \sqrt{pT} \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_p}^{1/2} \end{aligned}$$

To complete the proof of Case II, we note that for any numbers \(a, b, c \ge 0\) the inequality \(c \le a + 2b \sqrt{c} \) is equivalent to \(\sqrt{c} \le b + \sqrt{b^2+a}\) and therefore also to \(c \le (b + \sqrt{b^2+a})^2 \le 4b^2 + 2a\). Identifying a with \(T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2\), b with \(3\gamma \left( R + {{\bar{\varepsilon }}} \right) \sqrt{pT}\) and c with \(\Vert \max _{t \in [T+1]} A_{t-1}\Vert _{L_p}\) then allows us to translate the inequality in the last display equation to

$$\begin{aligned} \left\| A_{T} \right\| _{L_p} \le \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_p}&\le 2 T \gamma ^2 R^2 + 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 36 \gamma ^2 \left( R + {{\bar{\varepsilon }}} \right) ^2 p T. \end{aligned}$$
(48)

Thus, for any \(2 \le p \le T/4\), we have again found an upper bound on \(\Vert A_{T}\Vert _{L_p}\) that is affine in p.

Case III \(({p = 1})\): Recalling the definition of \(A_T\ge 0\), we find that

$$\begin{aligned} \Vert A_T \Vert _{L_{1}} = {\mathbb {E}}[A_T]&= T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + {\mathbb {E}}\left[ \, \sum _{t=1}^T {\mathbb {E}}[B_t | \mathcal F_{t-1}] \right] \\&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \left\| \max _{t \in [T]} \sum _{k=1}^t {\mathbb {E}}[B_k | \mathcal F_{k-1}] \right\| _{L_1} \\&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 2 \gamma {{\bar{\varepsilon }}} \sqrt{T} \left\| \max _{t \in [T+1]} A_{t-1} \right\| ^{1/2}_{L_{1/2}} \\&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 2 \gamma {{\bar{\varepsilon }}} \sqrt{T} \left\| \max _{t \in [T+1]} A_{t-1} \right\| ^{1/2}_{L_{2}}, \end{aligned}$$

where the second inequality follows from the estimate (46), which holds indeed for all \(p\in \mathbb N\), while the last inequality follows from Jensen’s inequality. By the second inequality in (48) for \(p=2\), we thus find

$$\begin{aligned} \Vert A_T \Vert _{L_{1}}&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 2 {{\bar{\varepsilon }}} \gamma \sqrt{T} \cdot \sqrt{2 T \gamma ^2 R^2 + 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 72 \gamma ^2 (R + {{\bar{\varepsilon }}})^2 T} \end{aligned}$$
(49a)
$$\begin{aligned}&\le 2 T \gamma ^2 R^2 + 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 36 \gamma ^2 (R + {{\bar{\varepsilon }}})^2 T + 2 {{\bar{\varepsilon }}}^2 \gamma ^2 T , \end{aligned}$$
(49b)

where the last inequality holds because \(2ab \le 2a^2 + b^2/ 2\) for all \(a,b\in {\mathbb {R}}\).

We now combine the bounds derived in Cases I, II and III to obtain a universal bound on \(\left\| A_{T} \right\| _{L_p}\) that holds for all \(p\in \mathbb N\). Specifically, one readily verifies that the bound

$$\begin{aligned} \left\| A_{T} \right\| _{L_p}&\le 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 40 \gamma ^2 \left( R + {{\bar{\varepsilon }}} \right) ^2 p T, \end{aligned}$$
(50)

is more conservative than each of the bounds (42), (48) and (49), and thus it holds indeed for any \(p \in \mathbb N\). Combining this universal bound with (41) proves the first inequality from the proposition statement.

In order to prove the second inequality, we need to extend ([14], Proposition 7) to biased gradient oracles. To this end, we first note that

$$\begin{aligned} \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\|&\le \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| + \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| \\&\le 2 M \left( \frac{1}{T} \sum _{t=1}^T h(\varvec{\phi }_{t-1}) - h(\varvec{\phi }^\star ) \right) + \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| \\&\le \frac{M}{T \gamma } A_T + \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| , \end{aligned}$$

where the second inequality follows from Lemma 4.4 (i), and the third inequality holds due to (40). By Minkowski’s inequality (38), we thus have for any \(p \ge 1\) that

$$\begin{aligned} \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\| _{L_{2p}}&\le \frac{M}{T \gamma } \Vert A_T \Vert _{L_{2p}} + \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| _{L_{2p}} \\&\le \frac{2 M}{T \gamma } \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 80 M \gamma \left( R + {{\bar{\varepsilon }}} \right) ^2 p\\&\qquad + \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| _{L_{2p}}, \end{aligned}$$

where the last inequality follows from the universal bound (50). In order to estimate the last term in the above expression, we recall that the update rule (37) is equivalent to \(\varvec{g}_t(\varvec{\phi }_{t-1}) = \left( \varvec{\phi }_{t-1} - \varvec{\phi }_{t} \right) / \gamma ,\) which in turn implies that \(\sum _{t=1}^T \varvec{g}_t(\varvec{\phi }_{t-1}) = \left( \varvec{\phi }_0 - \varvec{\phi }_T \right) / \gamma .\) Hence, for any \(p \ge 1\), we have

$$\begin{aligned}&\left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| _{L_{2p}}\\&\quad = \left\| \frac{1}{T} \sum _{t=1}^T \Big ( \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \Big ) + \frac{\varvec{\phi }_0 - \varvec{\phi }^\star }{T \gamma } + \frac{\varvec{\phi }^\star - \varvec{\phi }_T}{T \gamma } \right\| _{L_{2p}} \\&\quad \le \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right\| _{L_{2p}} + \frac{1}{T \gamma } \left\| \varvec{\phi }_0 - \varvec{\phi }^\star \right\| + \frac{1}{T \gamma } \left\| \varvec{\phi }^\star - \varvec{\phi }_T \right\| _{L_{2p}} \\&\quad \le \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right\| _{L_{2p}} + \frac{1}{T \gamma } \left\| \varvec{\phi }_0 - \varvec{\phi }^\star \right\| + \frac{1}{T \gamma } \left\| A_T \right\| _{L_{p}}^{1/2} \\&\quad \le \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right\| _{L_{2p}} + \frac{1 + \sqrt{2}}{T \gamma } \left\| \varvec{\phi }_0 - \varvec{\phi }^\star \right\| \\&\qquad + \frac{2 \sqrt{10} \left( R + {{\bar{\varepsilon }}} \right) \sqrt{p}}{\sqrt{T}}, \end{aligned}$$

where the first inequality exploits Minkowski’s inequality (38), the second inequality follows from (40), which implies that \(\Vert \varvec{\phi }^\star - \varvec{\phi }_T \Vert \le \sqrt{A_T}\), and the definition of the \(L_p\)-norm. The last inequality in the above expression is a direct consequence of the universal bound (50) and the inequality \( \sqrt{a+b} \le \sqrt{a} + \sqrt{b}\) for all \(a,b\ge 0\). Next, define for any \(t\in \mathbb N\) a martingale difference of the form

$$\begin{aligned}\varvec{C}_t = \frac{1}{T} \Big ( \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) - {\mathbb {E}}[\nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) | \mathcal F_{t-1}] \Big ).\end{aligned}$$

Note that these martingale differences are bounded because

$$\begin{aligned} \Vert \varvec{C}_t \Vert&\le \frac{1}{T} \Big ( \Vert \nabla h(\varvec{\phi }_{t-1}) \Vert + \Vert \varvec{g}_t(\varvec{\phi }_{t-1}) \Vert + \Vert {\mathbb {E}}[\nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) | \mathcal F_{t-1}] \Vert \Big )\\&\le \frac{2R + \varepsilon _{t-1}}{T}\\&\le \frac{2R + {{\bar{\varepsilon }}}}{T}, \end{aligned}$$

and thus the BRP inequality of Lemma 4.3 implies that

$$\begin{aligned} \left\| \sum _{t=1}^T \varvec{C}_t \right\| _{L_{2p}} \le \sqrt{2p} \, \frac{2R + {{\bar{\varepsilon }}}}{\sqrt{T}} + 2p \, \frac{2R + {{\bar{\varepsilon }}}}{T}. \end{aligned}$$

Recalling the definition of the martingale differences \(\varvec{C}_t\), \(t\in \mathbb N\), this bound allows us to conclude that

$$\begin{aligned}&\frac{1}{T} \left\| \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right\| _{L_{2p}}\\&\quad \le \left\| \sum _{t=1}^T \varvec{C}_t \right\| _{L_{2p}} + \frac{1}{T} \left\| \sum _{t=1}^T {\mathbb {E}}[\nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) | \mathcal F_{t-1}] \right\| _{L_{2p}} \\&\quad \le \sqrt{2p} \, \frac{2R + {{\bar{\varepsilon }}}}{\sqrt{T}} + 2p \, \frac{2R + {{\bar{\varepsilon }}}}{T} + \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \le 2 \sqrt{2p} \, \frac{R + {{\bar{\varepsilon }}}}{\sqrt{T}} + 4p \, \frac{R + {{\bar{\varepsilon }}}}{T}, \end{aligned}$$

where the second inequality exploits Assumption 4.1 (i) as well as the second inequality in (43). Combining all inequalities derived above and observing that \(2\sqrt{2} + 2 \sqrt{10} < 10 \) finally yields

$$\begin{aligned}&\left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\| _{L_{2p}}\\&\quad \le \frac{2 M}{T \gamma } \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 80 M \gamma \left( R + {{\bar{\varepsilon }}} \right) ^2 p + 2 \sqrt{2p} \, \frac{R + {{\bar{\varepsilon }}}}{\sqrt{T}} + 4p \, \frac{R + {{\bar{\varepsilon }}}}{T} \\&\qquad + \frac{1 + \sqrt{2}}{T \gamma } \left\| \varvec{\phi }_0 - \varvec{\phi }^\star \right\| + \frac{2 \sqrt{10} \left( R + {{\bar{\varepsilon }}} \right) \sqrt{p}}{\sqrt{T}} \\&\quad \le \frac{G}{\sqrt{T}} \left( 10 \sqrt{p} + \frac{4p}{\sqrt{T}} + 80 G^2 \gamma \sqrt{T} p + \frac{2}{\gamma \sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{3}{G \gamma \sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) , \end{aligned}$$

where \(G = \max \{ M, R + {{\bar{\varepsilon }}} \}\). This proves the second inequality from the proposition statement. \(\square \)

The following corollary follows immediately from the proof of Proposition 4.2.

Corollary 4.5

Consider the inexact gradient descent algorithm (37) with constant step size \(\gamma > 0\). If Assumptions 4.1 (i)–(ii) hold with \(\varepsilon _t \le {{{\bar{\varepsilon }}}}/{(2\sqrt{1+t})}\) for some \({{\bar{\varepsilon }}} \ge 0\), then we have

$$\begin{aligned} \frac{1}{T} \sum _{t=1}^T \mathbb E \left[ \left( \nabla h(\varvec{\phi }_{t}) - \varvec{g}_t(\varvec{\phi }_{t}) \right) ^\top (\varvec{\phi }_{t} - \varvec{\phi }_{\star }) \right] \le \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 74 \gamma ^2 (R + {{\bar{\varepsilon }}})^2 T}. \end{aligned}$$

Proof of Corollary 4.5

Defining \(B_t\) as in the proof of Proposition 4.2, we find

$$\begin{aligned}&\frac{1}{T} \sum _{t=1}^T \mathbb E \left[ \left( \nabla h(\varvec{\phi }_{t}) - \varvec{g}_t(\varvec{\phi }_{t}) \right) ^\top (\varvec{\phi }_{t} - \varvec{\phi }_{\star }) \right] \\&\quad = \frac{1}{2 \gamma T} {\mathbb {E}}\left[ \sum _{t=1}^T {\mathbb {E}}[B_t | \mathcal F_{t-1}] \right] \\&\quad \le \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 T \gamma ^2 R^2 + 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 72 \gamma ^2 (R + {{\bar{\varepsilon }}})^2 T}, \end{aligned}$$

where the inequality is an immediate consequence of the reasoning in Case (III) in the proof of Proposition 4.2. The claim then follows from the trivial inequality \(R+ {{\bar{\varepsilon }}} \ge R\).

\(\square \)

Armed with Proposition 4.2 and Corollary 4.5, we are now ready to prove the main convergence result.

Theorem 4.6

Consider the inexact gradient descent algorithm (37) with constant step size \(\gamma > 0\). If Assumptions 4.1 (i)–(ii) hold with \(\varepsilon _t \le {{{\bar{\varepsilon }}}}/{(2\sqrt{1+t})}\) for some \({{\bar{\varepsilon }}} \ge 0\), then the following statements hold.

  1. (i)

    If \(\gamma = 1 / (2 (R + {{\bar{\varepsilon }}})^2 \sqrt{T})\), then we have

    $$\begin{aligned} {\mathbb {E}}\left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right] - h(\varvec{\phi }^\star )&\le \frac{(R + {{\bar{\varepsilon }}})^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{1}{4\sqrt{T}}\\&\quad + \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{37}{2(R + {{\bar{\varepsilon }}})^2}} . \end{aligned}$$
  2. (ii)

    If \(\gamma = 1 / (2 (R + {{\bar{\varepsilon }}})^2 \sqrt{T} + L)\) and the Assumptions 4.1 (iv)–(v) hold in addition to the blanket assumptions mentioned above, then we have

    $$\begin{aligned} {\mathbb {E}}\left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t} \right) \right] - h(\varvec{\phi }^\star )&\le \frac{L}{2T}\Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{(R + {{\bar{\varepsilon }}})^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2\\&\qquad + \frac{\sigma ^2}{4 (R+{{\bar{\varepsilon }}})^2\sqrt{T}} \\&\qquad + \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{37}{2(R + {{\bar{\varepsilon }}})^2}}. \end{aligned}$$
  3. (iii)

    If \(\gamma = 1 / (2 G^2 \sqrt{T})\) with \(G = \max \{M, R + {{\bar{\varepsilon }}} \}\), the smallest eigenvalue \(\kappa \) of \(\nabla ^2 h(\varvec{\phi }^\star )\) is strictly positive and Assumption 4.1 (iii) holds in addition to the blanket assumptions mentioned above, then we have

    $$\begin{aligned} \mathbb E \left[ h\left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1}\right) \right] - h(\varvec{\phi }^\star )&\le \frac{G^2}{\kappa T} \left( 4 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert + 20 \right) ^4. \end{aligned}$$

The proof of Theorem 4.6 relies on the following concentration inequalities due to Bach [14].

Lemma 4.7

(Concentration inequalities)

  1. (i)

    ([14], Lemma 11): If there exist \(a,b>0\) and a random variable \(\varvec{z} \in {\mathbb {R}}^n\) with \( \Vert \varvec{z} \Vert _{L_p} \le a + b p \) for all \(p \in \mathbb N\), then we have

    $$\begin{aligned} \mathbb P \left[ \Vert \varvec{z} \Vert \ge 3 b s + 2 a \right] \le 2 \exp (-s)\quad \forall s \ge 0. \end{aligned}$$
  2. (ii)

    ([14], Lemma 12): If there exist \(a,b,c>0\) and a random variable \(\varvec{z} \in {\mathbb {R}}^n\) with \( \Vert \varvec{z} \Vert _{L_p} \le (a \sqrt{p} + b p + c)^2 \) for all \(p \in [T]\), then we have

    $$\begin{aligned} \mathbb P \left[ \Vert \varvec{z} \Vert \ge (2 a \sqrt{s} + 2 b s + 2 c)^2 \right] \le 4 \exp (-s)\quad \forall s \le T. \end{aligned}$$

Proof of Theorem 4.6

Define \(A_t\) as in the proof of Proposition 4.2. Then, we have

$$\begin{aligned} {\mathbb {E}}\left[ h \left( \frac{1}{T} \sum _{t=0}^{T-1} \varvec{\phi }_{t} \right) - h(\varvec{\phi }^\star ) \right]&\le \frac{\mathbb E[A_T]}{2 \gamma T} = \frac{\Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}{2 \gamma T} + \frac{\gamma R^2}{2}\nonumber \\&\quad + \frac{1}{T} \sum _{t=1}^T \mathbb E \left[ \left( \nabla h(\varvec{\phi }_{t}) - \varvec{g}_t(\varvec{\phi }_{t}) \right) ^\top (\varvec{\phi }_{t} - \varvec{\phi }_{\star }) \right] \nonumber \\&\le \frac{\Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}{2 \gamma T} + \frac{\gamma R^2}{2} \nonumber \\&\quad + \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 74 \gamma ^2 (R + {{\bar{\varepsilon }}})^2 T}, \end{aligned}$$
(51)

where the two inequalities follow from (41) and from Corollary 4.5, respectively. Setting the step size to \(\gamma = 1 / ( 2 (R+ {{\bar{\varepsilon }}})^2 \sqrt{T} )\) then completes the proof of assertion (i).

Assertion (ii) generalizes ([45], Theorem 1). By the L-smoothness of \(h(\varvec{\phi })\), we have

$$\begin{aligned} h(\varvec{\phi }_{t})&\le h(\varvec{\phi }_{t-1}) + \nabla h(\varvec{\phi }_{t-1})^\top (\varvec{\phi }_{t} - \varvec{\phi }_{t-1}) + \frac{L}{2}\Vert \varvec{\phi }_{t} - \varvec{\phi }_{t-1}\Vert ^2 \nonumber \\&= h(\varvec{\phi }_{t-1}) + \varvec{g}_t(\varvec{\phi }_{t-1})^\top (\varvec{\phi }_{t} - \varvec{\phi }_{t-1}) + \left( \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right) ^\top \nonumber \\&\quad (\varvec{\phi }_{t} - \varvec{\phi }_{t-1}) + \frac{L}{2}\Vert \varvec{\phi }_{t} - \varvec{\phi }_{t-1}\Vert ^2 \nonumber \\&\le h(\varvec{\phi }_{t-1}) + \varvec{g}_t(\varvec{\phi }_{t-1})^\top (\varvec{\phi }_{t} - \varvec{\phi }_{t-1}) + \frac{\zeta }{2}\Vert \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \Vert ^2\nonumber \\&\quad + \frac{L + 1/\zeta }{2}\Vert \varvec{\phi }_{t} - \varvec{\phi }_{t-1}\Vert ^2, \end{aligned}$$
(52)

where the last inequality exploits the Cauchy-Schwarz inequality together with the elementary inequality \(2ab \le \zeta a^2 + b^2 / \zeta \), which holds for all \(a,b\in {\mathbb {R}}\) and \(\zeta > 0\). Next, note that the iterates satisfy the recursion

$$\begin{aligned} \Vert \varvec{\phi }_{t-1} - \varvec{\phi }^\star \Vert ^2 = \Vert \varvec{\phi }_{t-1} - \varvec{\phi }_{t} \Vert ^2 + \Vert \varvec{\phi }_{t} - \varvec{\phi }^\star \Vert ^2 + 2 (\varvec{\phi }_{t-1} - \varvec{\phi }_{t})^\top (\varvec{\phi }_{t} - \varvec{\phi }^\star ), \end{aligned}$$

which can be re-expressed as

$$\begin{aligned} \varvec{g}_t(\varvec{\phi }_{t-1})^\top (\varvec{\phi }_{t} - \varvec{\phi }^\star ) = \frac{1}{2 \gamma } \left( \Vert \varvec{\phi }_{t-1} - \varvec{\phi }^\star \Vert ^2 - \Vert \varvec{\phi }_{t-1} - \varvec{\phi }_{t} \Vert ^2 - \Vert \varvec{\phi }_{t} - \varvec{\phi }^\star \Vert ^2 \right) \end{aligned}$$

by using the update rule (37). In the remainder of the proof we assume that \(0< \gamma < 1 / L\). Substituting the above equality into (52) and setting \(\zeta = \gamma / (1 - \gamma L)\) then yields

$$\begin{aligned} h(\varvec{\phi }_{t})&\le h(\varvec{\phi }_{t-1}) + \varvec{g}_t(\varvec{\phi }_{t-1})^\top (\varvec{\phi }^\star - \varvec{\phi }_{t-1}) + \frac{\gamma }{2(1 - \gamma L)} \Vert \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \Vert ^2 \\&\qquad + \frac{1}{2 \gamma } \left( \Vert \varvec{\phi }_{t-1} - \varvec{\phi }^\star \Vert ^2 - \Vert \varvec{\phi }_{t} - \varvec{\phi }^\star \Vert ^2 \right) . \end{aligned}$$

By the convexity of h, we have \(h(\varvec{\phi }^\star ) \ge h(\varvec{\phi }_{t-1}) + \nabla h(\varvec{\phi }_{t-1})^\top (\varvec{\phi }^\star - \varvec{\phi }_{t-1})\), which finally implies that

$$\begin{aligned} h(\varvec{\phi }_{t})&\le h(\varvec{\phi }^\star ) + \left( \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right) ^\top ( \varvec{\phi }_{t-1} - \varvec{\phi }^\star )\\&\qquad + \frac{\gamma }{2(1 - \gamma L)} \Vert \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \Vert ^2 \\&\qquad + \frac{1}{2\gamma } \left( \Vert \varvec{\phi }_{t-1} - \varvec{\phi }^\star \Vert ^2 - \Vert \varvec{\phi }_{t} - \varvec{\phi }^\star \Vert ^2 \right) . \end{aligned}$$

Averaging the above inequality over t and taking expectations then yields the estimate

$$\begin{aligned}&\mathbb E \left[ \frac{1}{T} \sum _{t=1}^T h(\varvec{\phi }_{t}) \right] - h(\varvec{\phi }^\star )\\&\quad \le \frac{\Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}{2\gamma T} + \frac{\gamma }{2 (1 - \gamma L)} \mathbb E \left[ \frac{1}{T} \sum _{t=1}^T \Vert \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \Vert ^2 \right] \\&\qquad + \mathbb E \left[ \frac{1}{T} \sum _{t=1}^T \left( \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right) ^\top (\varvec{\phi }_{t-1} - \varvec{\phi }_{\star }) \right] \\&\quad \le \frac{\Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}{2\gamma T} + \frac{\gamma \sigma ^2}{2 (1 - \gamma L)} + \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 74 \gamma ^2 (R + {{\bar{\varepsilon }}})^2 T}, \end{aligned}$$

where the second inequality exloits Assumption 4.1 (v) and Corollary 4.5. Using Jensen’s inequality to move the average over t inside h, assertion (ii) then follows by setting \(\gamma = 1 / (2 (R + {{\bar{\varepsilon }}})^2 \sqrt{T} + L)\) and observing that \(\gamma / ( 1 - \gamma L) = 1 / ( 2(R+{{\bar{\varepsilon }}})^2 \sqrt{T} )\).

To prove assertion (iii), we distinguish two different cases.

Case I: Assume first that \(4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \le {\kappa \sqrt{T}}/{(8 G^2)}\), where \(G = \max \{M, R + {{\bar{\varepsilon }}} \}\) and \(\kappa \) denotes the smallest eigenvalue of \(\nabla ^2 h(\varvec{\phi }^\star )\). By a standard formula for the expected value of a non-negative random variable, we find

$$\begin{aligned}&{\mathbb {E}}\left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \right] \nonumber \\&\quad = \int _{0}^\infty \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge u \right] \mathrm {d}u \nonumber \\&\quad = \int _{0}^{u_1} \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge u \right] \mathrm {d}u \nonumber \\&\qquad + \int _{u_1}^{u_2} \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge u \right] \mathrm {d}u \nonumber \\&\qquad + \int _{u_2}^{\infty } \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge u \right] \mathrm {d}u, \end{aligned}$$
(53)

where \(u_1 = \frac{8 G^2}{\kappa T}(4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert )^2\) and \(u_2 = \frac{8 G^2}{\kappa T}(\frac{\kappa \sqrt{T}}{4 G^2} + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert )^2\). The first of the three integrals in (53) is trivially upper bounded by \(u_1\). Next, we investigate the third integral in (53), which is easier to bound from above than the second one. By combining the first inequality in Proposition 4.2 for \(\gamma = 1 / (2 G^2 \sqrt{T})\) with the trivial inequality \(G \ge R + {{\bar{\varepsilon }}}\), we find

$$\begin{aligned} \left\| h\left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \right\| _{L_p} \le \frac{2G^2}{\sqrt{T}}\,\Vert \varvec{\phi }_0-\varvec{\phi }^\star \Vert ^2 + \frac{10}{\sqrt{T}} \,p\quad \forall p\in \mathbb N. \end{aligned}$$

Lemma 4.7 (i) with \(a = 2 G^2 \Vert \varvec{\phi }_0 -\varvec{\phi }^\star \Vert ^2 / \sqrt{T}\) and \(b = 10 / \sqrt{T}\) thus implies that

$$\begin{aligned} \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge \frac{30}{\sqrt{T}} s + \frac{4 G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 \right] \le 2 \exp (-s) \quad \forall s \ge 0. \end{aligned}$$
(54)

We also have

$$\begin{aligned} u_2 - \frac{4 G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 \ge u_2 - \frac{\kappa }{8 G^2} \ge \frac{8 G^2}{\kappa T} \left( \frac{\kappa \sqrt{T}}{4 G^2} \right) ^2 - \frac{\kappa }{8 G^2} = \frac{3 \kappa }{8 G^2} \ge 0, \end{aligned}$$
(55)

where the first inequality follows from the basic assumption underlying Case I, while the second inequality holds due to the definition of \(u_2\). By (54) and (55), the third integral in (53) satisfies

$$\begin{aligned}&\int _{u_2}^{\infty } \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge u \right] \mathrm {d}u \\&\quad =\; \int _{u_2 - \frac{4 G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}^{\infty } \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge u + \frac{4 G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 \right] \mathrm {d}u \\&\quad \le \; 2 \int _{u_2 - \frac{4 G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}^\infty \exp \left( -\frac{\sqrt{T} u}{30} \right) \mathrm {d}u= \frac{60}{\sqrt{T}} \exp \left( -\frac{\sqrt{T}}{30} \left( u_2 - \frac{4 G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 \right) \right) \\&\quad \le \; \frac{60}{\sqrt{T}} \exp \left( -\frac{\kappa \sqrt{T}}{80 G^2} \right) \le \frac{2400 G^2}{\kappa T}, \end{aligned}$$

where the first inequality follows from the concentration inequality (54) and the insight from (55) that \(u_2 - \frac{4 G^2}{\sqrt{T}}\Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 \ge 0\). The second inequality exploits again (55), and the last inequality holds because \(\exp (-x) \le 1 / (2x)\) for all \( x > 0\). We have thus found a simple upper bound on the third integral in (53). It remains to derive an upper bound on the second integral in (53). To this end, we first observe that the second inequality in Proposition 4.2 for \(\gamma = 1 / (2 G^2 \sqrt{T})\) translates to

$$\begin{aligned}&\left\| \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\| ^2 \right\| _{L_p} \\&\quad \le \frac{G^{2}}{T} \left( 10 \sqrt{p} + \frac{4p}{\sqrt{T}} + 40 p + 4G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) ^2 \quad \forall p\in \mathbb N. \end{aligned}$$

Lemma 4.7 (ii) with \(a = 10 G / \sqrt{T}\), \(b = 4 G / T + 40 G / \sqrt{T}\) and \(c = 4 G^3 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 / \sqrt{T} + 6 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert /\sqrt{T}\) thus gives rise to the concentration inequality

$$\begin{aligned}&\mathbb P \left[ \;\left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \!\right) \right\| ^2 \right. \\&\quad \left. \ge \! \frac{4G^2}{T} \left( 10 \sqrt{s} + \frac{4s}{\sqrt{T}} + 40 s + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) ^2 \right] \le 4 \exp (-s), \end{aligned}$$

which holds only for small deviations \(s\le T\). However, this concentration inequality can be simplified to

$$\begin{aligned}&\mathbb P \left[ \;\left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\| \right. \\&\quad \left. \ge \frac{2G}{\sqrt{T}} \left( 12 \sqrt{s} + 40 s + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) \right] \\&\quad \le 4 \exp (-s), \end{aligned}$$

which remains valid for all deviations \( s\ge 0\). To see this, note that if \( s \le T/4 \), then the simplified concentration inequality holds because \( 4 s / T \le 2 \sqrt{s / T}\). Otherwise, if \( s > T/4 \), then the simplified concentration inequality holds trivially because the probability on the left hand vanishes. Indeed, this is an immediate consequence of Assumption 4.1 (ii), which stipulates that the norm of the gradient of h is bounded by R, and of the elementary estimate \(24 G \sqrt{s / T} > G\ge R\), which holds for all \(s > T / 4\).

In the following, we restrict attention to those deviations \(s\ge 0\) that are small in the sense that

$$\begin{aligned} \displaystyle 12 \sqrt{s} + 40 s \le \frac{ \kappa \sqrt{T}}{4G^2}. \end{aligned}$$
(56)

Assume now for the sake of argument that the event inside the probability in the simplified concentration inequality does not occur, that is, assume that

$$\begin{aligned} \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\| < \frac{2G}{\sqrt{T}} \left( 12 \sqrt{s} + 40 s + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) . \end{aligned}$$
(57)

By (56) and the assumption of Case I, (57) implies that \(\Vert \nabla h ( \frac{1}{T}\sum _{t=1}^T \varvec{\phi }_{t-1} ) \Vert< 3 \kappa / (4G) < 3 \kappa / (4M)\). Hence, we may apply Lemma 4.4 (ii) to conclude that \(h ( \frac{1}{T}\sum _{t=1}^T \varvec{\phi }_{t-1} ) - h(\varvec{\phi }^\star ) \le \frac{2}{\kappa } \Vert \nabla h ( \frac{1}{T} \sum _{t=1}^T \varvec{\phi }_{t-1} ) \Vert ^2\). Combining this inequality with (57) then yields

$$\begin{aligned} h \left( \frac{1}{T}\sum _{t=1}^T \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) < \frac{8G^2}{\kappa T} \left( 12 \sqrt{s} + 40 s + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) ^2. \end{aligned}$$
(58)

By the simplified concentration inequality derived above, we may thus conclude that

$$\begin{aligned} 4 \exp (-s)&\ge \; \mathbb P \left[ \; \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\| \right. \nonumber \\&\left. \ge \frac{2G}{\sqrt{T}} \left( 12 \sqrt{s} + 40 s + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) \right] \nonumber \\&\ge \; \mathbb P \left[ \; h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \right. \nonumber \\&\left. \ge \frac{8G^2}{\kappa T} \left( 12 \sqrt{s} + 40 s + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) ^2 \right] \end{aligned}$$
(59)

for any \(s\ge 0\) that satisfies (56), where the second inequality holds because (57) implies (58) or, equivalently, because the negation of (58) implies the negation of (57). The resulting concentration inequality (59) now enables us to construct an upper bound on the second integral in (53). To this end, we define the function

$$\begin{aligned} \ell (s) = \frac{8 G^2}{\kappa T} \left( 12 \sqrt{s} + 40 s + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) ^2 \end{aligned}$$

for all \(s\ge 0\), and set \({{\bar{s}}} = ((9/400 + \kappa \sqrt{T} / (160 G^2))^{\frac{1}{2}} - 3 / 20)^{2}\). Note that \(s\ge 0\) satisfies the inequality (56) if and only if \(s\le {{\bar{s}}}\) and that \(\ell (0) = u_1\) as well as \(\ell ({{\bar{s}}}) = u_2\). By substituting u with \( \ell (s)\) and using the concentration inequality (59) to bound the integrand, we find that the second integral in (53) satisfies

$$\begin{aligned}&\int _{u_1}^{u_2} \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge u \right] \mathrm {d}u\\&\quad = \int _{0}^{{{\bar{s}}}} \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge \ell (s) \right] \frac{\mathrm {d}\ell (s)}{\mathrm {d}s} \mathrm {d}s \\&\quad \le \int _{0}^{{{\bar{s}}}} 4 \mathrm {e}^{-s} \; \frac{\mathrm {d}}{\mathrm {d}s} \! \left( \frac{8 G^2}{\kappa T} \left( 12 \sqrt{s} + 40 s + \tau \right) ^2 \right) \mathrm {d}s \\&\quad \le \frac{32 G^2}{\kappa T} \int _{0}^{\infty } \mathrm {e}^{-s} \left( 144 + 3200 s + 1440 s^{1/2} + 80 \tau + 12 \tau s^{-1/2} \right) \mathrm {d}s \\&\quad = \frac{32 G^2}{\kappa T} \big ( 144 + 3200 \Gamma (2) + 1440 \Gamma (3/2) + 80 \tau + 12 \tau \Gamma (1/2) \big ) \\&\quad \le \frac{32 G^2}{\kappa T} ( 4621 + 102 \tau ), \end{aligned}$$

where \(\tau \) is is a shorthand for \(4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \), and \(\Gamma \) denotes the Gamma function with \(\Gamma (2) = 1\), \(\Gamma (1/2) = \sqrt{\pi }\) and \(\Gamma (3/2) = \sqrt{\pi }/2\); see for example ([141], Chapter 8). The last inequality is obtained by rounding all fractional numbers up to the next higher integer. Combining the upper bounds for the three integrals in (53) finally yields

$$\begin{aligned} {\mathbb {E}}\left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \right]&\le \frac{8 G^2}{\kappa T} \left( \tau ^2 + 18484 + 408 \tau + 300 \right) \\&= \frac{8 G^2}{\kappa T} \Big ( 16 G^4 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^4 + 48 G^3 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^3 \\&\quad + 1668 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 \\&\quad + 2448 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert + 18784 \Big ) \\&\le \frac{G^2}{\kappa T} (4 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert + 20)^4. \end{aligned}$$

This complete the proof of assertion (iii) in Case I.

Case II: Assume now that \(4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert > {\kappa \sqrt{T}}/{(8 G^2)}\), where G is defined as before. Since h has bounded gradients, the inequality (51) remains valid. Setting the step size to \(\gamma = 1 / (2 G^2 \sqrt{T})\) and using the trivial inequalities \(G \ge R + {{\bar{\varepsilon }}} \ge R\), we thus obtain

$$\begin{aligned} {\mathbb {E}}\left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right] - h(\varvec{\phi }^\star )&\le \frac{G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{1}{4\sqrt{T}} + \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{37}{2G^2}} \\&\le \frac{G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{2G}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert + \frac{5}{\sqrt{T}} , \end{aligned}$$

where the second inequality holds because \(G \ge {{\bar{\varepsilon }}}\) and \(\sqrt{a + b} \le \sqrt{a} + \sqrt{b}\) for all \(a,b\ge 0\). Multiplying the right hand side of the last inequality by \(G^2 (32 G^2 \Vert \varvec{\phi }_0^\star - \varvec{\phi }^\star \Vert ^2 + 48 G \Vert \varvec{\phi }_0^\star - \varvec{\phi }^\star \Vert ) / (\kappa \sqrt{T})\), which is strictly larger than 1 by the basic assumption underlying Case II, we then find

$$\begin{aligned}&{\mathbb {E}}\left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right] - h(\varvec{\phi }^\star ) \\&\quad \le \frac{G^2}{\kappa T} \left( G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 2 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert + 5 \right) \left( 32 G^2 \Vert \varvec{\phi }_0^\star - \varvec{\phi }^\star \Vert ^2 + 48 G \Vert \varvec{\phi }_0^\star - \varvec{\phi }^\star \Vert \right) \\&\quad \le \frac{G^2}{\kappa T} (4 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert + 20)^4. \end{aligned}$$

This observation completes the proof. \(\square \)

4.2 Smooth optimal transport problems with marginal ambiguity sets

The smooth optimal transport problem (12) can be viewed as an instance of a stochastic optimization problem, that is, a convex maximization problem akin to (36), where the objective function is representable as \(h(\varvec{\phi }) = {\mathbb {E}}_{\varvec{x} \sim \mu } [ \varvec{\nu }^\top \varvec{\phi }- {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})]\). Throughout this section we assume that the smooth (discrete) c-transform \({{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\) defined in (11) is induced by a marginal ambiguity set of the form (26) with continuous marginal distribution functions. By Proposition 3.6, the integrand \(\varvec{\nu }^\top \varvec{\phi }- {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\) is therefore concave and differentiable in \(\varvec{\phi }\). We also assume that \({{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\) is \(\mu \)-integrable in \(\varvec{x}\), that we have access to an oracle that generates independent samples from \(\mu \) and that problem (12) is solvable.

The following proposition establishes several useful properties of the smooth c-transform.

Proposition 4.8

(Properties of the smooth c-transform) If \(\Theta \) is a marginal ambiguity set of the form (26) with cumulative distribution functions \(F_i\), \(i\in [N]\), then \({{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\) has the following properties for all \(\varvec{x} \in \mathcal X\).

  1. (i)

    Bounded gradient: If \(F_i\), \(i\in [N]\), are continuous, then we have \( \Vert \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) \Vert \le 1 \) for all \(\varvec{\phi }\in {\mathbb {R}}^N\).

  2. (ii)

    Lipschitz continuous gradient: If \(F_i\), \(i\in [N]\), are Lipschitz continuous with Lipschitz constant \(L>0\), then \({{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\) is L-smooth with respect to \(\varvec{\phi }\) in the sense of Assumption 4.1 (iv).

  3. (iii)

    Generalized self-concordance: If \(F_i\), \(i\in [N]\), are twice differentiable on the interiors of their respective supports and if there is \(M > 0\) with

    $$\begin{aligned} \sup _{s \in F_i^{-1}(0,1)} ~ \frac{|\mathrm {d}^2F_i(s) / \mathrm {d}s^2|}{\mathrm {d}F_i(s) / \mathrm {d}s} \le M, \end{aligned}$$
    (60)

    then \({{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\) is M-generalized self-concordant with respect to \(\varvec{\phi }\) in the sense of Assumption 4.1 (iii).

Proof

As for (i), Proposition 3.6 implies that \(\nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) \in \Delta ^N\), and thus we have \(\Vert \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) \Vert \le 1\). As for (ii), note that the convex conjugate of the smooth c-transform with respect to \(\varvec{\phi }\) is given by

$$\begin{aligned} {{\overline{\psi }}}{}_c^*(\varvec{p}, \varvec{x})&= \sup _{\varvec{\phi }\in {\mathbb {R}}^N} \varvec{p}^\top \varvec{\phi }- {{\overline{\psi }}}(\varvec{\phi }, \varvec{x}) \\&= \sup _{\varvec{\phi }\in {\mathbb {R}}^N} \inf _{\varvec{q} \in \Delta ^N} ~ \sum _{i=1}^N p_i \phi _i - (\phi _i - c(\varvec{x}, \varvec{y_i})) q_i - \int _{1-q_i}^1 F_i^{-1}(t)\mathrm {d}t \\&= \inf _{\varvec{q} \in \Delta ^N} \sup _{\varvec{\phi }\in {\mathbb {R}}^N} ~ \sum _{i=1}^N p_i \phi _i - (\phi _i - c(\varvec{x}, \varvec{y_i})) q_i - \int _{1-q_i}^1 F_i^{-1}(t)\mathrm {d}t \\&= {\left\{ \begin{array}{ll} \;\displaystyle \sum \limits _{i=1}^N c(\varvec{x}, \varvec{y_i}) p_i - \int _{1-p_i}^1 F_i^{-1}(t)\mathrm {d}t &{} \text {if } \varvec{p} \in \Delta ^N \\ \;+\infty &{} \text {otherwise,} \end{array}\right. } \end{aligned}$$

where the second equality follows again from Proposition 3.6, and the interchange of the infimum and the supremum is allowed by Sion’s classical minimax theorem. In the following we first prove that \({{\overline{\psi }}}{}_c^*(\varvec{p}, \varvec{x})\) is 1/L-strongly convex in \(\varvec{p}\), that is, the function \({{\overline{\psi }}}{}_c^*(\varvec{p}, \varvec{x}) - \Vert \varvec{p}\Vert ^2/ (2L)\) is convex in \(\varvec{p}\) for any fixed \(\varvec{x} \in \mathcal X\). To this end, recall that \(F_i\) is assumed to be Lipschitz continuous with Lipschitz constant L. Thus, we have

$$\begin{aligned} L\!\ge \!\sup _{\begin{array}{c} s_1,s_2 \in {\mathbb {R}}\\ s_1 \ne s_2 \end{array}}\!\frac{\left| F_i (s_1) \!-\! F_i(s_2)\right| }{|s_1 - s_2|} \!=\! \sup _{\begin{array}{c} s_1, s_2 \in {\mathbb {R}}\\ s_1> s_2 \end{array}}\frac{ F_i (s_1) \!-\! F_i(s_2)}{s_1 - s_2}\!\ge \! \sup _{\begin{array}{c} p_i, q_i \in (0,1)\\ p_i > q_i \end{array}} \frac{p_i - q_i}{F_i^{-1}(p_i) \!-\! F_i^{-1}(q_i)}, \end{aligned}$$

where the second inequality follows from restricting \(s_1\) and \(s_2\) to the preimage of (0, 1) with respect to \(F_i\). Rearranging terms in the above inequality then yields

$$\begin{aligned} -F_i^{-1}(1 - q_i) - q_i/L&\le -F_i^{-1}(1-p_i)-p_i/L \end{aligned}$$

for all \(p_i, q_i \in (0, 1)\) with \(q_i < p_i\). Consequently, the function \(- F_i^{-1}(1-p_i) - {p_i}/L\) is non-decreasing and its primitive \(- \int _{1-p_i}^1 F_i^{-1}(t)\mathrm {d}t - p_i^2/(2 L)\) is convex in \(p_i\) on the interval (0, 1). This implies that

$$\begin{aligned} {{\overline{\psi }}}{}_c^*(\varvec{p}, \varvec{x}) - \frac{\Vert \varvec{p}\Vert _2^2}{2 L} = \sum _{i=1}^N c(\varvec{x}, \varvec{y_i}) p_i - \int _{1-p_i}^1 F_i^{-1}(t)\mathrm {d}t - \frac{p_i^2}{2 L} \end{aligned}$$

constitutes a sum of convex univariate functions for every fixed \(\varvec{x}\in {\mathcal {X}}\). Thus, \({{\overline{\psi }}}{}_c^*(\varvec{p}, \varvec{x})\) is 1/L-strongly convex in \(\varvec{p}\). By ([78], Theorem 6), however, any convex function whose conjugate is 1/L-strongly convex is guaranteed to be L-smooth. This observation completes the proof of assertion (ii). As for assertion (iii), choose any \(\varvec{\phi }, \varvec{\varphi }\in {\mathbb {R}}^N\) and \(\varvec{x} \in \mathcal X\), and introduce the auxiliary function

$$\begin{aligned} u(s)= & {} {{\overline{\psi }}}_c \left( \varvec{\phi }+ s (\varvec{\varphi }- \varvec{\phi }), \varvec{x} \right) = \max _{ \varvec{p} \in \Delta ^N} \displaystyle \sum \limits _{i=1}^N ~ (\phi _i + s (\varphi _i - \phi _i) - c(\varvec{x}, \varvec{y_i}))p_i \nonumber \\&+ \int _{1-p_i}^1 F_i^{-1}(t) \mathrm {d}t. \end{aligned}$$
(61)

For ease of exposition, in the remainder of the proof we use prime symbols to designate derivatives of univariate functions. A direct calculation then yields

$$\begin{aligned} u'(s) =&\left( \varvec{\varphi }- \varvec{\phi }\right) ^\top \nabla _{\varvec{\phi }} {{\overline{\psi }}} \left( \varvec{\phi }+ s (\varvec{\varphi }- \varvec{\phi }), \varvec{x} \right) \quad \text {and} \\ \quad u''(s) =&\left( \varvec{\varphi }- \varvec{\phi }\right) ^\top \nabla _{\varvec{\phi }}^2 {{\overline{\psi }}} \left( \varvec{\phi }+ s (\varvec{\varphi }- \varvec{\phi }), \varvec{x} \right) \left( \varvec{\varphi }- \varvec{\phi }\right) . \end{aligned}$$

By Proposition 3.6, \(\varvec{p}^\star (s)=\nabla _{\varvec{\phi }} {{\overline{\psi }}}_c \left( \varvec{\phi }+ s (\varvec{\varphi }- \varvec{\phi }), \varvec{x} \right) \) represents the unique solution of the maximization problem in (61). In addition, by ([159], Proposition 6), the Hessian of the smooth c-transform with respect to \(\varvec{\phi }\) can be computed from the Hessian of its convex conjugate as follows.

$$\begin{aligned}&\nabla _{\varvec{\phi }}^2 {{\overline{\psi }}}_c \left( \varvec{\phi }+ s (\varvec{\varphi }- \varvec{\phi }), \varvec{x} \right) = \left( \nabla ^2_{\varvec{p}} {{\overline{\psi }}}{}_c^*(\varvec{p}^\star (s), \varvec{x}) \right) ^{-1}\\&\quad = \mathrm {diag} \left( [F_1'(F_1^{-1}(1 - p_1^\star (s))), \dots , F_N'(F_N^{-1}(1 - p_N^\star (s))) ] \right) \end{aligned}$$

Hence, the first two derivatives of the auxiliary function u(s) simplify to

$$\begin{aligned} u'(s) = \sum _{i=1}^N (\varphi _i- \phi _i) p^\star _i(s) \quad \text {and} \quad u''(s) = \sum _{i=1}^N (\varphi _i- \phi _i)^2 F_i'(F_i^{-1}(1 - p_i^\star (s))).\end{aligned}$$

Similarly, the above formula for the Hessian of the smooth c-transform can be used to show that \((p_i^\star )'(s) = (\varphi _i- \phi _i) F_i'(F_i^{-1}(1 - p_i^\star (s)))\) for all \(i \in [N]\). The third derivative of u(s) therefore simplifies to

$$\begin{aligned} u'''(s) =&- \sum _{i=1}^N (\varphi _i- \phi _i)^2 \,\frac{ F_i''(F_i^{-1}(1 - p_i^\star (s)))}{F_i'(F_i^{-1}(1 - p_i^\star (s)))}\, (p_i^\star )'(s) \\ =&- \sum _{i=1}^N (\varphi _i- \phi _i)^3 F_i''(F_i^{-1}(1 - p_i^\star (s))). \end{aligned}$$

This implies via Hölder’s inequality that

$$\begin{aligned} | u'''(s) |&= \left| \sum _{i=1}^N (\varphi _i- \phi _i)^2\, F_i'(F_i^{-1}(1 - p_i^\star (s))) \, \frac{F_i''(F_i^{-1}(1 - p_i^\star (s)))}{F_i'(F_i^{-1}(1 - p_i^\star (s)))} \, (\varphi _i- \phi _i) \right| \\&\le \left( \sum _{i=1}^N (\varphi _i- \phi _i)^2\, F_i'(F_i^{-1}(1 - p_i^\star (s))) \right) \left( \max _{i \in [N]} \left| \frac{F_i''(F_i^{-1}(1 - p_i^\star (s)))}{F_i'(F_i^{-1}(1 - p_i^\star (s)))} \, (\varphi _i- \phi _i) \right| \right) . \end{aligned}$$

Notice that the first term in the above expression coincides with \(u''(s)\), and the second term satisfies

$$\begin{aligned}&\max _{i \in [N]} \left| \frac{F_i''(F_i^{-1}(1 - p_i^\star (s)))}{F_i'(F_i^{-1}(1 - p_i^\star (s)))} \, (\varphi _i- \phi _i) \right| \\&\quad \le \max _{i \in [N]} \left| \frac{F_i''(F_i^{-1}(1 - p_i^\star (s)))}{F_i'(F_i^{-1}(1 - p_i^\star (s)))} \right| \, \Vert \varvec{\varphi }- \varvec{\phi }\Vert _\infty \le M \Vert \varvec{\varphi }\varvec{-} \varvec{\phi }\Vert , \end{aligned}$$

where the first inequality holds because \(\max _{i \in [N]} |a_i b_i| \le \Vert \varvec{a} \Vert _{\infty } \Vert \varvec{b} \Vert _\infty \) for all \(\varvec{a}, \varvec{b} \in \mathbb R^N\), and the second inequality follows from the definition of M and the fact that the 2-norm provides an upper bound on the \(\infty \)-norm. Combining the above results shows that \(|u'''(s)|\le M \Vert \varvec{\varphi }\varvec{-} \varvec{\phi }\Vert u''(s)\) for all \(s\in {\mathbb {R}}\). The claim now follows because \(\varvec{\phi }, \varvec{\varphi }\in {\mathbb {R}}^N\) and \(\varvec{x} \in \mathcal X\) were chosen arbitrarily. \(\square \)

figure a

In the following we use the averaged SGD algorithm of Sect. 4.1 to solve the smooth optimal transport problem (12). A detailed description of this algorithm in pseudocode is provided in Algorithm 1. This algorithm repeatedly calls a sub-routine for estimating the gradient of \({{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\) with respect to \(\varvec{\phi }\). By Proposition 3.6, this gradient coincides with the unique solution \(\varvec{p}^\star \) of the convex maximization problem (27). In addition, from the proof of Proposition 3.6 it is clear that its components are given by

$$\begin{aligned} p^\star _i = \theta ^\star \left[ i = \min \, \mathop {\mathrm{argmax}}\limits _{j \in [N]} \phi _j - c(\varvec{x}, \varvec{y}_j) + z_j \right] \quad \forall i \in [N], \end{aligned}$$

where \(\theta ^\star \) represents an optimizer of the semi-parametric discrete choice problem (11). Therefore, \(\varvec{p}^\star \) can be interpreted as a vector of choice probabilities under the best-case probability measure \(\theta ^\star \). Sometimes these choice probabilities are available in closed form. This is the case, for instance, in the exponential distribution model of Example 3.8, which is equivalent to the generalized extreme value distribution model of Sect. 3.1. Indeed, in this case \(\varvec{p}^\star \) is given by a softmax of the utility values \(\phi _i - c(\varvec{x}, \varvec{y_i})\), \(i\in [N]\), i.e.,

$$\begin{aligned} p_i^\star = \frac{\eta _i \exp \left( ({\phi _i - c(\varvec{x}, \varvec{y_i}) )}/{\lambda }\right) }{\sum _{j=1}^N \eta _j \exp \left( ({\phi _j - c(\varvec{x},\varvec{y_j}) })/{\lambda } \right) } \quad \forall i \in [N]. \end{aligned}$$
(62)

Note that these particular choice probabilities are routinely studied in the celebrated multinomial logit choice model ([16], § 5.1). The choice probabilities are also available in closed form in the uniform distribution model of Example 3.9. As the derivation of \(\varvec{p}^\star \) is somewhat cumbersome in this case, we relegate it to Appendix D. For general marginal ambiguity sets with continuous marginal distribution functions, we propose a bisection method to compute the gradient of the smooth c-transform numerically up to any prescribed accuracy; see Algorithm 2.

Theorem 4.9

(Biased gradient oracle) If \(\Theta \) is a marginal ambiguity set of the form (26) and the cumulative distribution function \(F_i\) is continuous for every \(i\in [N]\), then, for any \(\varvec{x} \in \mathcal X\), \(\varvec{\phi }\in {\mathbb {R}}^N\) and \(\varepsilon > 0\), Algorithm 2 outputs \(\varvec{p} \in {\mathbb {R}}^N \) with \(\Vert \varvec{p} \Vert \le 1\) and \(\Vert \nabla _{\varvec{\phi }} {{\overline{\psi }}}(\varvec{\phi }, \varvec{x}) - {\varvec{p}} \Vert \le \varepsilon \).

Proof

Thanks to Proposition 3.6, we can recast the smooth c-transform in dual form as

$$\begin{aligned} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})= & {} \min _{\begin{array}{c} \varvec{\zeta }\in {\mathbb {R}}_+^N \\ \tau \in {\mathbb {R}} \end{array}}\;\sup _{\varvec{p} \in {\mathbb {R}}^N} ~ \sum \limits _{i=1}^N (\phi _i - c(\varvec{x},\varvec{y_i}))p_i\\&+\sum \limits _{i=1}^N \int ^1_{1- p_i} F_i^{-1}(t)\mathrm {d}t + \tau \left( \sum \limits _{i=1}^N p_i - 1 \right) + \sum \limits _{i=1}^N \zeta _i p_i. \end{aligned}$$

Strong duality and dual solvability hold because we may construct a Slater point for the primal problem by setting \(p_i=1/N\), \(i\in [N]\). By the Karush-Kuhn-Tucker optimality conditions, \(\varvec{p}^\star \) and \((\tau ^\star ,\varvec{\zeta }^\star )\) are therefore optimal in the primal and dual problems, respectively, if and only if we have

$$\begin{aligned} \begin{array}{lll} \sum _{i=1}^N p^\star _i =1, ~p^\star _i \ge 0 &{} \forall i \in [N] &{} \text {(primal feasibility)}\\ \zeta ^\star _i\ge 0 &{} \forall i \in [N] &{} \text {(dual feasibility)}\\ \zeta _i^\star p_i^\star =0 &{} \forall i \in [N] &{} \text {(complementary slackness)} \\ \phi _i-c(\varvec{x},\varvec{y_i}) + F_i^{-1}(1-p^\star _i) + \tau ^\star + \zeta ^\star _i = 0 &{} \forall i \in [N] &{} \text {(stationarity)}. \end{array} \end{aligned}$$

If \(p_i^\star > 0\), then the complementary slackness and stationarity conditions imply that \(\zeta _i^\star = 0\) and that \(\phi _i-c(\varvec{x},\varvec{y_i}) + F_i^{-1}(1-p^\star _i) + \tau ^\star = 0\), respectively. Thus, we have \(p_i^\star = 1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -\tau ^\star )\). If \(p_i^\star = 0\), on the other hand, then similar arguments show that \(\zeta _i^\star \ge 0\) and \(\phi _i-c(\varvec{x},\varvec{y_i}) + F_i^{-1}(1) + \tau ^\star \le 0\). These two inequalities are equivalent to \(1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -\tau ^\star ) \le 0\). As all values of \(F_i\) are smaller or equal to 1, the last equality must in in fact hold as an equality. Combining the insights gained so far thus yields \(p_i^\star = 1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -\tau ^\star )\), which holds for all \(i\in [N]\) irrespective of the sign of \(p_i^\star \). Primal feasibility therefore ensures that \(\sum _{i=1}^N 1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -\tau ^\star ) = 1\). Finding the unique optimizer \(\varvec{p}^\star \) of (27) (i.e., finding the gradient of \( {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\)) is therefore tantamount to finding a root \(\tau ^\star \) of the univariate equation

$$\begin{aligned} \sum _{i=1}^N 1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -\tau ) = 1. \end{aligned}$$
(63)

Note the function on the left hand side of (63) is continuous and non-decreasing in \(\tau \) because of the continuity (by assumption) and monotonicity (by definition) of the cumulative distribution functions \(F_i\), \(i\in [N]\). Hence, the root finding problem can be solved efficiently via bisection. To complete the proof, we first show that the interval between the constants \({\underline{\tau }}\) and \({\overline{\tau }}\) defined in Algorithm 2 is guaranteed to contain \(\tau ^\star \). Specifically, we will demonstrate that evaluating the function on the left hand side of (63) at \({{\underline{\tau }}}\) or \({{\overline{\tau }}}\) yields a number that is not larger or not smaller than 1, respectively. For \(\tau ={{\underline{\tau }}}\) we have

$$\begin{aligned}&1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -{{\underline{\tau }}})\\&\quad = 1 - F_i \left( c(\varvec{x},\varvec{y_i})-\phi _i - \min _{j \in [N]} \left\{ c \left( \varvec{x}, \varvec{y_j} \right) - \phi _j -F_j^{-1}(1-1/N) \right\} \right) \\&\quad \le 1 - F_i \left( F_i^{-1}(1-1/N) \right) = 1 / N\qquad \forall i\in [N], \end{aligned}$$

where the inequality follows from the monotonicity of \(F_i\). Summing the above inequality over all \(i\in [N]\) then yields the desired inequality \(\sum _{i =1}^N 1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -{{\underline{\tau }}}) \le 1\). Similarly, for \(\tau ={{\overline{\tau }}}\) we have

$$\begin{aligned}&1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -{{\overline{\tau }}})\\&\quad = 1 - F_i \left( c(\varvec{x},\varvec{y_i})-\phi _i - \max _{j \in [N]} \left\{ c \left( \varvec{x}, \varvec{y_j} \right) - \phi _j -F_j^{-1}(1-1/N) \right\} \right) \\&\quad \ge 1 - F_i \left( F_i^{-1}(1-1/N) \right) = 1/N \qquad \forall i\in [N]. \end{aligned}$$

We may thus conclude that \(\sum _{i =1}^N 1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -{{\overline{\tau }}}) \ge 1\). Therefore, \([{{\underline{\tau }}}, {{\overline{\tau }}}]\) constitutes a valid initial search interval for the bisection algorithm. Note that the function \(1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -\tau )\), which defines \(p_i\) in terms of \(\tau \), is uniformly continuous in \(\tau \) throughout \(\mathbb R\). This follows from ([22], Problem 14.8) and our assumption that \(F_i\) is continuous. The uniform continuity ensures that the tolerance

$$\begin{aligned} \delta (\varepsilon ) = \min _{i \in N} \left\{ \max _\delta \left\{ \delta : | F_i(t_1) - F_i(t_2) | \le \varepsilon / \sqrt{N} ~~ \forall t_1,t_2\in {\mathbb {R}}\text { with } | t_1 - t_2 | \le \delta \right\} \right\} \end{aligned}$$
(64)

is strictly positive for every \(\varepsilon >0\). As the length of the search interval is halved in each iteration, Algorithm 2 outputs a near optimal solution \(\tau \) with \(| \tau - \tau ^\star | \le \delta (\varepsilon )\) after \(\lceil \log _2 (({\overline{\tau }} - {\underline{\tau }}) / \delta (\varepsilon )) \rceil \) iterations. Moreover, the construction of \(\delta (\varepsilon )\) guarantees that \(|1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -\tau ) - p_i^\star | \le \varepsilon / \sqrt{N}\) for all \(\tau \) with \(|\tau - \tau ^\star | \le \delta (\varepsilon )\). Therefore, the output \(\varvec{p}\in {\mathbb {R}}^N_+\) of Algorithm 2 satisfies \(|p_i - p_i^\star | \le \varepsilon / \sqrt{N} \) for each \(i\in [N]\), which in turn implies that \( \Vert \varvec{p} - \varvec{p}^\star \Vert \le \varepsilon \). By construction, finally, Algorithm 2 outputs \(\varvec{p}\ge \varvec{0}\) with \(\sum _{i \in [N]} p_i < 1\), which ensures that \(\Vert p \Vert \le 1\). Thus, the claim follows. \(\square \)

If all cumulative distribution functions \(F_i\), \(i\in [N]\), are Lipschitz continuous with a common Lipschitz constant \(L>0\), then the uniform continuity parameter \(\delta (\varepsilon )\) required in Algorithm 2 can simply be set to \(\delta (\varepsilon ) = \varepsilon / (L \sqrt{N})\). We are now ready to prove that Algorithm 1 offers different convergence guarantees depending on the continuity and smoothness properties of the marginal cumulative distribution functions.

Corollary 4.10

Use \(h(\varvec{\phi }) = {\mathbb {E}}_{\varvec{x} \sim \mu } [ \varvec{\nu }^\top \varvec{\phi }- {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})]\) as a shorthand for the objective function of the smooth optimal transport problem (12), and let \(\varvec{\phi }^\star \) be a maximizer of (12). If \(\Theta \) is a marginal ambiguity set of the form (26) with distribution functions \(F_i\), \(i\in [N]\), then for any \(T \in \mathbb N\) and \({{\bar{\varepsilon }}}\ge 0\), the outputs and \(\bar{\varvec{\phi }}_T = \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t}\) of Algorithm 1 satisfy the following inequalities.

  1. (i)

    If \(\gamma = 1 / (2 (2 + {{\bar{\varepsilon }}}) \sqrt{T})\) and \(F_i\) is continuous for every \(i\in [N]\), then we have

  2. (ii)

    If \(\gamma = 1 / (2 \sqrt{T} + L)\) and \(F_i\) is Lipschitz continuous with Lipschitz constant \(L>0\) for every \(i\in [N]\), then we have

    $$\begin{aligned} {{\overline{W}}}_c (\mu , \nu ) - {\mathbb {E}}\left[ h \big (\bar{\varvec{\phi }}_T \big ) \right]&\le \frac{L}{2T}\Vert \varvec{\phi }^\star \Vert ^2 + \frac{(2 + {{\bar{\varepsilon }}})^2}{\sqrt{T}} \Vert \varvec{\phi }^\star \Vert ^2 \\&\quad + \frac{{{\bar{\varepsilon }}}^2 + 2}{4 (2+{{\bar{\varepsilon }}})^2\sqrt{T}} + \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 \Vert \varvec{\phi }^\star \Vert ^2 + \frac{37}{2(2 + {{\bar{\varepsilon }}})^2}}. \end{aligned}$$
  3. (iii)

    If \(\gamma = 1 / (2 G^2 \sqrt{T}) \) with \(G = \max \{M, 2 + {{\bar{\varepsilon }}}\}\), \(F_i\) satisfies the generalized self-concordance condition (60) with \(M> 0\) for every \(i\in [N]\), and the smallest eigenvalue \(\kappa \) of \(-\nabla ^2_{\varvec{\phi }} h(\varvec{\phi }^\star )\) is strictly positive, then we have

Proof

Recall that problem (12) can be viewed as an instance of the convex minimization problem (36) provided that its objective function is inverted. Throughout the proof we denote by \(\varvec{p}_t(\varvec{\phi }_t, \varvec{x}_t)\) the inexact estimate for \(\nabla _{\varvec{\phi }} {{\overline{\psi }}}(\varvec{\phi }_t, \varvec{x}_t)\) output by Algorithm 2 in iteration t of the averaged SGD algorithm. Note that

$$\begin{aligned} \left\| {\mathbb {E}}\left[ \varvec{\nu }- \varvec{p}_t(\varvec{\phi }_{t-1}, \varvec{x}_t) \big | \mathcal F_{t-1} \right] - \nabla h(\varvec{\phi }_{t-1}) \right\|&= \left\| {\mathbb {E}}\left[ \varvec{p}_t(\varvec{\phi }_{t-1}, \varvec{x}_t) - \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}_t)\right] \right\| \\&\le {\mathbb {E}}\left[ \left\| \varvec{p}_t(\varvec{\phi }_{t-1}, \varvec{x}_t) - \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}_t) \right\| \right] \\&\le \varepsilon _{t-1} \le \frac{{{\bar{\varepsilon }}}}{2 \sqrt{t}}, \end{aligned}$$

where the two inequalities follow from Jensen’s inequality and the choice of \(\varepsilon _{t-1}\) in Algorithm 1, respectively. The triangle inequality and Proposition 4.8 (i) further imply that

$$\begin{aligned} \left\| \nabla h(\varvec{\phi }) \right\| = {\mathbb {E}}\left[ \left\| \varvec{\nu }- \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) \right\| \right] \le \left\| \varvec{\nu }\right\| + {\mathbb {E}}\left[ \left\| \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) \right\| \right] \le 2. \end{aligned}$$

Assertion (i) thus follows from Theorem 4.6 (i) with \(R=2\). As for assertion (ii), we have

$$\begin{aligned}&\; {\mathbb {E}}\left[ \left\| \varvec{\nu }- \varvec{p}_t(\varvec{\phi }_{t-1}, \varvec{x}_t) - \nabla h(\varvec{\phi }_{t-1}) \right\| ^2 | \mathcal F_{t-1} \right] \\&\quad = {\mathbb {E}}\left[ \left\| \varvec{p}_t(\varvec{\phi }_{t-1}, \varvec{x}_t) - {\mathbb {E}}\left[ \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) \right] \right\| ^2 | \mathcal F_{t-1} \right] \\&\quad = {\mathbb {E}}\left[ \left\| \varvec{p}_t(\varvec{\phi }_{t-1}, \varvec{x}_t) - \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) + \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) - {\mathbb {E}}\left[ \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) \right] \right\| ^2 | \mathcal F_{t-1} \right] \\&\quad \le {\mathbb {E}}\left[ 2 \left\| \varvec{p}_t(\varvec{\phi }_{t-1}, \varvec{x}_t) - \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) \right\| ^2 + 2 \left\| \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x})\right. \right. \\&\left. \left. \quad - {\mathbb {E}}\left[ \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) \right] \right\| ^2 | \mathcal F_{t-1} \right] \\&\quad \le 2\varepsilon _{t-1}^2 + 2 \le {{\bar{\varepsilon }}}^2 + 2, \end{aligned}$$

where the second inequality holds because \(\nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) \in \Delta ^N\) and because \(\Vert \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) \Vert _2^2 \le 1\), while the last inequality follows from the choice of \(\varepsilon _{t-1}\) in Algorithm 1. As \({{\overline{\psi }}}(\varvec{\phi }, \varvec{x})\) is L-smooth with respect to \(\varvec{\phi }\) by virtue of Proposition 4.8 (ii), we further have

$$\begin{aligned} \Vert \nabla h(\varvec{\phi }) - \nabla h(\varvec{\phi }') \Vert = \left\| {\mathbb {E}}\left[ \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) - \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }', \varvec{x}) \right] \right\| \le L \Vert \varvec{\phi }- \varvec{\phi }' \Vert \quad \forall \varvec{\phi }, \varvec{\phi }' \in {\mathbb {R}}^n. \end{aligned}$$

Assertion (ii) thus follows from Theorem 4.6 (ii) with \(R=2\) and \(\sigma = \sqrt{{{\bar{\varepsilon }}}^2 + 2}\). As for assertion (iii), finally, we observe that h is M-generalized self-concordant thanks to Proposition 4.8 (iii). Assertion (iii) thus follows from Theorem 4.6 (iii) with \(R=2\).

\(\square \)

One can show that the objective function of the smooth optimal transport problem (12) with marginal exponential noise distributions as described in Example 3.8 is generalized self-concordant. Hence, the convergence rate of Algorithm 1 for the exponential distribution model of Example 3.8 is of the order \(\mathcal O(1/T)\), which improves the state-of-the-art \(\mathcal O(1/\sqrt{T})\) guarantee established by Genevay et al. [64].

5 Numerical experiments

All experiments are run on a 2.6 GHz 6-Core Intel Core i7, and all optimization problems are implemented in MATLAB R2020a. The corresponding codes are available at https://github.com/RAO-EPFL/Semi-Discrete-Smooth-OT.git.

We now aim to assess the empirical convergence behavior of Algorithm 1 and to showcase the effects of regularization in semi-discrete optimal transport. To this end, we solve the original dual optimal transport problem (10) as well as its smooth variant (12) with a Fréchet ambiguity set corresponding to the exponential distribution model of Example 3.8, to the uniform distribution model of Example 3.9 and to the hyperbolic cosine distribution model of Example 3.11. Recall from Theorem 3.7 that any Fréchet ambiguity set is uniquely determined by a marginal generating function F and a probability vector \(\varvec{\eta }\). As for the exponential distribution model of Example 3.8, we set \(F(s) = \exp (10 s - 1)\) and \(\eta _i = 1/N\) for all \(i\in [N]\). In this case problem (12) is equivalent to the regularized primal optimal transport problem (13) with an entropic regularizer, and the gradient \(\nabla _{\varvec{\phi }}{{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\), which is known to coincide with the vector \(\varvec{p}^\star \) of optimal choice probabilities in problem (27), admits the closed-form representation (62). We can therefore solve problem (12) with a variant of Algorithm 1 that calculates \(\nabla _{\varvec{\phi }}{{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\) exactly instead of approximately via bisection.

As for the uniform distribution model of Example 3.9, we set \(F(s) = s / 20 + 1/2\) and \(\eta _i = 1/N\) for all \(i\in [N]\). In this case problem (12) is equivalent to the regularized primal optimal transport problem (13) with a \(\chi ^2\)-divergence regularizer, and the vector \(\varvec{p}^\star \) of optimal choice probabilities can be computed exactly and highly efficiently by sorting thanks to Proposition D.1 in the appendix. We can therefore again solve problem (12) with a variant of Algorithm 1 that calculates \(\nabla _{\varvec{\phi }}{{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\) exactly. As for the hyperbolic cosine model of Example 3.11, we set \(F(s) = \sinh (10s - k)\) with \(k=\sqrt{2} - 1 - \text {arcsinh}(1)\) and \(\eta _i = 1/N\) for all \(i \in [N]\). In this case problem (12) is equivalent to the regularized primal optimal transport problem (13) with a hyperbolic divergence regularizer. However, the vector \(\varvec{p}^\star \) is not available in closed form, and thus we use Algorithm 2 to compute \(\varvec{p}^\star \) approximately. Lastly, note that the original dual optimal transport problem (10) can be interpreted as an instance of (12) equipped with a degenerate singleton ambiguity set that only contains the Dirac measure at the origin of \({\mathbb {R}}^N\). In this case \({{\overline{\psi }}}_c(\varvec{\phi },\varvec{x}) = \psi _c(\varvec{\phi },\varvec{x})\) fails to be smooth in \(\varvec{\phi }\), but an exact subgradient \(\varvec{p}^\star \in \partial _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi },\varvec{x})\) is given by

$$\begin{aligned} p_i^\star = {\left\{ \begin{array}{ll} 1 \quad &{}\text {if } i = \min \, \mathop {\mathrm{argmax}}\limits _{i \in [N]}~\phi _i - c(\varvec{x}, \varvec{y}_i),\\ 0 &{}\text {otherwise.} \end{array}\right. } \end{aligned}$$

We can therefore solve problem (10) with a variant of Algorithm 1 that has access to exact subgradients (instead of gradients) of \({{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\). Note that the maximizer \(\varvec{\phi }^\star \) of (10) may not be unique. In our experiments, we force Algorithm 1 to converge to the maximizer with minimal Euclidean norm by adding a vanishingly small Tikhonov regularization term to \(\psi _c(\varvec{\phi },\varvec{x})\). Thus, we set \({{\overline{\psi }}}_c(\varvec{\phi },\varvec{x}) = \psi _c(\varvec{\phi },\varvec{x}) + \varepsilon \Vert \varvec{\phi }\Vert _2^2\) for some small regularization weight \(\varepsilon > 0\), in which case \(\varvec{p}^\star +2\varepsilon \varvec{\phi }\in \partial _{\varvec{\phi }}{{\overline{\psi }}}_c(\varvec{\phi },\varvec{x})\) is an exact subgradient.

In the following we set \(\mu \) to the standard Gaussian measure on \(\mathcal X= {\mathbb {R}}^2\) and \(\nu \) to the uniform measure on 10 independent samples drawn uniformly from \(\mathcal Y=[-1,\, 1]^2\). We further set the transportation cost to \(c(\varvec{x}, \varvec{y}) = \Vert \varvec{x} - \varvec{y}\Vert _\infty \). Under these assumptions, we use Algorithm 1 to solve the original as well as the three smooth optimal transport problems approximately for \(T=1,\ldots , 10^5\). For each fixed T the step size is selected in accordance with Corollary 4.10.

We emphasize that Corollary 4.10 (i) remains valid if \({{\overline{\psi }}}_c(\varvec{\phi },\varvec{x})\) fails to be smooth in \(\varvec{\phi }\) and we have only access to subgradients; see [116, Corollary 1]. Denoting by \(\bar{\varvec{\phi }}_T\) the output of Algorithm 1, we record the suboptimality

$$\begin{aligned} {{\overline{W}}}_c(\mu , \nu ) - {\mathbb {E}}_{\varvec{x} \sim \mu } \left[ \varvec{\nu }^\top \bar{\varvec{\phi }}_T - {{\overline{\psi }}}_c(\bar{\varvec{\phi }}_T , \varvec{x})\right] \end{aligned}$$

of \(\bar{\varvec{\phi }}_T\) in (12) as well as the discrepancy \(\Vert \bar{\varvec{\phi }}_T - \varvec{\phi }^\star \Vert ^2_2\) of \(\bar{\varvec{\phi }}_T\) to the exact maximizer \(\varvec{\phi }^\star \) of problem (12) as a function of T. In order to faithfully measure the convergence rate of \(\bar{\varvec{\phi }}_T\) and its suboptimality, we need to compute \(\varvec{\phi }^\star \) as well as \({{\overline{W}}}_c(\mu , \nu )\) to within high accuracy. This is only possible if the dimension of \(\mathcal X\) is small (e.g., if \(\mathcal X= {\mathbb {R}}^2\) as in our numerical example); even though Algorithm 1 can efficiently solve optimal transport problems in high dimensions. We obtain high-quality approximations for \({{\overline{W}}}_c(\mu , \nu )\) and \(\varvec{\phi }^\star \) by solving the finite-dimensional optimal transport problem between \(\nu \) and the discrete distribution that places equal weight on \(10 \times T\) samples drawn independently from \(\mu \). Note that only the first T of these samples are used by Algorithm 1. The proposed high-quality approximations of the entropic and \(\chi ^2\)-divergence regularized optimal transport problems are conveniently solved via Nesterov’s accelerated gradient descent method, where the suboptimality gap of the \(t^{\text {th}}\) iterate is guaranteed to decay as \(\mathcal O(1/ t^2)\) under the step size rule advocated in ([114], Theorem 1). To our best knowledge, Nesterov’s accelerated gradient descent algorithm is not guaranteed to converge with inexact gradients. For the hyperbolic divergence regularized optimal transport problem, we thus use Algorithm 1 with \(50 \times T\) iterations to obtain an approximation for \({{\overline{W}}}_c(\mu , \nu )\) and \(\varvec{\phi }^\star \). In contrast, we model the high-quality approximation of the original optimal transport problem (10) in YALMIP [95] and solve it with MOSEK. If this problem has multiple maximizers, we report the one with minimal Euclidean norm.

Fig. 1
figure 1

Suboptimality a and discrepancy to \(\varvec{\phi }^\star \) b of the outputs \(\bar{\varvec{\phi }}_T\) of Algorithm 1 for the original (blue), the entropic regularized (orange), the \(\chi ^2\)-divergence regularized (red) and the hyperbolic divergence regularized (purple) optimal transport problems

Figure 1 shows how the suboptimality of \(\bar{\varvec{\phi }}_T\) and the discrepancy between \(\bar{\varvec{\phi }}_T\) and the exact maximizer decay with T, both for the original as well as for the entropic, the \(\chi ^2\)-divergence and  hyperbolic divergence regularized optimal transport problems, averaged across 20 independent simulation runs. Fig. 1a suggests that the suboptimality decays as \(\mathcal O(1/\sqrt{T})\) for the original optimal transport problem, which is in line with the theoretical guarantees by Nesterov and Vial ([116], Corollary 1),

and as \(\mathcal O(1/ T)\) for the entropic, the \(\chi ^2\)-divergence and the hyperbolic divergence regularized optimal transport problems, which is consistent with the theoretical guarantees established in Corollary 4.10. Indeed, entropic regularization can be explained by the exponential distribution model of Example 3.8, where the exponential distribution functions \(F_i\) satisfy the generalized self-concordance condition (60) with \(M =1/ \lambda \). Similarly, \(\chi ^2\)-divergence regularization can be explained by the uniform distribution model of Example 3.9, where the uniform distribution functions \(F_i\) satisfy the generalized self-concordance condition with any \(M > 0\). Finally, hyperbolic divergence regularization can be explained by the hyperbolic cosine distribution model of Example 3.11, where the hyperbolic cosine functions \(F_i\) satisfy the generalized self-concordance condition with \(M = 1/\lambda \). In all cases the smallest eigenvalue of \(-\nabla _{\varvec{\phi }}^2 {\mathbb {E}}_{\varvec{x} \sim \mu } [\varvec{\nu }^\top \varvec{\phi }^\star - {\overline{\psi }}_{c}(\varvec{\phi }^\star , \varvec{x})]\), which we estimate when solving the high-quality approximations of the two smooth optimal transport problems, is strictly positive. Therefore, Corollary 4.10 (iii) is indeed applicable and guarantees that the suboptimality gap is bounded above by \(\mathcal O (1/T)\).

Finally, Fig. 1b suggests that \(\Vert \bar{\varvec{\phi }}_T - \varvec{\phi }^\star \Vert ^2_2\) converges to 0 at rate \(\mathcal O(1/T)\) for the entropic, the \(\chi ^2\)-divergence and the hyperbolic divergence regularized optimal transport problems, which is consistent with ([14], Proposition 10).