Semi-discrete optimal transport: hardness, regularization and numerical solution

Taşkesen, Bahar; Shafieezadeh-Abadeh, Soroosh; Kuhn, Daniel

doi:10.1007/s10107-022-01856-x

Semi-discrete optimal transport: hardness, regularization and numerical solution

Full Length Paper
Series A
Open access
Published: 25 July 2022

Volume 199, pages 1033–1106, (2023)
Cite this article

Download PDF

You have full access to this open access article

Mathematical Programming Submit manuscript

Semi-discrete optimal transport: hardness, regularization and numerical solution

Download PDF

Bahar Taşkesen ORCID: orcid.org/0000-0002-7767-5108¹,
Soroosh Shafieezadeh-Abadeh² &
Daniel Kuhn¹

3909 Accesses
6 Citations
Explore all metrics

Abstract

Semi-discrete optimal transport problems, which evaluate the Wasserstein distance between a discrete and a generic (possibly non-discrete) probability measure, are believed to be computationally hard. Even though such problems are ubiquitous in statistics, machine learning and computer vision, however, this perception has not yet received a theoretical justification. To fill this gap, we prove that computing the Wasserstein distance between a discrete probability measure supported on two points and the Lebesgue measure on the standard hypercube is already $\#$P-hard. This insight prompts us to seek approximate solutions for semi-discrete optimal transport problems. We thus perturb the underlying transportation cost with an additive disturbance governed by an ambiguous probability distribution, and we introduce a distributionally robust dual optimal transport problem whose objective function is smoothed with the most adverse disturbance distributions from within a given ambiguity set. We further show that smoothing the dual objective function is equivalent to regularizing the primal objective function, and we identify several ambiguity sets that give rise to several known and new regularization schemes. As a byproduct, we discover an intimate relation between semi-discrete optimal transport problems and discrete choice models traditionally studied in psychology and economics. To solve the regularized optimal transport problems efficiently, we use a stochastic gradient descent algorithm with imprecise stochastic gradient oracles. A new convergence analysis reveals that this algorithm improves the best known convergence guarantee for semi-discrete optimal transport problems with entropic regularizers.

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

Prox-Regular Integro-Differential Sweeping Process

Article 14 June 2024

Exact Lipschitz Regularization of Convex Optimization Problems

Article Open access 08 June 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Optimal transport theory has a long and distinguished history in mathematics dating back to the seminal work of Monge [107] and Kantorovich [79]. While originally envisaged for applications in civil engineering, logistics and economics, optimal transport problems provide a natural framework for comparing probability measures and have therefore recently found numerous applications in statistics and machine learning. Indeed, the minimum cost of transforming a probability measure $\mu $ on ${\mathcal {X}}$ to some other probability measure $\nu $ on ${\mathcal {Y}}$ with respect to a prescribed cost function on ${\mathcal {X}}\times {\mathcal {Y}}$ can be viewed as a measure of distance between $\mu $ and $\nu $. If ${\mathcal {X}}={\mathcal {Y}}$ and the cost function coincides with (the $p^{\text {th}}$ power of) a metric on ${\mathcal {X}}\times {\mathcal {X}}$, then the resulting optimal transport distance represents (the $p^{\text {th}}$ power of) a Wasserstein metric on the space of probability measures over ${\mathcal {X}}$ [168]. In the remainder of this paper we distinguish discrete, semi-discrete and continuous optimal transport problems in which either both, only one or none of the two probability measures $\mu $ and $\nu $ are discrete, respectively.

In the wider context of machine learning, discrete optimal transport problems are nowadays routinely used, for example, in the analysis of mixture models [84, 118] as well as in image processing [8, 58, 83, 121, 160], computer vision and graphics [124, 125, 140, 156, 157], data-driven bioengineering [59, 86, 169], clustering [73], dimensionality reduction [29, 60, 139, 145, 148], domain adaptation [38, 109], distributionally robust optimization [106, 117, 150, 151], scenario reduction [72, 142], scenario generation [74, 129], the assessment of the fairness properties of machine learning algorithms [67, 161, 162] and signal processing [163].

The discrete optimal transport problem represents a tractable linear program that is susceptible to the network simplex algorithm [119]. Alternatively, it can be addressed with dual ascent methods [21], the Hungarian algorithm for assignment problems [85] or customized auction algorithms [19, 20]. The currently best known complexity bound for computing an exact solution is attained by modern interior-point algorithms. Indeed, if N denotes the number of atoms in $\mu $ or in $\nu $, whichever is larger, then the discrete optimal transport problem can be solved in time^{Footnote 1} $\mathcal {{\tilde{O}}}(N^{2.5})$ with an interior point algorithm by Lee and Sidford [89]. The need to evaluate optimal transport distances between increasingly fine-grained histograms has also motivated efficient approximation schemes. Blanchet et al. [23] and Quanrud [134] show that an $\epsilon $-optimal solution can be found in time ${\mathcal {O}}(N^2/\epsilon )$ by reducing the discrete optimal transport problem to a matrix scaling or a positive linear programming problem, which can be solved efficiently by a Newton-type algorithm. Jambulapati et al. [77] describe a parallelizable primal-dual first-order method that achieves a similar convergence rate.

The tractability of the discrete optimal transport problem can be improved by adding an entropy regularizer to its objective function, which penalizes the entropy of the transportation plan for morphing $\mu $ into $\nu $. When the weight of the regularizer grows, this problem reduces to the classical Schrödinger bridge problem of finding the most likely random evolution from $\mu $ to $\nu $ [147]. Generic linear programs with entropic regularizers were first studied by Fang [56]. Cominetti and San Martín [35] prove that the optimal values of these regularized problems converge exponentially fast to the optimal values of the corresponding unregularized problems as the regularization weight drops to zero. Non-asymptotic convergence rates for entropy regularized linear programs are derived by Weed [171]. Cuturi [39] was the first to realize that entropic penalties are computationally attractive because they make the discrete optimal transport problem susceptible to a fast matrix scaling algorithm by Sinkhorn [155]. This insight has spurred widespread interest in machine learning and led to a host of new applications of optimal transport in color transfer [31], inverse problems [2, 80], texture synthesis [128], the analysis of crowd evolutions [126] and shape interpolation [157] to name a few. This surge of applications inspired in turn several new algorithms for the entropy regularized discrete optimal transport problem such as a greedy dual coordinate descent method also known as the Greenkhorn algorithm [1, 6, 30]. Dvurechensky et al. [51] and Lin et al. [94] prove that both the Sinkhorn and the Greenkhorn algorithms are guaranteed to find an $\epsilon $-optimal solution in time $\tilde{{\mathcal {O}}}({N^2}/{\epsilon ^2})$. In practice, however, the Greenkhorn algorithm often outperforms the Sinkhorn algorithm [94]. The runtime guarantee of both algorithms can be improved to $\tilde{{\mathcal {O}}}(N^{7/3}/\epsilon )$ via a randomization scheme [93]. In addition, the regularized discrete optimal transport problem can be addressed by tailoring general-purpose optimization algorithms such as accelerated gradient descent algorithms [51], iterative Bregman projections [18], quasi-Newton methods [24] or stochastic average gradient descent algorithms [64]. While the original optimal transport problem induces sparse solutions, the entropy penalty forces the optimal transportation plan of the regularized optimal transport problem to be strictly positive and thus completely dense. In applications where the interpretability of the optimal transportation plan is important, the lack of sparsity could be undesirable; examples include color transfer [131], domain adaptation [38] or ecological inference [110]. Hence, there is merit in exploring alternative regularization schemes that retain the attractive computational properties of the entropic regularizer but induce sparsity. Examples that have attracted significant interest include smooth convex regularization and Tikhonov regularization [24, 47, 54, 149], Lasso regularization [92], Tsallis entropy regularization [110] or group Lasso regularization [38].

Much like the discrete optimal transport problems, the significantly more challenging semi-discrete optimal transport problems emerge in numerous applications including variational inference [9], blue noise sampling [133], computational geometry [90], image quantization [42] or deep learning with generative adversarial networks [11, 65, 68]. Semi-discrete optimal transport problems are also used in fluid mechanics to simulate incompressible fluids [43].

Exact solutions of a semi-discrete optimal transport problem can be constructed by solving an incompressible Euler-type partial differential equation discovered by Brenier [27]. Any optimal solution is known to partition the support of the non-discrete measure into cells corresponding to the atoms of the discrete measure [12], and the resulting tessellation is usually referred to as a power diagram. Mirebeau [103] uses this insight to solve Monge-Ampère equations with a damped Newton algorithm, and Kitagawa et al. [82] show that a closely related algorithm with a global linear convergence rate lends itself for the numerical solution of generic semi-discrete optimal transport problems. In addition, Mérigot [102] proposes a quasi-Newton algorithm for semi-discrete optimal transport, which improves a method due to Aurenhammer et al. [12] by exploiting Llyod’s algorithm to iteratively simplify the discrete measure. If the transportation cost is quadratic, Bonnotte [25] relates the optimal transportation plan to the Knothe-Rosenblatt rearrangement for mapping $\mu $ to $\nu $, which is very easy to compute.

As usual, regularization improves tractability. Genevay et al. [64] show that the dual of a semi-discrete optimal transport problem with an entropic regularizer is susceptible to an averaged stochastic gradient descent algorithm that enjoys a convergence rate of $\mathcal O(1/\sqrt{T})$, T being the number of iterations. Altschuler et al. [7] show that the optimal value of the entropically regularized problem converges to the optimal value of the unregularized problem at a quadratic rate as the regularization weight drops to zero. Improved error bounds under stronger regularity conditions are derived by Delalande [46].

Continuous optimal transport problems constitute difficult variational problems involving infinitely many variables and constraints. Benamou and Brenier [17] recast them as boundary value problems in fluid dynamics, and Papadakis et al. [122] solve discretized versions of these reformulations using first-order methods. For a comprehensive survey of the interplay between partial differential equations and optimal transport we refer to [55]. As nearly all numerical methods for partial differential equations suffer from a curse of dimensionality, current research focuses on solution schemes for regularized continuous optimal transport problems. For instance, Genevay et al. [64] embed their duals into a reproducing kernel Hilbert space to obtain finite-dimensional optimization problems that can be solved with a stochastic gradient descent algorithm. Seguy et al. [149] solve regularized continuous optimal transport problems by representing the transportation plan as a multilayer neural network. This approach results in finite-dimensional optimization problems that are non-convex and offer no approximation guarantees. However, it provides an effective means to compute approximate solutions in high dimensions. Indeed, the optimal value of the entropically regularized continuous optimal transport problem is known to converge to the optimal value of the unregularized problem at a linear rate as the regularization weight drops to zero [32, 36, 53, 120]. Due to a lack of efficient algorithms, applications of continuous optimal transport problems are scarce in the extant literature. Peyré and Cuturi [127] provide a comprehensive survey of numerous applications and solution methods for discrete, semi-discrete and continuous optimal transport problems.

This paper focuses on semi-discrete optimal transport problems. Our main goal is to formally establish that these problems are computationally hard, to propose a unifying regularization scheme for improving their tractability and to develop efficient algorithms for solving the resulting regularized problems, assuming only that we have access to independent samples from the continuous probability measure $\mu $. Our regularization scheme is based on the observation that any dual semi-discrete optimal transport problem maximizes the expectation of a piecewise affine function with N pieces, where the expectation is evaluated with respect to $\mu $, and where N denotes the number of atoms of the discrete probability measure $\nu $. We argue that this piecewise affine function can be interpreted as the optimal value of a discrete choice problem, which can be smoothed by adding random disturbances to the underlying utility values [99, 164]. As probabilistic discrete choice problems are routinely studied in economics and psychology, we can draw on a wealth of literature in choice theory to design various smooth (dual) optimal transport problems with favorable numerical properties. For maximal generality we will also study semi-parametric discrete choice models where the disturbance distribution is itself subject to uncertainty [4, 57, 105, 111]. Specifically, we aim to evaluate the best-case (maximum) expected utility across a Fréchet ambiguity set containing all disturbance distributions with prescribed marginals. Such models can be addressed with customized methods from modern distributionally robust optimization [111]. For Fréchet ambiguity sets, we prove that smoothing the dual objective is equivalent to regularizing the primal objective of the semi-discrete optimal transport problem. The corresponding regularizer penalizes the discrepancy between the chosen transportation plan and the product measure $\mu \otimes \nu $ with respect to a divergence measure constructed from the marginal disturbance distributions. Connections between primal regularization and dual smoothing were previously recognized by Blondel et al. [24] and Paty and Cuturi [123] in discrete optimal transport and by Genevay et al. [64] in semi-discrete optimal transport. As they are constructed ad hoc or under a specific adversarial noise model, these existing regularization schemes lack the intuitive interpretation offered by discrete choice theory and emerge as special cases of our unifying scheme.

The key contributions of this paper are summarized below.

i.
We study the computational complexity of semi-discrete optimal transport problems. Specifically, we prove that computing the optimal transport distance between two probability measures $\mu $ and $\nu $ on the same Euclidean space is $\#$P-hard even if only approximate solutions are sought and even if $\mu $ is the Lebesgue measure on the standard hypercube and $\nu $ is supported on merely two points.
ii.
We propose a unifying framework for regularizing semi-discrete optimal transport problems by leveraging ideas from distributionally robust optimization and discrete choice theory [4, 57, 105, 111]. Specifically, we perturb the transportation cost to every atom of the discrete measure $\nu $ with a random disturbance, and we assume that the vector of all disturbances is governed by an uncertain probability distribution from within a Fréchet ambiguity set that prescribes the marginal disturbance distributions. Solving the dual optimal transport problem under the least favorable disturbance distribution in the ambiguity set amounts to smoothing the dual and regularizing the primal objective function. We show that numerous known and new regularization schemes emerge as special cases of this framework, and we derive a priori approximation bounds for the resulting regularized optimal transport problems.
iii.
We derive new convergence guarantees for an averaged stochastic gradient descent (SGD) algorithm that has only access to a biased stochastic gradient oracle. Specifically, we prove that this algorithm enjoys a convergence rate of $\mathcal O(1/\sqrt{T})$ for Lipschitz continuous and of $\mathcal O(1/T)$ for generalized self-concordant objective functions. We also show that this algorithm lends itself to solving the smooth dual optimal transport problems obtained from the proposed regularization scheme. When the smoothing is based on a semi-parametric discrete choice model with a Fréchet ambiguity set, the algorithm’s convergence rate depends on the smoothness properties of the marginal noise distributions, and its per-iteration complexity depends on our ability to compute the optimal choice probabilities. We demonstrate that these choice probabilities can indeed be computed efficiently via bisection or sorting, and in special cases they are even available in closed form. As a byproduct, we show that our algorithm can improve the state-of-the-art $\mathcal O(1/\sqrt{T})$ convergence guarantee of Genevay et al. [64] for the semi-discrete optimal transport problem with an entropic regularizer.

The rest of this paper unfolds as follows. In Sect. 2 we study the computational complexity of semi-discrete optimal transport problems, and in Sect. 3 we develop our unifying regularization scheme. In Sect. 4 we analyze the convergence rate of an averaged SGD algorithm with a biased stochastic gradient oracle that can be used for solving smooth dual optimal transport problems, and in Sect. 5 we compare its empirical convergence behavior against the theoretical convergence guarantees.

Notation. We denote by $\Vert \cdot \Vert $ the 2-norm, by $[N] = \{1, \ldots , N \}$ the set of all integers up to $N\in {\mathbb {N}}$ and by $\Delta ^d = \{\varvec{x} \in {\mathbb {R}}_+^d : \sum _{i = 1}^d x_i =1\}$ the probability simplex in $\mathbb R^d$. For a logical statement $\mathcal E$ we define $\mathbbm {1}_{\mathcal E} = 1$ if $\mathcal E$ is true and $\mathbbm {1}_{\mathcal E} = 0$ if $\mathcal E$ is false. For any closed set ${\mathcal {X}}\subseteq {\mathbb {R}}^d$ we define ${\mathcal {M}}({\mathcal {X}})$ as the family of all Borel measures and ${\mathcal {P}}({\mathcal {X}})$ as its subset of all Borel probability measures on ${\mathcal {X}}$. For $\mu \in {\mathcal {P}}({\mathcal {X}})$, we denote by ${\mathbb {E}}_{\varvec{x} \sim \mu }[\cdot ]$ the expectation operator under $\mu $ and define ${\mathcal {L}}({\mathcal {X}}, \mu )$ as the family of all $\mu $-integrable functions $f:{\mathcal {X}}\rightarrow {\mathbb {R}}$, that is, $f \in {\mathcal {L}}({\mathcal {X}}, \mu )$ if and only if $\int _{{\mathcal {X}}} |f(\varvec{x})| \mu (\mathrm {d}\varvec{x})<\infty $. The Lipschitz modulus of a function $f: {\mathbb {R}}^d \rightarrow {\mathbb {R}}$ is defined as ${{\,\mathrm{lip}\,}}(f) = \sup _{\varvec{x}, \varvec{x}'}\{|f(\varvec{x}) - f(\varvec{x}')|/\Vert \varvec{x} - \varvec{x}'\Vert : \varvec{x} \ne \varvec{x}'\}$. The convex conjugate of $f: {\mathbb {R}}^d \rightarrow [-\infty ,+\infty ]$ is the function $f^*:{\mathbb {R}}^d\rightarrow [-\infty ,+\infty ]$ defined through $f^{*}(\varvec{y}) = \sup _{\varvec{x} \in {\mathbb {R}}^d}\varvec{y}^\top \varvec{x} - f(\varvec{x})$.

2 Hardness of computing optimal transport distances

If ${\mathcal {X}}$ and ${\mathcal {Y}}$ are closed subsets of finite-dimensional Euclidean spaces and $c: {\mathcal {X}}\times {\mathcal {Y}}\rightarrow [0,+\infty ]$ is a lower-semicontinuous cost function, then the Monge-Kantorovich optimal transport distance between two probability measures $\mu \in \mathcal P({\mathcal {X}})$ and $\nu \in \mathcal P({\mathcal {Y}})$ is defined as

$$\begin{aligned} W_c(\mu , \nu ) = \min \limits _{\pi \in \Pi (\mu ,\nu )} ~ {\mathbb {E}}_{(\varvec{x}, \varvec{y}) \sim \pi }\left[ {c(\varvec{x}, \varvec{y})}\right] , \end{aligned}$$

(1)

where $\Pi (\mu ,\nu )$ denotes the family of all couplings of $\mu $ and $\nu $, that is, the set of all probability measures on ${\mathcal {X}}\times {\mathcal {Y}}$ with marginals $\mu $ on ${\mathcal {X}}$ and $\nu $ on ${\mathcal {Y}}$. One can show that the minimum in (1) is always attained ([168], Theorem 4.1). If ${\mathcal {X}}={\mathcal {Y}}$ is a metric space with metric $d:{\mathcal {X}}\times {\mathcal {X}}\rightarrow {\mathbb {R}}_+$ and the transportation cost is defined as $c(\varvec{x}, \varvec{y})=d^p(\varvec{x},\varvec{y})$ for some $p \ge 1$, then $W_c(\mu , \nu )^{1/p}$ is termed the p-th Wasserstein distance between $\mu $ and $\nu $. The optimal transport problem (1) constitutes an infinite-dimensional linear program over measures and admits a strong dual linear program over functions ([168], Theorem 5.9).

Proposition 2.1

(Kantorovich duality) The optimal transport distance between $\mu \in {\mathcal {P}}({\mathcal {X}})$ and $\nu \in {\mathcal {P}}({\mathcal {Y}})$ admits the dual representation

$$\begin{aligned} W_c(\mu , \nu ) =\left\{ \begin{array}{c@{\quad }l@{\qquad }l} \sup &{} \displaystyle {\mathbb {E}}_{\varvec{y} \sim \nu }\left[ {\phi (\varvec{y})}\right] - {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ {\psi (\varvec{x})}\right] &{} \\ \mathrm {s.t.}&{} \psi \in {\mathcal {L}}({\mathcal {X}}, \mu ),~ \phi \in {\mathcal {L}}({\mathcal {Y}}, \nu )&{} \\ &{} \phi (\varvec{y}) - \psi (\varvec{x}) \le c(\varvec{x}, \varvec{y}) \quad \forall \varvec{x} \in {\mathcal {X}},~ \varvec{y} \in {\mathcal {Y}}. \end{array}\right. \end{aligned}$$

(2)

The linear program (2) optimizes over the two Kantorovich potentials $\psi \in {\mathcal {L}}({\mathcal {X}}, \mu )$ and $\phi \in {\mathcal {L}}({\mathcal {Y}}, \nu )$, but it can be reformulated as the following non-linear program over a single potential function,

$$\begin{aligned} W_c(\mu , \nu ) =\sup _{\phi \in {\mathcal {L}}({\mathcal {Y}}, \nu )} ~ \displaystyle {\mathbb {E}}_{\varvec{y} \sim \nu }\left[ \phi (\varvec{y})\right] - {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \phi _c(\varvec{x}) \right] , \end{aligned}$$

(3)

where $\phi _c:{\mathcal {X}}\rightarrow [-\infty ,+\infty ]$ is called the c-transform of $\phi $ and is defined through

$$\begin{aligned} \phi _c(\varvec{x}) = \sup _{\varvec{y} \in {\mathcal {Y}}} ~ \phi (\varvec{y}) - c(\varvec{x}, \varvec{y}) \qquad \forall \varvec{x} \in {\mathcal {X}}, \end{aligned}$$

(4)

see Villani ([168], § 5) for details. The Kantorovich duality is the key enabling mechanism to study the computational complexity of the optimal transport problem (1).

Theorem 2.2

(Hardness of computing optimal transport distances) Computing $W_c(\mu , \nu )$ is #P-hard even if ${\mathcal {X}}={\mathcal {Y}}={\mathbb {R}}^d$, $c(\varvec{x}, \varvec{y}) = \Vert \varvec{x}-\varvec{y}\Vert ^{p}$ for some $p\ge 1$, $\mu $ is the Lebesgue measure on the standard hypercube $[0,1]^d$, and $\nu $ is a discrete probability measure supported on only two points.

To prove Theorem 2.2, we will show that computing the optimal transport distance $W_c(\mu , \nu )$ is at least as hard computing the volume of the knapsack polytope $P( \varvec{w}, b) = \{\varvec{x}\in [0,1]^d : \varvec{w}^\top \varvec{x}\le b\}$ for a given $\varvec{w}\in {\mathbb {R}}^d_+$ and $ b \in {\mathbb {R}}_+$, which is known to be $\#$P-hard ([52], Theorem 1). Specifically, we will leverage the following variant of this hardness result, which establishes that approximating the volume of the knapsack polytope $P( \varvec{w}, b)$ to a sufficiently high accuracy is already $\#$P-hard.

Lemma 2.3

(Hanasusanto et al. ([70], Lemma 1)) Computing the volume of the knapsack polytope $P( \varvec{w}, b)$ for a given $\varvec{w}\in {\mathbb {R}}^d_+$ and $ b \in {\mathbb {R}}_+$ to within an absolute accuracy of $\delta >0$ is $\#$P-hard whenever

$$\begin{aligned} \delta <\frac{1}{ {2d!(\Vert \varvec{w}\Vert _1+2)^d(d+1)^{d+1}\prod _{i = 1}^{d}w_i}}. \end{aligned}$$

(5)

Fix now any knapsack polytope $P( \varvec{w}, b)$ encoded by $\varvec{w}\in {\mathbb {R}}_+^d$ and $ b \in {\mathbb {R}}_+$. Without loss of generality, we may assume that $\varvec{w} \ne \varvec{0}$ and $b > 0$. Indeed, we are allowed to exclude $\varvec{w} = \varvec{0} $ because the volume of $P(\varvec{0}, b) $ is trivially equal to 1. On the other hand, $b= 0$ can be excluded by applying a suitable rotation and translation, which are volume-preserving transformations. In the remainder, we denote by $\mu $ the Lebesgue measure on the standard hypercube $[0,1]^d$ and by ${\nu }_ t = t \delta _{\varvec{y}_1} + (1-t) \delta _{\varvec{y}_2}$ a family of discrete probability measures with two atoms at $\varvec{y}_1=\varvec{0}$ and $\varvec{y}_2=2b\varvec{w}/ \Vert \varvec{w}\Vert ^2$, respectively, whose probabilities are parameterized by $t \in [0, 1]$.

The following preparatory lemma relates the volume of $P( \varvec{{w}},b)$ to the optimal transport problem (1) and is thus instrumental for the proof of Theorem 2.2.

Lemma 2.4

If $c(\varvec{x}, \varvec{y})=\Vert \varvec{x}- \varvec{y} \Vert ^p$ for some $p\ge 1$, then we have ${\mathrm{Vol}}(P( \varvec{{w}},b)) = {{\,\mathrm{argmin}\,}}_{ t \in [0,1]} W_c(\mu , {\nu }_ t )$.

Proof

By the definition of the optimal transport distance in (1) and our choice of $c(\varvec{x}, \varvec{y})$, we have

$$\begin{aligned}&\underset{ t \in [0,1]}{\min }W_c(\mu , {\nu }_ t )\\&\quad = \underset{ t \in [0,1]}{\min } ~ \min \limits _{\pi \in \Pi (\mu ,\nu _t)} ~ {\mathbb {E}}_{(\varvec{x}, \varvec{y})\sim \pi }\left[ \Vert \varvec{x}- \varvec{y} \Vert ^p \right] \\&\quad =\min \limits _{ t \in [0,1]}~ \left\{ \begin{array}{cl} \min \limits _{q_1, q_2 \in {\mathcal {P}}({\mathbb {R}}^d)}&{} t \displaystyle \int _{{\mathbb {R}}^d} \Vert \varvec{x}-\varvec{y}_1\Vert ^p q_1(\mathrm {d}\varvec{x}) + (1-t) \displaystyle \int _{{\mathbb {R}}^d}\left\| \varvec{x}-\varvec{y}_2 \right\| ^p q_2(\mathrm {d}\varvec{x})\\ &{}\quad \text {s.t.} t \cdot q_1 + (1-t) \cdot q_2 = \mu , \end{array}\right. \end{aligned}$$

where the second equality holds because any coupling $\pi $ of $\mu $ and $\nu _t$ can be constructed from the marginal probability measure $\nu _t$ of $\varvec{y}$ and the probability measures $q_1$ and $q_2$ of $\varvec{x}$ conditional on $\varvec{y} =\varvec{y}_1$ and $\varvec{y} = \varvec{y}_2$, respectively, that is, we may write $\pi = t\cdot q_1\otimes \delta _{\varvec{y}_1} + (1-t)\cdot q_2\otimes \delta _{\varvec{y}_2}$. The constraint of the inner minimization problem ensures that the marginal probability measure of $\varvec{x}$ under $\pi $ coincides with $\mu $. By applying the variable transformations $q_1\leftarrow t \cdot q_1 $ and $q_2 \leftarrow (1-t)\cdot q_2$ to eliminate all bilinear terms, we then obtain

$$\begin{aligned} \underset{ t \in [0,1]}{\min }W_c(\mu , {\nu }_ t )=\left\{ \begin{array}{cll} \underset{\begin{array}{c} t \in [0,1] \\ q_1, q_2 \in {\mathcal {M}}({\mathbb {R}}^d) \end{array}}{\min } &{}\displaystyle \int _{{\mathbb {R}}^d} \Vert \varvec{x} -\varvec{y}_1\Vert ^p q_1(\mathrm {d}\varvec{x}) + \displaystyle \int _{{\mathbb {R}}^d} \left\| \varvec{x}-\varvec{y}_2 \right\| ^p q_2(\mathrm {d}\varvec{x})\\ \text {s.t.} &{}\displaystyle \int _{{\mathbb {R}}^d} q_1(\mathrm {d}\varvec{x}) = t \\ &{}\displaystyle \int _{{\mathbb {R}}^d} q_2(\mathrm {d}\varvec{x}) = 1- t \\ &{} q_1 + q_2 = \mu . \end{array}\right. \end{aligned}$$

Observe next that the decision variable t and the two normalization constraints can be eliminated without affecting the optimal value of the resulting infinite-dimensional linear program because the Borel measures $q_1$ and $q_2$ are non-negative and because the constraint $q_1+q_2=\mu $ implies that $q_1({\mathbb {R}}^d)+q_2({\mathbb {R}}^d)=\mu ({\mathbb {R}}^d)=1$. Thus, there always exists $t\in [0,1]$ such that $q_1({\mathbb {R}}^d)=t$ and $q_2({\mathbb {R}}^d)=1-t$. This reasoning implies that

$$\begin{aligned} \underset{ t \in [0,1]}{\min }W_c(\mu , {\nu }_ t )=\left\{ \begin{array}{ccll} &{}\min \limits _{q_1,q_2\in {\mathcal {M}}({\mathbb {R}}^d)}\; &{} \displaystyle \int _{{\mathbb {R}}^d} \Vert \varvec{x} -\varvec{y}_1\Vert ^p q_1(\mathrm {d}\varvec{x}) + \displaystyle \int _{{\mathbb {R}}^d}\left\| \varvec{x}-\varvec{y}_2 \right\| ^p q_2(\mathrm {d}\varvec{x}) \\ &{} \text {s.t.} &{} q_1 + q_2= \mu . \end{array}\right. \end{aligned}$$

The constraint $q_1+q_2=\mu $ also implies that $q_1$ and $q_2$ are absolutely continuous with respect to $\mu $, and thus

$$\begin{aligned} \underset{ t \in [0,1]}{\min }W_c(\mu , {\nu }_ t )&=\left\{ \begin{array}{ccll} &{}\min \limits _{q_1,q_2\in {\mathcal {M}}({\mathbb {R}}^d)}\; &{} \displaystyle \int _{{\mathbb {R}}^d} \Vert \varvec{x} \!-\!\varvec{y}_1\Vert ^p \frac{\mathrm {d}q_1}{\mathrm {d}\mu }(\varvec{x}) \!+\! \left\| \varvec{x} \!-\! \varvec{y}_2 \right\| ^p \, \frac{\mathrm {d}q_2}{\mathrm {d}\mu }(\varvec{x})\, \mu (\mathrm {d}\varvec{x}) \\ &{} \text {s.t.} &{} \displaystyle \frac{\mathrm {d}q_1}{\mathrm {d}\mu }(\varvec{x}) + \frac{\mathrm {d}q_2}{\mathrm {d}\mu }(\varvec{x})= 1 \quad \forall \varvec{x}\in [0,1]^d \end{array}\right. \nonumber \\&= \int _{{\mathbb {R}}^d} \min \left\{ \Vert \varvec{x} -\varvec{y}_1 \Vert ^p,\left\| \varvec{x} - \varvec{y}_2 \right\| ^p \right\} \,\mu (\mathrm {d}\varvec{x}), \end{aligned}$$

(6)

where the second equality holds because at optimality the Radon-Nikodym derivatives must satisfy

$$\begin{aligned} \frac{\mathrm {d}q_i}{\mathrm {d}\mu }(\varvec{x})=\left\{ \begin{array}{cl} 1 &{} \text {if } \Vert \varvec{x}-\varvec{y}_i\Vert ^p \le \Vert \varvec{x}-\varvec{y}_{3-i}\Vert ^p \\ 0 &{} \text {otherwise} \end{array} \right. \end{aligned}$$

for $\mu $-almost every $\varvec{x}\in {\mathbb {R}}^d$ and for every $i=1,2$.

In the second part of the proof we will demonstrate that the minimization problem $\min _{t\in [0,1]} W_c(\mu , \nu _ t )$ is solved by $t^\star =\text {Vol}(P(\varvec{w}, b))$. By Proposition 2.1 and the definition of the c-transform, we first note that

$$\begin{aligned} W_c(\mu , \nu _ {t^\star } )&=\underset{\phi \in {\mathcal {L}}({\mathbb {R}}^d, \nu _{t^\star })}{\max } ~ {\mathbb {E}}_{\varvec{y}\sim \nu _{t^\star }}[\phi (\varvec{y})] - {\mathbb {E}}_{\varvec{x}\sim \mu }[\phi _c(\varvec{x})] \nonumber \\&= \underset{\varvec{\phi }\in {\mathbb {R}}^2}{\max } ~ t^\star \cdot \phi _1 + (1- t^\star ) \cdot \phi _2- \int _{{\mathbb {R}}^d}\max _{i=1,2}\left\{ \phi _i- \Vert \varvec{x}-\varvec{y}_i \Vert ^p\right\} \mu (\mathrm {d}\varvec{x})\nonumber \\&= \max \limits _{\varvec{\phi }\in {\mathbb {R}}^2} ~ t^\star \cdot \phi _1 + (1-t^\star )\cdot \phi _2- \sum \limits _{i = 1}^2 \int _{{\mathcal {X}}_i(\varvec{\phi })}(\phi _i - \Vert \varvec{x} - \varvec{y_i}\Vert ^p)\,\mu (\mathrm {d}\varvec{x}), \end{aligned}$$

(7)

where

$$\begin{aligned} {\mathcal {X}}_i(\varvec{\phi }) = \{\varvec{x}\in {\mathbb {R}}^d: \phi _i - \Vert \varvec{x}-\varvec{y}_i \Vert ^p \ge \phi _{3-i} - \left\| \varvec{x} - \varvec{y}_{3-i} \right\| ^p\}\quad \forall i=1,2. \end{aligned}$$

The second equality in (7) follows from the construction of $\nu _{t^\star }$ as a probability measure with only two atoms at the points $\varvec{y}_i$ for $i=1,2$. Indeed, by fixing the corresponding function values $\phi _i=\phi (\varvec{y}_i)$ for $i=1,2$, the expectation ${\mathbb {E}}_{\varvec{y} \sim \nu _{t}}[\phi (\varvec{y})]$ simplifies to $t^\star \cdot \phi _1 + (1-t^\star )\cdot \phi _2$, while the negative expectation $-{\mathbb {E}}_{\varvec{x} \sim \mu }[\phi _c(\varvec{x})]$ is maximized by setting $\phi (\varvec{y})$ to a large negative constant for all $\varvec{y}\notin \{\varvec{y}_1,\varvec{y}_2\}$, which implies that

$$\begin{aligned} \phi _c(\varvec{x}) = \sup _{\varvec{y} \in {\mathbb {R}}^d} \phi (\varvec{y}) - \Vert \varvec{x} - \varvec{y}\Vert ^p = \max _{i=1,2}\left\{ \phi _i- \Vert \varvec{x}-\varvec{y}_i \Vert ^p\right\} \quad \forall \varvec{x}\in [0,1]^d. \end{aligned}$$

Next, we will prove that any $\varvec{\phi }^\star \in {\mathbb {R}}^2$ with $\phi ^\star _1=\phi ^\star _2$ attains the maximum of the unconstrained convex optimization problem on the last line of (7). To see this, note that

$$\begin{aligned} \nabla _{\varvec{\phi }} \left[ \sum \limits _{i = 1}^2 \int _{{\mathcal {X}}_i(\varvec{\phi })}(\phi _i - \Vert \varvec{x} - \varvec{y}_i\Vert ^p)\,\mu (\mathrm {d}\varvec{x})\right]= & {} \sum \limits _{i = 1}^2 \int _{{\mathcal {X}}_i(\varvec{\phi })} \nabla _{\varvec{\phi }}(\phi _i - \Vert \varvec{x} - \varvec{y}_i\Vert ^p)\,\mu (\mathrm {d}\varvec{x})\\= & {} \begin{bmatrix} \mu ({\mathcal {X}}_1(\varvec{\phi }))\\ \mu ({\mathcal {X}}_2(\varvec{\phi })) \end{bmatrix} \end{aligned}$$

by virtue of the Reynolds theorem. Thus, the first-order optimality condition^{Footnote 2}$t^\star =\mu ({\mathcal {X}}_1(\varvec{\phi }))$ is necessary and sufficient for global optimality. Fix now any $\varvec{\phi }^\star \in {\mathbb {R}}^2$ with $\phi ^\star _1=\phi ^\star _2$ and observe that

$$\begin{aligned} t^\star =\text {Vol}(P(\varvec{w}, b)) =&\mu \left( \left\{ \varvec{x}\in {\mathbb {R}}^d: \varvec{w}^\top \varvec{x}\le b \right\} \right) \\ =&\mu \left( \left\{ \varvec{x}\in {\mathbb {R}}^d: \Vert \varvec{x} \Vert ^2\le \Vert \varvec{x}-2b \varvec{w}/\Vert \varvec{w}\Vert ^2\Vert ^2 \right\} \right) \\ =&\mu \left( \left\{ \varvec{x}\in {\mathbb {R}}^d: \Vert \varvec{x} -\varvec{y}_1\Vert ^p\le \Vert \varvec{x}-\varvec{y}_2\Vert ^p \right\} \right) =\mu ({\mathcal {X}}_1(\varvec{\phi }^\star )), \end{aligned}$$

where the first and second equalities follow from the definitions of $t^\star $ and the knapsack polytope $P(\varvec{w}, b)$, respectively, the fourth equality holds because $\varvec{y}_1=\varvec{0}$ and $\varvec{y}_2=2b\varvec{w}/\Vert \varvec{w}\Vert ^2$, and the fifth equality follows from the definition of ${\mathcal {X}}_1(\varvec{\phi }^\star )$ and our assumption that $\phi ^\star _1=\phi ^\star _2$. This reasoning implies that $\varvec{\phi }^\star $ attains indeed the maximum of the optimization problem on the last line of (7). Hence, we find

$$\begin{aligned} W_c(\mu , \nu _ {t^\star } )&= t^\star \cdot \phi ^\star _1 + (1-t^\star )\cdot \phi ^\star _2- \sum \limits _{i = 1}^2 \int _{{\mathcal {X}}_i(\varvec{\phi }^\star )}(\phi ^\star _i - \Vert \varvec{x} - \varvec{y_i}\Vert ^p)\,\mu (\mathrm {d}\varvec{x})\\&= \sum \limits _{i = 1}^2 \int _{{\mathcal {X}}_i(\varvec{\phi }^\star )} \Vert \varvec{x} - \varvec{y_i}\Vert ^p \,\mu (\mathrm {d}\varvec{x}) = \int _{{\mathbb {R}}^d} \min _{i=1,2}\left\{ \Vert \varvec{x} -\varvec{y}_i \Vert ^p\right\} \,\mu (\mathrm {d}\varvec{x})\\&=\underset{ t \in [0,1]}{\min }W_c(\mu , {\nu }_ t ), \end{aligned}$$

where the second equality holds because $\phi ^\star _1=\phi ^\star _2$, the third equality exploits the definition of ${\mathcal {X}}_1(\varvec{\phi }^\star )$, and the fourth equality follows from (6). We may therefore conclude that $t^\star =\text {Vol}(P(\varvec{w}, b))$ solves indeed the minimization problem $\min _{t\in [0,1]} W_c(\mu , \nu _ t )$. Using similar techniques, one can further prove that $\partial _t W_c(\mu , \nu _t)$ exists and is strictly increasing in t, which ensures that $W_c(\mu , \nu _t)$ is strictly convex in t and, in particular, that $t^\star $ is the unique solution of $\min _{t\in [0,1]} W_c(\mu , \nu _ t )$. Details are omitted for brevity. $\square $

Proof of Theorem 2.2

Lemma 2.4 applies under the assumptions of the theorem, and therefore the volume of the knapsack polytope $P(\varvec{w}, b)$ coincides with the unique minimizer of

$$\begin{aligned} \min _{ t \in [0,1]} W_c(\mu , {\nu }_ t ). \end{aligned}$$

(8)

From the proof of Lemma 2.4 we know that the Wasserstein distance $W_c(\mu ,{\nu }_ t )$ is strictly convex in t, which implies that the minimization problem (8) constitutes a one-dimensional convex program with a unique minimizer. A near-optimal solution that approximates the exact minimizer to within an absolute accuracy $\delta =(6d!(\Vert \varvec{w}\Vert _1+2)^d(d+1)^{d+1}\prod _{i = 1}^{d}w_i)^{-1}$ can readily be computed with a binary search method such as Algorithm 3 described in Lemma A.1 (i), which evaluates $g(t)=W_c(\mu ,\nu _t)$ at exactly $2L=2({\lceil }{\log _2(1/\delta )}{\rceil } + 1)$ test points. Note that $\delta $ falls within the interval (0, 1) and satisfies the strict inequality (5). Note also that L grows only polynomially with the bit length of $\varvec{w}$ and b; see Appendix B for details. One readily verifies that all operations in Algorithm 3 except for the computation of $W_c(\mu , \nu _t)$ can be carried out in time polynomial in the bit length of $\varvec{w}$ and b. Thus, if we could compute $W_c(\mu , \nu _t)$ in time polynomial in the bit length of $\varvec{w}$, b and t, then we could efficiently compute the volume of the knapsack polytope $P( \varvec{w}, b)$ to within accuracy $\delta $, which is $\#$P-hard by Lemma 2.3. We have thus constructed a polynomial-time Turing reduction from the $\#$P-hard problem of (approximately) computing the volume of a knapsack polytope to computing the Wasserstein distance $W_c(\mu , {\nu }_ t )$. By the definition of the class of $\#$P-hard problems (see, e.g., ([167], Definition 1)), we may thus conclude that computing $W_c(\mu , \nu _t)$ is $\#$P-hard. $\square $

Corollary 2.5

(Hardness of computing approximate optimal transport distances) Computing $W_c(\mu , \nu )$ to within an absolute accuracy of

$$\begin{aligned} \varepsilon =\frac{1}{4} \min \limits _{l\in [ 2^L]} \left\{ |W_c(\mu , \nu _{t_{l}}) - W_c(\mu , \nu _{t_{l-1}})| : W_c(\mu , \nu _{t_{l}}) \ne W_c(\mu , \nu _{t_{l-1}})\right\} , \end{aligned}$$

where $L = {\lceil }{\log _2(1/ \delta )}{\rceil } + 1$, $\delta = (6 d!(\Vert \varvec{w}\Vert _1+2)^d(d+1)^{d+1}\prod _{i = 1}^{d}w_i)^{-1} $ and $t_l = l/ 2^{L}$ for all $l =0, \ldots , 2^L$, is #P-hard even if ${\mathcal {X}}={\mathcal {Y}}={\mathbb {R}}^d$, $c(\varvec{x}, \varvec{y}) = \Vert \varvec{x}-\varvec{y}\Vert ^{p}$ for some $p\ge 1$, $\mu $ is the Lebesgue measure on the standard hypercube $[0,1]^d$, and $\nu $ is a discrete probability measure supported on only two points.

Proof

Assume that we have access to an inexact oracle that outputs, for any fixed $t\in [0,1]$, an approximate optimal transport distance ${{\widetilde{W}}}_c(\mu , \nu _t)$ with $|{{\widetilde{W}}}_c(\mu , \nu _t) - W_c(\mu , \nu _t) |\le \varepsilon $. By Lemma A.1 (ii), which applies thanks to the definition of $\varepsilon $, we can then find a $2\delta $-approximation for the unique minimizer of (8) using 2L oracle calls. Note that $\delta '=2\delta $ falls within the interval (0, 1) and satisfies the strict inequality (5). Recall also that L grows only polynomially with the bit length of $\varvec{w}$ and b; see Appendix B for details. Thus, if we could compute ${{\widetilde{W}}}_c(\mu , \nu _t)$ in time polynomial in the bit length of $\varvec{w}$, b and t, then we could efficiently compute the volume of the knapsack polytope $P( \varvec{w}, b)$ to within accuracy $\delta '$, which is $\#$P-hard by Lemma 2.3. Computing $W_c(\mu , \nu )$ to within an absolute accuracy of $\varepsilon $ is therefore also $\#$P-hard. $\square $

The hardness of optimal transport established in Theorem 2.2 and Corollary 2.5 is predicated on the hardness of numerical integration. A popular technique to reduce the complexity of numerical integration is smoothing, whereby an initial (possibly discontinuous) integrand is approximated with a differentiable one [48]. Smoothness is also a desired property of objective functions when designing scalable optimization algorithms [28]. These observations prompt us to develop a systematic way to smooth the optimal transport problem that leads to efficient approximate numerical solution schemes.

3 Smooth optimal transport

The semi-discrete optimal transport problem evaluates the optimal transport distance (1) between an arbitrary probability measure $\mu $ supported on ${\mathcal {X}}$ and a discrete probability measure $\nu = \sum _{i=1}^N {\nu }_i\delta _{\varvec{y_i}}$ with atoms $\varvec{y}_1,\ldots , \varvec{y}_N \in {\mathcal {Y}}$ and corresponding probabilities $\varvec{\nu }=(\nu _1,\ldots , \nu _N)\in \Delta ^N$ for some $N\ge 2$. In the following, we define the discrete c-transform $\psi _c:{\mathbb {R}}^N\times {\mathcal {X}}\rightarrow [-\infty ,+\infty )$ of $\varvec{\phi }\in {\mathbb {R}}^N$ through

$$\begin{aligned} \psi _c(\varvec{\phi }, \varvec{x}) = \max \limits _{i \in [N]} \phi _i - c(\varvec{x}, \varvec{y}_i) \quad \forall \varvec{x} \in {\mathcal {X}}. \end{aligned}$$

(9)

Armed with the discrete c-transform, we can now reformulate the semi-discrete optimal transport problem as a finite-dimensional maximization problem over a single dual potential vector.

Lemma 3.1

(Discrete c-transform) The semi-discrete optimal transport problem is equivalent to

$$\begin{aligned} W_c(\mu , \nu ) = \sup _{ \varvec{\phi } \in {\mathbb {R}}^N} \varvec{\nu }^\top \varvec{\phi } - {\mathbb {E}}_{\varvec{x} \sim \mu }[{\psi _c(\varvec{\phi }, \varvec{x}) } ]. \end{aligned}$$

(10)

Proof

As $\nu = \sum _{i=1}^N {\nu }_i\delta _{\varvec{y_i}}$ is discrete, the dual optimal transport problem (3) simplifies to

$$\begin{aligned} W_c(\mu , \nu )&=\sup _{\varvec{\phi }\in {\mathbb {R}}^N} \sup _{\phi \in {\mathcal {L}}({\mathcal {Y}}, \nu )} \left\{ \varvec{\nu }^\top \varvec{\phi }- {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \phi _c(\varvec{x}) \right] \;:\;\phi (\varvec{y}_i)=\phi _i~\forall i\in [N] \right\} \\&=\sup _{\varvec{\phi }\in {\mathbb {R}}^N}~ \varvec{\nu }^\top \varvec{\phi }- \inf _{\phi \in {\mathcal {L}}({\mathcal {Y}}, \nu )} \Big \{ {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \phi _c(\varvec{x}) \right] \;:\;\phi (\varvec{y}_i)=\phi _i~\forall i\in [N] \Big \} . \end{aligned}$$

Using the definition of the standard c-transform, we can then recast the inner minimization problem as

$$\begin{aligned}&\inf _{\phi \in {\mathcal {L}}({\mathcal {Y}}, \nu )} \left\{ {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \sup _{\varvec{y} \in {\mathcal {Y}}} \phi (\varvec{y}) - c(\varvec{x}, \varvec{y}) \right] \;:\;\phi (\varvec{y}_i)=\phi _i~\forall i\in [N] \right\} \\&\quad = ~{\mathbb {E}}_{\varvec{x} \sim \mu } \left[ \max _{i \in [N]}\left\{ \phi _i- c(\varvec{x}, \varvec{y}_i)\right\} \right] ~=~ {\mathbb {E}}_{\varvec{x} \sim \mu } \left[ {\psi _c(\varvec{\phi }, \varvec{x}) } \right] , \end{aligned}$$

where the first equality follows from setting $\phi (\varvec{y})={{\underline{\phi }}}$ for all $\varvec{y}\notin \{\varvec{y}_1, \ldots , \varvec{y}_N\}$ and letting ${{\underline{\phi }}}$ tend to $-\infty $, while the second equality exploits the definition of the discrete c-transform. Thus, (10) follows. $\square $

The discrete c-transform (9) can be viewed as the optimal value of a discrete choice model, where a utility-maximizing agent selects one of N mutually exclusive alternatives with utilities $\phi _i - c(\varvec{x}, \varvec{y}_i)$, $i\in [N]$, respectively. Discrete choice models are routinely used for explaining the preferences of travelers selecting among different modes of transportation [16], but they are also used for modeling the choice of residential location [100], the interests of end-users in engineering design [170] or the propensity of consumers to adopt new technologies [69].

In practice, the preferences of decision-makers and the attributes of the different choice alternatives are invariably subject to uncertainty, and it is impossible to specify a discrete choice model that reliably predicts the behavior of multiple individuals. Psychological theory thus models the utilities as random variables [164], in which case the optimal choice becomes random, too. The theory as well as the econometric analysis of probabilistic discrete choice models were pioneered by McFadden [99].

The availability of a wealth of elegant theoretical results in discrete choice theory prompts us to add a random noise term to each deterministic utility value $\phi _i - c(\varvec{x}, \varvec{y}_i)$ in (9). We will argue below that the expected value of the resulting maximal utility with respect to the noise distribution provides a smooth approximation for the c-transform $\psi _c(\varvec{\phi }, \varvec{x})$, which in turn leads to a smooth optimal transport problem that displays favorable numerical properties. For a comprehensive survey of additive random utility models in discrete choice theory we refer to Dubin and McFadden [49] and Daganzo [40]. Generalized semi-parametric discrete choice models where the noise distribution is itself subject to uncertainty are studied by Natarajan et al. [111]. Using techniques from modern distributionally robust optimization, these models evaluate the best-case (maximum) expected utility across an ambiguity set of multivariate noise distributions. Semi-parametric discrete choice models are studied in the context of appointment scheduling [97], traffic management [3] and product line pricing [91].

We now define the smooth (discrete) c-transform as a best-case expected utility of the type studied in semi-parametric discrete choice theory, that is,

$$\begin{aligned} {\overline{\psi }}_c(\varvec{\phi }, \varvec{x}) = \sup _{\theta \in \Theta }\;{\mathbb {E}}_{\varvec{z} \sim \theta }\left[ \max _{i \in [N]} \phi _i -c(\varvec{x}, \varvec{y_i}) +z_i \right] , \end{aligned}$$

(11)

where $\varvec{z}$ represents a random vector of perturbations that are independent of $\varvec{x}$ and $\varvec{y}$. Specifically, we assume that $\varvec{z}$ is governed by a Borel probability measure $\theta $ from within some ambiguity set $\Theta \subseteq {\mathcal {P}}({\mathbb {R}}^N)$. Note that if $\Theta $ is a singleton that contains only the Dirac measure at the origin of ${\mathbb {R}}^N$, then the smooth c-transform collapses to ordinary c-transform defined in (9), which is piecewise affine and thus non-smooth in $\varvec{\phi }$. For many commonly used ambiguity sets, however, we will show below that the smooth c-transform is indeed differentiable in $\varvec{\phi }$. In practice, the additive noise $z_i$ in the transportation cost could originate, for example, from uncertainty about the position $\varvec{y}_i$ of the i-th atom of the discrete distribution $\nu $. This interpretation is justified if $c(\varvec{x},\varvec{y})$ is approximately affine in $\varvec{y}$ around the atoms $\varvec{y}_i$, $i\in [N]$. The smooth c-transform gives rise to the following smooth (semi-discrete) optimal transport problem in dual form.

$$\begin{aligned} {{\overline{W}}}_c (\mu , \nu ) = \sup \limits _{\varvec{\phi } \in {\mathbb {R}}^N} {\mathbb {E}}_{\varvec{x} \sim \mu } \left[ \varvec{\nu }^\top \varvec{\phi }- {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})\right] \end{aligned}$$

(12)

Note that (12) is indeed obtained from the original dual optimal transport problem (10) by replacing the original c-transform $\psi _c(\varvec{\phi }, \varvec{x})$ with the smooth c-transform ${{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})$. As smooth functions are susceptible to efficient numerical integration, we expect that (12) is easier to solve than (10). A key insight of this work is that the smooth dual optimal transport problem (12) typically has a primal representation of the form

$$\begin{aligned} \min \limits _{\pi \in \Pi (\mu ,\nu )}\mathbb E_{(\varvec{x}, \varvec{y}) \sim \pi }\left[ c(\varvec{x}, \varvec{y})\right] + R_\Theta (\pi ), \end{aligned}$$

(13)

where $R_\Theta (\pi )$ can be viewed as a regularization term that penalizes the complexity of the transportation plan $\pi $. In the remainder of this section we will prove (13) and derive $R_\Theta (\pi )$ for different ambiguity sets $\Theta $. We will see that this regularization term is often related to an f-divergence, where $f:{\mathbb {R}}_+ \rightarrow {\mathbb {R}}\cup \{\infty \}$ constitutes a lower-semicontinuous convex function with $f(1) = 0$. If $\tau $ and $\rho $ are two Borel probability measures on a closed subset $\mathcal Z$ of a finite-dimensional Euclidean space, and if $\tau $ is absolutely continuous with respect to $\rho $, then the continuous f-divergence form $\tau $ to $\rho $ is defined as $D_f(\tau \parallel \rho ) = \int _{\mathcal Z} f({\mathrm {d}\tau }/{\mathrm {d}\rho }(\varvec{z})) \rho (\mathrm {d}\varvec{z})$, where ${\mathrm {d}\tau }/{\mathrm {d}\rho }$ stands for the Radon-Nikodym derivative of $\tau $ with respect to $\rho $. By slight abuse of notation, if $\varvec{\tau }$ and $\varvec{\rho }$ are two probability vectors in $\Delta ^N$ and if $\varvec{\rho }>\varvec{0}$, then the discrete f-divergence form $\varvec{\tau }$ to $\varvec{\rho }$ is defined as $D_f(\varvec{\tau }\parallel \varvec{\rho }) = \sum _{i =1}^N f({\tau _i}/{\rho _i}) \rho _i$. The correct interpretation of $D_f$ is usually clear from the context.

The following lemma shows that the smooth optimal transport problem (13) equipped with an f-divergence regularization term is equivalent to a finite-dimensional convex minimization problem. This result will be instrumental to prove the equivalence of (12) and (13) for different ambiguity sets $\Theta $.

Lemma 3.2

(Strong duality) If $\varvec{\eta }\in \Delta ^N$ with $\varvec{\eta }>\varvec{0}$ and $\eta = \sum _{i=1}^N \eta _i \delta _{\varvec{y}_i}$ is a discrete probability measure on ${\mathcal {Y}}$, then problem (13) with regularization term $R_\Theta (\pi ) = D_{f}(\pi \Vert \mu \otimes \eta )$ is equivalent to

$$\begin{aligned} \sup \limits _{ \varvec{\phi }\in {\mathbb {R}}^N} ~ {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \min \limits _{\varvec{p}\in \Delta ^N} \sum \limits _{i=1}^N{\phi _i\nu _i}- (\phi _i - c(\varvec{x}, \varvec{y_i}))p_i + D_f(\varvec{p} \parallel \varvec{\eta }) \right] . \end{aligned}$$

(14)

Proof of Lemma 3.2

If ${\mathbb {E}}_{\varvec{x} \sim \mu }[c(\varvec{x},\varvec{y}_i)]=\infty $ for some $i\in [N]$, then both (13) and (14) evaluate to infinity, and the claim holds trivially. In the remainder of the proof we may thus assume without loss of generality that ${\mathbb {E}}_{\varvec{x} \sim \mu }[c(\varvec{x},\varvec{y}_i)]<\infty $ for all $i\in [N]$. Using ([138], Theorem 14.6) to interchange the minimization over $\varvec{p}$ with the expectation over $\varvec{x}$, problem (14) can first be reformulated as

$$\begin{aligned} \begin{array}{ccccll} &{} &{}\sup \limits _{ \varvec{\phi }\in {\mathbb {R}}^N} &{}\min \limits _{\varvec{p}\in \mathcal L_\infty ^N({\mathcal {X}},\mu )} ~ &{}{\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \displaystyle \sum \limits _{i=1}^N{\phi _i\nu _i} - (\phi _i - c(\varvec{x}, \varvec{y_i}))p_i(\varvec{x})+ D_f(\varvec{p}(\varvec{x})\Vert \varvec{\eta })\right] \\ &{}&{}&{}\text {s.t.} &{}\displaystyle \varvec{p}(\varvec{x})\in \Delta ^N \quad \mu \text {-a.s.}, \end{array} \end{aligned}$$

where $\mathcal L_\infty ^N({\mathcal {X}},\mu )$ denotes the Banach space of all Borel-measurable functions from ${\mathcal {X}}$ to ${\mathbb {R}}^N$ that are essentially bounded with respect to $\mu $. Interchanging the supremum over $\varvec{\phi }$ with the minimum over $\varvec{p}$ and evaluating the resulting unconstrained linear program over $\varvec{\phi }$ in closed form then yields the dual problem

$$\begin{aligned} \begin{array}{ccl} &{} \min \limits _{\varvec{p}\in \mathcal L_\infty ^N({\mathcal {X}},\mu )} &{}\displaystyle {\mathbb {E}}_{\varvec{x} \sim \mu }\Bigg [ \sum \limits _{i=1}^Nc(\varvec{x}, \varvec{y_i})p_{i}(\varvec{x}) +\displaystyle D_f (\varvec{p}(\varvec{x}) \! \parallel \!\varvec{\eta }) \Bigg ] \\ &{}\text {s.t.} &{}\displaystyle {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \varvec{p}(\varvec{x})\right] = \varvec{\nu },\quad \varvec{p}(\varvec{x})\in \Delta ^N \quad \mu \text {-a.s.} \end{array} \end{aligned}$$

(15)

Strong duality holds for the following reasons. As c and f are lower-semicontinuous and c is non-negative, we may proceed as in ([154], § 3.2) to show that the dual objective function is weakly${}^*$ lower semicontinuous in $\varvec{p}$. Similarly, as $\Delta ^N$ is compact, one can use the Banach-Alaoglu theorem to show that the dual feasible set is weakly${}^*$ compact. Finally, as f is real-valued and ${\mathbb {E}}_{\varvec{x} \sim \mu }[c(\varvec{x},\varvec{y}_i)]<\infty $ for all $i\in [N]$, the constant solution $\varvec{p}(\varvec{x})=\varvec{\nu }$ is dual feasible for all $\varvec{\nu }\in \Delta ^N$. Thus, the dual problem is solvable and has a finite optimal value. This argument remains valid if we add a perturbation $\varvec{\delta }\in H=\{\varvec{\delta }'\in {\mathbb {R}}^N: \sum _{i=1}^N\delta '_i=0\}$ to the right hand side vector $\varvec{\nu }$ as long as $\varvec{\delta }>-\varvec{\nu }$. The optimal value of the perturbed dual problem is thus pointwise finite as well as convex and—consequently—continuous and locally bounded in $\varvec{\delta }$ at the origin of H. As $\varvec{\nu }>\varvec{0}$, strong duality therefore follows from ([137], Theorem 17 (a)).

Any dual feasible solution $\varvec{p}\in \mathcal L^N_\infty ({\mathcal {X}},\mu )$ gives rise to a Borel probability measure $\pi \in \mathcal P(\mathcal X \times \mathcal Y)$ defined through $\pi ( \varvec{y} \in \mathcal B) = \nu (\varvec{y} \in \mathcal B)$ for all Borel sets $\mathcal B \subseteq \mathcal Y$ and $\pi (\varvec{x} \in \mathcal A | \varvec{y} = \varvec{y}_i) = \int _{ \mathcal A} p_i(\varvec{x}) \mu (\mathrm {d}\varvec{x}) / \nu _i$ for all Borel sets $\mathcal A \subseteq \mathcal X$ and $i \in [N]$. This follows from the law of total probability, whereby the joint distribution of $\varvec{x}$ and $\varvec{y}$ is uniquely determined if we specify the marginal distribution of $\varvec{y}$ and the conditional distribution of $\varvec{x}$ given $\varvec{y}=\varvec{y}_i$ for every $i\in [N]$. By construction, the marginal distributions of $\varvec{x}$ and $\varvec{y}$ under $\pi $ are determined by $\mu $ and $\nu $, respectively. Indeed, note that for any Borel set $\mathcal A \subseteq \mathcal X$ we have

$$\begin{aligned} \pi (\varvec{x} \in \mathcal A)&= \sum \limits _{i=1}^N \pi (\varvec{x} \in \mathcal A | \varvec{y} = \varvec{y}_i) \cdot \pi (\varvec{y} = \varvec{y}_i) = \sum \limits _{i=1}^N \pi (\varvec{x} \in \mathcal A | \varvec{y} = \varvec{y}_i) \cdot \nu _i\\&= \sum \limits _{i=1}^N \int _{\mathcal A} {p_i(\varvec{x})}\mu (\mathrm {d}\varvec{x}) = \int _{\mathcal A} \mu (\mathrm {d}\varvec{x}) = \mu (\varvec{x}\in \mathcal A), \end{aligned}$$

where the first equality follows from the law of total probability, the second and the third equalities both exploit the construction of $\pi $, and the fourth equality holds because $\varvec{p}(\varvec{x})\in \Delta ^N$ $\mu $-almost surely due to dual feasibility. This reasoning implies that $\pi $ constitutes a coupling of $\mu $ and $\nu $ (that is, $\pi \in \Pi (\mu , \nu )$) and is thus feasible in (13). Conversely, any $\pi \in \Pi (\mu ,\nu )$ gives rise to a function $\varvec{p}\in \mathcal L_\infty ^N({\mathcal {X}},\mu )$ defined through

$$\begin{aligned} p_i(\varvec{x}) =\nu _i\cdot \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )} (\varvec{x}, \varvec{y}_i)\quad \forall i\in [N]. \end{aligned}$$

By the properties of the Randon-Nikodym derivative, we have $p_i(\varvec{x})\ge 0$ $\mu $-almost surely for all $i\in [N]$. In addition, for any Borel set $\mathcal A\subseteq {\mathcal {X}}$ we have

$$\begin{aligned} \int _{\mathcal A}\sum _{i=1}^N p_i(\varvec{x})\,\mu (\mathrm {d}\varvec{x})&= \int _{\mathcal A} \sum _{i=1}^N \nu _i\cdot \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )} (\varvec{x}, \varvec{y}_i)\,\mu (\mathrm {d}\varvec{x})\\&= \int _{\mathcal A\times {\mathcal {Y}}} \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )} (\varvec{x}, \varvec{y})\,(\mu \otimes \nu )(\mathrm {d}\varvec{x},\mathrm {d}\varvec{y}) \\&= \int _{\mathcal A\times {\mathcal {Y}}} \pi (\mathrm {d}\varvec{x}, \mathrm {d}\varvec{y}) = \int _{\mathcal A}\mu (\mathrm {d}\varvec{x}), \end{aligned}$$

where the second equality follows from Fubini’s theorem and the definition of $\nu =\sum _{i=1}^N\nu _i\delta _{\varvec{y}_i}$, while the fourth equality exploits that the marginal distribution of $\varvec{x}$ under $\pi $ is determined by $\mu $. As the above identity holds for all Borel sets $\mathcal A\subseteq {\mathcal {X}}$, we find that $\sum _{i=1}^N p_i(\varvec{x})=1$ $\mu $-almost surely. Similarly, we have

$$\begin{aligned} \mathbb E_{\varvec{x}\sim \mu }\left[ p_i(\varvec{x})\right]&=\int _{\mathcal {X}}\nu _i\cdot \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )} (\varvec{x}, \varvec{y}_i) \,\mu (\mathrm {d}\varvec{x}) \\&=\int _{{\mathcal {X}}\times \{\varvec{y}_i\}} \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )} (\varvec{x}, \varvec{y}) \,(\mu \otimes \nu )(\mathrm {d}\varvec{x},\mathrm {d}\varvec{y}) \\&= \int _{{\mathcal {X}}\times \{\varvec{y}_i\}} \pi (\mathrm {d}\varvec{x},\mathrm {d}\varvec{y})=\int _{\{\varvec{y}_i\}}\nu (\mathrm {d}\varvec{y})=\nu _i \end{aligned}$$

for all $i\in [N]$. In summary, $\varvec{p}$ is feasible in (15). Thus, we have shown that every probability measure $\pi $ feasible in (13) induces a function $\varvec{p}$ feasible in (15) and vice versa. We further find that the objective value of $\varvec{p}$ in (15) coincides with the objective value of the corresponding $\pi $ in (13). Specifically, we have

$$\begin{aligned}&{\mathbb {E}}_{\varvec{x} \sim \mu }\Bigg [ \sum \limits _{i=1}^N c(\varvec{x}, \varvec{y_i})\, p_{i}(\varvec{x}) +\displaystyle D_f (\varvec{p}(\varvec{x}) \Vert \varvec{\eta }) \Bigg ]\\&\quad =\displaystyle \int _{\mathcal {X}}\sum \limits _{i=1}^N c(\varvec{x}, \varvec{y}_i) p_i(\varvec{x}) \,\mu ( \mathrm {d}\varvec{x}) + \displaystyle \int _{\mathcal {X}}\sum _{i=1}^N f\left( \frac{p_i(\varvec{x})}{\eta _i}\right) \eta _i \, \mu (\mathrm {d}\varvec{x}) \\&\quad =\displaystyle \int _{\mathcal {X}}\sum \limits _{i=1}^N c(\varvec{x}, \varvec{y}_i) \cdot \nu _i\cdot \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )}(\varvec{x}, \varvec{y}_i)\, \mu ( \mathrm {d}\varvec{x}) \\&\qquad + \int _{\mathcal {X}}\sum _{i=1}^N f\left( \frac{\nu _i}{\eta _i} \cdot \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )}(\varvec{x}, \varvec{y}_i)\right) \cdot \eta _i \,\mu ( \mathrm {d}\varvec{x}) \\&\quad =\displaystyle \int _{{\mathcal {X}}\times {\mathcal {Y}}} c(\varvec{x}, \varvec{y})\frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \nu )}(\varvec{x}, \varvec{y}) \,(\mu \otimes \nu )(\mathrm {d}\varvec{x}, \mathrm {d}\varvec{y}) \\&\qquad + \displaystyle \int _{{\mathcal {X}}\times {\mathcal {Y}}} f\left( \frac{\mathrm {d}\pi }{\mathrm {d}(\mu \otimes \eta )}(\varvec{x}, \varvec{y})\right) (\mu \otimes \eta )(\mathrm {d}\varvec{x},\mathrm {d}\varvec{y}) \\&\quad =\mathbb E_{(\varvec{x}, \varvec{y}) \sim \pi } \left[ c(\varvec{x}, \varvec{y})\right] + D_f(\pi \Vert \mu \otimes \eta ), \end{aligned}$$

where the first equality exploits the definition of the discrete f-divergence, the second equality expresses the function $\varvec{p}$ in terms of the corresponding probability measure $\pi $, the third equality follows from Fubini’s theorem and uses the definitions $\nu =\sum _{i=1}^N \nu _i\delta _{\varvec{y}_i}$ and $\eta =\sum _{i=1}^N \eta _i\delta _{\varvec{y}_i}$, and the fourth equality follows from the definition of the continuous f-divergence. In summary, we have thus shown that (13) is equivalent to (15), which in turn is equivalent to (14). This observation completes the proof. $\square $

Proposition 3.3

(Approximation bound) If $\varvec{\eta }\in \Delta ^N$ with $\varvec{\eta }>\varvec{0}$ and $\eta = \sum _{i=1}^N \eta _i \delta _{\varvec{y}_i}$ is a discrete probability measure on ${\mathcal {Y}}$, then problem (13) with regularization term $R_\Theta (\pi ) = D_{f}(\pi \Vert \mu \otimes \eta )$ satisfies

$$\begin{aligned}|{{\overline{W}}}_c(\mu , \nu ) - W_c(\mu , \nu )| \le \max \Bigg \{\bigg |\min _{\varvec{p} \in \Delta ^N} D_f(\varvec{p} \Vert \varvec{\eta })\bigg |, \bigg |\max _{i \in [N]}\bigg \{ f\bigg (\frac{1}{\eta _i}\bigg ) \eta _i+ f(0) \sum _{k \ne i} \eta _k\bigg \}\bigg |\Bigg \}.\end{aligned}$$

Proof

By Lemma 3.2, problem (13) is equivalent to (14). Note that the inner optimization problem in (14) can be viewed as an f-divergence regularized linear program with optimal value $\varvec{\nu }^\top \varvec{\phi }-\ell (\varvec{\phi }, \varvec{x})$, where

$$\begin{aligned} \ell (\varvec{\phi }, \varvec{x}) = \max \limits _{\varvec{p} \in \Delta ^N} \sum \limits _{i=1}^N (\phi _i - c(\varvec{x}, \varvec{y}_i)) p_i - D_f(\varvec{p} \Vert \varvec{\eta }). \end{aligned}$$

Bounding $D_f(\varvec{p} \Vert \varvec{\eta })$ by its minimum and its maximum over $\varvec{p}\in \Delta ^N$ then yields the estimates

$$\begin{aligned} \psi _c(\varvec{\phi }, \varvec{x}) - \max _{ \varvec{p} \in \Delta ^N} D_f(\varvec{p} \Vert \varvec{\eta }) \le \ell (\varvec{\phi }, \varvec{x}) \le \psi _c(\varvec{\phi }, \varvec{x}) - \min _{\varvec{p} \in \Delta ^N} D_f(\varvec{p} \Vert \varvec{\eta }). \end{aligned}$$

(16)

Here, $\psi _c(\varvec{\phi }, \varvec{x})$ stands as usual for the discrete c-transform defined in (9), which can be represented as

$$\begin{aligned} \psi _c(\varvec{\phi }, \varvec{x}) = \max \limits _{\varvec{p} \in \Delta ^N}\sum \limits _{i=1}^N (\phi _i - c(\varvec{x}, \varvec{y}_i)) p_i. \end{aligned}$$

(17)

Multiplying (16) by $-1$, adding $\varvec{\nu }^\top \varvec{\phi }$, averaging over $\varvec{x}$ using the probability measure $\mu $ and maximizing over $\varvec{\phi }\in {\mathbb {R}}^N$ further implies via (10) and (14) that

$$\begin{aligned} W_c(\mu ,\nu )+ \min _{ \varvec{p} \in \Delta ^N} D_f(\varvec{p} \Vert \varvec{\eta }) \le {{\overline{W}}}_c(\mu , \nu ) \le W_c(\mu ,\nu ) + \max _{\varvec{p} \in \Delta ^N} D_f(\varvec{p} \Vert \varvec{\eta }). \end{aligned}$$

(18)

As $D_f(\varvec{p} \Vert \varvec{\eta })$ is convex in $\varvec{p}$, its maximum is attained at a vertex of $\Delta ^N$ ([75], Theorem 1), that is,

$$\begin{aligned} \max _{\varvec{p} \in \Delta ^N} D_f(\varvec{p} \Vert \varvec{\eta }) = \max _{i \in [N]}\bigg \{ f\bigg (\frac{1}{\eta _i}\bigg ) \eta _i + f(0) \sum _{k \ne i} \eta _k\bigg \}. \end{aligned}$$

The claim then follows by substituting the above formula into (18) and rearranging terms. $\square $

In the following we discuss three different classes of ambiguity sets $\Theta $ for which the dual smooth optimal transport problem (12) is indeed equivalent to the primal reguarized optimal transport problem (13).

3.1 Generalized extreme value distributions

Assume first that the ambiguity set $\Theta $ represents a singleton that accommodates only one single Borel probability measure $\theta $ on ${\mathbb {R}}^N$ defined through

$$\begin{aligned} \theta (\varvec{z} \le \varvec{s}) = \exp \left( -G \left( \exp (-s_1),\ldots , \exp (-s_N) \right) \right) \quad \forall \varvec{s}\in {\mathbb {R}}^N, \end{aligned}$$

(19)

where $G:{\mathbb {R}}^N \rightarrow {\mathbb {R}}_+$ is a smooth generating function with the following properties. First, G is homogeneous of degree $1/\lambda $ for some $\lambda >0$, that is, for any $\alpha \ne 0$ and $\varvec{s}\in {\mathbb {R}}^N$ we have $G(\alpha \varvec{s}) = \alpha ^{1/\lambda }G(\varvec{s})$. In addition, $G(\varvec{s})$ tends to infinity as $s_i$ grows for any $i \in [N]$. Finally, the partial derivative of G with respect to k distinct arguments is non-negative if k is odd and non-positive if k is even. These properties ensure that the noise vector $\varvec{z}$ follows a generalized extreme value distribution in the sense of ([165], § 4.1).

Proposition 3.4

(Entropic regularization) Assume that $\Theta $ is a singleton ambiguity set that contains only a generalized extreme value distribution with $G( \varvec{s}) = \exp (-e)N\sum _{i=1}^N \eta _i s_i^{1/\lambda }$ for some $\lambda > 0$ and $\varvec{\eta }\in \Delta ^N$, $\varvec{\eta }> \varvec{0}$, where e stands for Euler’s constant. Then, the components of $\varvec{z}$ follow independent Gumbel distributions with means $\lambda \log (N \eta _i)$ and variances $\lambda ^2 \pi ^2 /6$ for all $i\in [N]$,

while the smooth c-transform (11) reduces to the $\log $-partition function

$$\begin{aligned} {{\overline{\psi }}}(\varvec{\phi }, \varvec{x}) = \lambda \log \left( \sum _{i=1}^N \eta _i \exp \left( \frac{\phi _i -c(\varvec{x},\varvec{y_i})}{\lambda } \right) \right) . \end{aligned}$$

(20)

In addition, the smooth dual optimal transport problem (12) is equivalent to the regularized primal optimal transport problem (13) with $R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )$, where $f(s) =\lambda s\log (s)$ and $\eta = \sum _{i =1}^N \eta _i \delta _{\varvec{y}_i}$.

Note that the log-partition function (20) constitutes indeed a smooth approximation for the maximum function in the definition (9) of the discrete c-transform. As $\lambda $ decreases, this approximation becomes increasingly accurate. It is also instructive to consider the special case where $\mu =\sum _{i=1}^M\mu _i\delta _{\varvec{x}_i}$ is a discrete probability measure with atoms $\varvec{x}_1,\ldots ,\varvec{x}_M\in {\mathcal {X}}$ and corresponding vector of probabilities $\varvec{\mu }\in \Delta ^M$. In this case, any coupling $\pi \in \Pi (\mu ,\nu )$ constitutes a discrete probability measure $\pi =\sum _{i=1}^M\sum _{j=1}^N \pi _{ij}\delta _{(\varvec{x}_i,\varvec{y}_j)}$ with matrix of probabilities $\varvec{\pi }\in \Delta ^{M\times N}$. If $f(x)=s\log (s)$, then the continuous f-divergence reduces to

$$\begin{aligned} D_f(\pi \Vert \mu \otimes \eta )&=\sum _{i=1}^M\sum _{j=1}^N \pi _{ij}\log (\pi _{ij})-\sum _{i=1}^M\sum _{j=1}^N \pi _{ij}\log (\mu _i)-\sum _{i=1}^M\sum _{j=1}^N \pi _{ij}\log (\eta _j)\\&=\sum _{i=1}^M\sum _{j=1}^N \pi _{ij}\log (\pi _{ij})-\sum _{i=1}^M\mu _i\log (\mu _i)-\sum _{j=1}^N \nu _j\log (\eta _j), \end{aligned}$$

where the second equality holds because $\pi $ is a coupling of $\mu $ and $\nu $. Thus, $D_f(\pi \Vert \mu \otimes \eta )$ coincides with the negative entropy of the probability matrix $\varvec{\pi }$ offset by a constant that is independent of $\varvec{\pi }$. For $f(s)=s\log (s)$ the choice of $\varvec{\eta }$ has therefore no impact on the minimizer of the smooth optimal transport problem (13), and we simply recover the celebrated entropic regularization proposed by Cuturi [39], Genevay et al. [64], Rigollet and Weed [135], Peyré and Cuturi [127] and Clason et al. [33].

Proof of Proposition 3.4

Substituting the explicit formula for the generating function G into (19) yields

$$\begin{aligned} \theta (\varvec{z} \le \varvec{s})&= \exp \left( -\exp (-e)N\sum \limits _{i=1}^N \eta _i \exp \left( -\frac{s_i}{\lambda }\right) \right) \\&=\prod \limits _{i=1}^N \exp \left( -\exp (-e)N\eta _i \exp \left( -\frac{s_i}{\lambda } \right) \right) \\&= \prod \limits _{i=1}^N \exp \left( -\exp \left( -\frac{s_i - \lambda (\log (N\eta _i)-e)}{\lambda }\right) \right) , \end{aligned}$$

where e stands for Euler’s constant. The components of the noise vector $\varvec{z}$ are thus independent under $\theta $, and $z_i$ follows a Gumbel distribution with location parameter $\lambda (\log (N\eta _i)-e)$ and scale parameter $\lambda $ for every $i \in [N]$. Therefore, $z_i$ has mean $\lambda \log (N \eta _i)$ and variance $\lambda ^2 \pi ^2/6$.

If the ambiguity set $\Theta $ contains only one single probability measure $\theta $ of the form (19), then Theorem 5.2 of McFadden [101] readily implies that the smooth c-transform (11) simplifies to

$$\begin{aligned} {{\overline{\psi }}}(\varvec{\phi }, \varvec{x}) = \lambda \log G \left( \exp (\phi _1 -c(\varvec{x},\varvec{y}_1)),\dots , \exp (\phi _N - c(\varvec{x}, \varvec{y}_N)) \right) + \lambda e.\qquad \end{aligned}$$

(21)

The closed-form expression for the smooth c-transform in (20) follows immediately by substituting the explicit formula for the generating function G into (21). One further verifies that (20) can be reformulated as

$$\begin{aligned} {\overline{\psi }}_c(\varvec{\phi }, \varvec{x}) = \max \limits _{\varvec{p} \in \Delta ^N} \sum \limits _{i=1}^N (\phi _i - c(\varvec{x}, \varvec{y}_i)) p_i - \lambda \sum \limits _{i=1}^N p_i \log \left( \frac{p_i}{\eta _i}\right) . \end{aligned}$$

(22)

Indeed, solving the underlying Karush-Kuhn-Tucker conditions analytically shows that the optimal value of the nonlinear program (22) coincides with the smooth c-transform (20). In the special case where $\eta _i = 1/N$ for all $i \in [N]$, the equivalence of (20) and (22) has already been recognized by Anderson et al. [10]. Substituting the representation (22) of the smooth c-transform into the dual smooth optimal transport problem (12) yields (14) with $f(s)= \lambda s \log (s)$. By Lemma 3.2, problem (12) is thus equivalent to the regularized primal optimal transport problem (13) with $R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )$, where $\eta = \sum _{i =1}^N \eta _i \delta _{\varvec{y}_i}$. $\square $

3.2 Chebyshev ambiguity sets

Assume next that $\Theta $ constitutes a Chebyshev ambiguity set comprising all Borel probability measures on ${\mathbb {R}}^N$ with mean vector $\varvec{0}$ and positive definite covariance matrix $\lambda \varvec{\Sigma }$ for some $\varvec{\Sigma }\succ \varvec{0}$ and $\lambda > 0$. Formally, we thus set $\Theta = \{\theta \in \mathcal P({\mathbb {R}}^N) : {\mathbb {E}}_\theta [\varvec{z}] = \varvec{0},\, \mathbb E_\theta [\varvec{z} \varvec{z}^\top ] = \lambda \varvec{\Sigma }\}$. In this case, ([4], Theorem 1) implies that the smooth c-transform (11) can be equivalently expressed as

$$\begin{aligned} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) = \max _{\varvec{p}\in \Delta ^N} \sum \limits _{i=1}^N(\phi _i -c(\varvec{x}, \varvec{y_i}))p_i + \lambda \,\text {tr}\left( (\varvec{\Sigma }^{1/2}(\text {diag}(\varvec{p})-\varvec{p}\varvec{p}^\top )\varvec{\Sigma }^{1/2})^{1/2}\right) ,\nonumber \\ \end{aligned}$$

(23)

where $\text {diag}(\varvec{p})\in {\mathbb {R}}^{N\times N}$ represents the diagonal matrix with $\varvec{p}$ on its main diagonal. Note that the maximum in (23) evaluates the convex conjugate of the extended real-valued regularization function

$$\begin{aligned} V(\varvec{p})=\left\{ \begin{array}{c@{\qquad }l} -\lambda \,\text {tr}\left( (\varvec{\Sigma }^{1/2}(\text {diag}(\varvec{p})-\varvec{p}\varvec{p}^\top )\varvec{\Sigma }^{1/2})^{1/2}\right) &{} \text {if }\quad \varvec{p}\in \Delta ^N \\ \infty &{} \text {if }\quad \varvec{p}\notin \Delta ^N \end{array}\right. \end{aligned}$$

at the point $(\phi _i -c(\varvec{x}, \varvec{y_i}))_{i\in [N]}$. As $\varvec{\Sigma }\succ \varvec{0}$ and $\lambda >0$, ([4], Theorem 1) implies that $V(\varvec{p})$ is strongly convex over its effective domain $\Delta ^N$. By [138, Proposition 12.60], the smooth discrete c-transform ${{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})$ is therefore indeed differentiable in $\varvec{\phi }$ for any fixed $\varvec{x}$. It is further known that problem (23) admits an exact reformulation as a tractable semidefinite program; see ([104], Proposition 1). If $\varvec{\Sigma }= \varvec{I}$, then the regularization function $V(\varvec{p})$ can be re-expressed in terms of a discrete f-divergence, which implies via Lemma 3.2 that the smooth optimal transport problem is equivalent to the original optimal transport problem regularized with a continuous f-divergence.

Proposition 3.5

(Chebyshev regularization) If $\Theta $ is the Chebyshev ambiguity set of all Borel probability measures with mean $\varvec{0}$ and covariance matrix $\lambda \varvec{I}$ with $\lambda > 0$, then the smooth c-transform (11) simplifies to

$$\begin{aligned} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) = \max _{ \varvec{p}\in \Delta ^N} \sum \limits _{i=1}^N(\phi _i -c(\varvec{x}, \varvec{y_i})) p_i + \lambda \sum _{i=1}^N\sqrt{p_i(1-p_i)}. \end{aligned}$$

(24)

In addition, the smooth dual optimal transport problem (12) is equivalent to the regularized primal optimal transport problem (13) with $R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )+ \lambda \sqrt{N-1}$, where $\eta = \frac{1}{N} \sum _{i =1}^N \delta _{\varvec{y}_i}$ and

$$\begin{aligned} f(s) = {\left\{ \begin{array}{ll} -\lambda \sqrt{s(N - s)} + \lambda s \sqrt{N-1} \quad \quad &{} \text {if }\quad 0 \le s \le N\\ +\infty &{} \text {if }\quad s>N. \end{array}\right. }\end{aligned}$$

(25)

Proof

The relation (24) follows directly from (23) by replacing $\varvec{\Sigma }$ with $\varvec{I}$. Next, one readily verifies that $-\sum _{i \in [N]} \sqrt{p_i(1-p_i)} $ can be re-expressed as the discrete f-divergence $D_f(\varvec{p}\Vert \varvec{\eta })$ from $\varvec{p}$ to $\varvec{\eta }=(\frac{1}{N},\ldots ,\frac{1}{N})$, where $f(s) =-\lambda \sqrt{s (N - s)}+ \lambda \sqrt{N-1}$. This implies that (24) is equivalent to

$$\begin{aligned} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) = \max _{ \varvec{p}\in \Delta ^N} \sum \limits _{i=1}^N(\phi _i -c(\varvec{x}, \varvec{y_i})) p_i - D_f(\varvec{p}\Vert \varvec{\eta }). \end{aligned}$$

Substituting the above representation of the smooth c-transform into the dual smooth optimal transport problem (12) yields (14) with $f(s)= -\lambda \sqrt{s (N - s)} +\lambda s \sqrt{N-1} $. By Lemma 3.2, (12) thus reduces to the regularized primal optimal transport problem (13) with $R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )$, where $\eta = \frac{1}{N} \sum _{i =1}^N \delta _{\varvec{y}_i}$. $\square $

Note that the function f(s) defined in (25) is indeed convex, lower-semicontinuous and satisfies $f(1)=0$. Therefore, it induces a standard f-divergence. Proposition 3.5 can be generalized to arbitrary diagonal matrices $\varvec{\Sigma }$, but the emerging f-divergences are rather intricate and not insightful. Hence, we do not show this generalization. We were not able to generalize Proposition 3.5 to non-diagonal matrices $\varvec{\Sigma }$.

3.3 Marginal ambiguity sets

We now investigate the class of marginal ambiguity sets of the form

$$\begin{aligned} \Theta = \Big \{ \theta \in {\mathcal {P}}({\mathbb {R}}^N) \, : \, \theta (z_i \le s) = F_i(s)\;\forall s\in {\mathbb {R}}, \; \forall i \in [N] \Big \}, \end{aligned}$$

(26)

where $F_i$ stands for the cumulative distribution function of the uncertain disturbance $z_i$, $i\in [N]$. Marginal ambiguity sets completely specify the marginal distributions of the components of the random vector $\varvec{z}$ but impose no restrictions on their dependence structure (i.e., their copula). Sometimes marginal ambiguity sets are also referred to as Fréchet ambiguity sets [62]. We will argue below that the marginal ambiguity sets explain most known as well as several new regularization methods for the optimal transport problem. In particular, they are more expressive than the extreme value distributions as well as the Chebyshev ambiguity sets in the sense that they induce a richer family of regularization terms. Below we denote by $F_i^{-1} : [0, 1] \rightarrow {\mathbb {R}}$ the (left) quantile function corresponding to $F_i$, which is defined through

$$\begin{aligned} F_i^{-1}(t) = \inf \{s :F_i(s) \ge t \}\quad \forall t\in {\mathbb {R}}. \end{aligned}$$

We first prove that if $\Theta $ constitutes a marginal ambiguity set, then the smooth c-transform (11) admits an equivalent reformulation as the optimal value of a finite convex program.

Proposition 3.6

(Smooth c-transform for marginal ambiguity sets) If $\Theta $ is a marginal ambiguity set of the form (26), and if the underlying cumulative distribution functions $F_i$, $i\in [N]$, are continuous, then the smooth c-transform (11) can be equivalently expressed as

$$\begin{aligned} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})= & {} \max _{ \varvec{p} \in \Delta ^N} \displaystyle \sum \limits _{i=1}^N ~ (\phi _i - c(\varvec{x}, \varvec{y_i}))p_i \nonumber \\&+ \sum _{i=1}^N \int _{1-p_i}^1 F_i^{-1}(t) \mathrm {d}t \end{aligned}$$

(27)

for all $\varvec{x}\in {\mathcal {X}}$ and $\varvec{\phi }\in {\mathbb {R}}^N$. In addition, the smooth c-transform is convex and differentiable with respect to $\varvec{\phi }$, and $\nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})$ represents the unique solution of the convex maximization problem (27).

Recall that the smooth c-transform (11) can be viewed as the best-case utility of a semi-parametric discrete choice model. Thus, (27) follows from [111, Theorem 1]. To keep this paper self-contained, we provide a new proof of Proposition 3.6, which exploits a natural connection between the smooth c-transform induced by a marginal ambiguity set and the conditional value-at-risk (CVaR).

Proof of Proposition 3.6

Throughout the proof we fix $\varvec{x}\in {\mathcal {X}}$ and $\varvec{\phi }\in {\mathbb {R}}^N$, and we introduce the nominal utility vector $\varvec{u} \in {\mathbb {R}}^N$ with components $u_i= \phi _i - c(\varvec{x}, \varvec{y}_i)$ in order to simplify notation. In addition, it is useful to define the binary function $\varvec{r}: {\mathbb {R}}^N \rightarrow \{ 0, 1 \}^N$ with components

$$\begin{aligned} r_i(\varvec{z}) = {\left\{ \begin{array}{ll} 1 &{} \text {if } i = \displaystyle \min \, \mathop {\mathrm{argmax}}\limits _{j \in [N]} ~ u_j + z_j, \\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

For any fixed $\theta \in \Theta $, we then have

$$\begin{aligned} {\mathbb {E}}_{\varvec{z} \sim \theta } \Big [ \max \limits _{i \in [N]} u_i + z_{i} \Big ] = {\mathbb {E}}_{\varvec{z} \sim \theta } \Big [ \; \sum _{i=1}^N ( u_i + z_i) r_i(\varvec{z}) \Big ]&= \sum _{i=1}^N u_i p_i + \sum _{i=1}^N {\mathbb {E}}_{\varvec{z} \sim \theta } \left[ z_i q_i(z_i) \right] , \end{aligned}$$

where $p_i = {\mathbb {E}}_{\varvec{z} \sim \theta } [ r_i(\varvec{z}) ]$ and $q_i(z_i) = {\mathbb {E}}_{\varvec{z} \sim \theta } [ r_i(\varvec{z}) | z_i ]$ almost surely with respect to $\theta $. From now on we denote by $\theta _i$ the marginal probability distribution of the random variable $z_i$ under $\theta $. As $\theta $ belongs to a marginal ambiguity set of the form (26), we thus have $\theta _i (z_i \le s) = F_i(s)$ for all $s \in {\mathbb {R}}$, that is, $\theta _i$ is uniquely determined by the cumulative distribution function $F_i$. The above reasoning then implies that

$$\begin{aligned} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) = \sup _{\theta \in \Theta } ~ {\mathbb {E}}_{\varvec{z} \sim \theta } \Big [ \max _{i \in [N]} u_i + z_i \Big ]&= \left\{ \begin{array}{cll} \sup &{} \displaystyle \sum _{i=1}^N u_i p_i + \sum _{i=1}^N {\mathbb {E}}_{\varvec{z} \sim \theta } \left[ z_i q_i(z_i) \right] \\ \text {s.t.} &{} \theta \in \Theta , ~\varvec{p} \in \Delta ^N, ~\varvec{q} \in \mathcal L^N({\mathbb {R}}) \\ &{} {\mathbb {E}}_{\varvec{z} \sim \theta } \left[ r_i(\varvec{z}) \right] = p_i &{} \forall i \in [N] \\ &{} {\mathbb {E}}_{\varvec{z} \sim \theta } [ r_i(\varvec{z}) | z_i ] = q_i(z_i) \quad \theta \text {-a.s.} &{} \forall i \in [N] \end{array} \right. \nonumber \\&\le \left\{ \begin{array}{cll} \sup &{} \displaystyle \sum _{i=1}^N u_i p_i + \sum _{i=1}^N {\mathbb {E}}_{z_i \sim \theta _i} \left[ z_i q_i(z_i) \right] \\ \text {s.t.} &{} \varvec{p} \in \Delta ^N,~ \varvec{q} \in \mathcal L^N({\mathbb {R}}) \\ &{} {\mathbb {E}}_{z_i \sim \theta _i} \left[ q_i(z_i) \right] = p_i &{} \forall i \in [N] \\ &{} 0 \le q_i(z_i) \le 1 \quad \theta _i\text {-a.s.} &{} \forall i \in [N]. \end{array} \right. \end{aligned}$$

(28)

The inequality can be justified as follows. One may first add the redundant expectation constraints $p_i = {\mathbb {E}}_{z_i \sim \theta } [q_i(z_i)]$ and the redundant $\theta _i$-almost sure constraints $0\le q_i(z_i)\le 1$ to the maximization problem over $ \theta $, $\varvec{p}$ and $\varvec{q}$ without affecting the problem’s optimal value. Next, one may remove the constraints that express $p_i$ and $q_i(z_i)$ in terms of $r_i(\varvec{z})$. The resulting relaxation provides an upper bound on the original maximization problem. Note that all remaining expectation operators involve integrands that depend on $\varvec{z}$ only through $z_i$ for some $i\in [N]$, and therefore the expectations with respect to the joint probability measure $\theta $ can all be simplified to expectations with respect to one of the marginal probability measures $\theta _i$. As neither the objective nor the constraints of the resulting problem depend on $\theta $, we may finally remove $\theta $ from the list of decision variables without affecting the problem’s optimal value.

For any fixed $\varvec{p} \in \Delta ^N$, the upper bounding problem (28) gives rise the following N subproblems indexed by $i\in [N]$.

$$\begin{aligned} \sup _{q_i \in \mathcal L({\mathbb {R}})} \bigg \{ {\mathbb {E}}_{z_i \sim \theta _i} \left[ z_i q_i(z_i) \right] : {\mathbb {E}}_{z_i \sim \theta _i} \left[ q_i(z_i) \right] = p_i, ~ 0 \le q_i(z_i) \le 1 ~ \theta _i\text {-a.s.} \bigg \} \end{aligned}$$

(29a)

If $p_i > 0 $, the optimization problem (29a) over the functions $q_i \in \mathcal L({\mathbb {R}})$ can be recast as an optimization problem over probability measures ${{\tilde{\theta }}}_i \in \mathcal P({\mathbb {R}})$ that are absolutely continuous with respect to $\theta _i$,

$$\begin{aligned} \sup _{{{\tilde{\theta }}}_i \in \mathcal P({\mathbb {R}})} \bigg \{ p_i \; {\mathbb {E}}_{z_i \sim {{\tilde{\theta }}}_i} \left[ z_i \right] : \frac{\mathrm {d}{{\tilde{\theta }}}_i}{\mathrm {d}\theta _i}(z_i) \le \frac{1}{p_i} ~ \theta _i\text {-a.s.} \bigg \}, \end{aligned}$$

(29b)

where $\mathrm {d}{{\tilde{\theta }}}_i / \mathrm {d}\theta _i $ denotes as usual the Radon-Nikodym derivative of ${{\tilde{\theta }}}_i$ with respect to $\theta _i$. Indeed, if $q_i$ is feasible in (29a), then ${{\tilde{\theta }}}_i$ defined through ${{\tilde{\theta }}}_i[\mathcal B]= \frac{1}{p_i} \int _B q_i(z_i) \theta _i(\mathrm {d}z_i)$ for all Borel sets $B\subseteq {\mathbb {R}}$ is feasible in (29b) and attains the same objective function value. Conversely, if ${{\tilde{\theta }}}_i$ is feasible in (29b), then $q_i (z_i)= p_i \, \mathrm {d}{{\tilde{\theta }}}_i / \mathrm {d}\theta _i (z_i)$ is feasible in (29a) and attains the same objective function value. Thus, (29a) and (29b) are indeed equivalent. By [61, Theorem 4.47], the optimal value of (29b) is given by $p_i \, \theta _i \text {-CVaR}_{p_i}(z_i) = \int _{1-p_i}^1 F_i^{-1}(t) \mathrm {d}t$, where $\theta _i \text {-CVaR}_{p_i}(z_i)$ denotes the CVaR of $z_i$ at level $p_i$ under $\theta _i$.

If $p_i = 0$, on the other hand, then the optimal value of (29a) and the integral $\int _{1-p_i}^1 F_i^{-1}(t) \mathrm {d}t$ both evaluate to zero. Thus, the optimal value of the subproblem (29a) coincides with $\int _{1-p_i}^1 F_i^{-1}(t) \mathrm {d}t$ irrespective of $p_i$. Substituting this optimal value into (28) finally yields the explicit upper bound

$$\begin{aligned} \sup _{\theta \in \Theta } ~ {\mathbb {E}}_{z \sim \theta } \Big [ \max \limits _{i \in [N]} u_i + z_i \Big ]&\le \sup _{\varvec{p} \in \Delta ^N} ~ \sum _{i=1}^N u_i p_i + \sum _{i=1}^N \int _{1-p_i}^1 F_i^{-1}(t) \mathrm {d}t. \end{aligned}$$

(30)

Note that the objective function of the upper bounding problem on the right hand side of (30) constitutes a sum of the strictly concave and differentiable univariate functions $u_i p_i + \int _{1-p_i}^1 F_i^{-1}(t)$. Indeed, the derivative of the $i^{\text {th}}$ function with respect to $p_i$ is given by $u_i + F_i^{-1}(1-p_i)$, which is strictly increasing in $p_i$ because $F_i$ is continuous by assumption. The upper bounding problem in (30) is thus solvable as it has a compact feasible set as well as a differentiable objective function. Moreover, the solution is unique thanks to the strict concavity of the objective function. In the following we denote this unique solution by $\varvec{p}^\star $.

It remains to be shown that there exists a distribution $\theta ^\star \in \Theta $ that attains the upper bound in (30). To this end, we define the functions $ q_i^\star (z_i) = \mathbbm {1}_{\{ z_i > F_i^{-1}(1 - p_i^\star ) \}}$ for all $i \in [N]$. By ([61], Remark 4.48), $q_i^\star (z_i)$ is optimal in (29a) for $p_i=p_i^\star $. In other words, we have ${\mathbb {E}}_{z_i \sim \theta _i} [q_i^\star (z_i)] = p_i^\star $ and ${\mathbb {E}}_{z_i \sim \theta _i}[z_i q_i^\star (z_i)] = \int _{1 - p_i^\star }^1 F_i^{-1}(t) \mathrm {d}t$. In addition, we also define the Borel measures $\theta _i^+$ and $\theta _i^-$ through

$$\begin{aligned} \theta _i^+(B) = \theta _i(B | z_i > F_i^{-1}(1 - p_i^\star )) \quad \text {and} \quad \theta _i^-(B) = \theta _i(B | z_i \le F_i^{-1}(1 - p_i^\star )) \end{aligned}$$

for all Borel sets $B \subseteq {\mathbb {R}}$, respectively. By construction, $\theta _i^+$ is supported on $(F_i^{-1}(1 - p_i^\star ), \infty )$, while $\theta _i^-$ is supported on $(-\infty , (F_i^{-1}(1 - p_i^\star )]$. The law of total probability further implies that $\theta _i = p_i^\star \theta _i^+ + (1 - p_i^\star ) \theta _i^-$. In the remainder of the proof we will demonstrate that the maximization problem on the left hand side of (30) is solved by the mixture distribution

$$\begin{aligned} \theta ^\star = \sum _{j=1}^N p_j^\star \cdot \left( \otimes _{k=1}^{j-1} \theta _k^- \right) \otimes \theta _j^+ \otimes \left( \otimes _{k=j+1}^{N} \theta _k^- \right) . \end{aligned}$$

This will show that the inequality in (30) is in fact an equality, which in turn implies that the smooth c-transform is given by (27). We first prove that $\theta ^\star \in \Theta $. To see this, note that for all $i \in [N]$ we have

$$\begin{aligned} \theta ^\star (z_i \le s) = p_i^\star \theta _i^+ (z_i \le s) + \left( \sum _{j \ne i} p_j^\star \right) \theta _i^- (z_i \le s) = \theta _i (z_i \le s) = F_i(s), \end{aligned}$$

where the second equality exploits the relation $\sum _{j \ne i} p_j^\star = 1 - p_i^\star $. This observation implies that $\theta ^\star \in \Theta $. Next, we prove that $\theta ^\star $ attains the upper bound in (30). By the definition of the binary function $\varvec{r}$, we have

$$\begin{aligned} {\mathbb {E}}_{{\varvec{z}} \sim \theta ^\star } \Big [ \max \limits _{i \in [N]} u_i + z_{i} \Big ]&={\mathbb {E}}_{{\varvec{z}} \sim \theta ^\star } \left[ ( u_i + z_i) r_i({\varvec{z}}) \right] \\&= {\mathbb {E}}_{z_i \sim \theta _i} \left[ (u_i + z_i) {\mathbb {E}}_{{\varvec{z}} \sim \theta ^\star } \left[ r_i({\varvec{z}}) | z_i \right] \right] \\&= {\mathbb {E}}_{z_i \sim \theta _i} \Big [ ( u_i + z_i) \, \theta ^\star \Big ( i = \min \, \mathop {\mathrm{argmax}}\limits _{j \in [N]} ~ u_j + z_j \big | z_i \Big ) \Big ] \\&= {\mathbb {E}}_{z_i \sim \theta _i} \left[ ( u_i + z_i) \, \theta ^\star \left( z_j < u_i + z_i - u_j~ \forall j \ne i \big | z_i \right) \right] , \end{aligned}$$

where the third equality holds because $r_i(\varvec{z})=1$ if and only if $i = \min {{\,\mathrm{argmax}\,}}_{j \in [N]} u_j + z_j$, and the fourth equality follows from the assumed continuity of the marginal distribution functions $F_i$, $i\in [N]$, which implies that $\theta ^\star ( z_j = u_i + z_i - u_j~ \forall j \ne i \big | z_i ) = 0$ $\theta _i$-almost surely for all $i,j\in [N]$.

Hence, we find

$$\begin{aligned} {\mathbb {E}}_{\varvec{z} \sim \theta ^\star } \Big [ \max \limits _{i \in [N]} u_i + z_{i} \Big ]&= p_i^\star \, {\mathbb {E}}_{z_i \sim \theta _i^+} \left[ ( u_i + z_i) \, \theta ^\star \left( z_j< u_i + z_i - u_j~ \forall j \ne i \big | z_i \right) \right] \nonumber \\&\quad + (1 - p_i^\star )\, {\mathbb {E}}_{z_i \sim \theta _i^-} \left[ ( u_i + z_i) \, \theta ^\star \left( z_j< u_i + z_i - u_j~ \forall j \ne i \big | z_i \right) \right] \nonumber \\&= \displaystyle p_i^\star \, {\mathbb {E}}_{z_i \sim \theta _i^+} \Big [ (u_i + z_i) \Big ( \prod _{j \ne i} \theta _j^-(z_j < z_i + u_i - u_j) \Big ) \Big ] \end{aligned}$$

(31a)

$$\begin{aligned}&\quad + \displaystyle \sum _{j \ne i} p_j^\star \,{\mathbb {E}}_{z_i \sim \theta _i^-} \Big [ (u_i + z_i) \Big ( \!\prod _{k \ne i, j} \theta _k^-(z_k< z_i + u_i - u_k) \Big )\nonumber \\&\quad \theta _j^+(z_j < z_i + u_i - u_j) \Big ], \end{aligned}$$

(31b)

where the first equality exploits the relation $\theta _i = p_i^\star \theta _i^+ + (1 - p_i^\star ) \theta _i^-$, while the second equality follows from the definition of $\theta ^\star $. The expectations in (31) can be further simplified by using the stationarity conditions of the upper bounding problem in (30), which imply that the partial derivatives of the objective function with respect to the decision variables $p_i$, $i\in [N]$, are all equal at $\varvec{p}=\varvec{p}^\star $. Thus, $\varvec{p}^\star $ must satisfy

$$\begin{aligned} u_i + F_i^{-1}(1 - p_i^\star ) = u_j + F_j^{-1}(1 - p_j^\star ) \quad \forall i, j \in [N]. \end{aligned}$$

(32)

Consequently, for every $z_i > F_i^{-1}(1 - p_i^\star )$ and $j\ne i$ we have

$$\begin{aligned} \theta _j^-(z_j < z_i + u_i - u_j) \ge \theta _j^-(z_j \le F_i^{-1}(1 - p_i^\star ) + u_i - u_j) = \theta _j^-(z_j \le F_j^{-1}(1 - p_j^\star )) = 1, \end{aligned}$$

where the first equality follows from (32), and the second equality holds because $\theta _j^-$ is supported on $(-\infty , F_j^{-1}(1 - p_j^\star )]$. As no probability can exceed 1, the above reasoning implies that $\theta _j^-(z_j < z_i + u_i - u_j)=1$ for all $z_i > F_i^{-1}(1 - p_i^\star )$ and $j\ne i$. Noting that $q_i^\star (z_i)= \mathbbm {1}_{\{ z_i > F_i^{-1}(1 - p_i^\star ) \}}$ represents the characteristic function of the set $(F_i^{-1}(1 - p_i^\star ), \infty )$ covering the support of $\theta _i^+$, the term (31a) can thus be simplified to

$$\begin{aligned}&p_i^\star \,{\mathbb {E}}_{z_i \sim \theta _i^+} \left[ (u_i + z_i) \left( \prod _{j \ne i} \theta _j^-(z_j < z_i + u_i - u_j) \right) q_i^\star (z_i) \right] = {\mathbb {E}}_{z_i \sim \theta _i} \left[ (u_i + z_i) q_i^\star (z_i) \right] . \end{aligned}$$

Similarly, for any $z_i \le F_i^{-1}(1 - p_i^\star )$ and $j\ne i$ we have

$$\begin{aligned} \theta _j^+(z_j< z_i + u_i - u_j) \le \theta _j^+(z_j< F_i^{-1}(1 - p_i^\star ) + u_i - u_j) = \theta _j^+(z_j < F_j^{-1}(1 - p_j^\star )) = 0, \end{aligned}$$

where the two equalities follow from (32) and the observation that $\theta _j^+$ is supported on $(F_j^{-1}(1 - p_j^\star ), \infty )$, respectively. As probabilities are non-negative, the above implies that $\theta _j^+(z_j < z_i + u_i - u_j)=0$ for all $z_i \le F_i^{-1}(1 - p_i^\star )$ and $j\ne i$. Hence, as $\theta _i^-$ is supported on $(-\infty , F_i^{-1}(1 - p_i^\star )]$, the term (31b) simplifies to

By combining the simplified reformulations of (31a) and (31b), we finally obtain

$$\begin{aligned} {\mathbb {E}}_{\varvec{z} \sim \theta ^\star } \Big [ \max \limits _{i \in [N]} u_i + z_{i} \Big ] = \sum _{i=1}^N {\mathbb {E}}_{z_i \sim \theta _i} \left[ ( u_i + z_i) q_i^\star (z_i) \right] = \sum _{i=1}^N u_i p_i^\star + \sum _{i=1}^N \int _{1-p_i^\star }^1 F_i^{-1}(t) \mathrm {d}t, \end{aligned}$$

where the last equality exploits the relations ${\mathbb {E}}_{z_i \sim \theta _i} [q_i^\star (z_i)] = p_i^\star $ and ${\mathbb {E}}_{z_i \sim \theta _i}[z_i q_i^\star (z_i)] = \int _{1 - p_i^\star }^1 F_i^{-1}(t) \mathrm {d}t$ derived in the first part of the proof. We have thus shown that the smooth c-transform is given by (27).

Finally, by the envelope theorem ([44], Theorem 2.16), the gradient of $\nabla _{\varvec{\phi }}{{\overline{\psi }}}(\varvec{\phi }, \varvec{x})$ exists and coincides with the unique maximizer $\varvec{p}^\star $ of the upper bounding problem in (27). $\square $

The next theorem reveals that the smooth dual optimal transport problem (12) with a marginal ambiguity set corresponds to a regularized primal optimal transport problem of the form (13).

Theorem 3.7

(Fréchet regularization) Suppose that $\Theta $ is a marginal ambiguity set of the form (26) and that the marginal cumulative distribution functions are defined through

$$\begin{aligned} F_i(s) = \min \{1, \max \{0, 1-\eta _i F(-s)\}\} \end{aligned}$$

(33)

for some probability vector $\varvec{\eta }\in \Delta ^N$ and strictly increasing function $F: {\mathbb {R}}\rightarrow {\mathbb {R}}$ with $\int _0^1 F^{-1} (t) \mathrm {d}t = 0$. Then, the smooth dual optimal transport problem (12) is equivalent to the regularized primal optimal transport problem (13) with $R_\Theta = D_f(\pi \Vert \mu \otimes \eta )$, where $f(s) = \int _{0 }^{s} F^{-1}(t) \mathrm {d}t$ and $\eta = \sum _{i=1}^N \eta _i \delta _{y_i}$.

The function f(s) introduced in Theorem 3.7 is smooth and convex because its derivative $ \mathrm {d}f(s) / \mathrm {d}s = F^{-1}(s)$ is strictly increasing, and $f(1) = \int _0^1 F^{-1}(t) \mathrm {d}t=0$ by assumption. Therefore, this function induces a standard f-divergence. From now on we will refer to F as the marginal generating function.

Proof of Theorem 3.7

By Proposition 3.6, the smooth dual optimal transport problem (12) is equivalent to

$$\begin{aligned} {\overline{W}}_{c}(\mu , \nu )&= \sup \limits _{ \varvec{\phi }\in {\mathbb {R}}^N} ~ {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \min \limits _{\varvec{p}\in \Delta ^N} \sum \limits _{i=1}^N{\phi _i\nu _i}- \sum \limits _{i=1}^N(\phi _i - c(\varvec{x}, \varvec{y_i}))p_i - \sum _{i=1}^N \displaystyle \int _{1-p_i}^1 F_i^{-1}(t)\mathrm {d}t \right] . \end{aligned}$$

As F is strictly increasing, we have $F_i^{-1}(s) = -F^{-1}((1-s) / \eta _i)$ for all $s \in (0, 1)$. Thus, we find

$$\begin{aligned} f(s) = \int _{0}^{s} F^{-1}(t) \mathrm {d}t = -\frac{1}{\eta _i} \int _{1}^{1 - s \eta _i} F^{-1} \left( \frac{1 - z}{\eta _i} \right) \mathrm {d}z= -\frac{1}{ \eta _i} \int _{1 - s \eta _i}^1 F_i^{-1}(z) \mathrm {d}z, \end{aligned}$$

(34)

where the second equality follows from the variable substitution $z\leftarrow 1-\eta _i t$. This integral representation of f(s) then allows us to reformulate the smooth dual optimal transport problem as

$$\begin{aligned} {\overline{W}}_{c}(\mu , \nu )= \sup \limits _{ \varvec{\phi }\in {\mathbb {R}}^N} ~ {\mathbb {E}}_{\varvec{x} \sim \mu }\left[ \min \limits _{\varvec{p}\in \Delta ^N} \sum \limits _{i=1}^N{\phi _i\nu _i}- \sum \limits _{i=1}^N(\phi _i - c(\varvec{x}, \varvec{y_i}))p_i + \sum \limits _{i=1}^N \eta _i \,f\left( \frac{p_i}{\eta _i} \right) \right] , \end{aligned}$$

which is manifestly equivalent to problem (14) thanks to the definition of the discrete f-divergence. Lemma 3.2 finally implies that the resulting instance of (14) is equivalent to the regularized primal optimal transport problem (13) with regularization term $R_\Theta (\pi ) = D_{f}(\pi \Vert \mu \otimes \eta )$. Hence, the claim follows. $\square $

Theorem 3.7 imposes relatively restrictive conditions on the marginals of $\varvec{z}$. Indeed, it requires that all marginal distribution functions $F_i$, $i\in [N]$, must be generated by a single marginal generating function F through the relation (33). The following examples showcase, however, that the freedom to select F offers significant flexibility in designing various (existing as well as new) regularization schemes. Details of the underlying derivations are relegated to Appendix C. Table 1 summarizes the marginal generating functions F studied in these examples and lists the corresponding divergence generators f.

Table 1 Marginal generating functions F with parameter $\lambda $ and corresponding divergence generators f

Full size table

Example 3.8

(Exponential distribution model) Suppose that $\Theta $ is a marginal ambiguity set with (shifted) exponential marginals of the form (33) induced by the generating function $F(s) = \exp (s / \lambda - 1)$ with $\lambda > 0$. Then the smooth dual optimal transport problem (12) is equivalent to the regularized optimal transport problem (13) with an entropic regularizer of the form $R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )$, where $f(s) =\lambda s \log (s)$, while the smooth c-transform (11) reduces to the log-partition function (20). This example shows that entropic regularizers are not only induced by singleton ambiguity sets containing a generalized extreme value distribution (see Sect. 3.1) but also by marginal ambiguity sets with exponential marginals.

Example 3.9

(Uniform distribution model) Suppose that $\Theta $ is a marginal ambiguity set with uniform marginals of the form (33) induced by the generating function $F(s) = s/(2\lambda ) + 1/2$ with $\lambda > 0$. In this case the smooth dual optimal transport problem (12) is equivalent to the regularized optimal transport problem (13) with a $\chi ^2$-divergence regularizer of the form $R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )$, where $f(s) = \lambda (s^2 -s)$. Such regularizers were previously investigated by Blondel et al. [24] and Seguy et al. [149] under the additional assumption that $\eta _i$ is independent of $i\in [N]$, yet their intimate relation to noise models with uniform marginals remained undiscovered until now. In addition, the smooth c-transform (11) satisfies

$$\begin{aligned} {{\overline{\psi }}}(\varvec{\phi }, \varvec{x}) = \lambda + \lambda \, \mathop {\mathrm{spmax}}\limits _{i \in [N]} \;\frac{\phi _i - c(\varvec{x}, \varvec{y_i})}{\lambda }, \end{aligned}$$

where the sparse maximum operator ‘${{\,\mathrm{spmax}\,}}$’ inspired by Martins and Astudillo [98] is defined through

$$\begin{aligned} \mathop {\mathrm{spmax}}\limits _{i \in [N]} \; u_i = \max _{\varvec{p} \in \Delta ^N} \; \sum _{i=1}^N u_i p_i - {p_i^2}/{\eta _i} \qquad \forall \varvec{u}\in {\mathbb {R}}^N. \end{aligned}$$

(35)

The envelope theorem ([44], Theorem 2.16) ensures that ${{\,\mathrm{spmax}\,}}_{i \in [N]} u_i$ is smooth and that its gradient with respect to $\varvec{u}$ is given by the unique solution $\varvec{p}^\star $ of the maximization problem on the right hand side of (35). We note that $\varvec{p}^\star $ has many zero entries due to the sparsity-inducing nature of the problem’s simplicial feasible set. In addition, we have $\lim _{\lambda \downarrow 0} \lambda {{\,\mathrm{spmax}\,}}_{i \in [N]} u_i/\lambda = \max _{i\in [N]}u_i$. Thus, the sparse maximum can indeed be viewed as a smooth approximation of the ordinary maximum. In marked contrast to the more widely used LogSumExp function, however, the sparse maximum has a sparse gradient. Proposition D.1 in Appendix D shows that $\varvec{p}^\star $ can be computed efficiently by sorting.

Example 3.10

(Pareto distribution model) Suppose that $\Theta $ is a marginal ambiguity set with (shifted) Pareto distributed marginals of the form (33) induced by the generating function $F(s) = (s (q-1) / (\lambda q)+1/q)^{1/(q-1)}$ with $\lambda ,q>0$. Then the smooth dual optimal transport problem (12) is equivalent to the regularized optimal transport problem (13) with a Tsallis divergence regularizer of the form $R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )$, where $f(s) = \lambda (s^q - s)/(q-1)$. Such regularizers were investigated by [110] under the additional assumption that $\eta _i$ is independent of $i\in [N]$. The Pareto distribution model encapsulates the exponential model (in the limit $q\rightarrow 1$) and the uniform distribution model (for $q=2$) as special cases. The smooth c-transform admits no simple closed-form representation under this model.

Example 3.11

(Hyperbolic cosine distribution model) Suppose that $\Theta $ is a marginal ambiguity set with hyperbolic cosine distributed marginals of the form (33) induced by the generating function $F(s) = \sinh (s/\lambda - k)$ with $k = \sqrt{2} - 1 - \text {arcsinh}(1)$ and $\lambda > 0$. Then the marginal probability density functions are given by scaled and truncated hyperbolic cosine functions, and the smooth dual optimal transport problem (12) is equivalent to the regularized optimal transport problem (13) with a hyperbolic divergence regularizer of the form $R_\Theta (\pi ) = D_f(\pi \Vert \mu \otimes \eta )$, where $f(s) = \lambda (s \text {arcsinh}(s) - \sqrt{s^2 + 1} + 1 + ks)$. Hyperbolic divergences were introduced by Ghai et al. [66] in order to unify several gradient descent algorithms.

Example 3.12

(t-distribution model) Suppose that $\Theta $ is a marginal ambiguity set where the marginals are determined by (33), and assume that the generating function is given by

$$\begin{aligned} F(s) = \frac{N}{2}\left( 1 + \frac{s - \sqrt{N-1}}{\sqrt{\lambda ^2 + (s - \sqrt{N-1})^{2}}}\right) \end{aligned}$$

for some $\lambda > 0$. In this case one can show that all marginals constitute t-distributions with 2 degrees of freedom. In addition, one can show that the smooth dual optimal transport problem (12) is equivalent to the Chebyshev regularized optimal transport problem described in Proposition 3.5.

To close this section, we remark that different regularization schemes differ as to how well they approximate the original (unregularized) optimal transport problem. Proposition 3.3 provides simple error bounds that may help in selecting suitable regularizers. For the entropic regularization scheme associated with the exponential distribution model of Example 3.8, for example, the error bound evaluates to $\max _{i\in [N]}\lambda \log (1/\eta _i)$, while for the $\chi ^2$-divergence regularization scheme associated with the uniform distribution model of Example 3.9, the error bound is given by $\max _{i \in [N]}\lambda (1/\eta _i - 1)$. In both cases, the error is minimized by setting $\eta _i = 1/N $ for all $i \in [N]$. Thus, the error bound grows logarithmically with N for entropic regularization and linearly with N for $\chi ^2$-divergence regularization. Different regularization schemes also differ with regard to their computational properties, which will be discussed in Sect. 4.

4 Numerical solution of smooth optimal transport problems

The smooth semi-discrete optimal transport problem (12) constitutes a stochastic optimization problem and can therefore be addressed with a stochastic gradient descent (SGD) algorithm. In Sect. 4.1 we first derive new convergence guarantees for an averaged gradient descent algorithm that has only access to a biased stochastic gradient oracle. This algorithm outputs the uniform average of the iterates (instead of the last iterate) as the recommended candidate solution. We prove that if the objective function is Lipschitz continuous, then the suboptimality of this candidate solution is of the order $\mathcal O(1/\sqrt{T})$, where T stands for the number of iterations. An improvement in the non-leading terms is possible if the objective function is additionally smooth. We further prove that a convergence rate of $\mathcal O(1/{T})$ can be obtained for generalized self-concordant objective functions. In Sect. 4.2 we then show that the algorithm of Sect. 4.1 can be used to efficiently solve the smooth semi-discrete optimal transport problem (12) corresponding to a marginal ambiguity set of the type (26). As a byproduct, we prove that the convergence rate of the averaged SGD algorithm for the semi-discrete optimal transport problem with entropic regularization is of the order $\mathcal O(1/T)$, which improves the $\mathcal O(1/\sqrt{T})$ guarantee of Genevay et al. [64].

4.1 Averaged gradient descent algorithm with biased gradient oracles

Consider a general convex minimization problem of the form

$$\begin{aligned} \min _{\varvec{\phi }\in {\mathbb {R}}^n} ~ h(\varvec{\phi }), \end{aligned}$$

(36)

where the objective function $h: {\mathbb {R}}^n \rightarrow {\mathbb {R}}$ is convex and differentiable. We assume that problem (36) admits a minimizer $\varvec{\phi }^\star $. We study the convergence behavior of the inexact gradient descent algorithm

$$\begin{aligned} \varvec{\phi }_{t} = \varvec{\phi }_{t-1} - \gamma \varvec{g}_t(\varvec{\phi }_{t-1}), \end{aligned}$$

(37)

where $\gamma > 0$ is a fixed step size, $\varvec{\phi }_0$ is a given deterministic initial point and the function $\varvec{g}_t: {\mathbb {R}}^n \rightarrow {\mathbb {R}}^n$ is an inexact gradient oracle that returns for every fixed $\varvec{\phi }\in {\mathbb {R}}^n$ a random estimate of the gradient of h at $\varvec{\phi }$. Note that we allow the gradient oracle to depend on the iteration counter t, which allows us to account for increasingly accurate gradient estimates. In contrast to the previous sections, we henceforth model all random objects as measurable functions on an abstract filtered probability space $(\Omega , \mathcal F, (\mathcal F_t)_{t \ge 0}, \mathbb P)$, where ${\mathcal {F}}_0 = \{ \emptyset ,\Omega \}$ represents the trivial $\sigma $-field, while the gradient oracle $\varvec{g}_t(\varvec{\phi })$ is $\mathcal F_t$-measurable for all $t\in \mathbb N$ and $\varvec{\phi }\in {\mathbb {R}}^n$. In order to avoid clutter, we use $\mathbb E[\cdot ]$ to denote the expectation operator with respect to $\mathbb P$, and all inequalities and equalities involving random variables are understood to hold $\mathbb P$-almost surely.

In the following we analyze the effect of averaging in inexact gradient descent algorithms. We will show that after T iterations with a constant step size $\gamma = \mathcal O(1 / \sqrt{T})$, the objective function value of the uniform average of all iterates generated by (37) converges to the optimal value of (36) at a sublinear rate. Specifically, we will prove that the rate of convergence varies between ${\mathcal {O}}(1 / \sqrt{T})$ and ${\mathcal {O}}(1/T)$ depending on properties of the objective function. Our convergence analysis will rely on several regularity conditions.

Assumption 4.1

(Regularity conditions) Different combinations of the following regularity conditions will enable us to establish different convergence guarantees for the averaged inexact gradient descent algorithm.

(i)
Biased gradient oracle: There exists tolerances $\varepsilon _t>0$, $t\in \mathbb N\cup \{0\}$, such that
$$\begin{aligned} \left\| {\mathbb {E}}\left[ \varvec{g}_t(\varvec{\phi }_{t-1}) \big | \mathcal F_{t-1} \right] - \nabla h(\varvec{\phi }_t) \right\| \le \varepsilon _{t-1}\quad \forall t\in \mathbb N. \end{aligned}$$
(ii)
Bounded gradients: There exists $R > 0$ such that
$$\begin{aligned} \Vert \nabla h(\varvec{\phi }) \Vert \le R\quad \text {and} \quad \Vert \varvec{g}_t(\varvec{\phi }) \Vert \le R \quad \forall \varvec{\phi }\in {\mathbb {R}}^n,~ \forall t \in \mathbb N. \end{aligned}$$
(iii)
Generalized self-concordance: The function h is M-generalized self-concordant for some $M > 0$, that is, h is three times differentiable, and for any $\varvec{\phi }, \varvec{\phi }' \in {\mathbb {R}}^n$ the function $u(s) = h(\varvec{\phi }+ s (\varvec{\phi }' - \varvec{\phi }))$ satisfies the inequality
$$\begin{aligned} \left| \frac{\mathrm {d}^3 u(s)}{\mathrm {d}s^3} \right| \le M \Vert \varvec{\phi }- \varvec{\phi }' \Vert \, \frac{\mathrm {d}^2 u(s)}{\mathrm {d}s^2} \quad \forall s \in {\mathbb {R}}.\end{aligned}$$
(iv)
Lipschitz continuous gradient: The function h is L-smooth for some $L > 0$, that is, we have
$$\begin{aligned} \Vert \nabla h(\varvec{\phi }) - \nabla h(\varvec{\phi }') \Vert \le L \Vert \varvec{\phi }- \varvec{\phi }' \Vert \quad \forall \varvec{\phi }, \varvec{\phi }' \in {\mathbb {R}}^n. \end{aligned}$$
(v)
Bounded second moments: There exists $\sigma > 0$ such that
$$\begin{aligned} {\mathbb {E}}\left[ \left\| \varvec{g}_t(\varvec{\phi }_{t-1}) - \nabla h(\varvec{\phi }_{t-1}) \right\| ^2 | \mathcal F_{t-1} \right] \le \sigma ^2 \quad \forall t \in \mathbb N. \end{aligned}$$

The averaged gradient descent algorithm with biased gradient oracles lends itself to solving both deterministic as well as stochastic optimization problems. In deterministic optimization, the gradient oracles $\varvec{g}_t$ are deterministic and output inexact gradients satisfying $\Vert \varvec{g}_t(\varvec{\phi }) - \nabla h(\varvec{\phi }) \Vert \le \varepsilon _t$ for all $\varvec{\phi }\in {\mathbb {R}}^n$, where the tolerances $\varepsilon _t$ bound the errors associated with the numerical computation of the gradients. A vast body of literature on deterministic optimization focuses on exact gradient oracles for which these tolerances can be set to 0. Inexact deterministic gradient oracles with bounded error tolerances are investigated by Nedićand Bertsekas [112] and d’Aspremont [41]. In this case exact convergence to $\varvec{\phi }^\star $ is not possible. If the error bounds decrease to 0, however, Luo and Tseng [96], Schmidt et al. [144] and Friedlander and Schmidt [63] show that adaptive gradient descent algorithms are guaranteed to converge to $\varvec{\phi }^\star $.

In stochastic optimization, the objective function is representable as $h(\varvec{\phi }) = {\mathbb {E}}[H(\varvec{\phi }, \varvec{x})]$, where the marginal distribution of the random vector $\varvec{x}$ under $\mathbb P$ is given by $\mu $, while the integrand $H(\varvec{\phi },\varvec{x})$ is convex and differentiable in $\varvec{\phi }$ and $\mu $-integrable in $\varvec{x}$. In this setting it is convenient to use gradient oracles of the form $\varvec{g}_t(\varvec{\phi }) = \nabla _{\varvec{\phi }} H(\varvec{\phi }, \varvec{x}_t)$ for all $t \in \mathbb N$, where the samples $\varvec{x}_t$ are drawn independently from $\mu $. As these oracles output unbiased estimates for $\nabla h(\varvec{\phi })$, all tolerances $\varepsilon _t$ in Assumptions 4.1 (i) may be set to 0. SGD algorithms with unbiased gradient oracles date back to the seminal paper by Robbins and Monro [136]. Nowadays, averaged SGD algorithms with Polyak-Ruppert averaging figure among the most popular variants of the SGD algorithm [113, 132, 143]. For general convex objective functions the best possible convergence rate of any averaged SGD algorithm run over T iterations amounts to ${\mathcal {O}}(1 / \sqrt{T})$, but it improves to ${\mathcal {O}}(1 / T)$ if the objective function is strongly convex; see for example [50, 87, 108, 113, 116, 152, 153, 172]. While smoothness plays a critical role to achieve acceleration in deterministic optimization, it only improves the constants in the convergence rate in stochastic optimization [34, 45, 81, 88, 158]. In fact, Tsybakov [166] demonstrates that smoothness does not provide any acceleration in general, that is, the best possible convergence rate of any averaged SGD algorithm can still not be improved beyond ${\mathcal {O}}(1 / \sqrt{T})$. Nevertheless, a substantial acceleration is possible when focusing on special problem classes such as linear or logistic regression problems [14, 15, 71]. In these special cases, the improvement in the convergence rate is facilitated by a generalized self-concordance property of the objective function [13]. Self-concordance was originally introduced in the context of Newton-type interior point methods [115] and later generalized to facilitate the analysis of probabilistic models [13] and second-order optimization algorithms [159].

In the following we analyze the convergence properties of the averaged SGD algorithm when we have only access to an inexact stochastic gradient oracle, in which case the tolerances $\varepsilon _t$ cannot be set to 0. To our best knowledge, inexact stochastic gradient oracles have only been considered by Cohen et al. [34], Hu et al. [76] and Ajalloeian and Stich [5]. Specifically, Hu et al. [76] use sequential semidefinite programs to analyze the convergence rate of the averaged SGD algorithm when $\mu $ has a finite support. In contrast, we do not impose any restrictions on the support of $\mu $. Cohen et al. [34] and Ajalloeian and Stich [5], on the other hand, study the convergence behavior of accelerated gradient descent algorithms for smooth stochastic optimization problems under the assumption that $\varvec{\phi }$ ranges over a compact domain. The proposed algorithms necessitate a projection onto the compact feasible set in each iteration. In contrast, our convergence analysis does not rely on any compactness assumptions. We note that compactness assumptions have been critical for the convergence analysis of the averaged SGD algorithm in the context of convex stochastic optimization [28, 34, 45, 113]. By leveraging a trick due to Bach [14], however, we can relax this assumption provided that the objective function is Lipschitz continuous.

Proposition 4.2

Consider the inexact gradient descent algorithm (37) with constant step size $\gamma > 0$. If Assumptions 4.1 (i)–(ii) hold with $\varepsilon _t \le {{{\bar{\varepsilon }}}}/{(2\sqrt{1+t})}$ for some ${{\bar{\varepsilon }}} \ge 0$, then we have for all $ p \in \mathbb N$ that

$$\begin{aligned} {\mathbb {E}}\left[ \left( h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \right) ^p \right] ^{1/p} \le \frac{\Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}{\gamma T} + 20 \gamma \left( R + {{\bar{\varepsilon }}} \right) ^2 p. \end{aligned}$$

If additionally Assumption 4.1 (iii) holds and if $G = \max \{ M, R + {{\bar{\varepsilon }}} \}$, then we have for all $ p \in \mathbb N$ that

$$\begin{aligned} {\mathbb {E}}\left[ \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\| ^{2p} \right] ^{1/p}&\le \frac{G^{2}}{T} \left( 10 \sqrt{p} + \frac{4p}{\sqrt{T}} + 80 G^2 \gamma \sqrt{T} p + \frac{2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}{\gamma \sqrt{T}} \right. \\&\quad \left. + \frac{3 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert }{G \gamma \sqrt{T}} \right) ^2. \end{aligned}$$

The proof of Proposition 4.2 relies on two lemmas. In order to state these lemmas concisely, we define the $L_p$-norm, of a random variable $\varvec{z} \in {\mathbb {R}}^n$ for any $p > 0$ through $\Vert \varvec{z} \Vert _{L_p} = \left( {\mathbb {E}}\left[ \Vert \varvec{z} \Vert ^p \right] \right) ^{1/p}$. For any random variables $\varvec{z}, \varvec{z}' \in {\mathbb {R}}^n$ and $p \ge 1$, Minkowski’s inequality ([26], § 2.11) then states that

$$\begin{aligned} \Vert \varvec{z} + \varvec{z}' \Vert _{L_p} \le \Vert \varvec{z} \Vert _{L_p} + \Vert \varvec{z}' \Vert _{L_p}. \end{aligned}$$

(38)

Another essential tool for proving Proposition 4.2 is the Burkholder-Rosenthal-Pinelis (BRP) inequality ([130], Theorem 4.1), which we restate below without proof to keep this paper self-contained.

Lemma 4.3

(BRP inequality) Let $\varvec{z}_t$ be an $\mathcal F_t$-measurable random variable for every $t\in \mathbb N$ and assume that $p \ge 2$. For any $t \in [T]$ with ${\mathbb {E}}[\varvec{z}_t | \mathcal F_{t-1}] = 0 $ and $\Vert \varvec{z}_t \Vert _{L_p}<\infty $ we then have

$$\begin{aligned} \left\| \max _{t \in [T]} \left\| \sum _{k=1}^t \varvec{z}_k \right\| \right\| _{L_p} \le \sqrt{p} \left\| \sum _{t=1}^T {\mathbb {E}}[ \Vert \varvec{z}_t \Vert ^2 | \mathcal F_{t-1}] \right\| _{L_{p/2}}^{1/2} + p \left\| \max _{t \in [T]} \Vert \varvec{z}_t \Vert \right\| _{L_p}. \end{aligned}$$

The following lemma reviews two useful properties of generalized self-concordant functions.

Lemma 4.4

(Generalized self-concordance) Assume that the objective function h of the convex optimization problem (36) is M-generalized self-concordant in the sense of Assumption 4.1 (iii) for some $M>0$.

(i)
([14], Appendix D.2) For any sequence $\varvec{\phi }_0, \dots , \varvec{\phi }_{T-1} \in {\mathbb {R}}^n$, we have
$$\begin{aligned} \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^T \varvec{\phi }_{t-1} \right) - \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| \le 2 M \left( \frac{1}{T} \sum _{t=1}^T h(\varvec{\phi }_{t-1}) - h(\varvec{\phi }^\star ) \right) . \end{aligned}$$
(ii)
([14], Lemma 9) For any $\varvec{\phi }\in {\mathbb {R}}^n$ with $ \Vert \nabla h(\varvec{\phi }) \Vert \le 3 \kappa / (4 M) $, where $\kappa $ is the smallest eigenvalue of $\nabla ^2 h(\varvec{\phi }^\star )$, and $\varvec{\phi }^\star $ is the optimizer of (36), we have $ h(\varvec{\phi }) - h(\varvec{\phi }^\star ) \le 2 {\Vert \nabla h(\varvec{\phi }) \Vert ^2}/{\kappa }.$

Armed with Lemmas 4.3 and 4.4, we are now ready to prove Proposition 4.2.

Proof of Proposition 4.2

The first claim generalizes Proposition 5 by Bach [14] to inexact gradient oracles. By the assumed convexity and differentiability of the objective function h, we have

$$\begin{aligned} h(\varvec{\phi }_{k-1})\le & {} h(\varvec{\phi }_{\star }) + \nabla h(\varvec{\phi }_{k-1})^\top (\varvec{\phi }_{k-1} - \varvec{\phi }_{\star }) \nonumber \\= & {} h(\varvec{\phi }_{\star }) + \varvec{g}_k(\varvec{\phi }_{k-1})^\top (\varvec{\phi }_{k-1} - \varvec{\phi }_{\star }) + \left( \nabla h(\varvec{\phi }_{k-1}) - \varvec{g}_k(\varvec{\phi }_{k-1}) \right) ^\top (\varvec{\phi }_{k-1} - \varvec{\phi }_{\star }).\nonumber \\ \end{aligned}$$

(39)

In addition, elementary algebra yields the recursion

$$\begin{aligned} \Vert \varvec{\phi }_{k} - \varvec{\phi }^\star \Vert ^2 = \Vert \varvec{\phi }_{k} - \varvec{\phi }_{k-1} \Vert ^2 + \Vert \varvec{\phi }_{k-1} - \varvec{\phi }^\star \Vert ^2 + 2 (\varvec{\phi }_{k} - \varvec{\phi }_{k-1})^\top (\varvec{\phi }_{k-1} - \varvec{\phi }^\star ). \end{aligned}$$

Thanks to the update rule (37), this recursion can be re-expressed as

$$\begin{aligned} \varvec{g}_k(\varvec{\phi }_{k-1})^\top (\varvec{\phi }_{k-1} - \varvec{\phi }^\star ) = \frac{1}{2 \gamma } \left( \gamma ^2 \Vert \varvec{g}_k(\varvec{\phi }_{k-1}) \Vert ^2 + \Vert \varvec{\phi }_{k-1} - \varvec{\phi }^\star \Vert ^2 - \Vert \varvec{\phi }_{k} - \varvec{\phi }^\star \Vert ^2 \right) , \end{aligned}$$

where $\gamma > 0$ is an arbitrary step size. Combining the above identity with (39) then yields

$$\begin{aligned} ~h(\varvec{\phi }_{k-1}) \le&~h(\varvec{\phi }_{\star }) + \frac{1}{2 \gamma } \left( \gamma ^2 \Vert \varvec{g}_k(\varvec{\phi }_{k-1}) \Vert ^2 + \Vert \varvec{\phi }_{k-1} - \varvec{\phi }^\star \Vert ^2 - \Vert \varvec{\phi }_{k} - \varvec{\phi }^\star \Vert ^2 \right) \\&+ \left( \nabla h(\varvec{\phi }_{k-1}) - \varvec{g}_k(\varvec{\phi }_{k-1}) \right) ^\top \! (\varvec{\phi }_{k-1} - \varvec{\phi }_{\star }) \\ \le&~h(\varvec{\phi }_{\star }) + \frac{1}{2 \gamma } \left( \gamma ^2 R^2 + \Vert \varvec{\phi }_{k-1} - \varvec{\phi }^\star \Vert ^2 - \Vert \varvec{\phi }_{k} - \varvec{\phi }^\star \Vert ^2 \right) \\&+ \left( \nabla h(\varvec{\phi }_{k-1}) - \varvec{g}_k(\varvec{\phi }_{k-1}) \right) ^\top (\varvec{\phi }_{k-1} - \varvec{\phi }_{\star }), \end{aligned}$$

where the last inequality follows from Assumption 4.1 (ii). Summing this inequality over k then shows that

$$\begin{aligned} 2 \gamma \sum _{k=1}^t \big ( h ( \varvec{\phi }_{k-1}) - h(\varvec{\phi }_{\star }) \big ) + \Vert \varvec{\phi }_{t} - \varvec{\phi }^\star \Vert ^2 \le A_t, \end{aligned}$$

(40)

where

$$\begin{aligned} A_t = t \gamma ^2 R^2 + \Vert \varvec{\phi }_{0} - \varvec{\phi }^\star \Vert ^2 + \sum _{k=1}^t B_k \quad \text {and} \quad B_t = 2 \gamma \left( \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right) ^\top (\varvec{\phi }_{t-1} - \varvec{\phi }_{\star }) \end{aligned}$$

for all $t \in \mathbb N$. Note that the term on the left hand side of (40) is non-negative because $\varvec{\phi }^\star $ is a global minimizer of h, which implies that the random variable $A_t$ is also non-negative for all $t\in \mathbb N$. For later use we further define $A_0 = \Vert \varvec{\phi }_{0} - \varvec{\phi }^\star \Vert ^2$. The estimate (40) for $t=T$ then implies via the convexity of h that

$$\begin{aligned} h \left( \frac{1}{T} \sum _{t=1}^T \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }_{\star }) \le \frac{A_T}{2 \gamma T }, \end{aligned}$$

(41)

where we dropped the non-negative term $\Vert \varvec{\phi }_T-\varvec{\phi }^\star \Vert ^2/(2\gamma T)$ without invalidating the inequality. In the following we analyze the $L_p$-norm of $A_T$ in order to obtain the desired bounds from the proposition statement. To do so, we distinguish three different regimes for $p \in \mathbb N$, and we show that the $L_p$-norm of the non-negative random variable $A_T$ is upper bounded by an affine function of p in each of these regimes.

Case I ($p \ge T / 4$): By using the update rule (37) and Assumption 4.1 (ii), one readily verifies that

$$\begin{aligned} \Vert \varvec{\phi }_k - \varvec{\phi }^\star \Vert \le \Vert \varvec{\phi }_{k-1} - \varvec{\phi }^\star \Vert + \Vert \varvec{\phi }_k - \varvec{\phi }_{k-1} \Vert \le \Vert \varvec{\phi }_{k-1} - \varvec{\phi }^\star \Vert + \gamma R. \end{aligned}$$

Iterating the above recursion k times then yields the conservative estimate $\Vert \varvec{\phi }_k - \varvec{\phi }^\star \Vert \le \Vert \varvec{\phi }_{0} - \varvec{\phi }^\star \Vert + k \gamma R$. By definitions of $A_t$ and $B_t$ for $t\in \mathbb N$, we thus have

$$\begin{aligned} A_t&= t \gamma ^2 R^2 + \Vert \varvec{\phi }_{0} - \varvec{\phi }^\star \Vert ^2 + 2 \gamma \sum _{k=1}^t \left( \nabla h(\varvec{\phi }_{k-1}) - \varvec{g}_k(\varvec{\phi }_{k-1}) \right) ^\top (\varvec{\phi }_{k-1} - \varvec{\phi }_{\star }) \\&\le t \gamma ^2 R^2 + \Vert \varvec{\phi }_{0} - \varvec{\phi }^\star \Vert ^2 + 4 \gamma R \sum _{k=1}^t \Vert \varvec{\phi }_{k-1} - \varvec{\phi }_{\star } \Vert \\&\le t \gamma ^2 R^2 + \Vert \varvec{\phi }_{0} - \varvec{\phi }^\star \Vert ^2 + 4 \gamma R \sum _{k=1}^t \left( \Vert \varvec{\phi }_{0} - \varvec{\phi }^\star \Vert + (k-1) \gamma R \right) \\&\le t \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 4 t \gamma R \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert + 2 t^2 \gamma ^2 R^2 \\&\le t \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 4 t^2 \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 2 t^2 \gamma ^2 R^2 \\&\le 7 t^2 \gamma ^2 R^2 + 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2, \end{aligned}$$

where the first two inequalities follow from Assumption 4.1 (ii) and the conservative estimate derived above, respectively, while the fourth inequality holds because $2 a b \le a^2 + b^2$ for all $a,b\in {\mathbb {R}}$. As $A_t \ge 0$, the random variable $A_t$ is bounded and satisfies $| A_t| \le 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 7 t^2 \gamma ^2 R^2$ for all $t\in \mathbb N$, which implies that

$$\begin{aligned} \Vert A_T \Vert _{L_p} \le 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 7 T^2 \gamma ^2 R^2&\le 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 28 T \gamma ^2 R^2 p, \end{aligned}$$

(42)

where the last inequality holds because $p \ge T/4$. Note that the resulting upper bound is affine in p.

Case II $({2 \le p \le T/4})$: The subsequent analysis relies on the simple bounds

$$\begin{aligned} \max _{t \in [T]} \varepsilon _{t-1} \le \frac{{{\bar{\varepsilon }}}}{2} \quad \text {and} \quad \sum _{t=1}^T \varepsilon _{t-1} \le {{\bar{\varepsilon }}} \sqrt{T}, \end{aligned}$$

(43)

which hold because $\varepsilon _t \le {{\bar{\varepsilon }}} / (2 \sqrt{1+t})$ by assumption and because $\sum _{t=1}^T 1 / \sqrt{t} \le 2 \sqrt{T}$, which can be proved by induction. In addition, it proves useful to introduce the martingale differences $ {{\bar{B}}}_t = B_t - {\mathbb {E}}[B_t | \mathcal F_{t-1}]$ for all $t\in \mathbb N$. By the definition of $A_t$ and the subadditivity of the supremum operator, we then have

$$\begin{aligned} \max _{t \in [T+1]} A_{t-1}&= \max _{t \in [T+1]} \left\{ (t-1) \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \sum _{k=1}^{t-1} {\mathbb {E}}[B_k | \mathcal F_{k-1}] + \sum _{k=1}^{t-1} {{\bar{B}}}_k \right\} \\&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \max _{t \in [T]} \sum _{k=1}^t {\mathbb {E}}[B_k | \mathcal F_{k-1}] + \max _{t \in [T]} \sum _{k=1}^t {{\bar{B}}}_k . \end{aligned}$$

As $p \ge 2$, Minkowski’s inequality (38) thus implies that

$$\begin{aligned} \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_p}&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 \nonumber \\&\quad + \left\| \max _{t \in [T]} \sum _{k=1}^t {\mathbb {E}}[B_k | \mathcal F_{k-1}] \right\| _{L_p} + \left\| \max _{t \in [T]} \sum _{k=1}^t {{\bar{B}}}_k \right\| _{L_p}. \end{aligned}$$

(44)

In order to bound the penultimate term in (44), we first note that

$$\begin{aligned} \left| {\mathbb {E}}[B_k | \mathcal F_{k-1}] \right|&= 2 \gamma \left| {\mathbb {E}}\left[ \left( \nabla h(\varvec{\phi }_{k-1}) - \varvec{g}_t(\varvec{\phi }_{k-1}) \right) | \mathcal F_{k-1} \right] ^\top (\varvec{\phi }_{k-1} - \varvec{\phi }_{\star }) \right| \nonumber \\&\le 2 \gamma \Vert {\mathbb {E}}\left[ \left( \nabla h(\varvec{\phi }_{k-1}) - \varvec{g}_k(\varvec{\phi }_{k-1}) \right) | \mathcal F_{k-1} \right] \Vert \Vert \varvec{\phi }_{k-1} - \varvec{\phi }_{\star } \Vert \nonumber \\&\le 2 \gamma \varepsilon _{k-1} \Vert \varvec{\phi }_{k-1} - \varvec{\phi }_{\star } \Vert \le 2 \gamma \varepsilon _{k-1}\sqrt{ A_{k-1}} \end{aligned}$$

(45)

for all $k\in \mathbb N$, where the second inequality holds due to Assumption 4.1 (i), and the last inequality follows from (40). This in turn implies that for all $t \in [T]$ we have

$$\begin{aligned} \left| \sum _{k=1}^t {\mathbb {E}}[B_k | \mathcal F_{k-1}] \right| \le&\,2 \gamma \sum _{k=1}^t \varepsilon _{k-1} \sqrt{A_{k-1}} \le 2 \gamma \left( \sum _{k=1}^t \varepsilon _{k-1} \right) \left( \max _{k \in [t]} \sqrt{A_{k-1}} \right) \\ \le&\,2 \gamma {{\bar{\varepsilon }}} \sqrt{t} \max _{k \in [t]} \sqrt{A_{k-1}}, \end{aligned}$$

where the last inequality exploits (43). Therefore, the penultimate term in (44) satisfies

$$\begin{aligned} \left\| \max _{t \in [T]} \sum _{k=1}^t {\mathbb {E}}[B_k | \mathcal F_{k-1}] \right\| _{L_p} \le 2 \gamma {{\bar{\varepsilon }}} \sqrt{T} \left\| \max _{t \in [T+1]} \sqrt{A_{t-1}} \right\| _{L_p} = 2 \gamma {{\bar{\varepsilon }}} \sqrt{T} \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_{p/2}}^{1/2}, \end{aligned}$$

(46)

where the equality follows from the definition of the $L_p$-norm.

Next, we bound the last term in (44) by using the BRP inequality of Lemma 4.3. To this end, note that

$$\begin{aligned} |{{\bar{B}}}_t |&\le | B_t | + | {\mathbb {E}}[B_t | \mathcal F_{t-1}] | \\&\le 2 \gamma \Vert \varvec{\phi }_{t-1} - \varvec{\phi }_{\star } \Vert \Vert \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \Vert + 2 \gamma \varepsilon _{t-1} \sqrt{A_{t-1}} \\&\le 2 \gamma \sqrt{A_{t-1}} \left( \Vert \nabla h(\varvec{\phi }_{t-1}) \Vert + \Vert \varvec{g}_t(\varvec{\phi }_{t-1}) \Vert \right) + 2 \gamma \varepsilon _{t-1} \sqrt{A_{t-1}} \le 2 \gamma (2R + \varepsilon _{t-1}) \sqrt{A_{t-1}} \end{aligned}$$

for all $t\in \mathbb N$, where the second inequality exploits the definition of $B_t$ and (45), the third inequality follows from (40), and the last inequality holds because of Assumption 4.1 (ii). Hence, we obtain

$$\begin{aligned} \textstyle \left\| \max _{t \in [T]} | {{\bar{B}}}_t | \right\| _{L_p} \le&\, 2 \gamma \left( 2 R + \max _{t \in [T]} \varepsilon _{t-1} \right) \left\| \max _{t \in [T]} \sqrt{A_{t-1}} \right\| _{L_p}\\ \le&\, ( 4 \gamma R + \gamma {{\bar{\varepsilon }}}) \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_{p/2}}^{1/2}, \end{aligned}$$

where the second inequality follows from (43) and the definition of the $L_p$-norm. In addition, we have

$$\begin{aligned} \left\| \sum _{t=1}^T {\mathbb {E}}[ {{\bar{B}}}_t^2 | \mathcal F_{t-1}] \right\| _{L_{p/2}}^{1/2}&= \left\| \sqrt{\sum _{t=1}^T {\mathbb {E}}[ {{\bar{B}}}_t^2 | \mathcal F_{t-1}]} \right\| _{L_p}\\&\le 2 \gamma \left\| \sqrt{ \sum _{t=1}^T (2R + \varepsilon _{t-1})^2 A_{t-1} } \right\| _{L_p} \\&\le 2 \gamma \left( \sum _{t=1}^T (2R + \varepsilon _{t-1})^2 \right) ^{1/2} \left\| \max _{t \in [T+1]} A_{t-1}^{1/2} \right\| _{L_p} \\&\le 2 \gamma \left( 2 R \sqrt{T} + \sqrt{\sum _{t=1}^T \varepsilon _{t-1}^2} \right) \left\| \max _{t \in [T+1]} A_{t-1}^{1/2} \right\| _{L_p} \\&\le \left( 4 \gamma R \sqrt{T} + \gamma {{\bar{\varepsilon }}} \sqrt{T} \right) \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_{p/2}}^{1/2}, \end{aligned}$$

where the first inequality exploits the upper bound on $|{{\bar{B}}}_t|$ derived above, which implies that ${\mathbb {E}}[ {{\bar{B}}}_t ^2 | \mathcal F_{t-1}] \le 4 \gamma ^2 (2R + \varepsilon _{t-1})^2 A_{t-1}$. The last three inequalities follow from the Hölder inequality, the triangle inequality for the Euclidean norm and the two inequalities in (43), respectively. Recalling that $p \ge 2$, we may then apply the BRP inequality of Lemma 4.3 to the martingale differences ${{\bar{B}}}_t$, $t\in [T]$, and use the bounds derived in the last two display equations in order to conclude that

$$\begin{aligned} \left\| \max _{t \in [T]} \left| \sum _{k=1}^t {{\bar{B}}}_k \right| \right\| _{L_p}&\le \left( 4 \gamma R \sqrt{pT} + \gamma {{\bar{\varepsilon }}} \sqrt{pT} + \gamma {{\bar{\varepsilon }}} p + 4 \gamma R p \right) \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_{p/2}}^{1/2}. \end{aligned}$$

(47)

Substituting (46) and (47) into (44), we thus obtain

$$\begin{aligned} \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_p}&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \left( 4 \gamma R \left( \sqrt{pT} + p \right) \right. \\&\left. \quad + \gamma {{\bar{\varepsilon }}} \left( \sqrt{pT} + p +2 \sqrt{T} \right) \right) \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_{p/2}}^{1/2} \\&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 \gamma \left( R + {{\bar{\varepsilon }}} \right) \sqrt{pT} \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_{p/2}}^{1/2}, \end{aligned}$$

where the second inequality holds because $p \le T/4$ by assumption, which implies that $\sqrt{pT} + p \le 1.5 \sqrt{pT} $ and $ \sqrt{pT} + p + 2 \sqrt{T} \le 6 \sqrt{pT}$. As Jensen’s inequality ensures that $\Vert \varvec{z} \Vert _{L_{p/2}} \le \Vert \varvec{z} \Vert _{L_p}$ for any random variable $\varvec{z}$ and $p > 0$, the following inequality holds for all $2 \le p \le T/4$.

$$\begin{aligned} \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_p}&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 \gamma \left( R + {{\bar{\varepsilon }}} \right) \sqrt{pT} \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_p}^{1/2} \end{aligned}$$

To complete the proof of Case II, we note that for any numbers $a, b, c \ge 0$ the inequality $c \le a + 2b \sqrt{c} $ is equivalent to $\sqrt{c} \le b + \sqrt{b^2+a}$ and therefore also to $c \le (b + \sqrt{b^2+a})^2 \le 4b^2 + 2a$. Identifying a with $T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2$, b with $3\gamma \left( R + {{\bar{\varepsilon }}} \right) \sqrt{pT}$ and c with $\Vert \max _{t \in [T+1]} A_{t-1}\Vert _{L_p}$ then allows us to translate the inequality in the last display equation to

$$\begin{aligned} \left\| A_{T} \right\| _{L_p} \le \left\| \max _{t \in [T+1]} A_{t-1} \right\| _{L_p}&\le 2 T \gamma ^2 R^2 + 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 36 \gamma ^2 \left( R + {{\bar{\varepsilon }}} \right) ^2 p T. \end{aligned}$$

(48)

Thus, for any $2 \le p \le T/4$, we have again found an upper bound on $\Vert A_{T}\Vert _{L_p}$ that is affine in p.

Case III $({p = 1})$: Recalling the definition of $A_T\ge 0$, we find that

$$\begin{aligned} \Vert A_T \Vert _{L_{1}} = {\mathbb {E}}[A_T]&= T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + {\mathbb {E}}\left[ \, \sum _{t=1}^T {\mathbb {E}}[B_t | \mathcal F_{t-1}] \right] \\&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \left\| \max _{t \in [T]} \sum _{k=1}^t {\mathbb {E}}[B_k | \mathcal F_{k-1}] \right\| _{L_1} \\&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 2 \gamma {{\bar{\varepsilon }}} \sqrt{T} \left\| \max _{t \in [T+1]} A_{t-1} \right\| ^{1/2}_{L_{1/2}} \\&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 2 \gamma {{\bar{\varepsilon }}} \sqrt{T} \left\| \max _{t \in [T+1]} A_{t-1} \right\| ^{1/2}_{L_{2}}, \end{aligned}$$

where the second inequality follows from the estimate (46), which holds indeed for all $p\in \mathbb N$, while the last inequality follows from Jensen’s inequality. By the second inequality in (48) for $p=2$, we thus find

$$\begin{aligned} \Vert A_T \Vert _{L_{1}}&\le T \gamma ^2 R^2 + \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 2 {{\bar{\varepsilon }}} \gamma \sqrt{T} \cdot \sqrt{2 T \gamma ^2 R^2 + 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 72 \gamma ^2 (R + {{\bar{\varepsilon }}})^2 T} \end{aligned}$$

(49a)

$$\begin{aligned}&\le 2 T \gamma ^2 R^2 + 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 36 \gamma ^2 (R + {{\bar{\varepsilon }}})^2 T + 2 {{\bar{\varepsilon }}}^2 \gamma ^2 T , \end{aligned}$$

(49b)

where the last inequality holds because $2ab \le 2a^2 + b^2/ 2$ for all $a,b\in {\mathbb {R}}$.

We now combine the bounds derived in Cases I, II and III to obtain a universal bound on $\left\| A_{T} \right\| _{L_p}$ that holds for all $p\in \mathbb N$. Specifically, one readily verifies that the bound

$$\begin{aligned} \left\| A_{T} \right\| _{L_p}&\le 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 40 \gamma ^2 \left( R + {{\bar{\varepsilon }}} \right) ^2 p T, \end{aligned}$$

(50)

is more conservative than each of the bounds (42), (48) and (49), and thus it holds indeed for any $p \in \mathbb N$. Combining this universal bound with (41) proves the first inequality from the proposition statement.

In order to prove the second inequality, we need to extend ([14], Proposition 7) to biased gradient oracles. To this end, we first note that

$$\begin{aligned} \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\|&\le \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| + \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| \\&\le 2 M \left( \frac{1}{T} \sum _{t=1}^T h(\varvec{\phi }_{t-1}) - h(\varvec{\phi }^\star ) \right) + \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| \\&\le \frac{M}{T \gamma } A_T + \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| , \end{aligned}$$

where the second inequality follows from Lemma 4.4 (i), and the third inequality holds due to (40). By Minkowski’s inequality (38), we thus have for any $p \ge 1$ that

$$\begin{aligned} \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\| _{L_{2p}}&\le \frac{M}{T \gamma } \Vert A_T \Vert _{L_{2p}} + \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| _{L_{2p}} \\&\le \frac{2 M}{T \gamma } \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 80 M \gamma \left( R + {{\bar{\varepsilon }}} \right) ^2 p\\&\qquad + \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| _{L_{2p}}, \end{aligned}$$

where the last inequality follows from the universal bound (50). In order to estimate the last term in the above expression, we recall that the update rule (37) is equivalent to $\varvec{g}_t(\varvec{\phi }_{t-1}) = \left( \varvec{\phi }_{t-1} - \varvec{\phi }_{t} \right) / \gamma ,$ which in turn implies that $\sum _{t=1}^T \varvec{g}_t(\varvec{\phi }_{t-1}) = \left( \varvec{\phi }_0 - \varvec{\phi }_T \right) / \gamma .$ Hence, for any $p \ge 1$, we have

$$\begin{aligned}&\left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) \right\| _{L_{2p}}\\&\quad = \left\| \frac{1}{T} \sum _{t=1}^T \Big ( \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \Big ) + \frac{\varvec{\phi }_0 - \varvec{\phi }^\star }{T \gamma } + \frac{\varvec{\phi }^\star - \varvec{\phi }_T}{T \gamma } \right\| _{L_{2p}} \\&\quad \le \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right\| _{L_{2p}} + \frac{1}{T \gamma } \left\| \varvec{\phi }_0 - \varvec{\phi }^\star \right\| + \frac{1}{T \gamma } \left\| \varvec{\phi }^\star - \varvec{\phi }_T \right\| _{L_{2p}} \\&\quad \le \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right\| _{L_{2p}} + \frac{1}{T \gamma } \left\| \varvec{\phi }_0 - \varvec{\phi }^\star \right\| + \frac{1}{T \gamma } \left\| A_T \right\| _{L_{p}}^{1/2} \\&\quad \le \left\| \frac{1}{T} \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right\| _{L_{2p}} + \frac{1 + \sqrt{2}}{T \gamma } \left\| \varvec{\phi }_0 - \varvec{\phi }^\star \right\| \\&\qquad + \frac{2 \sqrt{10} \left( R + {{\bar{\varepsilon }}} \right) \sqrt{p}}{\sqrt{T}}, \end{aligned}$$

where the first inequality exploits Minkowski’s inequality (38), the second inequality follows from (40), which implies that $\Vert \varvec{\phi }^\star - \varvec{\phi }_T \Vert \le \sqrt{A_T}$, and the definition of the $L_p$-norm. The last inequality in the above expression is a direct consequence of the universal bound (50) and the inequality $ \sqrt{a+b} \le \sqrt{a} + \sqrt{b}$ for all $a,b\ge 0$. Next, define for any $t\in \mathbb N$ a martingale difference of the form

$$\begin{aligned}\varvec{C}_t = \frac{1}{T} \Big ( \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) - {\mathbb {E}}[\nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) | \mathcal F_{t-1}] \Big ).\end{aligned}$$

Note that these martingale differences are bounded because

$$\begin{aligned} \Vert \varvec{C}_t \Vert&\le \frac{1}{T} \Big ( \Vert \nabla h(\varvec{\phi }_{t-1}) \Vert + \Vert \varvec{g}_t(\varvec{\phi }_{t-1}) \Vert + \Vert {\mathbb {E}}[\nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) | \mathcal F_{t-1}] \Vert \Big )\\&\le \frac{2R + \varepsilon _{t-1}}{T}\\&\le \frac{2R + {{\bar{\varepsilon }}}}{T}, \end{aligned}$$

and thus the BRP inequality of Lemma 4.3 implies that

$$\begin{aligned} \left\| \sum _{t=1}^T \varvec{C}_t \right\| _{L_{2p}} \le \sqrt{2p} \, \frac{2R + {{\bar{\varepsilon }}}}{\sqrt{T}} + 2p \, \frac{2R + {{\bar{\varepsilon }}}}{T}. \end{aligned}$$

Recalling the definition of the martingale differences $\varvec{C}_t$, $t\in \mathbb N$, this bound allows us to conclude that

$$\begin{aligned}&\frac{1}{T} \left\| \sum _{t=1}^T \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right\| _{L_{2p}}\\&\quad \le \left\| \sum _{t=1}^T \varvec{C}_t \right\| _{L_{2p}} + \frac{1}{T} \left\| \sum _{t=1}^T {\mathbb {E}}[\nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) | \mathcal F_{t-1}] \right\| _{L_{2p}} \\&\quad \le \sqrt{2p} \, \frac{2R + {{\bar{\varepsilon }}}}{\sqrt{T}} + 2p \, \frac{2R + {{\bar{\varepsilon }}}}{T} + \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \le 2 \sqrt{2p} \, \frac{R + {{\bar{\varepsilon }}}}{\sqrt{T}} + 4p \, \frac{R + {{\bar{\varepsilon }}}}{T}, \end{aligned}$$

where the second inequality exploits Assumption 4.1 (i) as well as the second inequality in (43). Combining all inequalities derived above and observing that $2\sqrt{2} + 2 \sqrt{10} < 10 $ finally yields

$$\begin{aligned}&\left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\| _{L_{2p}}\\&\quad \le \frac{2 M}{T \gamma } \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 80 M \gamma \left( R + {{\bar{\varepsilon }}} \right) ^2 p + 2 \sqrt{2p} \, \frac{R + {{\bar{\varepsilon }}}}{\sqrt{T}} + 4p \, \frac{R + {{\bar{\varepsilon }}}}{T} \\&\qquad + \frac{1 + \sqrt{2}}{T \gamma } \left\| \varvec{\phi }_0 - \varvec{\phi }^\star \right\| + \frac{2 \sqrt{10} \left( R + {{\bar{\varepsilon }}} \right) \sqrt{p}}{\sqrt{T}} \\&\quad \le \frac{G}{\sqrt{T}} \left( 10 \sqrt{p} + \frac{4p}{\sqrt{T}} + 80 G^2 \gamma \sqrt{T} p + \frac{2}{\gamma \sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{3}{G \gamma \sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) , \end{aligned}$$

where $G = \max \{ M, R + {{\bar{\varepsilon }}} \}$. This proves the second inequality from the proposition statement. $\square $

The following corollary follows immediately from the proof of Proposition 4.2.

Corollary 4.5

Consider the inexact gradient descent algorithm (37) with constant step size $\gamma > 0$. If Assumptions 4.1 (i)–(ii) hold with $\varepsilon _t \le {{{\bar{\varepsilon }}}}/{(2\sqrt{1+t})}$ for some ${{\bar{\varepsilon }}} \ge 0$, then we have

$$\begin{aligned} \frac{1}{T} \sum _{t=1}^T \mathbb E \left[ \left( \nabla h(\varvec{\phi }_{t}) - \varvec{g}_t(\varvec{\phi }_{t}) \right) ^\top (\varvec{\phi }_{t} - \varvec{\phi }_{\star }) \right] \le \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 74 \gamma ^2 (R + {{\bar{\varepsilon }}})^2 T}. \end{aligned}$$

Proof of Corollary 4.5

Defining $B_t$ as in the proof of Proposition 4.2, we find

$$\begin{aligned}&\frac{1}{T} \sum _{t=1}^T \mathbb E \left[ \left( \nabla h(\varvec{\phi }_{t}) - \varvec{g}_t(\varvec{\phi }_{t}) \right) ^\top (\varvec{\phi }_{t} - \varvec{\phi }_{\star }) \right] \\&\quad = \frac{1}{2 \gamma T} {\mathbb {E}}\left[ \sum _{t=1}^T {\mathbb {E}}[B_t | \mathcal F_{t-1}] \right] \\&\quad \le \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 T \gamma ^2 R^2 + 2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 72 \gamma ^2 (R + {{\bar{\varepsilon }}})^2 T}, \end{aligned}$$

where the inequality is an immediate consequence of the reasoning in Case (III) in the proof of Proposition 4.2. The claim then follows from the trivial inequality $R+ {{\bar{\varepsilon }}} \ge R$.

$\square $

Armed with Proposition 4.2 and Corollary 4.5, we are now ready to prove the main convergence result.

Theorem 4.6

Consider the inexact gradient descent algorithm (37) with constant step size $\gamma > 0$. If Assumptions 4.1 (i)–(ii) hold with $\varepsilon _t \le {{{\bar{\varepsilon }}}}/{(2\sqrt{1+t})}$ for some ${{\bar{\varepsilon }}} \ge 0$, then the following statements hold.

(i)
If $\gamma = 1 / (2 (R + {{\bar{\varepsilon }}})^2 \sqrt{T})$, then we have
$$\begin{aligned} {\mathbb {E}}\left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right] - h(\varvec{\phi }^\star )&\le \frac{(R + {{\bar{\varepsilon }}})^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{1}{4\sqrt{T}}\\&\quad + \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{37}{2(R + {{\bar{\varepsilon }}})^2}} . \end{aligned}$$
(ii)
If $\gamma = 1 / (2 (R + {{\bar{\varepsilon }}})^2 \sqrt{T} + L)$ and the Assumptions 4.1 (iv)–(v) hold in addition to the blanket assumptions mentioned above, then we have
$$\begin{aligned} {\mathbb {E}}\left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t} \right) \right] - h(\varvec{\phi }^\star )&\le \frac{L}{2T}\Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{(R + {{\bar{\varepsilon }}})^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2\\&\qquad + \frac{\sigma ^2}{4 (R+{{\bar{\varepsilon }}})^2\sqrt{T}} \\&\qquad + \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{37}{2(R + {{\bar{\varepsilon }}})^2}}. \end{aligned}$$
(iii)
If $\gamma = 1 / (2 G^2 \sqrt{T})$ with $G = \max \{M, R + {{\bar{\varepsilon }}} \}$, the smallest eigenvalue $\kappa $ of $\nabla ^2 h(\varvec{\phi }^\star )$ is strictly positive and Assumption 4.1 (iii) holds in addition to the blanket assumptions mentioned above, then we have
$$\begin{aligned} \mathbb E \left[ h\left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1}\right) \right] - h(\varvec{\phi }^\star )&\le \frac{G^2}{\kappa T} \left( 4 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert + 20 \right) ^4. \end{aligned}$$

The proof of Theorem 4.6 relies on the following concentration inequalities due to Bach [14].

Lemma 4.7

(Concentration inequalities)

(i)
([14], Lemma 11): If there exist $a,b>0$ and a random variable $\varvec{z} \in {\mathbb {R}}^n$ with $ \Vert \varvec{z} \Vert _{L_p} \le a + b p $ for all $p \in \mathbb N$, then we have
$$\begin{aligned} \mathbb P \left[ \Vert \varvec{z} \Vert \ge 3 b s + 2 a \right] \le 2 \exp (-s)\quad \forall s \ge 0. \end{aligned}$$
(ii)
([14], Lemma 12): If there exist $a,b,c>0$ and a random variable $\varvec{z} \in {\mathbb {R}}^n$ with $ \Vert \varvec{z} \Vert _{L_p} \le (a \sqrt{p} + b p + c)^2 $ for all $p \in [T]$, then we have
$$\begin{aligned} \mathbb P \left[ \Vert \varvec{z} \Vert \ge (2 a \sqrt{s} + 2 b s + 2 c)^2 \right] \le 4 \exp (-s)\quad \forall s \le T. \end{aligned}$$

Proof of Theorem 4.6

Define $A_t$ as in the proof of Proposition 4.2. Then, we have

$$\begin{aligned} {\mathbb {E}}\left[ h \left( \frac{1}{T} \sum _{t=0}^{T-1} \varvec{\phi }_{t} \right) - h(\varvec{\phi }^\star ) \right]&\le \frac{\mathbb E[A_T]}{2 \gamma T} = \frac{\Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}{2 \gamma T} + \frac{\gamma R^2}{2}\nonumber \\&\quad + \frac{1}{T} \sum _{t=1}^T \mathbb E \left[ \left( \nabla h(\varvec{\phi }_{t}) - \varvec{g}_t(\varvec{\phi }_{t}) \right) ^\top (\varvec{\phi }_{t} - \varvec{\phi }_{\star }) \right] \nonumber \\&\le \frac{\Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}{2 \gamma T} + \frac{\gamma R^2}{2} \nonumber \\&\quad + \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 74 \gamma ^2 (R + {{\bar{\varepsilon }}})^2 T}, \end{aligned}$$

(51)

where the two inequalities follow from (41) and from Corollary 4.5, respectively. Setting the step size to $\gamma = 1 / ( 2 (R+ {{\bar{\varepsilon }}})^2 \sqrt{T} )$ then completes the proof of assertion (i).

Assertion (ii) generalizes ([45], Theorem 1). By the L-smoothness of $h(\varvec{\phi })$, we have

$$\begin{aligned} h(\varvec{\phi }_{t})&\le h(\varvec{\phi }_{t-1}) + \nabla h(\varvec{\phi }_{t-1})^\top (\varvec{\phi }_{t} - \varvec{\phi }_{t-1}) + \frac{L}{2}\Vert \varvec{\phi }_{t} - \varvec{\phi }_{t-1}\Vert ^2 \nonumber \\&= h(\varvec{\phi }_{t-1}) + \varvec{g}_t(\varvec{\phi }_{t-1})^\top (\varvec{\phi }_{t} - \varvec{\phi }_{t-1}) + \left( \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right) ^\top \nonumber \\&\quad (\varvec{\phi }_{t} - \varvec{\phi }_{t-1}) + \frac{L}{2}\Vert \varvec{\phi }_{t} - \varvec{\phi }_{t-1}\Vert ^2 \nonumber \\&\le h(\varvec{\phi }_{t-1}) + \varvec{g}_t(\varvec{\phi }_{t-1})^\top (\varvec{\phi }_{t} - \varvec{\phi }_{t-1}) + \frac{\zeta }{2}\Vert \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \Vert ^2\nonumber \\&\quad + \frac{L + 1/\zeta }{2}\Vert \varvec{\phi }_{t} - \varvec{\phi }_{t-1}\Vert ^2, \end{aligned}$$

(52)

where the last inequality exploits the Cauchy-Schwarz inequality together with the elementary inequality $2ab \le \zeta a^2 + b^2 / \zeta $, which holds for all $a,b\in {\mathbb {R}}$ and $\zeta > 0$. Next, note that the iterates satisfy the recursion

$$\begin{aligned} \Vert \varvec{\phi }_{t-1} - \varvec{\phi }^\star \Vert ^2 = \Vert \varvec{\phi }_{t-1} - \varvec{\phi }_{t} \Vert ^2 + \Vert \varvec{\phi }_{t} - \varvec{\phi }^\star \Vert ^2 + 2 (\varvec{\phi }_{t-1} - \varvec{\phi }_{t})^\top (\varvec{\phi }_{t} - \varvec{\phi }^\star ), \end{aligned}$$

which can be re-expressed as

$$\begin{aligned} \varvec{g}_t(\varvec{\phi }_{t-1})^\top (\varvec{\phi }_{t} - \varvec{\phi }^\star ) = \frac{1}{2 \gamma } \left( \Vert \varvec{\phi }_{t-1} - \varvec{\phi }^\star \Vert ^2 - \Vert \varvec{\phi }_{t-1} - \varvec{\phi }_{t} \Vert ^2 - \Vert \varvec{\phi }_{t} - \varvec{\phi }^\star \Vert ^2 \right) \end{aligned}$$

by using the update rule (37). In the remainder of the proof we assume that $0< \gamma < 1 / L$. Substituting the above equality into (52) and setting $\zeta = \gamma / (1 - \gamma L)$ then yields

$$\begin{aligned} h(\varvec{\phi }_{t})&\le h(\varvec{\phi }_{t-1}) + \varvec{g}_t(\varvec{\phi }_{t-1})^\top (\varvec{\phi }^\star - \varvec{\phi }_{t-1}) + \frac{\gamma }{2(1 - \gamma L)} \Vert \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \Vert ^2 \\&\qquad + \frac{1}{2 \gamma } \left( \Vert \varvec{\phi }_{t-1} - \varvec{\phi }^\star \Vert ^2 - \Vert \varvec{\phi }_{t} - \varvec{\phi }^\star \Vert ^2 \right) . \end{aligned}$$

By the convexity of h, we have $h(\varvec{\phi }^\star ) \ge h(\varvec{\phi }_{t-1}) + \nabla h(\varvec{\phi }_{t-1})^\top (\varvec{\phi }^\star - \varvec{\phi }_{t-1})$, which finally implies that

$$\begin{aligned} h(\varvec{\phi }_{t})&\le h(\varvec{\phi }^\star ) + \left( \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right) ^\top ( \varvec{\phi }_{t-1} - \varvec{\phi }^\star )\\&\qquad + \frac{\gamma }{2(1 - \gamma L)} \Vert \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \Vert ^2 \\&\qquad + \frac{1}{2\gamma } \left( \Vert \varvec{\phi }_{t-1} - \varvec{\phi }^\star \Vert ^2 - \Vert \varvec{\phi }_{t} - \varvec{\phi }^\star \Vert ^2 \right) . \end{aligned}$$

Averaging the above inequality over t and taking expectations then yields the estimate

$$\begin{aligned}&\mathbb E \left[ \frac{1}{T} \sum _{t=1}^T h(\varvec{\phi }_{t}) \right] - h(\varvec{\phi }^\star )\\&\quad \le \frac{\Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}{2\gamma T} + \frac{\gamma }{2 (1 - \gamma L)} \mathbb E \left[ \frac{1}{T} \sum _{t=1}^T \Vert \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \Vert ^2 \right] \\&\qquad + \mathbb E \left[ \frac{1}{T} \sum _{t=1}^T \left( \nabla h(\varvec{\phi }_{t-1}) - \varvec{g}_t(\varvec{\phi }_{t-1}) \right) ^\top (\varvec{\phi }_{t-1} - \varvec{\phi }_{\star }) \right] \\&\quad \le \frac{\Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}{2\gamma T} + \frac{\gamma \sigma ^2}{2 (1 - \gamma L)} + \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 74 \gamma ^2 (R + {{\bar{\varepsilon }}})^2 T}, \end{aligned}$$

where the second inequality exloits Assumption 4.1 (v) and Corollary 4.5. Using Jensen’s inequality to move the average over t inside h, assertion (ii) then follows by setting $\gamma = 1 / (2 (R + {{\bar{\varepsilon }}})^2 \sqrt{T} + L)$ and observing that $\gamma / ( 1 - \gamma L) = 1 / ( 2(R+{{\bar{\varepsilon }}})^2 \sqrt{T} )$.

To prove assertion (iii), we distinguish two different cases.

Case I: Assume first that $4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \le {\kappa \sqrt{T}}/{(8 G^2)}$, where $G = \max \{M, R + {{\bar{\varepsilon }}} \}$ and $\kappa $ denotes the smallest eigenvalue of $\nabla ^2 h(\varvec{\phi }^\star )$. By a standard formula for the expected value of a non-negative random variable, we find

$$\begin{aligned}&{\mathbb {E}}\left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \right] \nonumber \\&\quad = \int _{0}^\infty \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge u \right] \mathrm {d}u \nonumber \\&\quad = \int _{0}^{u_1} \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge u \right] \mathrm {d}u \nonumber \\&\qquad + \int _{u_1}^{u_2} \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge u \right] \mathrm {d}u \nonumber \\&\qquad + \int _{u_2}^{\infty } \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge u \right] \mathrm {d}u, \end{aligned}$$

(53)

where $u_1 = \frac{8 G^2}{\kappa T}(4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert )^2$ and $u_2 = \frac{8 G^2}{\kappa T}(\frac{\kappa \sqrt{T}}{4 G^2} + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert )^2$. The first of the three integrals in (53) is trivially upper bounded by $u_1$. Next, we investigate the third integral in (53), which is easier to bound from above than the second one. By combining the first inequality in Proposition 4.2 for $\gamma = 1 / (2 G^2 \sqrt{T})$ with the trivial inequality $G \ge R + {{\bar{\varepsilon }}}$, we find

$$\begin{aligned} \left\| h\left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \right\| _{L_p} \le \frac{2G^2}{\sqrt{T}}\,\Vert \varvec{\phi }_0-\varvec{\phi }^\star \Vert ^2 + \frac{10}{\sqrt{T}} \,p\quad \forall p\in \mathbb N. \end{aligned}$$

Lemma 4.7 (i) with $a = 2 G^2 \Vert \varvec{\phi }_0 -\varvec{\phi }^\star \Vert ^2 / \sqrt{T}$ and $b = 10 / \sqrt{T}$ thus implies that

$$\begin{aligned} \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge \frac{30}{\sqrt{T}} s + \frac{4 G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 \right] \le 2 \exp (-s) \quad \forall s \ge 0. \end{aligned}$$

(54)

We also have

$$\begin{aligned} u_2 - \frac{4 G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 \ge u_2 - \frac{\kappa }{8 G^2} \ge \frac{8 G^2}{\kappa T} \left( \frac{\kappa \sqrt{T}}{4 G^2} \right) ^2 - \frac{\kappa }{8 G^2} = \frac{3 \kappa }{8 G^2} \ge 0, \end{aligned}$$

(55)

where the first inequality follows from the basic assumption underlying Case I, while the second inequality holds due to the definition of $u_2$. By (54) and (55), the third integral in (53) satisfies

$$\begin{aligned}&\int _{u_2}^{\infty } \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge u \right] \mathrm {d}u \\&\quad =\; \int _{u_2 - \frac{4 G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}^{\infty } \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge u + \frac{4 G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 \right] \mathrm {d}u \\&\quad \le \; 2 \int _{u_2 - \frac{4 G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2}^\infty \exp \left( -\frac{\sqrt{T} u}{30} \right) \mathrm {d}u= \frac{60}{\sqrt{T}} \exp \left( -\frac{\sqrt{T}}{30} \left( u_2 - \frac{4 G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 \right) \right) \\&\quad \le \; \frac{60}{\sqrt{T}} \exp \left( -\frac{\kappa \sqrt{T}}{80 G^2} \right) \le \frac{2400 G^2}{\kappa T}, \end{aligned}$$

where the first inequality follows from the concentration inequality (54) and the insight from (55) that $u_2 - \frac{4 G^2}{\sqrt{T}}\Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 \ge 0$. The second inequality exploits again (55), and the last inequality holds because $\exp (-x) \le 1 / (2x)$ for all $ x > 0$. We have thus found a simple upper bound on the third integral in (53). It remains to derive an upper bound on the second integral in (53). To this end, we first observe that the second inequality in Proposition 4.2 for $\gamma = 1 / (2 G^2 \sqrt{T})$ translates to

$$\begin{aligned}&\left\| \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\| ^2 \right\| _{L_p} \\&\quad \le \frac{G^{2}}{T} \left( 10 \sqrt{p} + \frac{4p}{\sqrt{T}} + 40 p + 4G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) ^2 \quad \forall p\in \mathbb N. \end{aligned}$$

Lemma 4.7 (ii) with $a = 10 G / \sqrt{T}$, $b = 4 G / T + 40 G / \sqrt{T}$ and $c = 4 G^3 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 / \sqrt{T} + 6 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert /\sqrt{T}$ thus gives rise to the concentration inequality

$$\begin{aligned}&\mathbb P \left[ \;\left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \!\right) \right\| ^2 \right. \\&\quad \left. \ge \! \frac{4G^2}{T} \left( 10 \sqrt{s} + \frac{4s}{\sqrt{T}} + 40 s + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) ^2 \right] \le 4 \exp (-s), \end{aligned}$$

which holds only for small deviations $s\le T$. However, this concentration inequality can be simplified to

$$\begin{aligned}&\mathbb P \left[ \;\left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\| \right. \\&\quad \left. \ge \frac{2G}{\sqrt{T}} \left( 12 \sqrt{s} + 40 s + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) \right] \\&\quad \le 4 \exp (-s), \end{aligned}$$

which remains valid for all deviations $ s\ge 0$. To see this, note that if $ s \le T/4 $, then the simplified concentration inequality holds because $ 4 s / T \le 2 \sqrt{s / T}$. Otherwise, if $ s > T/4 $, then the simplified concentration inequality holds trivially because the probability on the left hand vanishes. Indeed, this is an immediate consequence of Assumption 4.1 (ii), which stipulates that the norm of the gradient of h is bounded by R, and of the elementary estimate $24 G \sqrt{s / T} > G\ge R$, which holds for all $s > T / 4$.

In the following, we restrict attention to those deviations $s\ge 0$ that are small in the sense that

$$\begin{aligned} \displaystyle 12 \sqrt{s} + 40 s \le \frac{ \kappa \sqrt{T}}{4G^2}. \end{aligned}$$

(56)

Assume now for the sake of argument that the event inside the probability in the simplified concentration inequality does not occur, that is, assume that

$$\begin{aligned} \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\| < \frac{2G}{\sqrt{T}} \left( 12 \sqrt{s} + 40 s + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) . \end{aligned}$$

(57)

By (56) and the assumption of Case I, (57) implies that $\Vert \nabla h ( \frac{1}{T}\sum _{t=1}^T \varvec{\phi }_{t-1} ) \Vert< 3 \kappa / (4G) < 3 \kappa / (4M)$. Hence, we may apply Lemma 4.4 (ii) to conclude that $h ( \frac{1}{T}\sum _{t=1}^T \varvec{\phi }_{t-1} ) - h(\varvec{\phi }^\star ) \le \frac{2}{\kappa } \Vert \nabla h ( \frac{1}{T} \sum _{t=1}^T \varvec{\phi }_{t-1} ) \Vert ^2$. Combining this inequality with (57) then yields

$$\begin{aligned} h \left( \frac{1}{T}\sum _{t=1}^T \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) < \frac{8G^2}{\kappa T} \left( 12 \sqrt{s} + 40 s + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) ^2. \end{aligned}$$

(58)

By the simplified concentration inequality derived above, we may thus conclude that

$$\begin{aligned} 4 \exp (-s)&\ge \; \mathbb P \left[ \; \left\| \nabla h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right\| \right. \nonumber \\&\left. \ge \frac{2G}{\sqrt{T}} \left( 12 \sqrt{s} + 40 s + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) \right] \nonumber \\&\ge \; \mathbb P \left[ \; h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \right. \nonumber \\&\left. \ge \frac{8G^2}{\kappa T} \left( 12 \sqrt{s} + 40 s + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) ^2 \right] \end{aligned}$$

(59)

for any $s\ge 0$ that satisfies (56), where the second inequality holds because (57) implies (58) or, equivalently, because the negation of (58) implies the negation of (57). The resulting concentration inequality (59) now enables us to construct an upper bound on the second integral in (53). To this end, we define the function

$$\begin{aligned} \ell (s) = \frac{8 G^2}{\kappa T} \left( 12 \sqrt{s} + 40 s + 4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert \right) ^2 \end{aligned}$$

for all $s\ge 0$, and set ${{\bar{s}}} = ((9/400 + \kappa \sqrt{T} / (160 G^2))^{\frac{1}{2}} - 3 / 20)^{2}$. Note that $s\ge 0$ satisfies the inequality (56) if and only if $s\le {{\bar{s}}}$ and that $\ell (0) = u_1$ as well as $\ell ({{\bar{s}}}) = u_2$. By substituting u with $ \ell (s)$ and using the concentration inequality (59) to bound the integrand, we find that the second integral in (53) satisfies

$$\begin{aligned}&\int _{u_1}^{u_2} \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge u \right] \mathrm {d}u\\&\quad = \int _{0}^{{{\bar{s}}}} \mathbb P \left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \ge \ell (s) \right] \frac{\mathrm {d}\ell (s)}{\mathrm {d}s} \mathrm {d}s \\&\quad \le \int _{0}^{{{\bar{s}}}} 4 \mathrm {e}^{-s} \; \frac{\mathrm {d}}{\mathrm {d}s} \! \left( \frac{8 G^2}{\kappa T} \left( 12 \sqrt{s} + 40 s + \tau \right) ^2 \right) \mathrm {d}s \\&\quad \le \frac{32 G^2}{\kappa T} \int _{0}^{\infty } \mathrm {e}^{-s} \left( 144 + 3200 s + 1440 s^{1/2} + 80 \tau + 12 \tau s^{-1/2} \right) \mathrm {d}s \\&\quad = \frac{32 G^2}{\kappa T} \big ( 144 + 3200 \Gamma (2) + 1440 \Gamma (3/2) + 80 \tau + 12 \tau \Gamma (1/2) \big ) \\&\quad \le \frac{32 G^2}{\kappa T} ( 4621 + 102 \tau ), \end{aligned}$$

where $\tau $ is is a shorthand for $4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert $, and $\Gamma $ denotes the Gamma function with $\Gamma (2) = 1$, $\Gamma (1/2) = \sqrt{\pi }$ and $\Gamma (3/2) = \sqrt{\pi }/2$; see for example ([141], Chapter 8). The last inequality is obtained by rounding all fractional numbers up to the next higher integer. Combining the upper bounds for the three integrals in (53) finally yields

$$\begin{aligned} {\mathbb {E}}\left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) - h(\varvec{\phi }^\star ) \right]&\le \frac{8 G^2}{\kappa T} \left( \tau ^2 + 18484 + 408 \tau + 300 \right) \\&= \frac{8 G^2}{\kappa T} \Big ( 16 G^4 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^4 + 48 G^3 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^3 \\&\quad + 1668 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 \\&\quad + 2448 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert + 18784 \Big ) \\&\le \frac{G^2}{\kappa T} (4 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert + 20)^4. \end{aligned}$$

This complete the proof of assertion (iii) in Case I.

Case II: Assume now that $4 G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 6 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert > {\kappa \sqrt{T}}/{(8 G^2)}$, where G is defined as before. Since h has bounded gradients, the inequality (51) remains valid. Setting the step size to $\gamma = 1 / (2 G^2 \sqrt{T})$ and using the trivial inequalities $G \ge R + {{\bar{\varepsilon }}} \ge R$, we thus obtain

$$\begin{aligned} {\mathbb {E}}\left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right] - h(\varvec{\phi }^\star )&\le \frac{G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{1}{4\sqrt{T}} + \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{37}{2G^2}} \\&\le \frac{G^2}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + \frac{2G}{\sqrt{T}} \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert + \frac{5}{\sqrt{T}} , \end{aligned}$$

where the second inequality holds because $G \ge {{\bar{\varepsilon }}}$ and $\sqrt{a + b} \le \sqrt{a} + \sqrt{b}$ for all $a,b\ge 0$. Multiplying the right hand side of the last inequality by $G^2 (32 G^2 \Vert \varvec{\phi }_0^\star - \varvec{\phi }^\star \Vert ^2 + 48 G \Vert \varvec{\phi }_0^\star - \varvec{\phi }^\star \Vert ) / (\kappa \sqrt{T})$, which is strictly larger than 1 by the basic assumption underlying Case II, we then find

$$\begin{aligned}&{\mathbb {E}}\left[ h \left( \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t-1} \right) \right] - h(\varvec{\phi }^\star ) \\&\quad \le \frac{G^2}{\kappa T} \left( G^2 \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert ^2 + 2 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert + 5 \right) \left( 32 G^2 \Vert \varvec{\phi }_0^\star - \varvec{\phi }^\star \Vert ^2 + 48 G \Vert \varvec{\phi }_0^\star - \varvec{\phi }^\star \Vert \right) \\&\quad \le \frac{G^2}{\kappa T} (4 G \Vert \varvec{\phi }_0 - \varvec{\phi }^\star \Vert + 20)^4. \end{aligned}$$

This observation completes the proof. $\square $

4.2 Smooth optimal transport problems with marginal ambiguity sets

The smooth optimal transport problem (12) can be viewed as an instance of a stochastic optimization problem, that is, a convex maximization problem akin to (36), where the objective function is representable as $h(\varvec{\phi }) = {\mathbb {E}}_{\varvec{x} \sim \mu } [ \varvec{\nu }^\top \varvec{\phi }- {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})]$. Throughout this section we assume that the smooth (discrete) c-transform ${{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})$ defined in (11) is induced by a marginal ambiguity set of the form (26) with continuous marginal distribution functions. By Proposition 3.6, the integrand $\varvec{\nu }^\top \varvec{\phi }- {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})$ is therefore concave and differentiable in $\varvec{\phi }$. We also assume that ${{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})$ is $\mu $-integrable in $\varvec{x}$, that we have access to an oracle that generates independent samples from $\mu $ and that problem (12) is solvable.

The following proposition establishes several useful properties of the smooth c-transform.

Proposition 4.8

(Properties of the smooth c-transform) If $\Theta $ is a marginal ambiguity set of the form (26) with cumulative distribution functions $F_i$, $i\in [N]$, then ${{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})$ has the following properties for all $\varvec{x} \in \mathcal X$.

(i)
Bounded gradient: If $F_i$, $i\in [N]$, are continuous, then we have $ \Vert \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) \Vert \le 1 $ for all $\varvec{\phi }\in {\mathbb {R}}^N$.
(ii)
Lipschitz continuous gradient: If $F_i$, $i\in [N]$, are Lipschitz continuous with Lipschitz constant $L>0$, then ${{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})$ is L-smooth with respect to $\varvec{\phi }$ in the sense of Assumption 4.1 (iv).
(iii)
Generalized self-concordance: If $F_i$, $i\in [N]$, are twice differentiable on the interiors of their respective supports and if there is $M > 0$ with
$$\begin{aligned} \sup _{s \in F_i^{-1}(0,1)} ~ \frac{|\mathrm {d}^2F_i(s) / \mathrm {d}s^2|}{\mathrm {d}F_i(s) / \mathrm {d}s} \le M, \end{aligned}$$
(60)
then ${{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})$ is M-generalized self-concordant with respect to $\varvec{\phi }$ in the sense of Assumption 4.1 (iii).

Proof

As for (i), Proposition 3.6 implies that $\nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) \in \Delta ^N$, and thus we have $\Vert \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) \Vert \le 1$. As for (ii), note that the convex conjugate of the smooth c-transform with respect to $\varvec{\phi }$ is given by

$$\begin{aligned} {{\overline{\psi }}}{}_c^*(\varvec{p}, \varvec{x})&= \sup _{\varvec{\phi }\in {\mathbb {R}}^N} \varvec{p}^\top \varvec{\phi }- {{\overline{\psi }}}(\varvec{\phi }, \varvec{x}) \\&= \sup _{\varvec{\phi }\in {\mathbb {R}}^N} \inf _{\varvec{q} \in \Delta ^N} ~ \sum _{i=1}^N p_i \phi _i - (\phi _i - c(\varvec{x}, \varvec{y_i})) q_i - \int _{1-q_i}^1 F_i^{-1}(t)\mathrm {d}t \\&= \inf _{\varvec{q} \in \Delta ^N} \sup _{\varvec{\phi }\in {\mathbb {R}}^N} ~ \sum _{i=1}^N p_i \phi _i - (\phi _i - c(\varvec{x}, \varvec{y_i})) q_i - \int _{1-q_i}^1 F_i^{-1}(t)\mathrm {d}t \\&= {\left\{ \begin{array}{ll} \;\displaystyle \sum \limits _{i=1}^N c(\varvec{x}, \varvec{y_i}) p_i - \int _{1-p_i}^1 F_i^{-1}(t)\mathrm {d}t &{} \text {if } \varvec{p} \in \Delta ^N \\ \;+\infty &{} \text {otherwise,} \end{array}\right. } \end{aligned}$$

where the second equality follows again from Proposition 3.6, and the interchange of the infimum and the supremum is allowed by Sion’s classical minimax theorem. In the following we first prove that ${{\overline{\psi }}}{}_c^*(\varvec{p}, \varvec{x})$ is 1/L-strongly convex in $\varvec{p}$, that is, the function ${{\overline{\psi }}}{}_c^*(\varvec{p}, \varvec{x}) - \Vert \varvec{p}\Vert ^2/ (2L)$ is convex in $\varvec{p}$ for any fixed $\varvec{x} \in \mathcal X$. To this end, recall that $F_i$ is assumed to be Lipschitz continuous with Lipschitz constant L. Thus, we have

$$\begin{aligned} L\!\ge \!\sup _{\begin{array}{c} s_1,s_2 \in {\mathbb {R}}\\ s_1 \ne s_2 \end{array}}\!\frac{\left| F_i (s_1) \!-\! F_i(s_2)\right| }{|s_1 - s_2|} \!=\! \sup _{\begin{array}{c} s_1, s_2 \in {\mathbb {R}}\\ s_1> s_2 \end{array}}\frac{ F_i (s_1) \!-\! F_i(s_2)}{s_1 - s_2}\!\ge \! \sup _{\begin{array}{c} p_i, q_i \in (0,1)\\ p_i > q_i \end{array}} \frac{p_i - q_i}{F_i^{-1}(p_i) \!-\! F_i^{-1}(q_i)}, \end{aligned}$$

where the second inequality follows from restricting $s_1$ and $s_2$ to the preimage of (0, 1) with respect to $F_i$. Rearranging terms in the above inequality then yields

$$\begin{aligned} -F_i^{-1}(1 - q_i) - q_i/L&\le -F_i^{-1}(1-p_i)-p_i/L \end{aligned}$$

for all $p_i, q_i \in (0, 1)$ with $q_i < p_i$. Consequently, the function $- F_i^{-1}(1-p_i) - {p_i}/L$ is non-decreasing and its primitive $- \int _{1-p_i}^1 F_i^{-1}(t)\mathrm {d}t - p_i^2/(2 L)$ is convex in $p_i$ on the interval (0, 1). This implies that

$$\begin{aligned} {{\overline{\psi }}}{}_c^*(\varvec{p}, \varvec{x}) - \frac{\Vert \varvec{p}\Vert _2^2}{2 L} = \sum _{i=1}^N c(\varvec{x}, \varvec{y_i}) p_i - \int _{1-p_i}^1 F_i^{-1}(t)\mathrm {d}t - \frac{p_i^2}{2 L} \end{aligned}$$

constitutes a sum of convex univariate functions for every fixed $\varvec{x}\in {\mathcal {X}}$. Thus, ${{\overline{\psi }}}{}_c^*(\varvec{p}, \varvec{x})$ is 1/L-strongly convex in $\varvec{p}$. By ([78], Theorem 6), however, any convex function whose conjugate is 1/L-strongly convex is guaranteed to be L-smooth. This observation completes the proof of assertion (ii). As for assertion (iii), choose any $\varvec{\phi }, \varvec{\varphi }\in {\mathbb {R}}^N$ and $\varvec{x} \in \mathcal X$, and introduce the auxiliary function

$$\begin{aligned} u(s)= & {} {{\overline{\psi }}}_c \left( \varvec{\phi }+ s (\varvec{\varphi }- \varvec{\phi }), \varvec{x} \right) = \max _{ \varvec{p} \in \Delta ^N} \displaystyle \sum \limits _{i=1}^N ~ (\phi _i + s (\varphi _i - \phi _i) - c(\varvec{x}, \varvec{y_i}))p_i \nonumber \\&+ \int _{1-p_i}^1 F_i^{-1}(t) \mathrm {d}t. \end{aligned}$$

(61)

For ease of exposition, in the remainder of the proof we use prime symbols to designate derivatives of univariate functions. A direct calculation then yields

$$\begin{aligned} u'(s) =&\left( \varvec{\varphi }- \varvec{\phi }\right) ^\top \nabla _{\varvec{\phi }} {{\overline{\psi }}} \left( \varvec{\phi }+ s (\varvec{\varphi }- \varvec{\phi }), \varvec{x} \right) \quad \text {and} \\ \quad u''(s) =&\left( \varvec{\varphi }- \varvec{\phi }\right) ^\top \nabla _{\varvec{\phi }}^2 {{\overline{\psi }}} \left( \varvec{\phi }+ s (\varvec{\varphi }- \varvec{\phi }), \varvec{x} \right) \left( \varvec{\varphi }- \varvec{\phi }\right) . \end{aligned}$$

By Proposition 3.6, $\varvec{p}^\star (s)=\nabla _{\varvec{\phi }} {{\overline{\psi }}}_c \left( \varvec{\phi }+ s (\varvec{\varphi }- \varvec{\phi }), \varvec{x} \right) $ represents the unique solution of the maximization problem in (61). In addition, by ([159], Proposition 6), the Hessian of the smooth c-transform with respect to $\varvec{\phi }$ can be computed from the Hessian of its convex conjugate as follows.

$$\begin{aligned}&\nabla _{\varvec{\phi }}^2 {{\overline{\psi }}}_c \left( \varvec{\phi }+ s (\varvec{\varphi }- \varvec{\phi }), \varvec{x} \right) = \left( \nabla ^2_{\varvec{p}} {{\overline{\psi }}}{}_c^*(\varvec{p}^\star (s), \varvec{x}) \right) ^{-1}\\&\quad = \mathrm {diag} \left( [F_1'(F_1^{-1}(1 - p_1^\star (s))), \dots , F_N'(F_N^{-1}(1 - p_N^\star (s))) ] \right) \end{aligned}$$

Hence, the first two derivatives of the auxiliary function u(s) simplify to

$$\begin{aligned} u'(s) = \sum _{i=1}^N (\varphi _i- \phi _i) p^\star _i(s) \quad \text {and} \quad u''(s) = \sum _{i=1}^N (\varphi _i- \phi _i)^2 F_i'(F_i^{-1}(1 - p_i^\star (s))).\end{aligned}$$

Similarly, the above formula for the Hessian of the smooth c-transform can be used to show that $(p_i^\star )'(s) = (\varphi _i- \phi _i) F_i'(F_i^{-1}(1 - p_i^\star (s)))$ for all $i \in [N]$. The third derivative of u(s) therefore simplifies to

$$\begin{aligned} u'''(s) =&- \sum _{i=1}^N (\varphi _i- \phi _i)^2 \,\frac{ F_i''(F_i^{-1}(1 - p_i^\star (s)))}{F_i'(F_i^{-1}(1 - p_i^\star (s)))}\, (p_i^\star )'(s) \\ =&- \sum _{i=1}^N (\varphi _i- \phi _i)^3 F_i''(F_i^{-1}(1 - p_i^\star (s))). \end{aligned}$$

This implies via Hölder’s inequality that

$$\begin{aligned} | u'''(s) |&= \left| \sum _{i=1}^N (\varphi _i- \phi _i)^2\, F_i'(F_i^{-1}(1 - p_i^\star (s))) \, \frac{F_i''(F_i^{-1}(1 - p_i^\star (s)))}{F_i'(F_i^{-1}(1 - p_i^\star (s)))} \, (\varphi _i- \phi _i) \right| \\&\le \left( \sum _{i=1}^N (\varphi _i- \phi _i)^2\, F_i'(F_i^{-1}(1 - p_i^\star (s))) \right) \left( \max _{i \in [N]} \left| \frac{F_i''(F_i^{-1}(1 - p_i^\star (s)))}{F_i'(F_i^{-1}(1 - p_i^\star (s)))} \, (\varphi _i- \phi _i) \right| \right) . \end{aligned}$$

Notice that the first term in the above expression coincides with $u''(s)$, and the second term satisfies

$$\begin{aligned}&\max _{i \in [N]} \left| \frac{F_i''(F_i^{-1}(1 - p_i^\star (s)))}{F_i'(F_i^{-1}(1 - p_i^\star (s)))} \, (\varphi _i- \phi _i) \right| \\&\quad \le \max _{i \in [N]} \left| \frac{F_i''(F_i^{-1}(1 - p_i^\star (s)))}{F_i'(F_i^{-1}(1 - p_i^\star (s)))} \right| \, \Vert \varvec{\varphi }- \varvec{\phi }\Vert _\infty \le M \Vert \varvec{\varphi }\varvec{-} \varvec{\phi }\Vert , \end{aligned}$$

where the first inequality holds because $\max _{i \in [N]} |a_i b_i| \le \Vert \varvec{a} \Vert _{\infty } \Vert \varvec{b} \Vert _\infty $ for all $\varvec{a}, \varvec{b} \in \mathbb R^N$, and the second inequality follows from the definition of M and the fact that the 2-norm provides an upper bound on the $\infty $-norm. Combining the above results shows that $|u'''(s)|\le M \Vert \varvec{\varphi }\varvec{-} \varvec{\phi }\Vert u''(s)$ for all $s\in {\mathbb {R}}$. The claim now follows because $\varvec{\phi }, \varvec{\varphi }\in {\mathbb {R}}^N$ and $\varvec{x} \in \mathcal X$ were chosen arbitrarily. $\square $

In the following we use the averaged SGD algorithm of Sect. 4.1 to solve the smooth optimal transport problem (12). A detailed description of this algorithm in pseudocode is provided in Algorithm 1. This algorithm repeatedly calls a sub-routine for estimating the gradient of ${{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})$ with respect to $\varvec{\phi }$. By Proposition 3.6, this gradient coincides with the unique solution $\varvec{p}^\star $ of the convex maximization problem (27). In addition, from the proof of Proposition 3.6 it is clear that its components are given by

$$\begin{aligned} p^\star _i = \theta ^\star \left[ i = \min \, \mathop {\mathrm{argmax}}\limits _{j \in [N]} \phi _j - c(\varvec{x}, \varvec{y}_j) + z_j \right] \quad \forall i \in [N], \end{aligned}$$

where $\theta ^\star $ represents an optimizer of the semi-parametric discrete choice problem (11). Therefore, $\varvec{p}^\star $ can be interpreted as a vector of choice probabilities under the best-case probability measure $\theta ^\star $. Sometimes these choice probabilities are available in closed form. This is the case, for instance, in the exponential distribution model of Example 3.8, which is equivalent to the generalized extreme value distribution model of Sect. 3.1. Indeed, in this case $\varvec{p}^\star $ is given by a softmax of the utility values $\phi _i - c(\varvec{x}, \varvec{y_i})$, $i\in [N]$, i.e.,

$$\begin{aligned} p_i^\star = \frac{\eta _i \exp \left( ({\phi _i - c(\varvec{x}, \varvec{y_i}) )}/{\lambda }\right) }{\sum _{j=1}^N \eta _j \exp \left( ({\phi _j - c(\varvec{x},\varvec{y_j}) })/{\lambda } \right) } \quad \forall i \in [N]. \end{aligned}$$

(62)

Note that these particular choice probabilities are routinely studied in the celebrated multinomial logit choice model ([16], § 5.1). The choice probabilities are also available in closed form in the uniform distribution model of Example 3.9. As the derivation of $\varvec{p}^\star $ is somewhat cumbersome in this case, we relegate it to Appendix D. For general marginal ambiguity sets with continuous marginal distribution functions, we propose a bisection method to compute the gradient of the smooth c-transform numerically up to any prescribed accuracy; see Algorithm 2.

Theorem 4.9

(Biased gradient oracle) If $\Theta $ is a marginal ambiguity set of the form (26) and the cumulative distribution function $F_i$ is continuous for every $i\in [N]$, then, for any $\varvec{x} \in \mathcal X$, $\varvec{\phi }\in {\mathbb {R}}^N$ and $\varepsilon > 0$, Algorithm 2 outputs $\varvec{p} \in {\mathbb {R}}^N $ with $\Vert \varvec{p} \Vert \le 1$ and $\Vert \nabla _{\varvec{\phi }} {{\overline{\psi }}}(\varvec{\phi }, \varvec{x}) - {\varvec{p}} \Vert \le \varepsilon $.

Proof

Thanks to Proposition 3.6, we can recast the smooth c-transform in dual form as

$$\begin{aligned} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})= & {} \min _{\begin{array}{c} \varvec{\zeta }\in {\mathbb {R}}_+^N \\ \tau \in {\mathbb {R}} \end{array}}\;\sup _{\varvec{p} \in {\mathbb {R}}^N} ~ \sum \limits _{i=1}^N (\phi _i - c(\varvec{x},\varvec{y_i}))p_i\\&+\sum \limits _{i=1}^N \int ^1_{1- p_i} F_i^{-1}(t)\mathrm {d}t + \tau \left( \sum \limits _{i=1}^N p_i - 1 \right) + \sum \limits _{i=1}^N \zeta _i p_i. \end{aligned}$$

Strong duality and dual solvability hold because we may construct a Slater point for the primal problem by setting $p_i=1/N$, $i\in [N]$. By the Karush-Kuhn-Tucker optimality conditions, $\varvec{p}^\star $ and $(\tau ^\star ,\varvec{\zeta }^\star )$ are therefore optimal in the primal and dual problems, respectively, if and only if we have

$$\begin{aligned} \begin{array}{lll} \sum _{i=1}^N p^\star _i =1, ~p^\star _i \ge 0 &{} \forall i \in [N] &{} \text {(primal feasibility)}\\ \zeta ^\star _i\ge 0 &{} \forall i \in [N] &{} \text {(dual feasibility)}\\ \zeta _i^\star p_i^\star =0 &{} \forall i \in [N] &{} \text {(complementary slackness)} \\ \phi _i-c(\varvec{x},\varvec{y_i}) + F_i^{-1}(1-p^\star _i) + \tau ^\star + \zeta ^\star _i = 0 &{} \forall i \in [N] &{} \text {(stationarity)}. \end{array} \end{aligned}$$

If $p_i^\star > 0$, then the complementary slackness and stationarity conditions imply that $\zeta _i^\star = 0$ and that $\phi _i-c(\varvec{x},\varvec{y_i}) + F_i^{-1}(1-p^\star _i) + \tau ^\star = 0$, respectively. Thus, we have $p_i^\star = 1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -\tau ^\star )$. If $p_i^\star = 0$, on the other hand, then similar arguments show that $\zeta _i^\star \ge 0$ and $\phi _i-c(\varvec{x},\varvec{y_i}) + F_i^{-1}(1) + \tau ^\star \le 0$. These two inequalities are equivalent to $1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -\tau ^\star ) \le 0$. As all values of $F_i$ are smaller or equal to 1, the last equality must in in fact hold as an equality. Combining the insights gained so far thus yields $p_i^\star = 1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -\tau ^\star )$, which holds for all $i\in [N]$ irrespective of the sign of $p_i^\star $. Primal feasibility therefore ensures that $\sum _{i=1}^N 1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -\tau ^\star ) = 1$. Finding the unique optimizer $\varvec{p}^\star $ of (27) (i.e., finding the gradient of $ {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})$) is therefore tantamount to finding a root $\tau ^\star $ of the univariate equation

$$\begin{aligned} \sum _{i=1}^N 1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -\tau ) = 1. \end{aligned}$$

(63)

Note the function on the left hand side of (63) is continuous and non-decreasing in $\tau $ because of the continuity (by assumption) and monotonicity (by definition) of the cumulative distribution functions $F_i$, $i\in [N]$. Hence, the root finding problem can be solved efficiently via bisection. To complete the proof, we first show that the interval between the constants ${\underline{\tau }}$ and ${\overline{\tau }}$ defined in Algorithm 2 is guaranteed to contain $\tau ^\star $. Specifically, we will demonstrate that evaluating the function on the left hand side of (63) at ${{\underline{\tau }}}$ or ${{\overline{\tau }}}$ yields a number that is not larger or not smaller than 1, respectively. For $\tau ={{\underline{\tau }}}$ we have

$$\begin{aligned}&1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -{{\underline{\tau }}})\\&\quad = 1 - F_i \left( c(\varvec{x},\varvec{y_i})-\phi _i - \min _{j \in [N]} \left\{ c \left( \varvec{x}, \varvec{y_j} \right) - \phi _j -F_j^{-1}(1-1/N) \right\} \right) \\&\quad \le 1 - F_i \left( F_i^{-1}(1-1/N) \right) = 1 / N\qquad \forall i\in [N], \end{aligned}$$

where the inequality follows from the monotonicity of $F_i$. Summing the above inequality over all $i\in [N]$ then yields the desired inequality $\sum _{i =1}^N 1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -{{\underline{\tau }}}) \le 1$. Similarly, for $\tau ={{\overline{\tau }}}$ we have

$$\begin{aligned}&1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -{{\overline{\tau }}})\\&\quad = 1 - F_i \left( c(\varvec{x},\varvec{y_i})-\phi _i - \max _{j \in [N]} \left\{ c \left( \varvec{x}, \varvec{y_j} \right) - \phi _j -F_j^{-1}(1-1/N) \right\} \right) \\&\quad \ge 1 - F_i \left( F_i^{-1}(1-1/N) \right) = 1/N \qquad \forall i\in [N]. \end{aligned}$$

We may thus conclude that $\sum _{i =1}^N 1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -{{\overline{\tau }}}) \ge 1$. Therefore, $[{{\underline{\tau }}}, {{\overline{\tau }}}]$ constitutes a valid initial search interval for the bisection algorithm. Note that the function $1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -\tau )$, which defines $p_i$ in terms of $\tau $, is uniformly continuous in $\tau $ throughout $\mathbb R$. This follows from ([22], Problem 14.8) and our assumption that $F_i$ is continuous. The uniform continuity ensures that the tolerance

$$\begin{aligned} \delta (\varepsilon ) = \min _{i \in N} \left\{ \max _\delta \left\{ \delta : | F_i(t_1) - F_i(t_2) | \le \varepsilon / \sqrt{N} ~~ \forall t_1,t_2\in {\mathbb {R}}\text { with } | t_1 - t_2 | \le \delta \right\} \right\} \end{aligned}$$

(64)

is strictly positive for every $\varepsilon >0$. As the length of the search interval is halved in each iteration, Algorithm 2 outputs a near optimal solution $\tau $ with $| \tau - \tau ^\star | \le \delta (\varepsilon )$ after $\lceil \log _2 (({\overline{\tau }} - {\underline{\tau }}) / \delta (\varepsilon )) \rceil $ iterations. Moreover, the construction of $\delta (\varepsilon )$ guarantees that $|1 - F_i(c(\varvec{x},\varvec{y_i})-\phi _i -\tau ) - p_i^\star | \le \varepsilon / \sqrt{N}$ for all $\tau $ with $|\tau - \tau ^\star | \le \delta (\varepsilon )$. Therefore, the output $\varvec{p}\in {\mathbb {R}}^N_+$ of Algorithm 2 satisfies $|p_i - p_i^\star | \le \varepsilon / \sqrt{N} $ for each $i\in [N]$, which in turn implies that $ \Vert \varvec{p} - \varvec{p}^\star \Vert \le \varepsilon $. By construction, finally, Algorithm 2 outputs $\varvec{p}\ge \varvec{0}$ with $\sum _{i \in [N]} p_i < 1$, which ensures that $\Vert p \Vert \le 1$. Thus, the claim follows. $\square $

If all cumulative distribution functions $F_i$, $i\in [N]$, are Lipschitz continuous with a common Lipschitz constant $L>0$, then the uniform continuity parameter $\delta (\varepsilon )$ required in Algorithm 2 can simply be set to $\delta (\varepsilon ) = \varepsilon / (L \sqrt{N})$. We are now ready to prove that Algorithm 1 offers different convergence guarantees depending on the continuity and smoothness properties of the marginal cumulative distribution functions.

Corollary 4.10

Use $h(\varvec{\phi }) = {\mathbb {E}}_{\varvec{x} \sim \mu } [ \varvec{\nu }^\top \varvec{\phi }- {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})]$ as a shorthand for the objective function of the smooth optimal transport problem (12), and let $\varvec{\phi }^\star $ be a maximizer of (12). If $\Theta $ is a marginal ambiguity set of the form (26) with distribution functions $F_i$, $i\in [N]$, then for any $T \in \mathbb N$ and ${{\bar{\varepsilon }}}\ge 0$, the outputs and $\bar{\varvec{\phi }}_T = \frac{1}{T} \sum _{t=1}^{T} \varvec{\phi }_{t}$ of Algorithm 1 satisfy the following inequalities.

(i)
If $\gamma = 1 / (2 (2 + {{\bar{\varepsilon }}}) \sqrt{T})$ and $F_i$ is continuous for every $i\in [N]$, then we have
(ii)
If $\gamma = 1 / (2 \sqrt{T} + L)$ and $F_i$ is Lipschitz continuous with Lipschitz constant $L>0$ for every $i\in [N]$, then we have
$$\begin{aligned} {{\overline{W}}}_c (\mu , \nu ) - {\mathbb {E}}\left[ h \big (\bar{\varvec{\phi }}_T \big ) \right]&\le \frac{L}{2T}\Vert \varvec{\phi }^\star \Vert ^2 + \frac{(2 + {{\bar{\varepsilon }}})^2}{\sqrt{T}} \Vert \varvec{\phi }^\star \Vert ^2 \\&\quad + \frac{{{\bar{\varepsilon }}}^2 + 2}{4 (2+{{\bar{\varepsilon }}})^2\sqrt{T}} + \frac{{{\bar{\varepsilon }}}}{\sqrt{T}} \sqrt{2 \Vert \varvec{\phi }^\star \Vert ^2 + \frac{37}{2(2 + {{\bar{\varepsilon }}})^2}}. \end{aligned}$$
(iii)
If $\gamma = 1 / (2 G^2 \sqrt{T}) $ with $G = \max \{M, 2 + {{\bar{\varepsilon }}}\}$, $F_i$ satisfies the generalized self-concordance condition (60) with $M> 0$ for every $i\in [N]$, and the smallest eigenvalue $\kappa $ of $-\nabla ^2_{\varvec{\phi }} h(\varvec{\phi }^\star )$ is strictly positive, then we have

Proof

Recall that problem (12) can be viewed as an instance of the convex minimization problem (36) provided that its objective function is inverted. Throughout the proof we denote by $\varvec{p}_t(\varvec{\phi }_t, \varvec{x}_t)$ the inexact estimate for $\nabla _{\varvec{\phi }} {{\overline{\psi }}}(\varvec{\phi }_t, \varvec{x}_t)$ output by Algorithm 2 in iteration t of the averaged SGD algorithm. Note that

$$\begin{aligned} \left\| {\mathbb {E}}\left[ \varvec{\nu }- \varvec{p}_t(\varvec{\phi }_{t-1}, \varvec{x}_t) \big | \mathcal F_{t-1} \right] - \nabla h(\varvec{\phi }_{t-1}) \right\|&= \left\| {\mathbb {E}}\left[ \varvec{p}_t(\varvec{\phi }_{t-1}, \varvec{x}_t) - \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}_t)\right] \right\| \\&\le {\mathbb {E}}\left[ \left\| \varvec{p}_t(\varvec{\phi }_{t-1}, \varvec{x}_t) - \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}_t) \right\| \right] \\&\le \varepsilon _{t-1} \le \frac{{{\bar{\varepsilon }}}}{2 \sqrt{t}}, \end{aligned}$$

where the two inequalities follow from Jensen’s inequality and the choice of $\varepsilon _{t-1}$ in Algorithm 1, respectively. The triangle inequality and Proposition 4.8 (i) further imply that

$$\begin{aligned} \left\| \nabla h(\varvec{\phi }) \right\| = {\mathbb {E}}\left[ \left\| \varvec{\nu }- \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) \right\| \right] \le \left\| \varvec{\nu }\right\| + {\mathbb {E}}\left[ \left\| \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) \right\| \right] \le 2. \end{aligned}$$

Assertion (i) thus follows from Theorem 4.6 (i) with $R=2$. As for assertion (ii), we have

$$\begin{aligned}&\; {\mathbb {E}}\left[ \left\| \varvec{\nu }- \varvec{p}_t(\varvec{\phi }_{t-1}, \varvec{x}_t) - \nabla h(\varvec{\phi }_{t-1}) \right\| ^2 | \mathcal F_{t-1} \right] \\&\quad = {\mathbb {E}}\left[ \left\| \varvec{p}_t(\varvec{\phi }_{t-1}, \varvec{x}_t) - {\mathbb {E}}\left[ \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) \right] \right\| ^2 | \mathcal F_{t-1} \right] \\&\quad = {\mathbb {E}}\left[ \left\| \varvec{p}_t(\varvec{\phi }_{t-1}, \varvec{x}_t) - \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) + \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) - {\mathbb {E}}\left[ \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) \right] \right\| ^2 | \mathcal F_{t-1} \right] \\&\quad \le {\mathbb {E}}\left[ 2 \left\| \varvec{p}_t(\varvec{\phi }_{t-1}, \varvec{x}_t) - \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) \right\| ^2 + 2 \left\| \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x})\right. \right. \\&\left. \left. \quad - {\mathbb {E}}\left[ \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) \right] \right\| ^2 | \mathcal F_{t-1} \right] \\&\quad \le 2\varepsilon _{t-1}^2 + 2 \le {{\bar{\varepsilon }}}^2 + 2, \end{aligned}$$

where the second inequality holds because $\nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) \in \Delta ^N$ and because $\Vert \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }_{t-1}, \varvec{x}) \Vert _2^2 \le 1$, while the last inequality follows from the choice of $\varepsilon _{t-1}$ in Algorithm 1. As ${{\overline{\psi }}}(\varvec{\phi }, \varvec{x})$ is L-smooth with respect to $\varvec{\phi }$ by virtue of Proposition 4.8 (ii), we further have

$$\begin{aligned} \Vert \nabla h(\varvec{\phi }) - \nabla h(\varvec{\phi }') \Vert = \left\| {\mathbb {E}}\left[ \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x}) - \nabla _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi }', \varvec{x}) \right] \right\| \le L \Vert \varvec{\phi }- \varvec{\phi }' \Vert \quad \forall \varvec{\phi }, \varvec{\phi }' \in {\mathbb {R}}^n. \end{aligned}$$

Assertion (ii) thus follows from Theorem 4.6 (ii) with $R=2$ and $\sigma = \sqrt{{{\bar{\varepsilon }}}^2 + 2}$. As for assertion (iii), finally, we observe that h is M-generalized self-concordant thanks to Proposition 4.8 (iii). Assertion (iii) thus follows from Theorem 4.6 (iii) with $R=2$.

$\square $

One can show that the objective function of the smooth optimal transport problem (12) with marginal exponential noise distributions as described in Example 3.8 is generalized self-concordant. Hence, the convergence rate of Algorithm 1 for the exponential distribution model of Example 3.8 is of the order $\mathcal O(1/T)$, which improves the state-of-the-art $\mathcal O(1/\sqrt{T})$ guarantee established by Genevay et al. [64].

5 Numerical experiments

All experiments are run on a 2.6 GHz 6-Core Intel Core i7, and all optimization problems are implemented in MATLAB R2020a. The corresponding codes are available at https://github.com/RAO-EPFL/Semi-Discrete-Smooth-OT.git.

We now aim to assess the empirical convergence behavior of Algorithm 1 and to showcase the effects of regularization in semi-discrete optimal transport. To this end, we solve the original dual optimal transport problem (10) as well as its smooth variant (12) with a Fréchet ambiguity set corresponding to the exponential distribution model of Example 3.8, to the uniform distribution model of Example 3.9 and to the hyperbolic cosine distribution model of Example 3.11. Recall from Theorem 3.7 that any Fréchet ambiguity set is uniquely determined by a marginal generating function F and a probability vector $\varvec{\eta }$. As for the exponential distribution model of Example 3.8, we set $F(s) = \exp (10 s - 1)$ and $\eta _i = 1/N$ for all $i\in [N]$. In this case problem (12) is equivalent to the regularized primal optimal transport problem (13) with an entropic regularizer, and the gradient $\nabla _{\varvec{\phi }}{{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})$, which is known to coincide with the vector $\varvec{p}^\star $ of optimal choice probabilities in problem (27), admits the closed-form representation (62). We can therefore solve problem (12) with a variant of Algorithm 1 that calculates $\nabla _{\varvec{\phi }}{{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})$ exactly instead of approximately via bisection.

As for the uniform distribution model of Example 3.9, we set $F(s) = s / 20 + 1/2$ and $\eta _i = 1/N$ for all $i\in [N]$. In this case problem (12) is equivalent to the regularized primal optimal transport problem (13) with a $\chi ^2$-divergence regularizer, and the vector $\varvec{p}^\star $ of optimal choice probabilities can be computed exactly and highly efficiently by sorting thanks to Proposition D.1 in the appendix. We can therefore again solve problem (12) with a variant of Algorithm 1 that calculates $\nabla _{\varvec{\phi }}{{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})$ exactly. As for the hyperbolic cosine model of Example 3.11, we set $F(s) = \sinh (10s - k)$ with $k=\sqrt{2} - 1 - \text {arcsinh}(1)$ and $\eta _i = 1/N$ for all $i \in [N]$. In this case problem (12) is equivalent to the regularized primal optimal transport problem (13) with a hyperbolic divergence regularizer. However, the vector $\varvec{p}^\star $ is not available in closed form, and thus we use Algorithm 2 to compute $\varvec{p}^\star $ approximately. Lastly, note that the original dual optimal transport problem (10) can be interpreted as an instance of (12) equipped with a degenerate singleton ambiguity set that only contains the Dirac measure at the origin of ${\mathbb {R}}^N$. In this case ${{\overline{\psi }}}_c(\varvec{\phi },\varvec{x}) = \psi _c(\varvec{\phi },\varvec{x})$ fails to be smooth in $\varvec{\phi }$, but an exact subgradient $\varvec{p}^\star \in \partial _{\varvec{\phi }} {{\overline{\psi }}}_c(\varvec{\phi },\varvec{x})$ is given by

$$\begin{aligned} p_i^\star = {\left\{ \begin{array}{ll} 1 \quad &{}\text {if } i = \min \, \mathop {\mathrm{argmax}}\limits _{i \in [N]}~\phi _i - c(\varvec{x}, \varvec{y}_i),\\ 0 &{}\text {otherwise.} \end{array}\right. } \end{aligned}$$

We can therefore solve problem (10) with a variant of Algorithm 1 that has access to exact subgradients (instead of gradients) of ${{\overline{\psi }}}_c(\varvec{\phi }, \varvec{x})$. Note that the maximizer $\varvec{\phi }^\star $ of (10) may not be unique. In our experiments, we force Algorithm 1 to converge to the maximizer with minimal Euclidean norm by adding a vanishingly small Tikhonov regularization term to $\psi _c(\varvec{\phi },\varvec{x})$. Thus, we set ${{\overline{\psi }}}_c(\varvec{\phi },\varvec{x}) = \psi _c(\varvec{\phi },\varvec{x}) + \varepsilon \Vert \varvec{\phi }\Vert _2^2$ for some small regularization weight $\varepsilon > 0$, in which case $\varvec{p}^\star +2\varepsilon \varvec{\phi }\in \partial _{\varvec{\phi }}{{\overline{\psi }}}_c(\varvec{\phi },\varvec{x})$ is an exact subgradient.

In the following we set $\mu $ to the standard Gaussian measure on $\mathcal X= {\mathbb {R}}^2$ and $\nu $ to the uniform measure on 10 independent samples drawn uniformly from $\mathcal Y=[-1,\, 1]^2$. We further set the transportation cost to $c(\varvec{x}, \varvec{y}) = \Vert \varvec{x} - \varvec{y}\Vert _\infty $. Under these assumptions, we use Algorithm 1 to solve the original as well as the three smooth optimal transport problems approximately for $T=1,\ldots , 10^5$. For each fixed T the step size is selected in accordance with Corollary 4.10.

We emphasize that Corollary 4.10 (i) remains valid if ${{\overline{\psi }}}_c(\varvec{\phi },\varvec{x})$ fails to be smooth in $\varvec{\phi }$ and we have only access to subgradients; see [116, Corollary 1]. Denoting by $\bar{\varvec{\phi }}_T$ the output of Algorithm 1, we record the suboptimality

$$\begin{aligned} {{\overline{W}}}_c(\mu , \nu ) - {\mathbb {E}}_{\varvec{x} \sim \mu } \left[ \varvec{\nu }^\top \bar{\varvec{\phi }}_T - {{\overline{\psi }}}_c(\bar{\varvec{\phi }}_T , \varvec{x})\right] \end{aligned}$$

of $\bar{\varvec{\phi }}_T$ in (12) as well as the discrepancy $\Vert \bar{\varvec{\phi }}_T - \varvec{\phi }^\star \Vert ^2_2$ of $\bar{\varvec{\phi }}_T$ to the exact maximizer $\varvec{\phi }^\star $ of problem (12) as a function of T. In order to faithfully measure the convergence rate of $\bar{\varvec{\phi }}_T$ and its suboptimality, we need to compute $\varvec{\phi }^\star $ as well as ${{\overline{W}}}_c(\mu , \nu )$ to within high accuracy. This is only possible if the dimension of $\mathcal X$ is small (e.g., if $\mathcal X= {\mathbb {R}}^2$ as in our numerical example); even though Algorithm 1 can efficiently solve optimal transport problems in high dimensions. We obtain high-quality approximations for ${{\overline{W}}}_c(\mu , \nu )$ and $\varvec{\phi }^\star $ by solving the finite-dimensional optimal transport problem between $\nu $ and the discrete distribution that places equal weight on $10 \times T$ samples drawn independently from $\mu $. Note that only the first T of these samples are used by Algorithm 1. The proposed high-quality approximations of the entropic and $\chi ^2$-divergence regularized optimal transport problems are conveniently solved via Nesterov’s accelerated gradient descent method, where the suboptimality gap of the $t^{\text {th}}$ iterate is guaranteed to decay as $\mathcal O(1/ t^2)$ under the step size rule advocated in ([114], Theorem 1). To our best knowledge, Nesterov’s accelerated gradient descent algorithm is not guaranteed to converge with inexact gradients. For the hyperbolic divergence regularized optimal transport problem, we thus use Algorithm 1 with $50 \times T$ iterations to obtain an approximation for ${{\overline{W}}}_c(\mu , \nu )$ and $\varvec{\phi }^\star $. In contrast, we model the high-quality approximation of the original optimal transport problem (10) in YALMIP [95] and solve it with MOSEK. If this problem has multiple maximizers, we report the one with minimal Euclidean norm.

Figure 1 shows how the suboptimality of $\bar{\varvec{\phi }}_T$ and the discrepancy between $\bar{\varvec{\phi }}_T$ and the exact maximizer decay with T, both for the original as well as for the entropic, the $\chi ^2$-divergence and hyperbolic divergence regularized optimal transport problems, averaged across 20 independent simulation runs. Fig. 1a suggests that the suboptimality decays as $\mathcal O(1/\sqrt{T})$ for the original optimal transport problem, which is in line with the theoretical guarantees by Nesterov and Vial ([116], Corollary 1),

and as $\mathcal O(1/ T)$ for the entropic, the $\chi ^2$-divergence and the hyperbolic divergence regularized optimal transport problems, which is consistent with the theoretical guarantees established in Corollary 4.10. Indeed, entropic regularization can be explained by the exponential distribution model of Example 3.8, where the exponential distribution functions $F_i$ satisfy the generalized self-concordance condition (60) with $M =1/ \lambda $. Similarly, $\chi ^2$-divergence regularization can be explained by the uniform distribution model of Example 3.9, where the uniform distribution functions $F_i$ satisfy the generalized self-concordance condition with any $M > 0$. Finally, hyperbolic divergence regularization can be explained by the hyperbolic cosine distribution model of Example 3.11, where the hyperbolic cosine functions $F_i$ satisfy the generalized self-concordance condition with $M = 1/\lambda $. In all cases the smallest eigenvalue of $-\nabla _{\varvec{\phi }}^2 {\mathbb {E}}_{\varvec{x} \sim \mu } [\varvec{\nu }^\top \varvec{\phi }^\star - {\overline{\psi }}_{c}(\varvec{\phi }^\star , \varvec{x})]$, which we estimate when solving the high-quality approximations of the two smooth optimal transport problems, is strictly positive. Therefore, Corollary 4.10 (iii) is indeed applicable and guarantees that the suboptimality gap is bounded above by $\mathcal O (1/T)$.

Finally, Fig. 1b suggests that $\Vert \bar{\varvec{\phi }}_T - \varvec{\phi }^\star \Vert ^2_2$ converges to 0 at rate $\mathcal O(1/T)$ for the entropic, the $\chi ^2$-divergence and the hyperbolic divergence regularized optimal transport problems, which is consistent with ([14], Proposition 10).

Notes

We use the soft-O notation $\tilde{\mathcal O}(\cdot )$ to hide polylogarithmic factors.
Note that the first-order condition $1-t^\star =\mu ({\mathcal {X}}_2(\varvec{\phi }))$ for $\phi _2$ is redundant in view of the first-order condition $t^\star =\mu ({\mathcal {X}}_1(\varvec{\phi }))$ for $\phi _1$ because $\mu $ is the Lebesgue measure on $[0,1]^d$, whereby $\mu ({\mathcal {X}}_1(\varvec{\phi })\cup {\mathcal {X}}_2(\varvec{\phi }))=\mu ({\mathcal {X}}_1(\varvec{\phi }))+\mu ({\mathcal {X}}_2(\varvec{\phi }))=1$.

References

Abid, B. K., Gower, R.: Stochastic algorithms for entropy-regularized optimal transport problems. In Artificial Intelligence and Statistics, pp 1505–1512 (2018)
Adler, J., Ringh, A., Öktem, O., Karlsson, J.: Learning to solve inverse problems using Wasserstein loss. arXiv:1710.10898 (2017)
Ahipaşaoğlu, S.D., Arıkan, U., Natarajan, K.: On the flexibility of using marginal distribution choice models in traffic equilibrium. Transportation Research Part B: Methodological 91, 130–158 (2016)
Google Scholar
Ahipaşaoğlu, S.D., Li, X., Natarajan, K.: A convex optimization approach for computing correlated choice probabilities with many alternatives. IEEE Trans. Autom. Control 64(1), 190–205 (2018)
MathSciNet MATH Google Scholar
Ajalloeian, A., Stich, S. U.: Analysis of SGD with biased gradient estimators. arXiv:2008.00051 (2020)
Altschuler, J., Weed, J., Rigollet, P.: Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Advances in Neural Information Processing Systems, pp 1964–1974 (2017)
Altschuler, J.M., Niles-Weed, J., Stromme, A.J.: Asymptotics for semidiscrete entropic optimal transport. SIAM J. Math. Anal. 54(2), 1718–1741 (2022)
MathSciNet MATH Google Scholar
Alvarez-Melis, D., Jaakkola, T., Jegelka, S.: Structured optimal transport. In Artificial Intelligence and Statistics, pp 1771–1780 (2018)
Ambrogioni, L., Guclu, U., Gucluturk, Y., van Gerven, M.: Wasserstein variational gradient descent: From semi-discrete optimal transport to ensemble variational inference. arXiv:1811.02827 (2018)
Anderson, S.P., De Palma, A., Thisse, J.-F.: A representative consumer theory of the logit model. Int. Econ. Rev. 29(3), 461–466 (1988)
MathSciNet MATH Google Scholar
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp 214–223 (2017)
Aurenhammer, F., Hoffmann, F., Aronov, B.: Minkowski-type theorems and least-squares clustering. Algorithmica 20(1), 61–76 (1998)
MathSciNet MATH Google Scholar
Bach, F.: Self-concordant analysis for logistic regression. Electronic J. Stat. 4, 384–414 (2010)
MathSciNet MATH Google Scholar
Bach, F.: Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. 15(19), 595–627 (2014)
MathSciNet MATH Google Scholar
Bach, F., Moulines, E.: Non-strongly-convex smooth stochastic approximation with convergence rate $O(1/n)$. In Advances in Neural Information Processing Systems, pp 773–781 (2013)
Ben-Akiva, M. E., Lerman, S. R.: Discrete Choice Analysis: Theory and Application to Travel Demand. MIT Press (1985)
Benamou, J.-D., Brenier, Y.: A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numer. Math. 84(3), 375–393 (2000)
MathSciNet MATH Google Scholar
Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., Peyré, G.: Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput. 37(2), A1111–A1138 (2015)
MathSciNet MATH Google Scholar
Bertsekas, D.P.: A new algorithm for the assignment problem. Math. Program. 21(1), 152–171 (1981)
MathSciNet MATH Google Scholar
Bertsekas, D.P.: Auction algorithms for network flow problems: A tutorial introduction. Comput. Optim. Appl. 1(1), 7–66 (1992)
MathSciNet MATH Google Scholar
Bertsimas, D., Tsitsiklis, J. N.: Introduction to Linear Optimization. Athena Scientific Belmont (1997)
Billingsley, P.: Probability and Measure. John Wiley and Sons (1995)
Blanchet, J., Jambulapati, A., Kent, C., Sidford, A.: Towards optimal running times for optimal transport. arXiv:1810.07717 (2018)
Blondel, M., Seguy, V., Rolet, A.: Smooth and sparse optimal transport. In Artificial Intelligence and Statistics, pp 880–889 (2018)
Bonnotte, N.: From Knothe’s rearrangement to Brenier’s optimal transport map. SIAM J. Math. Anal. 45(1), 64–87 (2013)
MathSciNet MATH Google Scholar
Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press (2013)
Brenier, Y.: Polar factorization and monotone rearrangement of vector-valued functions. Commun. Pure Appl. Math. 44(4), 375–417 (1991)
MathSciNet MATH Google Scholar
Bubeck, S.: Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)
MATH Google Scholar
Cazelles, E., Seguy, V., Bigot, J., Cuturi, M., Papadakis, N.: Geodesic PCA versus log-PCA of histograms in the Wasserstein space. SIAM J. Sci. Comput. 40(2), B429–B456 (2018)
MathSciNet MATH Google Scholar
Chakrabarty, D., Khanna, S.: Better and simpler error analysis of the Sinkhorn-Knopp algorithm for matrix scaling. Mathematical Programming, pp 1–13 (2020) Forthcoming
Chizat, L., Peyré, G., Schmitzer, B., Vialard, F.-X.: Scaling algorithms for unbalanced optimal transport problems. Math. Comput. 87(314), 2563–2609 (2018)
MathSciNet MATH Google Scholar
Chizat, L., Roussillon, P., Léger, F., Vialard, F.-X., Peyré, G.: Faster Wasserstein distance estimation with the Sinkhorn divergence. Adv. Neural. Inf. Process. Syst. 33, 2257–2269 (2020)
Google Scholar
Clason, C., Lorenz, D.A., Mahler, H., Wirth, B.: Entropic regularization of continuous optimal transport problems. J. Math. Anal. Appl. 494(1), 124432 (2021)
MathSciNet MATH Google Scholar
Cohen, M., Diakonikolas, J., Orecchia, L.: On acceleration with noise-corrupted gradients. In International Conference on Machine Learning, pp 1019–1028 (2018)
Cominetti, R., San Martín, J.: Asymptotic Analysis of the Exponential Penalty Trajectory in Linear Programming. Math. Program. 67(1–3), 169–187 (1994)
MathSciNet MATH Google Scholar
Conforti, G., Tamanini, L.: A formula for the time derivative of the entropic cost and applications. J. Funct. Anal. 280(11), 108964 (2021)
MathSciNet MATH Google Scholar
Cormen, T. H., Leiserson, C. E., Rivest, R. L., Stein, C.: Introduction to Algorithms. MIT Press (2009)
Courty, N., Flamary, R., Tuia, D., Rakotomamonjy, A.: Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 39(9), 1853–1865 (2016)
Google Scholar
Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pp 2292–2300 (2013)
Daganzo, C.: Multinomial Probit: the Theory and its Application to Demand Forecasting. Elsevier (2014)
d’Aspremont, A.: Smooth optimization with approximate gradient. SIAM J. Optim. 19(3), 1171–1183 (2008)
MathSciNet MATH Google Scholar
De Goes, F., Breeden, K., Ostromoukhov, V., Desbrun, M.: Blue noise through optimal transport. ACM Trans. Graph. 31(6), 171 (2012)
Google Scholar
de Goes, F., Wallez, C., Huang, J., Pavlov, D., Desbrun, M.: Power particles: An incompressible fluid solver based on power diagrams. ACM Trans. Graph. 34(4), 50:1-50:11 (2015)
MATH Google Scholar
De la Fuente, A.: Mathematical Methods and Models for Economists. Cambridge University Press (2000)
Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13, 165–202 (2012)
MathSciNet MATH Google Scholar
Delalande, A.: Nearly tight convergence bounds for semi-discrete entropic optimal transport. arXiv:2110.12678 (2021)
Dessein, A., Papadakis, N., Rouas, J.-L.: Regularized optimal transport and the rot mover’s distance. J. Mach. Learn. Res. 19(1), 590–642 (2018)
MathSciNet MATH Google Scholar
Dick, J., Kuo, F.Y., Sloan, I.H.: High-dimensional integration: The quasi-Monte Carlo way. Acta Numer 22, 133–288 (2013)
MathSciNet MATH Google Scholar
Dubin, J.A., McFadden, D.L.: An econometric analysis of residential electric appliance holdings and consumption. Econometrica 52(2), 345–362 (1984)
MATH Google Scholar
Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10(99), 2899–2934 (2009)
MathSciNet MATH Google Scholar
Dvurechensky, P., Gasnikov, A., Kroshnin, A.: Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In International Conference on Machine Learning, pp 1367–1376 (2018)
Dyer, M.E., Frieze, A.M.: On the complexity of computing the volume of a polyhedron. SIAM J. Comput. 17(5), 967–974 (1988)
MathSciNet MATH Google Scholar
Erbar, M., Maas, J., Renger, M.: From large deviations to Wasserstein gradient flows in multiple dimensions. Electron. Commun. Probab. 20, 1–12 (2015)
MathSciNet MATH Google Scholar
Essid, M., Solomon, J.: Quadratically regularized optimal transport on graphs. SIAM J. Sci. Comput. 40(4), A1961–A1986 (2018)
MathSciNet MATH Google Scholar
Evans, L.C.: Partial differential equations and Monge-Kantorovich mass transfer. Curr. Dev. Math. 1997(1), 65–126 (1997)
MATH Google Scholar
Fang, S.-C.: An unconstrained convex programming view of linear programming. Z. Oper. Res. 36(2), 149–161 (1992)
MathSciNet MATH Google Scholar
Feng, G., Li, X., Wang, Z.: On the relation between several discrete choice models. Oper. Res. 65(6), 1516–1525 (2017)
MathSciNet MATH Google Scholar
Ferradans, S., Papadakis, N., Peyré, G., Aujol, J.-F.: Regularized discrete optimal transport. SIAM J. Imag. Sci. 7(3), 1853–1882 (2014)
MathSciNet MATH Google Scholar
Feydy, J., Charlier, B., Vialard, F.-X., Peyré, G.: Optimal transport for diffeomorphic registration. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp 291–299 (2017)
Flamary, R., Cuturi, M., Courty, N., Rakotomamonjy, A.: Wasserstein discriminant analysis. Mach. Learn. 107(12), 1923–1945 (2018)
MathSciNet MATH Google Scholar
Föllmer, H., Schied, A.: Stochastic Finance: An Introduction in Discrete Time. Walter de Gruyter (2004)
Fréchet, M.: Sur les tableaux de corrélation dont les marges sont données. Annal. de l’Université de Lyon, Sci. 4(1/2), 13–84 (1951)
MATH Google Scholar
Friedlander, M.P., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. SIAM J. Sci. Comput. 34(3), A1380–A1405 (2012)
MathSciNet MATH Google Scholar
Genevay, A., Cuturi, M., Peyré, G., Bach, F.: Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems, pp 3440–3448 (2016)
Genevay, A., Peyré, G., Cuturi, M.: Learning generative models with Sinkhorn divergences. In Artificial Intelligence and Statistics, pp 1608–1617 (2018)
Ghai, U., Hazan, E., Singer, Y.: Exponentiated gradient meets gradient descent. In International Conference on Algorithmic Learning Theory, pp 386–407 (2020)
Gordaliza, P., Barrio, E. D., Fabrice, G., Loubes, J.-M.: Obtaining fairness using optimal transport theory. In International Conference on Machine Learning, pp 2357–2365 (2019)
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A. C.: Improved training of Wasserstein Gans. In Advances in Neural Information Processing Systems, pp 5767–5777 (2017)
Hackbarth, A., Madlener, R.: Consumer preferences for alternative fuel vehicles: A discrete choice analysis. Transp. Res. Part D: Transp. Environ. 25, 5–17 (2013)
Google Scholar
Hanasusanto, G.A., Kuhn, D., Wiesemann, W.: A comment on “computational complexity of stochastic programming problems’’. Math. Program. 159(1–2), 557–569 (2016)
MathSciNet MATH Google Scholar
Hazan, E., Koren, T., Levy, K. Y.: Logistic regression: Tight bounds for stochastic and online optimization. In Conference on Learning Theory, pp 197–209 (2014)
Heitsch, H., Römisch, W.: A note on scenario reduction for two-stage stochastic programs. Oper. Res. Lett. 35(6), 731–738 (2007)
MathSciNet MATH Google Scholar
Ho, N., Nguyen, X., Yurochkin, M., Bui, H. H., Huynh, V., Phung, D.: Multilevel clustering via Wasserstein means. In International Conference on Machine Learning, pp 1501–1509 (2017)
Hochreiter, R., Pflug, G.C.: Financial scenario generation for stochastic multi-stage decision processes as facility location problems. Ann. Oper. Res. 152(1), 257–272 (2007)
MathSciNet MATH Google Scholar
Hoffman, K.L.: A method for globally minimizing concave functions over convex sets. Math. Program. 20(1), 22–32 (1981)
MathSciNet MATH Google Scholar
Hu, B., Seiler, P., Lessard, L.: Analysis of biased stochastic gradient descent using sequential semidefinite programs. Mathematical Programming, pp 1–26 (2020) Forthcoming
Jambulapati, A., Sidford, A., Tian, K.: A direct ${\cal{\tilde{O}}}(1/e)$ iteration parallel algorithm for optimal transport. In Advances in Neural Information Processing Systems, pp 11359–11370 (2019)
Kakade, S., Shalev-Shwartz, S., Tewari, A.: On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Technical report, Toyota Technological Institute (2009)
Kantorovich, L.: On the transfer of masses (in Russian). Dokl. Akad. Nauk SSSR 37(2), 227–229 (1942)
Google Scholar
Karlsson, J., Ringh, A.: Generalized Sinkhorn iterations for regularizing inverse problems using optimal mass transport. SIAM J. Imag. Sci. 10(4), 1935–1962 (2017)
MathSciNet MATH Google Scholar
Kavis, A., Levy, K. Y., Bach, F., Cevher, V.: UniXGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. In Advances in Neural Information Processing Systems, pages 6257–6266 (2019)
Kitagawa, J., Mérigot, Q., Thibert, B.: Convergence of a Newton algorithm for semi-discrete optimal transport. arXiv:1603.05579 (2016)
Kolouri, S., Rohde, G. K.: Transport-based single frame super resolution of very low resolution face images. In IEEE Conference on Computer Vision and Pattern Recognition, pp 4876–4884 (2015)
Kolouri, S., Park, S.R., Thorpe, M., Slepcev, D., Rohde, G.K.: Optimal mass transport: Signal processing and machine-learning applications. IEEE Signal Process. Mag. 34(4), 43–59 (2017)
Google Scholar
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Quart. 2(1–2), 83–97 (1955)
MathSciNet MATH Google Scholar
Kundu, S., Kolouri, S., Erickson, K.I., Kramer, A.F., McAuley, E., Rohde, G.K.: Discovery and visualization of structural biomarkers from MRI using transport-based morphometry. Neuroimage 167, 256–275 (2018)
Google Scholar
Lacoste-Julien, S., Schmidt, M., Bach, F.: A simpler approach to obtaining an ${\cal{O}} (1/t)$ convergence rate for the projected stochastic subgradient method. arXiv:1212.2002 (2012)
Lan, G.: An optimal method for stochastic composite optimization. Math. Program. 133(1–2), 365–397 (2012)
MathSciNet MATH Google Scholar
Lee, Y. T., Sidford, A.: Path finding methods for linear programming: Solving linear programs in ${\cal{\tilde{O}}}(\sqrt{rank})$ iterations and faster algorithms for maximum flow. In IEEE Symposium on Foundations of Computer Science, pp 424–433 (2014)
Lévy, B.: A numerical algorithm for $L_2$ semi-discrete optimal transport in 3D. ESAIM Math. Modelling Numer. Anal. 49(6), 1693–1715 (2015)
MathSciNet MATH Google Scholar
Li, H., Webster, S., Mason, N., Kempf, K.: Product-line pricing under discrete mixed multinomial logit demand. Manuf. Serv. Oper. Manag. 21, 14–28 (2019)
Google Scholar
Li, W., Osher, S., Gangbo, W.: A fast algorithm for earth mover’s distance based on optimal transport and ${l_1}$ type regularization. arXiv:1609.07092 (2016)
Lin, T., Ho, N., Jordan, M. I.: On the efficiency of the Sinkhorn and Greenkhorn algorithms for optimal transport. arXiv:1906.01437 (2019).
Lin, T., Ho, N., Jordan, M. I.: On efficient optimal transport: An analysis of greedy and accelerated mirror descent algorithms. In International Conference on Machine Learning, pp 3982–3991 (2019)
Löfberg, J.: YALMIP: A toolbox for modeling and optimization in MATLAB. In IEEE International Conference on Robotics and Automation, pp 284–289 (2004)
Luo, Z.-Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: A general approach. Ann. Oper. Res. 46(1), 157–178 (1993)
MathSciNet MATH Google Scholar
Mak, H.-Y., Rong, Y., Zhang, J.: Appointment scheduling with limited distributional information. Manage. Sci. 61(2), 316–334 (2015)
Google Scholar
Martins, A., Astudillo, R.: From softmax to sparsemax: A sparse model of attention and multi-label classification. In International Conference on Machine Learning, pp 1614–1623 (2016)
McFadden, D.: Conditional logit analysis of qualitative choice behavior. In: Zarembka, P. (ed.) Frontiers in Econometrics, pp 105–142. Academic Press (1974)
McFadden, D.: Modeling the choice of residential location. Transp. Res. Rec. 673, 72–77 (1978)
Google Scholar
McFadden, D.: Econometric models of probabilistic choice. In: Manski C., McFadden, D. (eds.) Structural Analysis of Discrete Data with Econometric Application, pp 198–272. MIT Press (1981)
Mérigot, Q.: A multiscale approach to optimal transport. Comput. Graph. Forum. 5(30), 1583–1592 (2011)
Google Scholar
Mirebeau, J.-M.: Discretization of the 3D Monge-Ampère operator, between wide stencils and power diagrams. Math. Modelling Numer. Anal. 49(5), 1511–1523 (2015)
MathSciNet MATH Google Scholar
Mishra, V.K., Natarajan, K., Tao, H., Teo, C.-P.: Choice prediction with semidefinite optimization when utilities are correlated. IEEE Trans. Autom. Control 57(10), 2450–2463 (2012)
MathSciNet MATH Google Scholar
Mishra, V.K., Natarajan, K., Padmanabhan, D., Teo, C.-P., Li, X.: On theoretical and empirical aspects of marginal distribution choice models. Manage. Sci. 60(6), 1511–1531 (2014)
Google Scholar
Mohajerin Esfahani, P., Kuhn, D.: Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Math. Prog. 171(1–2), 115–166 (2018)
MathSciNet MATH Google Scholar
Monge, G.: Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris (1781)
Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp 451–459 (2011)
Murez, Z., Kolouri, S., Kriegman, D., Ramamoorthi, R., Kim, K.: Image to image translation for domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, pp 4500–4509 (2018)
Muzellec, B., Nock, R., Patrini, G., Nielsen, F.: Tsallis regularized optimal transport and ecological inference. In Association for the Advancement of Artificial Intelligence, pp 2387–2393 (2017)
Natarajan, K., Song, M., Teo, C.-P.: Persistency model and its applications in choice modeling. Manage. Sci. 55(3), 453–469 (2009)
MATH Google Scholar
Nedić, A., Bertsekas, D.: Convergence rate of incremental subgradient algorithms. In: Uryasev, S., Pardalos, P. M. (eds.) Stochastic Optimization: Algorithms and Applications, pages 263–304. Kluwer Academic Publishers (2000)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
MathSciNet MATH Google Scholar
Nesterov, Y.: A method for solving the convex programming problem with convergence rate ${\cal{O}} (1/k^2)$. Proceedings of the USSR Academy of Sciences 269, 543–547 (1983)
Google Scholar
Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming. SIAM (1994)
Nesterov, Y., Vial, J.P.: Confidence level solutions for stochastic programming. Automatica 44(6), 1559–1568 (2008)
MathSciNet MATH Google Scholar
Nguyen, V. A., Zhang, F., Blanchet, J., Delage, E., Ye, Y.: Distributionally robust local non-parametric conditional estimation. In Advances in Neural Information Processing Systems (2020)
Nguyen, X., et al.: Convergence of latent mixing measures in finite and infinite mixture models. Ann. Stat. 41(1), 370–400 (2013)
MathSciNet MATH Google Scholar
Orlin, J.B.: A polynomial time primal network simplex algorithm for minimum cost flows. Math. Program. 78(2), 109–129 (1997)
MathSciNet MATH Google Scholar
Pal, S.: On the difference between entropic cost and the optimal transport cost. arXiv preprint arXiv:1905.12206 (2019)
Papadakis, N., Rabin, J.: Convex histogram-based joint image segmentation with regularized optimal transport cost. J. Math. Imaging. Vis. 59(2), 161–186 (2017)
MathSciNet MATH Google Scholar
Papadakis, N., Peyré, G., Oudet, E.: Optimal transport with proximal splitting. SIAM J. Imag. Sci. 7(1), 212–238 (2014)
MathSciNet MATH Google Scholar
Paty, F.-P., Cuturi, M.: Regularized optimal transport is ground cost adversarial. In International Conference on Machine Learning, pp 7532–7542. PMLR (2020)
Pele, O., Werman, M.: A linear time histogram metric for improved sift matching. In European Conference on Computer Vision, pp 495–508 (2008)
Pele, O., Werman, M.: Fast and robust earth mover’s distances. In IEEE International Conference on Computer Vision, pp 460–467 (2009)
Peyré, G.: Entropic approximation of Wasserstein gradient flows. SIAM J. Imag. Sci. 8(4), 2323–2351 (2015)
MathSciNet MATH Google Scholar
Peyré, G., Cuturi, M.: Computational optimal transport. Found. Trends Mach. Learn. 11(5–6), 355–607 (2019)
MATH Google Scholar
Peyré, G., Chizat, L., Vialard, F.-X., Solomon, J.: Quantum entropic regularization of matrix-valued optimal transport. European Journal of Applied Mathematics, pp 1–24 (2017)
Pflug, G.C.: Scenario tree generation for multiperiod financial optimization by optimal discretization. Math. Program. 89(2), 251–271 (2001)
MathSciNet MATH Google Scholar
Pinelis, I.: Optimum bounds for the distributions of martingales in Banach spaces. Ann. Probab. 22(4), 1679–1706 (1994)
MathSciNet MATH Google Scholar
Pitié, F., Kokaram, A.C., Dahyot, R.: Automated colour grading using colour distribution transfer. Comput. Vis. Image Underst. 107(1–2), 123–137 (2007)
Google Scholar
Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control. Optim. 30(4), 838–855 (1992)
MathSciNet MATH Google Scholar
Qin, H., Chen, Y., He, J., Chen, B.: Wasserstein blue noise sampling. ACM Transactions on Graphics 36(4), 1–14 (2017)
Google Scholar
Quanrud, K.: Approximating optimal transport with linear programs. In Symposium on Simplicity in Algorithms, pp 6:1–6:9 (2019)
Rigollet, P., Weed, J.: Entropic optimal transport is maximum-likelihood deconvolution. C.R. Math. 356(11–12), 1228–1235 (2018)
MathSciNet MATH Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
MathSciNet MATH Google Scholar
Rockafellar, R. T.: Conjugate Duality and Optimization. SIAM (1974)
Rockafellar, R. T., Wets, R. J.-B.: Variational Analysis. Springer Science & Business Media (2009)
Rolet, A., Cuturi, M., Peyré, G.: Fast dictionary learning with a smoothed Wasserstein loss. In Artificial Intelligence and Statistics, pp 630–638 (2016)
Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 40(2), 99–121 (2000)
MATH Google Scholar
Rudin, W.: Principles of Mathematical Analysis. McGraw-Hill Education (1964)
Rujeerapaiboon, N., Schindler, K., Kuhn, D., Wiesemann, W.: Scenario reduction revisited: Fundamental limits and guarantees. Mathematical Programming (2018) Forthcoming
Ruppert, D.: Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, School of Operations Research and Industrial Engineering, Cornell University (1988)
Schmidt, M., Roux, N. L., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in Neural Information Processing Systems, pp 1458–1466 (2011)
Schmitzer, B.: A sparse multiscale algorithm for dense optimal transport. J. Math. Imaging. Vis. 56(2), 238–259 (2016)
MathSciNet MATH Google Scholar
Schrijver, A.: Theory of Linear and Integer Programming. John Wiley & Sons (1998)
Schrödinger, E.: Über die Umkehrung der Naturgesetze. Sitzungsberichte der Preussischen Akademie der Wissenschaften. Physikalisch-Mathematische Klasse 144(3), 144–153 (1931)
MATH Google Scholar
Seguy, V., Cuturi, M.: Principal geodesic analysis for probability measures under the optimal transport metric. In Advances in Neural Information Processing Systems, pp 3312–3320 (2015)
Seguy, V., Damodaran, B. B., Flamary, R., Courty, N., Rolet, A., Blondel, M.: Large-scale optimal transport and mapping estimation. International Conference on Learning Representations (2018)
Shafieezadeh-Abadeh, S., Mohajerin Esfahani, P., Kuhn, D.: Distributionally robust logistic regression. In Advances in Neural Information Processing Systems, pp 1576–1584 (2015)
Shafieezadeh-Abadeh, S., Kuhn, D., Esfahani, P.M.: Regularization via mass transportation. J. Mach. Learn. Res. 20(103), 1–68 (2019)
MathSciNet MATH Google Scholar
Shalev-Shwartz, S., Shamir, O., Srebro, N., Sridharan, K.: Stochastic convex optimization. In Conference on Learning Theory (2009)
Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: Primal estimated sub-gradient solver for SVM. Math. Program. 127(1), 3–30 (2011)
MathSciNet MATH Google Scholar
Shapiro, A.: Distributionally robust stochastic programming. SIAM J. Optim. 27(4), 2258–2275 (2017)
MathSciNet MATH Google Scholar
Sinkhorn, R.: Diagonal equivalence to matrices with prescribed row and column sums. Am. Math. Mon. 74(4), 402–405 (1967)
MathSciNet MATH Google Scholar
Solomon, J., Rustamov, R., Guibas, L., Butscher, A.: Earth mover’s distances on discrete surfaces. ACM. Trans. Graph. 33(4), 67 (2014)
MATH Google Scholar
Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., Guibas, L.: Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM. Trans. Graph. 34(4), 66 (2015)
MATH Google Scholar
Srebro, N., Sridharan, K., Tewari, A.: Optimistic rates for learning with a smooth loss. arXiv:1009.3896 (2010)
Sun, T., Tran-Dinh, Q.: Generalized self-concordant functions: A recipe for Newton-type methods. Math. Program. 178(1–2), 145–213 (2019)
MathSciNet MATH Google Scholar
Tartavel, G., Peyré, G., Gousseau, Y.: Wasserstein loss for image synthesis and restoration. SIAM J. Imag. Sci. 9(4), 1726–1755 (2016)
MathSciNet MATH Google Scholar
Taşkesen, B., Nguyen, V. A., Kuhn, D., Blanchet, J.: A distributionally robust approach to fair classification. arXiv:2007.09530 (2020)
Taşkesen, B., Blanchet, J., Kuhn, D., Nguyen, V. A.: A statistical test for probabilistic fairness. In ACM Conference on Fairness, Accountability, and Transparency (2021)
Thorpe, M., Park, S., Kolouri, S., Rohde, G.K., Slepčev, D.: A transportation $L^p$ distance for signal analysis. J. Math. Imaging. Vis. 59(2), 187–210 (2017)
MATH Google Scholar
Thurstone, L.L.: A law of comparative judgment. Psychol. Rev. 34(4), 273 (1927)
Google Scholar
Train, K. E.: Discrete Choice Methods with Simulation. Cambridge University Press (2009)
Tsybakov, A. B.: Optimal rates of aggregation. In Conference on Learning Theory, pp 303–313 (2003)
Leeuwen, J. Van: Handbook of Theoretical Computer Science: Algorithms and Complexity. Elsevier (1990)
Villani, C.: Optimal Transport: Old and New. Springer Science & Business Media (2008)
Wang, W., Ozolek, J.A., Slepcev, D., Lee, A.B., Chen, C., Rohde, G.K.: An optimal transportation approach for nuclear structure-based pathology. IEEE Trans. Med. Imaging 30(3), 621–631 (2010)
Google Scholar
Wassenaar, H.J., Chen, W.: An approach to decision-based design with discrete choice analysis for demand modeling. Trans. ASME. J. Mech. Design. 125(3), 490–497 (2003)
Google Scholar
Weed, J.: An explicit analysis of the entropic penalty in linear programming. In Conference On Learning Theory, pp 1841–1855 (2018)
Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems, pp 2116–2124 (2009)

Download references

Acknowledgements

This research was supported by the Swiss National Science Foundation under the NCCR Automation, grant agreement 51NF40_180545. The research of the second author is supported by an Early Postdoc.Mobility Fellowship, grant agreement P2ELP2_195149.

Funding

Open access funding provided by EPFL Lausanne

Author information

Authors and Affiliations

Risk Analytics and Optimization Chair, EPFL Lausanne, Lausanne, Switzerland
Bahar Taşkesen & Daniel Kuhn
Tepper School of Business, CMU, Pittsburgh, USA
Soroosh Shafieezadeh-Abadeh

Authors

Bahar Taşkesen
View author publications
You can also search for this author in PubMed Google Scholar
Soroosh Shafieezadeh-Abadeh
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Kuhn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bahar Taşkesen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Approximating the minimizer of a strictly convex function

The following lemma is key ingredient for the proofs of Theorem 2.2 and Corollary 2.5.

Lemma A.1

Assume that $g:[0,1] \rightarrow {\mathbb {R}}_{+}$ is a strictly convex function with unique minimizer $t^\star \in [0,1]$, and define $L = {\lceil }{\log _2(1/\delta )}{\rceil } +1$ for some prescribed tolerance $\delta \in (0,1)$. Then, the following hold.

(i)
Given an oracle that evaluates g exactly, we can compute a $\delta $-approximation for $t^\star $ with 2L oracle calls.
(ii)
Given an oracle that evaluates g inexactly to within an absolute accuracy
$$\begin{aligned} \varepsilon = \frac{1}{4}\min _{l\in [2^L] } \left\{ |g(t_l) - g(t_{l-1})| : g(t_l) \ne g(t_{l-1}) \right\} \quad \text {with} \quad t_l = \frac{l}{2^{L}} \quad \forall l =0,\ldots ,2^L, \end{aligned}$$
we can compute a $2\delta $-approximation for $t^\star $ with 2L oracle calls.

Proof

Consider the uniform grid $\{t_0, \ldots , t_{2^L}\}$, and note that the difference $2^{-L}$ between consecutive grid points is strictly smaller than $\delta $. Next, introduce a piecewise affine function ${{\bar{g}}} : [0,1] \rightarrow {\mathbb {R}}_+$ that linearly interpolates g between consecutive grid points. By construction, ${{\bar{g}}}$ is affine on the interval $[t_{l-1}, t_{l}]$ with slope $a_l/\delta $ and $a_l= g(t_{l}) - g(t_{l-1})$ for all $l \in [2^{L}]$. In addition, ${{\bar{g}}}$ is continuous and inherits convexity from g. As g is strictly convex, it is easy to verify that ${{\bar{g}}}$ has also a kink at every inner grid point $t_l$ for $l\in [2^L-1]$, and therefore the distance between the unique minimizer $t^\star $ of g and any minimizer of ${{\bar{g}}}$ is at most $2^{-L}<\delta $. In order to compute a $\delta $-approximation for $t^\star $, it thus suffices to find a minimizer of ${{\bar{g}}}$.

Next, define $\varvec{a} = (a_0, \ldots , a_{2^{L} })$ with $a_0 = -\infty $. As ${{\bar{g}}}$ has a kink at every inner grid point, we may conclude that the array $\varvec{a}$ is sorted in ascending order, that is, $a_l> a_{l-1}$ for all $l \in [2^L]$. This implies that at most one element of $\varvec{a}$ can vanish. In the following, define $l^\star = \max \{l\in \{0\}\cup [2^{L}]: a_l \le 0 \}$. If $l^\star =0$, then ${{\bar{g}}}$ is uniquely minimized by $t_{l^\star }=0$, and $t^\star $ must fall within the interval $[t_0, t_1]$. If $l^\star >0$ and $a_{l^\star }<0$, on the other hand, then ${{\bar{g}}}$ is uniquely minimized by $t_{l^\star }$, and $t^\star $ must fall within the interval $[t_{l^\star -1},t_{l^\star + 1}]$. If $l^\star >0$ and $a_{l^\star }=0$, finally, then every point in $[t_{l^\star -1}, t_{l^\star }]$ minimizes ${{\bar{g}}}$, and $t^\star $ must also fall within $[t_{l^\star -1}, t_{l^\star }]$. In any case, $t_{l^\star }$ provides a $\delta $-approximation for $t^\star $. In the remainder of the proof we show that the index $l^\star $ can be computed efficiently by Algorithm 3. This bisection scheme maintains lower and upper bounds ${\underline{l}}$ and ${{\overline{l}}}$ on the sought index $l^\star $, respectively, and reduces the search interval between ${\underline{l}}$ and ${{\overline{l}}}$ by a factor of two in each iteration. Thus, Algorithm 3 computes $l^\star $ in exactly L iterations [37, § 12].

We remark that l is guaranteed to be an integer and thus represents a valid index in each iteration of the algorithm because ${\underline{l}}$ and ${{\overline{l}}}$ are initialized as 0 and $2^L$, respectively. Note also that if we have access to an oracle for evaluating g exactly, then any element $a_l$ of the array $\varvec{a}$ can be computed with merely two oracle calls. Algorithm 3 thus computes $l^\star $ with 2L oracle calls in total. Hence, assertion (i) follows.

As for assertion (ii), assume now that we have only access to an inexact oracle that outputs, for any fixed $t\in [0,1]$, an approximate function value ${{\widetilde{g}}}(t)$ with $|{{\widetilde{g}}}(t)- g(t)| \le \varepsilon $, where $\varepsilon $ is defined as in the statement of the lemma. Reusing the notation from the first part of the proof, one readily verifies that $\varepsilon = \frac{1}{4}\min _{l \in [2^{L}]}\{a_l : a_l \ne 0\}$. Next, set ${{\widetilde{a}}}_0 = -\infty $, and define ${{\widetilde{a}}}_l = {{\widetilde{g}}}(t_{l}) -{{\widetilde{g}}}(t_{l-1})$ for all $l \in [2^{L}]$. Therefore, $\widetilde{\varvec{a}} = ({{\widetilde{a}}}_0, \ldots , {{\widetilde{a}}}_{2^{L}})$ can be viewed as an approximation of $\varvec{a}$. Moreover, Algorithm 3 with input $\widetilde{\varvec{a}}$ computes an approximation ${{\tilde{l}}}^\star $ of $l^\star $ in L iterations. Next, we will show that $|{{\tilde{l}}}^\star - l^\star | \le 1$ even though $\widetilde{\varvec{a}}$ is not necessarily sorted in ascending order. To see this, note that $|a_l| \ge 4 \varepsilon $ for all $l\in [2^L]$ with $a_l\ne 0$ by the definition of $\varepsilon $. By the triangle inequality and the assumptions about the inexact oracle, we further have

$$\begin{aligned} |a_l- {{\widetilde{a}}}_l| \le |{{\widetilde{g}}}(t_{l}) - g(t_{l})| + |{{\widetilde{g}}}(t_{l-1}) - g(t_{l-1})| \le 2\varepsilon \quad \forall l\in [2^L]. \end{aligned}$$

This reasoning reveals that ${{\widetilde{a}}}_l$ has the same sign as $a_l$ for every $l\in [2^L]$ with $a_l\ne 0$. In addition, it implies that $t_{{{\tilde{l}}}^\star }$ approximates $t^\star $ irrespective of whether or not the array $\varvec{a}$ has a vanishing element. Indeed, if no element of $\varvec{a}$ vanishes, then ${{\widetilde{a}}}_l$ has the same sign as $a_l$ for all $l \in [2^{L}]$. As Algorithm 3 only checks signs, this implies that ${{\tilde{l}}}^\star =l^\star $ and that $t_{{{\tilde{l}}}^\star }$ provides a $\delta $-approximation for $t^\star $ as in assertion (i). If one element of $\varvec{a}$ vanishes, on the other hand, then ${{\widetilde{a}}}_l$ has the same sign as $a_l$ for all $l \in [2^{L}]$ with $l\ne l^\star $. As Algorithm 3 only checks signs, this implies that $|{{\tilde{l}}}^\star - l^\star | \le 1$. Recalling that $|t^\star - t_{l^\star }| \le \delta $, we thus have

$$\begin{aligned} |t_{{{\tilde{l}}}^\star } - t^\star | \le |t_{{{\tilde{l}}}^\star } - t_{l^\star }| + |t^\star - t_{l^\star }| \le 2\delta . \end{aligned}$$

In either case, $t_{{{\tilde{l}}}^\star }$ provides a $2\delta $-approximation for $t^\star $. As any element of the array $\widetilde{\varvec{a}}$ can be evaluated with two oracle calls, Algorithm 3 computes ${{\tilde{l}}}^\star $ with 2L oracle calls in total. Hence, assertion (ii) follows. $\square $

Efficiency of binary search

We adopt the conventions of Schrijver ([146], § 2.1) to measure the size of a computational problem, which is needed to reason about the problem’s complexity. Specifically, the size of a scalar $x\in \mathbb R$ is defined as

$$\begin{aligned}&{\mathrm{size}}(x)\\&\quad = \left\{ \begin{array}{ll} 1 + {\lceil }{\log _2{(|p|+1)}}{\rceil } + {\lceil }{\log _2{(q+1)}}{\rceil } &{} \text {if }\, x= p/q\\ &{} \text { with } p\in \mathbb Z \text { and } q\in \mathbb N \text { are relatively prime,}\\ \infty &{} \text {if } x \text { is irrational,} \end{array} \right. \end{aligned}$$

where we reserve one bit to encode the sign of x. The size of a real vector is defined as the sum of the sizes of its components plus its dimension. Thus, the input size of an instance $\varvec{w}\in {\mathbb {R}}_+^d$ and $b\in \mathbb R_+$ of the knapsack problem described in Lemma 2.3 amounts to

$$\begin{aligned} \mathrm{{size}}(\varvec{w}, b) = d + 1+\sum _{i=1}^d{\mathrm{{size}}(w_i)}+ \mathrm{{size}}(b). \end{aligned}$$

In the following we will prove that the number of iterations

$$\begin{aligned} L = \left\lceil \log _2(6) + \log _2 d! + + d \log _2(\Vert \varvec{w} \Vert _1 + 2) + (d+1) \log _2(d+1) + \sum \limits _{i=1}^d \log _2(w_i) \right\rceil + 1\end{aligned}$$

of the bisection algorithm used in the proof of Theorem 2.2 is upper bounded by a polynomial in $\mathrm{{size}}(\varvec{w}, b)$. The claim holds trivially if any component of $(\varvec{w},b)$ is irrational. Below we may thus assume that $w_i = p_i / q_i$ and $b = p_{d+1}/q_{d+1}$, where $p_i\in \mathbb Z_+$ and $q_i\in \mathbb N$ are relatively prime for every $i\in [d+1]$. This implies that

$$\begin{aligned} \mathrm{{size}}(\varvec{w}, b) =2 d + 1+ \sum _{i=1}^{d+1}{{\lceil }{\log _2{(p_i+1)}}{\rceil } + {\lceil }{\log _2{(q_i+1)}}{\rceil } }. \end{aligned}$$

In order to show that L is bounded by a polynomial in $\mathrm{{size}}(\varvec{w}, b)$, we first note that

$$\begin{aligned}&\log _2 d!\le \log _2 d^d\le d^2\le \text {size}(\varvec{w}, b)^2\quad \text {and}\quad (d+1) \log _2(d+1)\nonumber \\&\quad \le (d+1)^2 \le \text {size}(\varvec{w}, b)^2. \end{aligned}$$

(65)

This follows from the properties of the logarithm and the definition of the size function. Similarly, we find

$$\begin{aligned} d \log _2(2 + \Vert \varvec{w}\Vert _1)&= d\log _2\left( 2 + \sum \limits _{i=1}^d {p_i}/{q_i}\right) \\&\le d\log _2\left( (d+1) \max \left\{ 2, \max _{i \in [d]}\{p_i/q_i\}\right\} \right) \\&=d\log _2(d+1)+ d\max \left\{ 1, \max _{i \in [d]}\{\log _2(p_i) - \log _2(q_i)\}\right\} \\&\le (d+1)\log _2(d+1)+ d\max \left\{ 1,\max _{i \in [d]}\{ \log _2(p_i) + \log _2(q_i)\}\right\} \\&\le \text {size}(\varvec{w}, b)^2 + \text {size}(\varvec{w}, b)\max _{i \in [d]}\{\log _2(p_i + 1) + \log _2(q_i+1)\}\\&\le 2\,\text {size}(\varvec{w}, b)^2, \end{aligned}$$

where the first inequality follows from the monotonicity of the logarithm, the second inequality holds because $\log _2(q_i) \ge 0$ for all $q_i \in \mathbb N$, and the third inequality exploits the second bound in (65) as well as the trivial estimates $d\le \mathrm{{size}}(\varvec{w}, b)$ and $1=\log _2 2\le \log _2(q_i+1)$ for all $q_i \in \mathbb N$. The last inequality, finally, follows from the observation that

$$\begin{aligned} \max _{i \in [d]}\{\log _2(p_i + 1) + \log _2(q_i+1)\} \le \sum \limits _{i=1}^d \log _2(p_i + 1)+ \log _2(q_i + 1) \le \text {size}(\varvec{w}, b). \end{aligned}$$

Using a similar reasoning, we find

$$\begin{aligned} \sum \limits _{i=1}^d \log _2(w_i) = \sum \limits _{i=1}^d \log _2(p_i / q_i ) \le \sum \limits _{i=1}^d \log _2(p_i)+ \log _2(q_i) \le \text {size}(\varvec{w}, b), \end{aligned}$$

and thus all terms in the definition of L grow at most quadratically with $\text {size}(\varvec{w}, b)$. Hence, the number of iterations L of the bisection algorithm is indeed bounded by a polynomial in $\text {size}(\varvec{w}, b)$.

Detailed derivations for the examples of marginal ambiguity sets

Example C.1

(Exponential distribution model) If the marginal generating function in (33) is set to $F(s) = \exp (s/\lambda - 1)$ for some $\lambda >0$, then the marginal distribution function $F_i$ for any $i\in [N]$ reduces to

$$\begin{aligned}F_i(s) = \min \left\{ 1, \max \{0, 1 - \eta _i \exp (-s/\lambda - 1)\} \right\} ,\end{aligned}$$

which characterizes a (shifted) exponential distribution. Defining f as in Theorem 3.7, we then obtain

$$\begin{aligned} f(s)=\int _0^s F^{-1}(t)\mathrm {d}t = \lambda \int _0^s (\log (t) + 1) \mathrm {d}t = \lambda s\log (s), \end{aligned}$$

where the third equality exploits the standard convention that $0\log (0) = 0$. The proof of Theorem 3.7 further implies that $\int _{1-p_i}^1 F_i^{-1}(t) \mathrm {d}t=-\eta _i f(p_i/\eta _i)$ for all $i\in [N]$; see (34). By Proposition 3.6 we thus have

$$\begin{aligned} {\overline{\psi }}_c(\varvec{\phi }, \varvec{x})&= \max \limits _{\varvec{p} \in \Delta ^N} \sum \limits _{i=1}^N (\phi _i - c(\varvec{x}, \varvec{y}_i))p_i - \lambda \sum \limits _{i=1}^N p_i \log \left( \frac{p_i}{\eta _i}\right) . \end{aligned}$$

Next, assign Lagrange multipliers $\tau $ and $\varvec{\zeta }$ to the simplex constraints $\sum _{i=1}^N p_i=1$ and $\varvec{p}\ge \varvec{0}$, respectively. If we can find $\varvec{p}^\star $, $\tau ^\star $ and $\varvec{\zeta }^\star $ that satisfy the Karush-Kuhn-Tucker optimality conditions

$$\begin{aligned} \begin{array}{lll} \sum _{i=1}^N p^\star _i =1, ~p^\star _i \ge 0 &{} \forall i \in [N] &{} \text {(primal feasibility)}\\ \zeta ^\star _i\ge 0 &{} \forall i \in [N] &{} \text {(dual feasibility)}\\ \zeta _i^\star p_i^\star =0 &{} \forall i \in [N] &{} \text {(complementary slackness)} \\ \phi _i - c(\varvec{x}, \varvec{y}_i) -\lambda \log \left( \frac{p_i}{\eta _i}\right) - \lambda - \tau ^\star + \zeta ^\star _i = 0 &{} \forall i \in [N] &{} \text {(stationarity)}, \end{array} \end{aligned}$$

then $\varvec{p}^\star $ is optimal in the above maximization problem. To see that $\varvec{p}^\star $, $\tau ^\star $ and $\varvec{\zeta }^\star $ exist, we use the stationarity condition to conclude that $p_i^\star = \eta _i \exp ((\phi _i - c(\varvec{x}, \varvec{y}_i) - \lambda - \tau ^\star + \zeta ^\star _i) / \lambda )>0$. As $\eta _i > 0$, we have $\zeta _i^\star =0$ by complementary slackness. We may then conclude that $p_i^\star = \eta _i \exp ((\phi _i - c(\varvec{x}, \varvec{y}_i)- \lambda - \tau ^\star ) / \lambda )$ for all $i \in [N]$, which implies via primal feasibility that $\sum _{i=1}^N \eta _i \exp ((\phi _i - c(\varvec{x}, \varvec{y}_i) - \lambda - \tau ^\star ) / \lambda ) = 1$. Solving this equation for $\tau ^\star $ and substituting the resulting formula for $\tau ^\star $ back into the formula for $p_i^\star $ yields

$$\begin{aligned}\tau ^\star= & {} \lambda \log \left( \sum _{i=1}^N \eta _i \exp \left( \frac{\phi _i - c(\varvec{x}, \varvec{y}_i) - \lambda }{\lambda }\right) \right) \quad \text {and}\quad \\ p_i^\star= & {} \frac{\eta _i \exp \left( ({\phi _i - c(\varvec{x}, \varvec{y_i}) )}/{\lambda }\right) }{ \sum _{j=1}^N \eta _j \exp \left( ({\phi _j - c(\varvec{x},\varvec{y_j}) })/{\lambda } \right) }\quad \forall i \in [N].\end{aligned}$$

The vector $\varvec{p}^\star $ constructed in this way constitutes an optimal solution for the maximization problem that defines ${\overline{\psi }}_c(\varvec{\phi }, \varvec{x})$. Evaluating the objective function value of $\varvec{p}^\star $ in this problem finally confirms that the smooth c-transform coincides with the log-partition function (20).

Example C.2

(Uniform distribution model) If the marginal generating function in (33) is set to $F(s) = s/ (2\lambda ) + 1/2$ for some $\lambda >0$, then the marginal distribution function $F_i$ for any $i\in [N]$ reduces to

$$\begin{aligned} F_i(s) = \min \{1, \max \{0, 1 + \eta _i s/ (2\lambda ) -\eta _i/ 2 \}\}, \end{aligned}$$

which characterizes a uniform distribution. Defining f as in Theorem 3.7, we then obtain

$$\begin{aligned} f(s)=\int _0^s F^{-1}(t)\mathrm {d}t = \lambda \int _0^s (2t-1) \mathrm {d}t = \lambda (s^2 - s). \end{aligned}$$

The proof of Theorem 3.7 further implies that $\int _{1-p_i}^1 F_i^{-1}(t) \mathrm {d}t=-\eta _i f(p_i/\eta _i)$ for all $i\in [N]$; see (34). By Proposition 3.6, the smooth c-transform thus simplifies to

$$\begin{aligned} {\overline{\psi }}_c(\varvec{\phi }, \varvec{x})&= \max \limits _{\varvec{p} \in \Delta ^N} \sum \limits _{i=1}^N (\phi _i - c(\varvec{x}, \varvec{y}_i))p_i - \lambda \sum \limits _{i=1}^N \frac{p_i^2}{\eta _i} + \lambda \\&= \lambda + \lambda \, \mathop {\mathrm{spmax}}\limits _{i \in [N]} \;\frac{\phi _i - c(\varvec{x}, \varvec{y_i})}{\lambda }, \end{aligned}$$

where the last equality follows from the definition of the sparse maximum operator in (35).

Example C.3

(Pareto distribution model) If the marginal generating function in (33) is set to $F(s) = (s (q-1) / (\lambda q)+1/q)^{1/(q-1)}$ for some $\lambda , q>0$, then the marginal distribution function $F_i$ for any $i\in [N]$ reduces to

$$\begin{aligned} F_i(s) = \min \left\{ 1, \max \left\{ 0, 1 - \eta _i \left( \frac{s (1-q)}{\lambda q} + \frac{1}{q} \right) ^{\frac{1}{q-1}} \right\} \right\} , \end{aligned}$$

which characterizes a Pareto distribution. Defining f as in Theorem 3.7, we then obtain

$$\begin{aligned} f(s)=\int _0^s F^{-1}(t) \mathrm {d}t = \frac{\lambda }{q-1} \int _0^s (q t^{q-1} - 1) \mathrm {d}t = \lambda \frac{s^q - s}{q-1}. \end{aligned}$$

Example C.4

(Hyperbolic cosine distribution model) If the marginal generating function in (33) is set to $F(s) = \sinh (s/\lambda - k)$ for some $\lambda > 0$ and for $k = \sqrt{2} - 1 - \text {arcsinh}(1)$, then the marginal distribution function $F_i$ for any $i\in [N]$ reduces to

$$\begin{aligned} F_i(s) = \min \left\{ 1, \max \left\{ 0, 1 + \eta _i \sinh (s/\lambda + k) \right\} \right\} , \end{aligned}$$

which characterizes a hyperbolic cosine distribution. Defining f as in Theorem 3.7, we then obtain

$$\begin{aligned} f(s)&= \int _0^s F^{-1}(t) \mathrm {d}t = \lambda \int _0^s (\text {arcsinh}(s) + k ) \mathrm {d}t \\&= \lambda (s \text {arcsinh}(s) - \sqrt{s^2 + 1} + 1 + ks). \end{aligned}$$

Example C.5

(t-distribution model) If the marginal generating function in (33) is set to

$$\begin{aligned} F(s) = \frac{N}{2}\left( 1 + \frac{s - \sqrt{N-1}}{\sqrt{\lambda ^2 + (s - \sqrt{N-1})^{2}}}\right) \end{aligned}$$

for some $\lambda , q>0$, then the marginal distribution function $F_i$ for any $i\in [N]$ reduces to

$$\begin{aligned} F_i(s) = \min \left\{ 1, \max \left\{ 0, 1 - \frac{\eta _i N}{2} \left( 1 - \frac{s + \sqrt{N-1}}{\sqrt{\lambda ^2 + (s + \sqrt{N-1})^{2}}}\right) \right\} \right\} , \end{aligned}$$

which characterizes a t-distribution with 2 degrees of freedom. Defining f as in Theorem 3.7, we then find

$$\begin{aligned} f(s) =&\int _0^s F^{-1}(t) \mathrm {d}t = \lambda \int _0^s \left( \frac{2s - N}{2 \sqrt{s(N-s)}} + \sqrt{N-1} \right) \mathrm {d}t\\ =&-\lambda \sqrt{s(N-s)} + \lambda s \sqrt{N-1}. \end{aligned}$$

The sparse maximum function

The following proposition, which is a simple extension of ([98], Proposition 1), suggests that the solution of (35) can be computed by a simple sorting algorithm.

Proposition D.1

Given $\varvec{u}\in {\mathbb {R}}^N$, let $\sigma $ be a permutation of [N] with $u_{\sigma (1)} \ge u_{\sigma (2)} \ge \cdots \ge u_{\sigma (N)}$, and set

$$\begin{aligned}k= & {} \max \left\{ j \in [N]: 2 + \left( \sum _{i =1}^j \eta _{\sigma (i)} \right) u_{\sigma (j)} > \sum _{i =1}^j \eta _{\sigma (i)} u_{\sigma (i)} \right\} \\ \text {and} \quad \tau ^\star= & {} \frac{\left( \sum _{i=1}^k \eta _{\sigma (i)} u_{\sigma (i)}\right) - 2}{\sum _{i=1}^k \eta _{\sigma (i)}}.\end{aligned}$$

Then $p^\star _i = \eta _i [u_i - \tau ^\star ]_+ / 2$, $i \in [N]$, is optimal in (35), where $[\cdot ]_+ = \max \{ 0, \cdot \}$ stands for the ramp function.

Proof

Assign Lagrange multipliers $\tau $ and $\varvec{\zeta }$ to the simplex constraints $\sum _{i=1}^N p_i=1$ and $\varvec{p}\ge \varvec{0}$ in problem (35), respectively. If we can find $\varvec{p}^\star $, $\tau ^\star $ and $\varvec{\zeta }^\star $ that satisfy the Karush-Kuhn-Tucker conditions

$$\begin{aligned} \begin{array}{l@{\quad }l@{\quad }l} \sum _{i=1}^N p^\star _i =1, ~p^\star _i \ge 0 &{} \forall i \in [N] &{} \text {(primal feasibility)}\\ \zeta ^\star _i\ge 0 &{} \forall i \in [N] &{} \text {(dual feasibility)}\\ \zeta _i^\star p_i^\star =0 &{} \forall i \in [N] &{} \text {(complementary slackness)} \\ u_i - \frac{2 p_i^\star }{\eta _i} - \tau ^\star + \zeta ^\star _i = 0 &{} \forall i \in [N] &{} \text {(stationarity)}, \end{array} \end{aligned}$$

then $\varvec{p}^\star $ is optimal in (35). In the following, we show that $\varvec{p}^\star $, $\tau ^\star $ and $\varvec{\zeta }^\star $ exist. Note first that if $p_i^\star > 0$, then $\zeta _i^\star = 0$ by complementary slackness and $p_i^\star = \eta _i (u_i - \tau ^\star ) / 2$ by stationarity. On the other hand, if $p_i^\star = 0$, then $\zeta _i^\star \ge 0$ by dual feasibility and $u_i - \tau ^\star \le 0$ by stationarity. In both cases we have $p_i^\star = \eta _i [u_i - \tau ^\star ]_+ / 2$ for all $i\in [N]$, which implies that $\sum _{i=1}^N \eta _i [u_i - \tau ^\star ]_+ = 2$ by primal feasibility. It thus remains to show that $\tau ^\star $ as defined in the proposition statement solves this nonlinear scalar equation. To this end, note that by the definitions of the permutation $\sigma $ and the index k we have

$$\begin{aligned} u_{\sigma (j)} \ge u_{\sigma (k)} > \frac{(\sum _{i=1}^k \eta _{\sigma (i)} u_{\sigma (i)}) - 2}{\sum _{i=1}^k \eta _{\sigma (i)}} = \tau ^\star \end{aligned}$$

for all $j \le k$. The definition of the index k further implies that

$$\begin{aligned} 2 + \left( \sum _{i=1}^{k+1} \eta _{\sigma (i)} \right) u_{\sigma (k+1)} \le \sum _{i=1}^{k+1} \eta _{\sigma (i)} u_{\sigma (i)}. \end{aligned}$$

A simple reordering, dividing both sides of the above inequality by $\sum _{i=1}^k \eta _{\sigma (i)}$, and using the definition of $\tau ^\star $ then yields $u_{\sigma (k+1)} \le \tau ^\star $. In addition, by the definition of the permutation $\sigma $, we have $u_{\sigma (j)} \le u_{\sigma (k+1)}$ for all $j > k$. Hence, we conclude that $u_{\sigma (j)} \le \tau ^\star $ for all $j > k$. One can then show that

$$\begin{aligned} \sum _{i=1}^N \eta _i [u_i - \tau ^\star ]_+ = \sum _{i=1}^k \eta _{\sigma (i)} (u_{\sigma (i)} - \tau ^\star ) = 2, \end{aligned}$$

as desired. Therefore, problem (35) is indeed solved by $p^\star _{i} = \eta _i [u_i - \tau ^\star ]_+ / 2$, $i\in [N]$.

$\square $

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Taşkesen, B., Shafieezadeh-Abadeh, S. & Kuhn, D. Semi-discrete optimal transport: hardness, regularization and numerical solution. Math. Program. 199, 1033–1106 (2023). https://doi.org/10.1007/s10107-022-01856-x

Download citation

Received: 10 March 2021
Accepted: 24 June 2022
Published: 25 July 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10107-022-01856-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Semi-discrete optimal transport: hardness, regularization and numerical solution

Abstract

Similar content being viewed by others

The Frank-Wolfe Algorithm: A Short Introduction

Prox-Regular Integro-Differential Sweeping Process

Exact Lipschitz Regularization of Convex Optimization Problems

1 Introduction

2 Hardness of computing optimal transport distances

Proposition 2.1

Theorem 2.2

Lemma 2.3

Lemma 2.4

Proof

Proof of Theorem 2.2

Corollary 2.5

Proof

3 Smooth optimal transport

Lemma 3.1

Proof

Lemma 3.2

Proof of Lemma 3.2

Proposition 3.3

Proof

3.1 Generalized extreme value distributions

Proposition 3.4

Proof of Proposition 3.4

3.2 Chebyshev ambiguity sets

Proposition 3.5

Proof

3.3 Marginal ambiguity sets

Proposition 3.6

Proof of Proposition 3.6

Theorem 3.7

Proof of Theorem 3.7

Example 3.8

Example 3.9

Example 3.10

Example 3.11

Example 3.12

4 Numerical solution of smooth optimal transport problems

4.1 Averaged gradient descent algorithm with biased gradient oracles

Assumption 4.1

Proposition 4.2

Lemma 4.3

Lemma 4.4

Proof of Proposition 4.2

Corollary 4.5

Proof of Corollary 4.5

Theorem 4.6

Lemma 4.7

Proof of Theorem 4.6

4.2 Smooth optimal transport problems with marginal ambiguity sets

Proposition 4.8

Proof

Theorem 4.9

Proof

Corollary 4.10

Proof

5 Numerical experiments

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Approximating the minimizer of a strictly convex function

Lemma A.1

Proof

Efficiency of binary search

Detailed derivations for the examples of marginal ambiguity sets

Example C.1

Example C.2

Example C.3

Example C.4

Example C.5

The sparse maximum function