Semi-discrete optimal transport: hardness, regularization and numerical solution

Semi-discrete optimal transport problems, which evaluate the Wasserstein distance between a discrete and a generic (possibly non-discrete) probability measure, are believed to be computationally hard. Even though such problems are ubiquitous in statistics, machine learning and computer vision, however, this perception has not yet received a theoretical justification. To fill this gap, we prove that computing the Wasserstein distance between a discrete probability measure supported on two points and the Lebesgue measure on the standard hypercube is already #\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\#$$\end{document}P-hard. This insight prompts us to seek approximate solutions for semi-discrete optimal transport problems. We thus perturb the underlying transportation cost with an additive disturbance governed by an ambiguous probability distribution, and we introduce a distributionally robust dual optimal transport problem whose objective function is smoothed with the most adverse disturbance distributions from within a given ambiguity set. We further show that smoothing the dual objective function is equivalent to regularizing the primal objective function, and we identify several ambiguity sets that give rise to several known and new regularization schemes. As a byproduct, we discover an intimate relation between semi-discrete optimal transport problems and discrete choice models traditionally studied in psychology and economics. To solve the regularized optimal transport problems efficiently, we use a stochastic gradient descent algorithm with imprecise stochastic gradient oracles. A new convergence analysis reveals that this algorithm improves the best known convergence guarantee for semi-discrete optimal transport problems with entropic regularizers.


Introduction
Optimal transport theory has a long and distinguished history in mathematics dating back to the seminal work of Monge [1781] and Kantorovich [1942]. While originally envisaged for applications in civil engineering, logistics and economics, optimal transport problems provide a natural framework for comparing probability measures and have therefore recently found numerous applications in statistics and machine learning. Indeed, the minimum cost of transforming a probability measure on to some other probability measure on with respect to a prescribed cost function on × can be viewed as a measure of distance between and . If = and the cost function coincides with (the th power of) a metric on × , then the resulting optimal transport distance represents (the th power of) a Wasserstein metric on the space of probability measures over [Villani, 2008]. In the remainder of this paper we distinguish discrete, semi-discrete and continuous optimal transport problems in which either both, only one or none of the two probability measures and are discrete, respectively.
The discrete optimal transport problem represents a tractable linear program that is susceptible to the network simplex algorithm [Orlin, 1997]. Alternatively, it can be addressed with dual ascent methods [Bertsimas and Tsitsiklis, 1997], the Hungarian algorithm for assignment problems [Kuhn, 1955] or customized auction algorithms [Bertsekas, 1981[Bertsekas, , 1992. The currently best known complexity bound for computing an exact solution is attained by modern interior-point algorithms. Indeed, if denotes the number of atoms in or in , whichever is larger, then the discrete optimal transport problem can be solved in time 1˜( 2.5 ) with an interior point algorithm by Lee and Sidford [2014]. The need to evaluate optimal transport distances between increasingly fine-grained histograms has also motivated efficient approximation schemes. Blanchet et al. [2018] and Quanrud [2019] show that an -optimal solution can be found in time ( 2 / ) by reducing the discrete optimal transport problem to a matrix scaling or a positive linear programming problem, which can be solved efficiently by a Newton-type algorithm. Jambulapati et al. [2019] describe a parallelizable primal-dual first-order method that achieves a similar convergence rate.
The tractability of the discrete optimal transport problem can be improved by adding an entropy regularizer to its objective function, which penalizes the entropy of the transportation plan for morphing into . When the weight of the regularizer grows, this problem reduces to the classical Schrödinger bridge problem of finding the most likely random evolution from to [Schrödinger, 1931]. Generic linear programs with entropic regularizers were first studied by Fang [1992]. Cominetti and San Martín [1994] prove that the optimal values of these regularized problems converge exponentially fast to the optimal values of the corresponding unregularized problems as the regularization weight drops to zero. Nonasymptotic convergence rates for entropy regularized linear programs are derived by Weed [2018]. Cuturi [2013] was the first to realize that entropic penalties are computationally attractive because they make the discrete optimal transport problem susceptible to a fast matrix scaling algorithm by Sinkhorn [1967].
This insight has spurred widespread interest in machine learning and led to a host of new applications of optimal transport in color transfer [Chizat et al., 2018], inverse problems [Karlsson and Ringh, 2017, Adler 1 We use the soft-O notation˜(·) to hide polylogarithmic factors. et al., 2017], texture synthesis , the analysis of crowd evolutions [Peyré, 2015] and shape interpolation [Solomon et al., 2015] to name a few. This surge of applications inspired in turn several new algorithms for the entropy regularized discrete optimal transport problem such as a greedy dual coordinate descent method also known as the Greenkhorn algorithm [Altschuler et al., 2017, Chakrabarty and Khanna, 2020, Abid and Gower, 2018. Dvurechensky et al. [2018] and Lin et al. [2019b] prove that both the Sinkhorn and the Greenkhorn algorithms are guaranteed to find an -optimal solution in time˜( 2 / 2 ). In practice, however, the Greenkhorn algorithm often outperforms the Sinkhorn algorithm [Lin et al., 2019b].
The runtime guarantee of both algorithms can be improved to˜( 7/3 / ) via a randomization scheme [Lin et al., 2019a]. In addition, the regularized discrete optimal transport problem can be addressed by tailoring general-purpose optimization algorithms such as accelerated gradient descent algorithms [Dvurechensky et al., 2018], iterative Bregman projections [Benamou et al., 2015], quasi-Newton methods  or stochastic average gradient descent algorithms [Genevay et al., 2016]. While the original optimal transport problem induces sparse solutions, the entropy penalty forces the optimal transportation plan of the regularized optimal transport problem to be strictly positive and thus completely dense. In applications where the interpretability of the optimal transportation plan is important, the lack of sparsity could be undesirable; examples include color transfer [Pitié et al., 2007], domain adaptation [Courty et al., 2016] or ecological inference [Muzellec et al., 2017]. Hence, there is merit in exploring alternative regularization schemes that retain the attractive computational properties of the entropic regularizer but induce sparsity.
Much like the discrete optimal transport problems, the significantly more challenging semi-discrete optimal transport problems emerge in numerous applications including variational inference [Ambrogioni et al., 2018], blue noise sampling [Qin et al., 2017], computational geometry [Lévy, 2015], image quantization [De Goes et al., 2012] or deep learning with generative adversarial networks , Genevay et al., 2018, Gulrajani et al., 2017. Semi-discrete optimal transport problems are also used in fluid mechanics to simulate incompressible fluids .
Exact solutions of a semi-discrete optimal transport problem can be constructed by solving an incompressible Euler-type partial differential equation discovered by Brenier [1991]. Any optimal solution is known to partition the support of the non-discrete measure into cells corresponding to the atoms of the discrete measure [Aurenhammer et al., 1998], and the resulting tessellation is usually referred to as a power diagram. Mirebeau [2015] uses this insight to solve Monge-Ampère equations with a damped Newton algorithm, and Kitagawa et al. [2016] show that a closely related algorithm with a global linear convergence rate lends itself for the numerical solution of generic semi-discrete optimal transport problems. In addition, Mérigot [2011] proposes a quasi-Newton algorithm for semi-discrete optimal transport, which improves a method due to Aurenhammer et al. [1998] by exploiting Llyod's algorithm to iteratively simplify the discrete measure. If the transportation cost is quadratic, Bonnotte [2013] relates the optimal transportation plan to the Knothe-Rosenblatt rearrangement for mapping to , which is very easy to compute.
As usual, regularization improves tractability. Genevay et al. [2016] show that the dual of a semidiscrete optimal transport problem with an entropic regularizer is susceptible to an averaged stochastic gradient descent algorithm that enjoys a convergence rate of (1/ √ ), being the number of iterations. Altschuler et al. [2022] show that the optimal value of the entropically regularized problem converges to the optimal value of the unregularized problem at a quadratic rate as the regularization weight drops to zero. Improved error bounds under stronger regularity conditions are derived by Delalande [2021].
Continuous optimal transport problems constitute difficult variational problems involving infinitely many variables and constraints. Benamou and Brenier [2000] recast them as boundary value problems in fluid dynamics, and  solve discretized versions of these reformulations using first-order methods. For a comprehensive survey of the interplay between partial differential equations and optimal transport we refer to [Evans, 1997]. As nearly all numerical methods for partial differential equations suffer from a curse of dimensionality, current research focuses on solution schemes for regularized continuous optimal transport problems. For instance, Genevay et al. [2016] embed their duals into a reproducing kernel Hilbert space to obtain finite-dimensional optimization problems that can be solved with a stochastic gradient descent algorithm.  solve regularized continuous optimal transport problems by representing the transportation plan as a multilayer neural network. This approach results in finite-dimensional optimization problems that are non-convex and offer no approximation guarantees.
However, it provides an effective means to compute approximate solutions in high dimensions. Indeed, the optimal value of the entropically regularized continuous optimal transport problem is known to converge to the optimal value of the unregularized problem at a linear rate as the regularization weight drops to zero [Chizat et al., 2020, Conforti and Tamanini, 2021, Erbar et al., 2015, Pal, 2019. Due to a lack of efficient algorithms, applications of continuous optimal transport problems are scarce in the extant literature. Peyré and Cuturi [2019] provide a comprehensive survey of numerous applications and solution methods for discrete, semi-discrete and continuous optimal transport problems.
This paper focuses on semi-discrete optimal transport problems. Our main goal is to formally establish that these problems are computationally hard, to propose a unifying regularization scheme for improving their tractability and to develop efficient algorithms for solving the resulting regularized problems, assuming only that we have access to independent samples from the continuous probability measure . Our regularization scheme is based on the observation that any dual semi-discrete optimal transport problem maximizes the expectation of a piecewise affine function with pieces, where the expectation is evaluated with respect to , and where denotes the number of atoms of the discrete probability measure . We argue that this piecewise affine function can be interpreted as the optimal value of a discrete choice problem, which can be smoothed by adding random disturbances to the underlying utility values [Thurstone, 1927, McFadden, 1974. As probabilistic discrete choice problems are routinely studied in economics and psychology, we can draw on a wealth of literature in choice theory to design various smooth (dual) optimal transport problems with favorable numerical properties. For maximal generality we will also study semi-parametric discrete choice models where the disturbance distribution is itself subject to uncertainty [Natarajan et al., 2009, Mishra et al., 2014, Feng et al., 2017, Ahipaşaoğlu et al., 2018. Specifically, we aim to evaluate the best-case (maximum) expected utility across a Fréchet ambiguity set containing all disturbance distributions with prescribed marginals. Such models can be addressed with customized methods from modern distributionally robust optimization [Natarajan et al., 2009]. For Fréchet ambiguity sets, we prove that smoothing the dual objective is equivalent to regularizing the primal objective of the semi-discrete optimal transport problem. The corresponding regularizer penalizes the discrepancy between the chosen transportation plan and the product measure ⊗ with respect to a divergence measure constructed from the marginal disturbance distributions. Connections between primal regularization and dual smoothing were previously recognized by  and Paty and Cuturi [2020] in discrete optimal transport and by Genevay et al. [2016] in semi-discrete optimal transport. As they are constructed ad hoc or under a specific adversarial noise model, these existing regularization schemes lack the intuitive interpretation offered by discrete choice theory and emerge as special cases of our unifying scheme.
The key contributions of this paper are summarized below.
i. We study the computational complexity of semi-discrete optimal transport problems. Specifically, we prove that computing the optimal transport distance between two probability measures and on the same Euclidean space is #P-hard even if only approximate solutions are sought and even if is the Lebesgue measure on the standard hypercube and is supported on merely two points.
ii. We propose a unifying framework for regularizing semi-discrete optimal transport problems by leveraging ideas from distributionally robust optimization and discrete choice theory [Natarajan et al., 2009, Mishra et al., 2014, Feng et al., 2017, Ahipaşaoğlu et al., 2018. Specifically, we perturb the transportation cost to every atom of the discrete measure with a random disturbance, and we assume that the vector of all disturbances is governed by an uncertain probability distribution from within a Fréchet ambiguity set that prescribes the marginal disturbance distributions. Solving the dual optimal transport problem under the least favorable disturbance distribution in the ambiguity set amounts to smoothing the dual and regularizing the primal objective function. We show that numerous known and new regularization schemes emerge as special cases of this framework, and we derive a priori approximation bounds for the resulting regularized optimal transport problems.
iii. We derive new convergence guarantees for an averaged stochastic gradient descent (SGD) algorithm that has only access to a biased stochastic gradient oracle. Specifically, we prove that this algorithm enjoys a convergence rate of (1/ √ ) for Lipschitz continuous and of (1/ ) for generalized self-concordant objective functions. We also show that this algorithm lends itself to solving the smooth dual optimal transport problems obtained from the proposed regularization scheme. When the smoothing is based on a semi-parametric discrete choice model with a Fréchet ambiguity set, the algorithm's convergence rate depends on the smoothness properties of the marginal noise distributions, and its per-iteration complexity depends on our ability to compute the optimal choice probabilities. We demonstrate that these choice probabilities can indeed be computed efficiently via bisection or sorting, and in special cases they are even available in closed form. As a byproduct, we show that our algorithm can improve the state-of-the-art (1/ √ ) convergence guarantee of Genevay et al. [2016] for the semi-discrete optimal transport problem with an entropic regularizer.
The rest of this paper unfolds as follows. In Section 2 we study the computational complexity of semidiscrete optimal transport problems, and in Section 3 we develop our unifying regularization scheme. In Section 4 we analyze the convergence rate of an averaged SGD algorithm with a biased stochastic gradient oracle that can be used for solving smooth dual optimal transport problems, and in Section 5 we compare its empirical convergence behavior against the theoretical convergence guarantees.

Hardness of Computing Optimal Transport Distances
If and are closed subsets of finite-dimensional Euclidean spaces and : × → [0, +∞] is a lower-semicontinuous cost function, then the Monge-Kantorovich optimal transport distance between two probability measures ∈ ( ) and ∈ ( ) is defined as where Π( , ) denotes the family of all couplings of and , that is, the set of all probability measures on × with marginals on and on . One can show that the minimum in (1) is always attained [Villani, 2008, Theorem 4.1]. If = is a metric space with metric : × → R + and the transportation cost is defined as ( , ) = ( , ) for some ≥ 1, then ( , ) 1/ is termed the -th Wasserstein distance between and . The optimal transport problem (1) constitutes an infinite-dimensional linear program over measures and admits a strong dual linear program over functions [Villani, 2008, Theorem 5.9].
Proposition 2.1 (Kantorovich duality). The optimal transport distance between ∈ ( ) and ∈ ( ) admits the dual representation The linear program (2) optimizes over the two Kantorovich potentials ∈ ℒ( , ) and ∈ ℒ( , ), but it can be reformulated as the following non-linear program over a single potential function, where : → [−∞, +∞] is called the -transform of and is defined through see Villani [2008, § 5] for details. The Kantorovich duality is the key enabling mechanism to study the computational complexity of the optimal transport problem (1).
Theorem 2.2 (Hardness of computing optimal transport distances). Computing ( , ) is #P-hard even if = = R , ( , ) = ‖ − ‖ for some ≥ 1, is the Lebesgue measure on the standard hypercube [0, 1] , and is a discrete probability measure supported on only two points.
To prove Theorem 2.2, we will show that computing the optimal transport distance ( , ) is at least as hard computing the volume of the knapsack polytope ( , ) = { ∈ [0, 1] : ⊤ ≤ } for a given ∈ R + and ∈ R + , which is known to be #P-hard [Dyer and Frieze, 1988, Theorem 1]. Specifically, we will leverage the following variant of this hardness result, which establishes that approximating the volume of the knapsack polytope ( , ) to a sufficiently high accuracy is already #P-hard. Lemma 2.3 (Hanasusanto et al. [2016, Lemma 1]). Computing the volume of the knapsack polytope ( , ) for a given ∈ R + and ∈ R + to within an absolute accuracy of > 0 is #P-hard whenever Fix now any knapsack polytope ( , ) encoded by ∈ R + and ∈ R + . Without loss of generality, we may assume that ̸ = 0 and > 0. Indeed, we are allowed to exclude = 0 because the volume of (0, ) is trivially equal to 1. On the other hand, = 0 can be excluded by applying a suitable rotation and translation, which are volume-preserving transformations. In the remainder, we denote by the Lebesgue measure on the standard hypercube [0, 1] and by = 1 + (1 − ) 2 a family of discrete probability measures with two atoms at 1 = 0 and 2 = 2 /‖ ‖ 2 , respectively, whose probabilities are parameterized by ∈ [0, 1]. The following preparatory lemma relates the volume of ( , ) to the optimal transport problem (1) and is thus instrumental for the proof of Theorem 2.2.
Proof. By the definition of the optimal transport distance in (1) and our choice of ( , ), we have min where the second equality holds because any coupling of and can be constructed from the marginal probability measure of and the probability measures 1 and 2 of conditional on = 1 and = 2 , respectively, that is, we may write = · 1 ⊗ 1 +(1− )· 2 ⊗ 2 . The constraint of the inner minimization problem ensures that the marginal probability measure of under coincides with . By applying the variable transformations 1 ← · 1 and 2 ← (1 − ) · 2 to eliminate all bilinear terms, we then obtain Observe next that the decision variable and the two normalization constraints can be eliminated without affecting the optimal value of the resulting infinite-dimensional linear program because the Borel measures 1 and 2 are non-negative and because the constraint 1 + 2 = implies that 1 (R )+ 2 (R ) = (R ) = 1. Thus, there always exists ∈ [0, 1] such that 1 (R ) = and 2 (R ) = 1 − . This reasoning implies that min ∈[0,1] The constraint 1 + 2 = also implies that 1 and 2 are absolutely continuous with respect to , and thus min ∈[0,1] where the second equality holds because at optimality the Radon-Nikodym derivatives must satisfy for -almost every ∈ R and for every = 1, 2.
In the second part of the proof we will demonstrate that the minimization problem min ∈[0,1] ( , ) is solved by ⋆ = Vol( ( , )). By Proposition 2.1 and the definition of the -transform, we first note that = max The second equality in (7) follows from the construction of ⋆ as a probability measure with only two atoms at the points for = 1, 2. Indeed, by fixing the corresponding function values = ( ) for = 1, 2, the expectation E ∼ [ ( )] simplifies to ⋆ · 1 + (1 − ⋆ ) · 2 , while the negative expectation −E ∼ [ ( )] is maximized by setting ( ) to a large negative constant for all / ∈ { 1 , 2 }, which implies that Next, we will prove that any ⋆ ∈ R 2 with ⋆ 1 = ⋆ 2 attains the maximum of the unconstrained convex optimization problem on the last line of (7). To see this, note that by virtue of the Reynolds theorem. Thus, the first-order optimality condition 2 ⋆ = ( 1 ( )) is necessary and sufficient for global optimality. Fix now any ⋆ ∈ R 2 with ⋆ 1 = ⋆ 2 and observe that 2 Note that the first-order condition 1 − ⋆ = ( 2( )) for 2 is redundant in view of the first-order condition ⋆ = ( 1( )) for 1 because is the Lebesgue measure on [0, 1] , whereby ( 1( ) ∪ 2( )) = ( 1( )) + ( 2( )) = 1.
where the first and second equalities follow from the definitions of ⋆ and the knapsack polytope ( , ), respectively, the fourth equality holds because 1 = 0 and 2 = 2 /‖ ‖ 2 , and the fifth equality follows from the definition of 1 ( ⋆ ) and our assumption that ⋆ 1 = ⋆ 2 . This reasoning implies that ⋆ attains indeed the maximum of the optimization problem on the last line of (7). Hence, we find where the second equality holds because ⋆ 1 = ⋆ 2 , the third equality exploits the definition of 1 ( ⋆ ), and the fourth equality follows from (6). We may therefore conclude that ⋆ = Vol( ( , )) solves indeed the minimization problem min ∈[0,1] ( , From the proof of Lemma 2.4 we know that the Wasserstein distance ( , ) is strictly convex in , which implies that the minimization problem (8) constitutes a one-dimensional convex program with a unique minimizer. A near-optimal solution that approximates the exact minimizer to within an absolute accuracy = (6 !(‖ ‖ 1 +2) ( +1) +1 ∏︀

=1
) −1 can readily be computed with a binary search method such as Algorithm 3 described in Lemma A.1 (i), which evaluates ( ) = ( , ) at exactly 2 = 2(⌈log 2 (1/ )⌉+1) test points. Note that falls within the interval (0, 1) and satisfies the strict inequality (5). Note also that grows only polynomially with the bit length of and ; see Appendix B for details. One readily verifies that all operations in Algorithm 3 except for the computation of ( , ) can be carried out in time polynomial in the bit length of and . Thus, if we could compute ( , ) in time polynomial in the bit length of , and , then we could efficiently compute the volume of the knapsack polytope ( , ) to within accuracy , which is #P-hard by Lemma 2.3. We have thus constructed a polynomial-time Turing reduction from the #P-hard problem of (approximately) computing the volume of a knapsack polytope to computing the Wasserstein distance ( , ). By the definition of the class of #P-hard problems (see, e.g., [Van Leeuwen, 1990, Definition 1]), we may thus conclude that computing ( , ) is #P-hard. Proof. Assume that we have access to an inexact oracle that outputs, for any fixed ∈ [0, 1], an approximate optimal transport distancẽ︁ ( , ) with |̃︁ ( , ) − ( , )| ≤ . By Lemma A.1 (ii), which applies thanks to the definition of , we can then find a 2 -approximation for the unique minimizer of (8) using 2 oracle calls. Note that ′ = 2 falls within the interval (0, 1) and satisfies the strict inequality (5).
Recall also that grows only polynomially with the bit length of and ; see Appendix B for details.
Thus, if we could computẽ︁ ( , ) in time polynomial in the bit length of , and , then we could efficiently compute the volume of the knapsack polytope ( , ) to within accuracy ′ , which is #P-hard by Lemma 2.3. Computing ( , ) to within an absolute accuracy of is therefore also #P-hard.
The hardness of optimal transport established in Theorem 2.2 and Corollary 2.5 is predicated on the hardness of numerical integration. A popular technique to reduce the complexity of numerical integration is smoothing, whereby an initial (possibly discontinuous) integrand is approximated with a differentiable one [Dick et al., 2013]. Smoothness is also a desired property of objective functions when designing scalable optimization algorithms [Bubeck, 2015]. These observations prompt us to develop a systematic way to smooth the optimal transport problem that leads to efficient approximate numerical solution schemes.

Smooth Optimal Transport
The semi-discrete optimal transport problem evaluates the optimal transport distance (1) between an arbitrary probability measure supported on and a discrete probability measure = ∑︀ =1 with atoms 1 , . . . , ∈ and corresponding probabilities = ( 1 , . . . , ) ∈ Δ for some ≥ 2. In the following, we define the discrete -transform Armed with the discrete -transform, we can now reformulate the semi-discrete optimal transport problem as a finite-dimensional maximization problem over a single dual potential vector.
Lemma 3.1 (Discrete -transform). The semi-discrete optimal transport problem is equivalent to Proof. As = ∑︀ =1 is discrete, the dual optimal transport problem (3) simplifies to Using the definition of the standard -transform, we can then recast the inner minimization problem as where the first equality follows from setting ( ) = for all / ∈ { 1 , . . . , } and letting tend to −∞, while the second equality exploits the definition of the discrete -transform. Thus, (10) follows.
The discrete -transform (9) can be viewed as the optimal value of a discrete choice model, where a utility-maximizing agent selects one of mutually exclusive alternatives with utilities − ( , ), ∈ [ ], respectively. Discrete choice models are routinely used for explaining the preferences of travelers selecting among different modes of transportation [Ben-Akiva and Lerman, 1985], but they are also used for modeling the choice of residential location [McFadden, 1978], the interests of end-users in engineering design [Wassenaar and Chen, 2003] or the propensity of consumers to adopt new technologies [Hackbarth and Madlener, 2013].
In practice, the preferences of decision-makers and the attributes of the different choice alternatives are invariably subject to uncertainty, and it is impossible to specify a discrete choice model that reliably predicts the behavior of multiple individuals. Psychological theory thus models the utilities as random variables [Thurstone, 1927], in which case the optimal choice becomes random, too. The theory as well as the econometric analysis of probabilistic discrete choice models were pioneered by McFadden [1974].
The availability of a wealth of elegant theoretical results in discrete choice theory prompts us to add a random noise term to each deterministic utility value − ( , ) in (9). We will argue below that the expected value of the resulting maximal utility with respect to the noise distribution provides a smooth approximation for the -transform ( , ), which in turn leads to a smooth optimal transport problem that displays favorable numerical properties. For a comprehensive survey of additive random utility models in discrete choice theory we refer to Dubin and McFadden [1984] and Daganzo [2014]. Generalized semiparametric discrete choice models where the noise distribution is itself subject to uncertainty are studied by Natarajan et al. [2009]. Using techniques from modern distributionally robust optimization, these models evaluate the best-case (maximum) expected utility across an ambiguity set of multivariate noise distributions. Semi-parametric discrete choice models are studied in the context of appointment scheduling [Mak et al., 2015], traffic management [Ahipaşaoğlu et al., 2016] and product line pricing [Li et al., 2019].
We now define the smooth (discrete) -transform as a best-case expected utility of the type studied in semi-parametric discrete choice theory, that is, where represents a random vector of perturbations that are independent of and . Specifically, we assume that is governed by a Borel probability measure from within some ambiguity set Θ ⊆ (R ).
Note that if Θ is a singleton that contains only the Dirac measure at the origin of R , then the smooth -transform collapses to ordinary -transform defined in (9), which is piecewise affine and thus non-smooth in . For many commonly used ambiguity sets, however, we will show below that the smooth -transform is indeed differentiable in . In practice, the additive noise in the transportation cost could originate, for example, from uncertainty about the position of the -th atom of the discrete distribution . This interpretation is justified if ( , ) is approximately affine in around the atoms , ∈ [ ]. The smooth -transform gives rise to the following smooth (semi-discrete) optimal transport problem in dual form.
Note that (12) is indeed obtained from the original dual optimal transport problem (10) by replacing the original -transform ( , ) with the smooth -transform ( , ). As smooth functions are susceptible to efficient numerical integration, we expect that (12) is easier to solve than (10). A key insight of this work is that the smooth dual optimal transport problem (12) typically has a primal representation of the form min ∈Π( , ) where Θ ( ) can be viewed as a regularization term that penalizes the complexity of the transportation plan . In the remainder of this section we will prove (13) and derive Θ ( ) for different ambiguity sets Θ.
We will see that this regularization term is often related to an -divergence, where : constitutes a lower-semicontinuous convex function with (1) = 0. If and are two Borel probability measures on a closed subset of a finite-dimensional Euclidean space, and if is absolutely continuous with respect to , then the continuous -divergence form to is defined as where d /d stands for the Radon-Nikodym derivative of with respect to . By slight abuse of notation, if and are two probability vectors in Δ and if > 0, then the discrete -divergence form to is defined as . The correct interpretation of is usually clear from the context.
The following lemma shows that the smooth optimal transport problem (13) equipped with andivergence regularization term is equivalent to a finite-dimensional convex minimization problem. This result will be instrumental to prove the equivalence of (12) and (13) for different ambiguity sets Θ.
is a discrete probability measure on , then problem (13) (13) and (14) evaluate to infinity, and the claim holds trivially. In the remainder of the proof we may thus assume without loss of generality Using [Rockafellar and Wets, 2009, Theorem 14.6] to interchange the minimization over with the expectation over , problem (14) can first be reformulated as where ℒ ∞ ( , ) denotes the Banach space of all Borel-measurable functions from to R that are essentially bounded with respect to . Interchanging the supremum over with the minimum over and evaluating the resulting unconstrained linear program over in closed form then yields the dual problem Strong duality holds for the following reasons. As and are lower-semicontinuous and is non-negative, we may proceed as in [Shapiro, 2017, § 3.2] to show that the dual objective function is weakly * lower semicontinuous in . Similarly, as Δ is compact, one can use the Banach-Alaoglu theorem to show that the dual feasible set is weakly * compact. Finally, as is real-valued and the constant solution ( ) = is dual feasible for all ∈ Δ . Thus, the dual problem is solvable and has a finite optimal value. This argument remains valid if we add a perturbation ∈ = { ′ ∈ R : ∑︀ =1 ′ = 0} to the right hand side vector as long as > − . The optimal value of the perturbed dual problem is thus pointwise finite as well as convex and-consequently-continuous and locally bounded in at the origin of . As > 0, strong duality therefore follows from [Rockafellar, 1974, Theorem 17 (a)].
Any dual feasible solution ∈ ℒ ∞ ( , ) gives rise to a Borel probability measure ∈ ( × ) defined through ( ∈ ℬ) = ( ∈ ℬ) for all Borel sets ℬ ⊆ and ( ∈ | = ) = ∫︀ ( ) (d )/ for all Borel sets ⊆ and ∈ [ ]. This follows from the law of total probability, whereby the joint distribution of and is uniquely determined if we specify the marginal distribution of and the conditional distribution of given = for every ∈ [ ]. By construction, the marginal distributions of and under are determined by and , respectively. Indeed, note that for any Borel set ⊆ we have where the first equality follows from the law of total probability, the second and the third equalities both exploit the construction of , and the fourth equality holds because ( ) ∈ Δ -almost surely due to dual feasibility. This reasoning implies that constitutes a coupling of and (that is, ∈ Π( , )) and is thus feasible in (13). Conversely, any ∈ Π( , ) gives rise to a function ∈ ℒ ∞ ( , ) defined through By the properties of the Randon-Nikodym derivative, we have ( ) ≥ 0 -almost surely for all ∈ [ ]. In addition, for any Borel set ⊆ we have where the second equality follows from Fubini's theorem and the definition of = ∑︀ =1 , while the fourth equality exploits that the marginal distribution of under is determined by . As the above identity holds for all Borel sets ⊆ , we find that In summary, is feasible in (15). Thus, we have shown that every probability measure feasible in (13) induces a function feasible in (15) and vice versa. We further find that the objective value of in (15) coincides with the objective value of the corresponding in (13). Specifically, we have where the first equality exploits the definition of the discrete -divergence, the second equality expresses the function in terms of the corresponding probability measure , the third equality follows from Fubini's theorem and uses the definitions = , and the fourth equality follows from the definition of the continuous -divergence. In summary, we have thus shown that (13) is equivalent to (15), which in turn is equivalent to (14). This observation completes the proof.
is a discrete probability measure on , then problem (13) with regularization term Θ ( ) = ( ‖ ⊗ ) satisfies Proof. By Lemma 3.2, problem (13) is equivalent to (14). Note that the inner optimization problem in (14) can be viewed as an -divergence regularized linear program with optimal value ⊤ − ℓ( , ), where Bounding ( ‖ ) by its minimum and its maximum over ∈ Δ then yields the estimates Here, ( , ) stands as usual for the discrete -transform defined in (9), which can be represented as Multiplying (16) by −1, adding ⊤ , averaging over using the probability measure and maximizing over ∈ R further implies via (10) and (14) that As ( ‖ ) is convex in , its maximum is attained at a vertex of Δ [Hoffman, 1981, Theorem 1], that is, The claim then follows by substituting the above formula into (18) and rearranging terms.
In the following we discuss three different classes of ambiguity sets Θ for which the dual smooth optimal transport problem (12) is indeed equivalent to the primal reguarized optimal transport problem (13).

Generalized Extreme Value Distributions
Assume first that the ambiguity set Θ represents a singleton that accommodates only one single Borel probability measure on R defined through where : R → R + is a smooth generating function with the following properties. First, is homogeneous of degree 1/ for some > 0, that is, for any ̸ = 0 and ∈ R we have ( ) = 1/ ( ). In addition, ( ) tends to infinity as grows for any ∈ [ ]. Finally, the partial derivative of with respect to distinct arguments is non-negative if is odd and non-positive if is even. These properties ensure that the noise vector follows a generalized extreme value distribution in the sense of [Train, 2009, § 4.1].
Proposition 3.4 (Entropic regularization). Assume that Θ is a singleton ambiguity set that contains only a generalized extreme value distribution with ( ) = exp(− ) ∑︀ =1 1/ for some > 0 and ∈ Δ , > 0, where stands for Euler's constant. Then, the components of follow independent Gumbel distributions with means log( ) and variances 2 2 /6 for all ∈ [ ], while the smooth -transform (11) reduces to the log-partition function In addition, the smooth dual optimal transport problem (12) is equivalent to the regularized primal optimal transport problem (13) Note that the log-partition function (20) constitutes indeed a smooth approximation for the maximum function in the definition (9) of the discrete -transform. As decreases, this approximation becomes increasingly accurate. It is also instructive to consider the special case where = ∑︀ =1 is a discrete probability measure with atoms 1 , . . . , ∈ and corresponding vector of probabilities ∈ Δ . In this case, any coupling ∈ Π( , ) constitutes a discrete probability measure = with matrix of probabilities ∈ Δ × . If ( ) = log( ), then the continuous -divergence reduces to where the second equality holds because is a coupling of and . Thus, ( ‖ ⊗ ) coincides with the negative entropy of the probability matrix offset by a constant that is independent of . For ( ) = log( ) the choice of has therefore no impact on the minimizer of the smooth optimal transport problem (13), and we simply recover the celebrated entropic regularization proposed by Cuturi where stands for Euler's constant. The components of the noise vector are thus independent under , and follows a Gumbel distribution with location parameter (log( ) − ) and scale parameter for every ∈ [ ]. Therefore, has mean log( ) and variance 2 2 /6.
If the ambiguity set Θ contains only one single probability measure of the form (19), then Theorem 5.2 of McFadden [1981] readily implies that the smooth -transform (11) simplifies to The closed-form expression for the smooth -transform in (20) follows immediately by substituting the explicit formula for the generating function into (21). One further verifies that (20) can be reformulated as Indeed, solving the underlying Karush-Kuhn-Tucker conditions analytically shows that the optimal value of the nonlinear program (22) (20) and (22) has already been recognized by Anderson et al. [1988].

Chebyshev Ambiguity Sets
Assume next that Θ constitutes a Chebyshev ambiguity set comprising all Borel probability measures on R with mean vector 0 and positive definite covariance matrix Σ for some Σ ≻ 0 and > 0.
In this case, [Ahipaşaoğlu et al., 2018, Theorem 1] implies that the smooth -transform (11) can be equivalently expressed as where diag( ) ∈ R × represents the diagonal matrix with on its main diagonal. Note that the maximum in (23) evaluates the convex conjugate of the extended real-valued regularization function As Σ ≻ 0 and > 0, [Ahipaşaoğlu et al., 2018, Theorem 3] implies that ( ) is strongly convex over its effective domain Δ . By [Rockafellar and Wets, 2009, Proposition 12.60], the smooth discrete -transform ( , ) is therefore indeed differentiable in for any fixed .
It is further known that problem (23) admits an exact reformulation as a tractable semidefinite program; see [Mishra et al., 2012, Proposition 1]. If Σ = , then the regularization function ( ) can be re-expressed in terms of a discrete -divergence, which implies via Lemma 3.2 that the smooth optimal transport problem is equivalent to the original optimal transport problem regularized with a continuous -divergence.
Proposition 3.5 (Chebyshev regularization). If Θ is the Chebyshev ambiguity set of all Borel probability measures with mean 0 and covariance matrix with > 0, then the smooth -transform (11) simplifies to In addition, the smooth dual optimal transport problem (12) is equivalent to the regularized primal optimal transport problem (13) Proof. The relation (24) follows directly from (23) by replacing Σ with . Next, one readily verifies that Substituting the above representation of the smooth -transform into the dual smooth optimal transport problem (12) yields (14) with ( ) = − √︀ ( − ) + √ − 1. By Lemma 3.2, (12) thus reduces to the regularized primal optimal transport problem (13) Note that the function ( ) defined in (25) is indeed convex, lower-semicontinuous and satisfies (1) = 0.
Therefore, it induces a standard -divergence. Proposition 3.5 can be generalized to arbitrary diagonal matrices Σ, but the emerging -divergences are rather intricate and not insightful. Hence, we do not show this generalization. We were not able to generalize Proposition 3.5 to non-diagonal matrices Σ.

Marginal Ambiguity Sets
We now investigate the class of marginal ambiguity sets of the form where stands for the cumulative distribution function of the uncertain disturbance , ∈ [ ]. Marginal ambiguity sets completely specify the marginal distributions of the components of the random vector but impose no restrictions on their dependence structure (i.e., their copula). Sometimes marginal ambiguity sets are also referred to as Fréchet ambiguity sets [Fréchet, 1951]. We will argue below that the marginal ambiguity sets explain most known as well as several new regularization methods for the optimal transport problem. In particular, they are more expressive than the extreme value distributions as well as the Chebyshev ambiguity sets in the sense that they induce a richer family of regularization terms. Below we denote by −1 : [0, 1] → R the (left) quantile function corresponding to , which is defined through We first prove that if Θ constitutes a marginal ambiguity set, then the smooth -transform (11) admits an equivalent reformulation as the optimal value of a finite convex program.  (11) can be equivalently expressed as for all ∈ and ∈ R . In addition, the smooth -transform is convex and differentiable with respect to , and ∇ ( , ) represents the unique solution of the convex maximization problem (27).
Recall that the smooth -transform (11) can be viewed as the best-case utility of a semi-parametric discrete choice model. Thus, (27) follows from [Natarajan et al., 2009, Theorem 1]. To keep this paper self-contained, we provide a new proof of Proposition 3.6, which exploits a natural connection between the smooth -transform induced by a marginal ambiguity set and the conditional value-at-risk (CVaR).
Proof of Proposition 3.6. Throughout the proof we fix ∈ and ∈ R , and we introduce the nominal utility vector ∈ R with components = − ( , ) in order to simplify notation. In addition, it is useful to define the binary function : For any fixed ∈ Θ, we then have almost surely with respect to . From now on we denote by the marginal probability distribution of the random variable under . As belongs to a marginal ambiguity set of the form (26), we thus have ( ≤ ) = ( ) for all ∈ R, that is, is uniquely determined by the cumulative distribution function . The above reasoning then implies that The inequality can be justified as follows. One may first add the redundant expectation constraints = E ∼ [ ( )] and the redundant -almost sure constraints 0 ≤ ( ) ≤ 1 to the maximization problem over , and without affecting the problem's optimal value. Next, one may remove the constraints that express and ( ) in terms of ( ). The resulting relaxation provides an upper bound on the original maximization problem. Note that all remaining expectation operators involve integrands that depend on only through for some ∈ [ ], and therefore the expectations with respect to the joint probability measure can all be simplified to expectations with respect to one of the marginal probability measures .
As neither the objective nor the constraints of the resulting problem depend on , we may finally remove from the list of decision variables without affecting the problem's optimal value. For any fixed ∈ Δ , the upper bounding problem (28) gives rise the following subproblems indexed by ∈ [ ]. sup If > 0, the optimization problem (29a) over the functions ∈ ℒ(R) can be recast as an optimization problem over probability measures˜∈ (R) that are absolutely continuous with respect to , where d˜/d denotes as usual the Radon-Nikodym derivative of˜with respect to . Indeed, if is in (29b) and attains the same objective function value. Conversely, if˜is feasible in (29b), then ( ) = d˜/d ( ) is feasible in (29a) and attains the same objective function value. Thus, (29a) and (29b) are indeed equivalent. By [Föllmer and Schied, 2004, Theorem 4.47], the optimal value of (29b) is given by Note that the objective function of the upper bounding problem on the right hand side of (30) constitutes a sum of the strictly concave and differentiable univariate functions + ∫︀ 1 1− −1 ( ). Indeed, the derivative of the th function with respect to is given by + −1 (1 − ), which is strictly increasing in because is continuous by assumption. The upper bounding problem in (30) is thus solvable as it has a compact feasible set as well as a differentiable objective function. Moreover, the solution is unique thanks to the strict concavity of the objective function. In the following we denote this unique solution by ⋆ .
It remains to be shown that there exists a distribution ⋆ ∈ Θ that attains the upper bound in (30).
To this end, we define the functions By [Föllmer and Schied, 2004, Remark 4.48 In addition, we also define the Borel measures + and − through for all Borel sets ⊆ R, respectively. By construction, + is supported on The law of total probability further implies that In the remainder of the proof we will demonstrate that the maximization problem on the left hand side of (30) is solved by the mixture distribution This will show that the inequality in (30) is in fact an equality, which in turn implies that the smooth -transform is given by (27). We first prove that ⋆ ∈ Θ. To see this, note that for all ∈ [ ] we have where the second equality exploits the relation Next, we prove that ⋆ attains the upper bound in (30). By the definition of the binary function , we have where the third equality holds because ( ) = 1 if and only if = min argmax ∈[ ] + , and the fourth equality follows from the assumed continuity of the marginal distribution functions , ∈ [ ], which where the first equality exploits the relation = ⋆ + + (1 − ⋆ ) − , while the second equality follows from the definition of ⋆ . The expectations in (31) can be further simplified by using the stationarity conditions of the upper bounding problem in (30), which imply that the partial derivatives of the objective function with respect to the decision variables , ∈ [ ], are all equal at = ⋆ . Thus, ⋆ must satisfy Consequently, for every > −1 (1 − ⋆ ) and ̸ = we have where the first equality follows from (32), and the second equality holds because − is supported on (−∞, −1 (1− ⋆ )]. As no probability can exceed 1, the above reasoning implies that − ( < + − ) = 1 for all > −1 (1 − ⋆ ) and ̸ = . Noting that ⋆ ( ) = 1 { > −1 (1− ⋆ )} represents the characteristic function of the set ( −1 (1 − ⋆ ), ∞) covering the support of + , the term (31a) can thus be simplified to Similarly, for any ≤ −1 (1 − ⋆ ) and ̸ = we have where the two equalities follow from (32) and the observation that + is supported on ( −1 (1 − ⋆ ), ∞), respectively. As probabilities are non-negative, the above implies that + ( < + − ) = 0 for all ≤ −1 (1 − ⋆ ) and ̸ = . Hence, as − is supported on (−∞, −1 (1 − ⋆ )], the term (31b) simplifies to By combining the simplified reformulations of (31a) and (31b), we finally obtain  (27).
The next theorem reveals that the smooth dual optimal transport problem (12) with a marginal ambiguity set corresponds to a regularized primal optimal transport problem of the form (13).
Theorem 3.7 (Fréchet regularization). Suppose that Θ is a marginal ambiguity set of the form (26) and that the marginal cumulative distribution functions are defined through for some probability vector ∈ Δ and strictly increasing function Then, the smooth dual optimal transport problem (12) is equivalent to the regularized primal optimal transport problem (13) with Θ = ( ‖ ⊗ ), where ( ) = The function ( ) introduced in Theorem 3.7 is smooth and convex because its derivative d ( )/d = −1 ( ) is strictly increasing, and (1) = ∫︀ 1 0 −1 ( )d = 0 by assumption. Therefore, this function induces a standard -divergence. From now on we will refer to as the marginal generating function.
Proof of Theorem 3.7. By Proposition 3.6, the smooth dual optimal transport problem (12) is equivalent to As is strictly increasing, we have −1 ( ) = − −1 ((1 − )/ ) for all ∈ (0, 1). Thus, we find where the second equality follows from the variable substitution ← 1 − . This integral representation of ( ) then allows us to reformulate the smooth dual optimal transport problem as which is manifestly equivalent to problem (14) thanks to the definition of the discrete -divergence.
Lemma 3.2 finally implies that the resulting instance of (14) is equivalent to the regularized primal optimal transport problem (13) with regularization term Θ ( ) = ( ‖ ⊗ ). Hence, the claim follows.   Then the smooth dual optimal transport problem (12) is equivalent to the regularized optimal transport problem (13) with an entropic regularizer of the form Θ ( ) = ( ‖ ⊗ ), where ( ) = log( ), while the smooth -transform (11) reduces to the log-partition function (20). This example shows that entropic regularizers are not only induced by singleton ambiguity sets containing a generalized extreme value distribution (see Section 3.1) but also by marginal ambiguity sets with exponential marginals. Such regularizers were previously investigated by  and  under the additional assumption that is independent of ∈ [ ], yet their intimate relation to noise models with uniform marginals remained undiscovered until now. In addition, the smooth -transform (11) satisfies where the sparse maximum operator 'spmax' inspired by Martins and Astudillo [2016] is defined through The envelope theorem [De la Fuente, 2000, Theorem 2.16] ensures that spmax ∈[ ] is smooth and that its gradient with respect to is given by the unique solution ⋆ of the maximization problem on the right hand side of (35). We note that ⋆ has many zero entries due to the sparsity-inducing nature of the problem's simplicial feasible set. In addition, we have lim ↓0 spmax ∈[ ] / = max ∈[ ] . Thus, the sparse maximum can indeed be viewed as a smooth approximation of the ordinary maximum. In marked contrast to the more widely used LogSumExp function, however, the sparse maximum has a sparse gradient. Proposition D.1 in Appendix D shows that ⋆ can be computed efficiently by sorting.
Example 3.10 (Pareto distribution model). Suppose that Θ is a marginal ambiguity set with (shifted) Pareto distributed marginals of the form (33) induced by the generating function ( ) = ( ( − 1)/( ) + 1/ ) 1/( −1) with , > 0. Then the smooth dual optimal transport problem (12) is equivalent to the regularized optimal transport problem (13) with a Tsallis divergence regularizer of the form Θ ( ) = ( ‖ ⊗ ), where ( ) = ( − )/( − 1). Such regularizers were investigated by [Muzellec et al., 2017] under the additional assumption that is independent of ∈ [ ]. The Pareto distribution model encapsulates the exponential model (in the limit → 1) and the uniform distribution model (for = 2) as special cases. The smooth -transform admits no simple closed-form representation under this model.  (33), and assume that the generating function is given by for some > 0. In this case one can show that all marginals constitute -distributions with 2 degrees of freedom. In addition, one can show that the smooth dual optimal transport problem (12) is equivalent to the Chebyshev regularized optimal transport problem described in Proposition 3.5.
To close this section, we remark that different regularization schemes differ as to how well they approximate the original (unregularized) optimal transport problem. Proposition 3.3 provides simple error bounds that may help in selecting suitable regularizers. For the entropic regularization scheme associated with the exponential distribution model of Example 3.8, for example, the error bound evaluates to max ∈[ ] log(1/ ), while for the 2 -divergence regularization scheme associated with the uniform distribution model of Example 3.9, the error bound is given by max ∈[ ] (1/ − 1). In both cases, the error is minimized by setting = 1/ for all ∈ [ ]. Thus, the error bound grows logarithmically with for entropic regularization and linearly with for 2 -divergence regularization. Different regularization schemes also differ with regard to their computational properties, which will be discussed in Section 4.

Numerical Solution of Smooth Optimal Transport Problems
The smooth semi-discrete optimal transport problem (12) constitutes a stochastic optimization problem and can therefore be addressed with a stochastic gradient descent (SGD) algorithm. In Section 4.1 we first derive new convergence guarantees for an averaged gradient descent algorithm that has only access to a biased stochastic gradient oracle. This algorithm outputs the uniform average of the iterates (instead of the last iterate) as the recommended candidate solution. We prove that if the objective function is Lipschitz continuous, then the suboptimality of this candidate solution is of the order (1/ √ ), where stands for the number of iterations. An improvement in the non-leading terms is possible if the objective function is additionally smooth. We further prove that a convergence rate of (1/ ) can be obtained for generalized self-concordant objective functions. In Section 4.2 we then show that the algorithm of Section 4.1 can be used to efficiently solve the smooth semi-discrete optimal transport problem (12) corresponding to a marginal ambiguity set of the type (26). As a byproduct, we prove that the convergence rate of the averaged SGD algorithm for the semi-discrete optimal transport problem with entropic regularization is of the order (1/ ), which improves the (1/ √ ) guarantee of Genevay et al. [2016].

Averaged Gradient Descent Algorithm with Biased Gradient Oracles
Consider a general convex minimization problem of the form where the objective function ℎ : R → R is convex and differentiable. We assume that problem (36) admits a minimizer ⋆ . We study the convergence behavior of the inexact gradient descent algorithm where > 0 is a fixed step size, 0 is a given deterministic initial point and the function : R → R is an inexact gradient oracle that returns for every fixed ∈ R a random estimate of the gradient of ℎ at .
Note that we allow the gradient oracle to depend on the iteration counter , which allows us to account for increasingly accurate gradient estimates. In contrast to the previous sections, we henceforth model all random objects as measurable functions on an abstract filtered probability space (Ω, ℱ, (ℱ ) ≥0 , P), where ℱ 0 = {∅, Ω} represents the trivial -field, while the gradient oracle ( ) is ℱ -measurable for all ∈ N and ∈ R . In order to avoid clutter, we use E[·] to denote the expectation operator with respect to P, and all inequalities and equalities involving random variables are understood to hold P-almost surely.
In the following we analyze the effect of averaging in inexact gradient descent algorithms. We will show that after iterations with a constant step size = (1/ √ ), the objective function value of the uniform average of all iterates generated by (37) converges to the optimal value of (36) at a sublinear rate.
Specifically, we will prove that the rate of convergence varies between (1/ √ ) and (1/ ) depending on properties of the objective function. Our convergence analysis will rely on several regularity conditions. Assumption 4.1 (Regularity conditions). Different combinations of the following regularity conditions will enable us to establish different convergence guarantees for the averaged inexact gradient descent algorithm.
(iii) Generalized self-concordance: The function ℎ is -generalized self-concordant for some > 0, that is, ℎ is three times differentiable, and for any , ′ ∈ R the function ( ) = ℎ( + ( ′ − )) satisfies the inequality (iv) Lipschitz continuous gradient: The function ℎ is -smooth for some > 0, that is, we have (v) Bounded second moments: There exists > 0 such that The averaged gradient descent algorithm with biased gradient oracles lends itself to solving both deter-  [Ruppert, 1988, Polyak and Juditsky, 1992, Nemirovski et al., 2009]. For general convex objective functions the best possible convergence rate of any averaged SGD algorithm run over iterations amounts to (1/ √ ), but it improves to (1/ ) if the objective function is strongly convex; see for example [Nesterov and Vial, 2008, Nemirovski et al., 2009, Duchi and Singer, 2009, Xiao, 2009, Moulines and Bach, 2011, Shalev-Shwartz et al., 2011, Lacoste-Julien et al., 2012. While smoothness plays a critical role to achieve acceleration in deterministic optimization, it only improves the constants in the convergence rate in stochastic optimization [Srebro et al., 2010, Dekel et al., 2012, Lan, 2012, Cohen et al., 2018, Kavis et al., 2019. In fact, Tsybakov [2003] demonstrates that smoothness does not provide any acceleration in general, that is, the best possible convergence rate of any averaged SGD algorithm can still not be improved beyond (1/ √ ). Nevertheless, a substantial acceleration is possible when focusing on special problem classes such as linear or logistic regression problems [Bach, 2014, Bach and Moulines, 2013, Hazan et al., 2014. In these special cases, the improvement in the convergence rate is facilitated by a generalized self-concordance property of the objective function [Bach, 2010]. Self-concordance was originally introduced in the context of Newton-type interior point methods [Nesterov and Nemirovskii, 1994] and later generalized to facilitate the analysis of probabilistic models [Bach, 2010] and second-order optimization algorithms [Sun and Tran-Dinh, 2019].
In the following we analyze the convergence properties of the averaged SGD algorithm when we have only access to an inexact stochastic gradient oracle, in which case the tolerances cannot be set to 0. To our best knowledge, inexact stochastic gradient oracles have only been considered by Cohen et al. [2018], Hu et al. [2020] and Ajalloeian and Stich [2020]. Specifically, Hu et al. [2020] use sequential semidefinite programs to analyze the convergence rate of the averaged SGD algorithm when has a finite support. In contrast, we do not impose any restrictions on the support of . Cohen et al. [2018] and Ajalloeian and Stich [2020], on the other hand, study the convergence behavior of accelerated gradient descent algorithms for smooth stochastic optimization problems under the assumption that ranges over a compact domain. The proposed algorithms necessitate a projection onto the compact feasible set in each iteration. In contrast, our convergence analysis does not rely on any compactness assumptions. We note that compactness assumptions have been critical for the convergence analysis of the averaged SGD algorithm in the context of convex stochastic optimization [Nemirovski et al., 2009, Dekel et al., 2012, Bubeck, 2015, Cohen et al., 2018. By leveraging a trick due to Bach [2014], however, we can relax this assumption provided that the objective function is Lipschitz continuous.
If additionally Assumption 4.1 (iii) holds and if = max{ , +¯}, then we have for all ∈ N that The proof of Proposition 4.2 relies on two lemmas. In order to state these lemmas concisely, we define the -norm, of a random variable ∈ R for any > 0 through ‖ ‖ = (E [‖ ‖ ]) 1/ . For any random variables , ′ ∈ R and ≥ 1, Minkowski's inequality [Boucheron et al., 2013, § 2.11] then states that Another essential tool for proving Proposition 4.2 is the Burkholder-Rosenthal-Pinelis (BRP) inequality [Pinelis, 1994, Theorem 4.1], which we restate below without proof to keep this paper self-contained.
Armed with Lemmas 4.3 and 4.4, we are now ready to prove Proposition 4.2.
Proof of Proposition 4.2. The first claim generalizes Proposition 5 by Bach [2014] to inexact gradient oracles. By the assumed convexity and differentiability of the objective function ℎ, we have In addition, elementary algebra yields the recursion Thanks to the update rule (37), this recursion can be re-expressed as where > 0 is an arbitrary step size. Combining the above identity with (39) then yields where the last inequality follows from Assumption 4.1 (ii). Summing this inequality over then shows that for all ∈ N. Note that the term on the left hand side of (40) is non-negative because ⋆ is a global minimizer of ℎ, which implies that the random variable is also non-negative for all ∈ N. For later use we further define 0 = ‖ 0 − ⋆ ‖ 2 . The estimate (40) for = then implies via the convexity of ℎ that where we dropped the non-negative term ‖ − ⋆ ‖ 2 /(2 ) without invalidating the inequality. In the following we analyze the -norm of in order to obtain the desired bounds from the proposition statement. To do so, we distinguish three different regimes for ∈ N, and we show that the -norm of the non-negative random variable is upper bounded by an affine function of in each of these regimes.
By definitions of and for ∈ N, we thus have where the first two inequalities follow from Assumption 4.1 (ii) and the conservative estimate derived above, respectively, while the fourth inequality holds because 2 ≤ 2 + 2 for all , ∈ R. As ≥ 0, the random variable is bounded and satisfies | | ≤ 2‖ 0 − ⋆ ‖ 2 + 7 2 2 2 for all ∈ N, which implies that where the last inequality holds because ≥ /4. Note that the resulting upper bound is affine in .
Case II (2 ≤ ≤ /4): The subsequent analysis relies on the simple bounds As ≥ 2, Minkowski's inequality (38) thus implies that In order to bound the penultimate term in (44), we first note that for all ∈ N, where the second inequality holds due to Assumption 4.1 (i), and the last inequality follows from (40). This in turn implies that for all ∈ [ ] we have where the last inequality exploits (43). Therefore, the penultimate term in (44) satisfies where the equality follows from the definition of the -norm.
Next, we bound the last term in (44) by using the BRP inequality of Lemma 4.3. To this end, note that for all ∈ N, where the second inequality exploits the definition of and (45), the third inequality follows from (40), and the last inequality holds because of Assumption 4.1 (ii). Hence, we obtain where the second inequality follows from (43) and the definition of the -norm. In addition, we have where the first inequality exploits the upper bound on |¯| derived above, which implies that E[¯2|ℱ −1 ] ≤ 4 2 (2 + −1 ) 2 −1 . The last three inequalities follow from the Hölder inequality, the triangle inequality for the Euclidean norm and the two inequalities in (43), respectively. Recalling that ≥ 2, we may then apply the BRP inequality of Lemma 4.3 to the martingale differences¯, ∈ [ ], and use the bounds derived in the last two display equations in order to conclude that Substituting (46) and (47) into (44), we thus obtain where the second inequality holds because ≤ /4 by assumption, which implies that √ + ≤ 1.5 √ and √ + + 2 √ ≤ 6 √ . As Jensen's inequality ensures that ‖ ‖ /2 ≤ ‖ ‖ for any random variable and > 0, the following inequality holds for all 2 ≤ ≤ /4.
Case III ( = 1): Recalling the definition of ≥ 0, we find that where the second inequality follows from the estimate (46), which holds indeed for all ∈ N, while the last inequality follows from Jensen's inequality. By the second inequality in (48) for = 2, we thus find where the last inequality holds because 2 ≤ 2 2 + 2 /2 for all , ∈ R.
We now combine the bounds derived in Cases I, II and III to obtain a universal bound on ‖ ‖ that holds for all ∈ N. Specifically, one readily verifies that the bound is more conservative than each of the bounds (42), (48) and (49), and thus it holds indeed for any ∈ N.
Combining this universal bound with (41) proves the first inequality from the proposition statement.
In order to prove the second inequality, we need to extend [Bach, 2014, Proposition 7] to biased gradient oracles. To this end, we first note that where the second inequality follows from Lemma 4.4 (i), and the third inequality holds due to (40). By Minkowski's inequality (38), we thus have for any ≥ 1 that where the last inequality follows from the universal bound (50). In order to estimate the last term in the above expression, we recall that the update rule (37) is equivalent to ( −1 ) = ( −1 − ) / , which in turn implies that ∑︀ =1 ( −1 ) = ( 0 − ) / . Hence, for any ≥ 1, we have where the first inequality exploits Minkowski's inequality (38), the second inequality follows from (40), which implies that ‖ ⋆ − ‖ ≤ √ , and the definition of the -norm. The last inequality in the above expression is a direct consequence of the universal bound (50) and the inequality √ + ≤ √ + √ for all , ≥ 0. Next, define for any ∈ N a martingale difference of the form Note that these martingale differences are bounded because and thus the BRP inequality of Lemma 4.3 implies that Recalling the definition of the martingale differences , ∈ N, this bound allows us to conclude that where the second inequality exploits Assumption 4.1 (i) as well as the second inequality in (43). Combining all inequalities derived above and observing that 2 √ 2 + 2 √ 10 < 10 finally yields where = max{ , +¯}. This proves the second inequality from the proposition statement.
The following corollary follows immediately from the proof of Proposition 4.2.
Corollary 4.5. Consider the inexact gradient descent algorithm (37) with constant step size > 0. If Assumptions 4.1 (i)-(ii) hold with ≤¯/(2 √ 1 + ) for some¯≥ 0, then we have Proof of Corollary 4.5. Defining as in the proof of Proposition 4.2, we find where the inequality is an immediate consequence of the reasoning in Case (III) in the proof of Proposition 4.2. The claim then follows from the trivial inequality +¯≥ .
Armed with Proposition 4.2 and Corollary 4.5, we are now ready to prove the main convergence result.
To prove assertion (iii), we distinguish two different cases.
Case I: Assume first that 4 2 ‖ 0 − ⋆ ‖ 2 + 6 ‖ 0 − ⋆ ‖ ≤ √ /(8 2 ), where = max{ , +¯} and denotes the smallest eigenvalue of ∇ 2 ℎ( ⋆ ). By a standard formula for the expected value of a non-negative random variable, we find The first of the three integrals in (53) is trivially upper bounded by 1 . Next, we investigate the third integral in (53), which is easier to bound from above than the second one. By combining the first inequality in Proposition 4.2 for = 1/(2 2 √ ) with the trivial inequality ≥ +¯, we find Lemma 4.7 (i) with = 2 2 ‖ 0 − ⋆ ‖ 2 / √ and = 10/ √ thus implies that We also have where the first inequality follows from the basic assumption underlying Case I, while the second inequality holds due to the definition of 2 . By (54) and (55), the third integral in (53) satisfies where the first inequality follows from the concentration inequality (54) and the insight from (55) that 2 − 4 2 √ ‖ 0 − ⋆ ‖ 2 ≥ 0. The second inequality exploits again (55), and the last inequality holds because exp(− ) ≤ 1/(2 ) for all > 0. We have thus found a simple upper bound on the third integral in (53).
It remains to derive an upper bound on the second integral in (53). To this end, we first observe that the second inequality in Proposition 4.2 for = 1/(2 2 √ ) translates to Lemma 4.7 (ii) with = 10 / √ , = 4 / + 40 / √ and = 4 3 ‖ 0 − ⋆ ‖ 2 / √ + 6 2 ‖ 0 − ⋆ ‖/ √ thus gives rise to the concentration inequality which holds only for small deviations ≤ . However, this concentration inequality can be simplified to In the following, we restrict attention to those deviations ≥ 0 that are small in the sense that Assume now for the sake of argument that the event inside the probability in the simplified concentration inequality does not occur, that is, assume that By (56) and the assumption of Case I, (57) implies that ‖∇ℎ( 1 ∑︀ =1 −1 )‖ < 3 /(4 ) < 3 /(4 ). Hence, we may apply Lemma 4.4 (ii) to conclude that ℎ( 1 ∑︀ Combining this inequality with (57) then yields By the simplified concentration inequality derived above, we may thus conclude that for any ≥ 0 that satisfies (56), where the second inequality holds because (57) implies (58) or, equivalently, because the negation of (58) implies the negation of (57). The resulting concentration inequality (59) now enables us to construct an upper bound on the second integral in (53). To this end, we define the function for all ≥ 0, and set¯= ((9/400 + √ /(160 2 )) 1 2 − 3/20) 2 . Note that ≥ 0 satisfies the inequality (56) if and only if ≤¯and that ℓ(0) = 1 as well as ℓ(¯) = 2 . By substituting with ℓ( ) and using the concentration inequality (59) to bound the integrand, we find that the second integral in (53) satisfies where is is a shorthand for 4 2 ‖ 0 − ⋆ ‖ 2 + 6 ‖ 0 − ⋆ ‖, and Γ denotes the Gamma function with Γ(2) = 1, Γ(1/2) = √ and Γ(3/2) = √ /2; see for example [Rudin, 1964, Chapter 8]. The last inequality is obtained by rounding all fractional numbers up to the next higher integer. Combining the upper bounds for the three integrals in (53) finally yields This complete the proof of assertion (iii) in Case I.
, where is defined as before.

Smooth Optimal Transport Problems with Marginal Ambiguity Sets
The smooth optimal transport problem (12) can be viewed as an instance of a stochastic optimization problem, that is, a convex maximization problem akin to (36), where the objective function is representable . Throughout this section we assume that the smooth (discrete)transform ( , ) defined in (11) is induced by a marginal ambiguity set of the form (26) with continuous marginal distribution functions. By Proposition 3.6, the integrand ⊤ − ( , ) is therefore concave and differentiable in . We also assume that ( , ) is -integrable in , that we have access to an oracle that generates independent samples from and that problem (12) is solvable.
The following proposition establishes several useful properties of the smooth -transform.
then ( , ) is -generalized self-concordant with respect to in the sense of Assumption 4.1 (iii).
Proof. As for (i), Proposition 3.6 implies that ∇ ( , ) ∈ Δ , and thus we have ‖∇ ( , )‖ ≤ 1. As for (ii), note that the convex conjugate of the smooth -transform with respect to is given by where the second equality follows again from Proposition 3.6, and the interchange of the infimum and the supremum is allowed by Sion's classical minimax theorem. In the following we first prove that * ( , ) is 1/ -strongly convex in , that is, the function * ( , ) − ‖ ‖ 2 /(2 ) is convex in for any fixed ∈ .
To this end, recall that is assumed to be Lipschitz continuous with Lipschitz constant . Thus, we have ≥ sup where the second inequality follows from restricting 1 and 2 to the preimage of (0, 1) with respect to .
Rearranging terms in the above inequality then yields is convex in on the interval (0, 1). This implies that constitutes a sum of convex univariate functions for every fixed ∈ . Thus, * ( , ) is 1/ -strongly convex in . By [Kakade et al., 2009, Theorem 6], however, any convex function whose conjugate is 1/strongly convex is guaranteed to be -smooth. This observation completes the proof of assertion (ii). As for assertion (iii), choose any , ∈ R and ∈ , and introduce the auxiliary function For ease of exposition, in the remainder of the proof we use prime symbols to designate derivatives of univariate functions. A direct calculation then yields By Proposition 3.6, ⋆ ( ) = ∇ ( + ( − ), ) represents the unique solution of the maximization problem in (61). In addition, by [Sun and Tran-Dinh, 2019, Proposition 6], the Hessian of the smooth -transform with respect to can be computed from the Hessian of its convex conjugate as follows.
Hence, the first two derivatives of the auxiliary function ( ) simplify to Similarly, the above formula for the Hessian of the smooth -transform can be used to show that ( ⋆ ) ′ ( ) = The third derivative of ( ) therefore simplifies to This implies via Hölder's inequality that Notice that the first term in the above expression coincides with ′′ ( ), and the second term satisfies where the first inequality holds because max ∈[ ] | | ≤ ‖ ‖ ∞ ‖ ‖ ∞ for all , ∈ R , and the second inequality follows from the definition of and the fact that the 2-norm provides an upper bound on the ∞-norm. Combining the above results shows that | ′′′ ( )| ≤ ‖ − ‖ ′′ ( ) for all ∈ R. The claim now follows because , ∈ R and ∈ were chosen arbitrarily.
In the following we use the averaged SGD algorithm of Section 4.1 to solve the smooth optimal transport problem (12). A detailed description of this algorithm in pseudocode is provided in Algorithm 1.
This algorithm repeatedly calls a sub-routine for estimating the gradient of ( , ) with respect to .
By Proposition 3.6, this gradient coincides with the unique solution ⋆ of the convex maximization problem (27). In addition, from the proof of Proposition 3.6 it is clear that its components are given by where ⋆ represents an optimizer of the semi-parametric discrete choice problem (11). Therefore, ⋆ can be interpreted as a vector of choice probabilities under the best-case probability measure ⋆ . Sometimes these choice probabilities are available in closed form. This is the case, for instance, in the exponential Note that these particular choice probabilities are routinely studied in the celebrated multinomial logit choice model [Ben-Akiva and Lerman, 1985, § 5.1]. The choice probabilities are also available in closed form in the uniform distribution model of Example 3.9. As the derivation of ⋆ is somewhat cumbersome in this case, we relegate it to Appendix D. For general marginal ambiguity sets with continuous marginal distribution functions, we propose a bisection method to compute the gradient of the smooth -transform numerically up to any prescribed accuracy; see Algorithm 2.
Proof. Thanks to Proposition 3.6, we can recast the smooth -transform in dual form as .
Strong duality and dual solvability hold because we may construct a Slater point for the primal problem by setting = 1/ , ∈ [ ]. By the Karush-Kuhn-Tucker optimality conditions, ⋆ and ( ⋆ , ⋆ ) are therefore optimal in the primal and dual problems, respectively, if and only if we have Finding the unique optimizer ⋆ of (27) (i.e., finding the gradient of ( , )) is therefore tantamount to finding a root ⋆ of the univariate equation Note the function on the left hand side of (63) is continuous and non-decreasing in because of the continuity (by assumption) and monotonicity (by definition) of the cumulative distribution functions , Hence, the root finding problem can be solved efficiently via bisection. To complete the proof, we first show that the interval between the constants and defined in Algorithm 2 is guaranteed to contain ⋆ . Specifically, we will demonstrate that evaluating the function on the left hand side of (63) at or yields a number that is not larger or not smaller than 1, respectively. For = we have where the inequality follows from the monotonicity of . Summing the above inequality over all ∈ [ ] then yields the desired inequality We may thus conclude that constitutes a valid initial search interval for the bisection algorithm. Note that the function 1 − ( ( , ) − − ), which defines in terms of , is uniformly continuous in throughout R. This follows from [Billingsley, 1995, Problem 14.8] and our assumption that is continuous. The uniform continuity ensures that the tolerance is strictly positive for every > 0. As the length of the search interval is halved in each iteration, Algorithm 2 outputs a near optimal solution with | − ⋆ | ≤ ( ) after ⌈log 2 (( − )/ ( ))⌉ iterations.
Moreover, the construction of ( ) guarantees that |1 − (i) If = 1/(2(2 +¯) √ ) and is continuous for every ∈ [ ], then we have (ii) If = 1/(2 √ + ) and is Lipschitz continuous with Lipschitz constant > 0 for every ∈ [ ], then we have (iii) If = 1/(2 2 √ ) with = max{ , 2 +¯}, satisfies the generalized self-concordance condition (60) with > 0 for every ∈ [ ], and the smallest eigenvalue of −∇ 2 ℎ( ⋆ ) is strictly positive, then we have Proof. Recall that problem (12) can be viewed as an instance of the convex minimization problem (36) provided that its objective function is inverted. Throughout the proof we denote by ( , ) the inexact estimate for ∇ ( , ) output by Algorithm 2 in iteration of the averaged SGD algorithm. Note that where the two inequalities follow from Jensen's inequality and the choice of −1 in Algorithm 1, respectively.
One can show that the objective function of the smooth optimal transport problem (12) with marginal exponential noise distributions as described in Example 3.8 is generalized self-concordant. Hence, the convergence rate of Algorithm 1 for the exponential distribution model of Example 3.8 is of the order (1/ ), which improves the state-of-the-art (1/ √ ) guarantee established by Genevay et al. [2016].

Numerical Experiments
All experiments are run on a 2.6 GHz 6-Core Intel Core i7, and all optimization problems are implemented in MATLAB R2020a. The corresponding codes are available at https://github.com/RAO-EPFL/ Semi-Discrete-Smooth-OT.git.
We now aim to assess the empirical convergence behavior of Algorithm 1 and to showcase the effects of regularization in semi-discrete optimal transport. To this end, we solve the original dual optimal transport problem (10) as well as its smooth variant (12) with a Fréchet ambiguity set corresponding to the exponential distribution model of Example 3.8, to the uniform distribution model of Example 3.9 and to the hyperbolic cosine distribution model of Example 3.11. Recall from Theorem 3.7 that any Fréchet ambiguity set is uniquely determined by a marginal generating function and a probability vector . As for the exponential distribution model of Example 3.8, we set ( ) = exp(10 −1) and = 1/ for all ∈ [ ].
In this case problem (12) is equivalent to the regularized primal optimal transport problem (13) with an entropic regularizer, and the gradient ∇ ( , ), which is known to coincide with the vector ⋆ of optimal choice probabilities in problem (27), admits the closed-form representation (62). We can therefore solve problem (12) with a variant of Algorithm 1 that calculates ∇ ( , ) exactly instead of approximately via bisection. As for the uniform distribution model of Example 3.9, we set ( ) = /20+1/2 and = 1/ for all ∈ [ ]. In this case problem (12) is equivalent to the regularized primal optimal transport problem (13) with a 2 -divergence regularizer, and the vector ⋆ of optimal choice probabilities can be computed exactly and highly efficiently by sorting thanks to Proposition D.1 in the appendix. We can therefore again solve problem (12) with a variant of Algorithm 1 that calculates ∇ ( , ) exactly. As for the hyperbolic cosine model of Example 3.11, we set ( ) = sinh(10 − ) with = √ 2 − 1 − arcsinh(1) and = 1/ for all ∈ [ ]. In this case problem (12) is equivalent to the regularized primal optimal transport problem (13) with a hyperbolic divergence regularizer. However, the vector ⋆ is not available in closed form, and thus we use Algorithm 2 to compute ⋆ approximately. Lastly, note that the original dual optimal transport problem (10) can be interpreted as an instance of (12) equipped with a degenerate singleton ambiguity set that only contains the Dirac measure at the origin of R . In this case ( , ) = ( , ) fails to be smooth in , but an exact subgradient ⋆ ∈ ( , ) is given by We can therefore solve problem (10) with a variant of Algorithm 1 that has access to exact subgradients (instead of gradients) of ( , ). Note that the maximizer ⋆ of (10) may not be unique. In our experiments, we force Algorithm 1 to converge to the maximizer with minimal Euclidean norm by adding a vanishingly small Tikhonov regularization term to ( , ). Thus, we set ( , ) = ( , ) + ‖ ‖ 2 2 for some small regularization weight > 0, in which case ⋆ + 2 ∈ ( , ) is an exact subgradient.
In the following we set to the standard Gaussian measure on = R 2 and to the uniform measure on 10 independent samples drawn uniformly from = [−1, 1] 2 . We further set the transportation cost to ( , ) = ‖ − ‖ ∞ . Under these assumptions, we use Algorithm 1 to solve the original as well as the three smooth optimal transport problems approximately for = 1, . . . , 10 5 . For each fixed the step size is selected in accordance with Corollary 4.10. We emphasize that Corollary 4.10 (i) remains valid if ( , ) fails to be smooth in and we have only access to subgradients; see [Nesterov and Vial, 2008, Corollary 1]. Denoting by¯the output of Algorithm 1, we record the suboptimality ]︁ of¯in (12) as well as the discrepancy ‖¯− ⋆ ‖ 2 2 of¯to the exact maximizer ⋆ of problem (12) as a function of . In order to faithfully measure the convergence rate of¯and its suboptimality, we need to compute ⋆ as well as ( , ) to within high accuracy. This is only possible if the dimension of is small (e.g., if = R 2 as in our numerical example); even though Algorithm 1 can efficiently solve optimal transport problems in high dimensions. We obtain high-quality approximations for ( , ) and ⋆ by solving the finite-dimensional optimal transport problem between and the discrete distribution that places equal weight on 10 × samples drawn independently from . Note that only the first of these samples are used by Algorithm 1. The proposed high-quality approximations of the entropic and 2 -divergence regularized optimal transport problems are conveniently solved via Nesterov's accelerated gradient descent method, where the suboptimality gap of the th iterate is guaranteed to decay as (1/ 2 ) under the step size rule advocated in [Nesterov, 1983, Theorem 1]. To our best knowledge, Nesterov's accelerated gradient descent algorithm is not guaranteed to converge with inexact gradients. For the hyperbolic divergence regularized optimal transport problem, we thus use Algorithm 1 with 50 × iterations to obtain an approximation for ( , ) and ⋆ . In contrast, we model the high-quality approximation of the original optimal transport problem (10) in YALMIP [Löfberg, 2004] and solve it with MOSEK. If this problem has multiple maximizers, we report the one with minimal Euclidean norm. Figure 1 shows how the suboptimality of¯and the discrepancy between¯and the exact maximizer decay with , both for the original as well as for the entropic, the 2 -divergence and hyperbolic divergence regularized optimal transport problems, averaged across 20 independent simulation runs. Figure 1a suggests that the suboptimality decays as (1/ √ ) for the original optimal transport problem, which is in line with the theoretical guarantees by Nesterov and Vial [2008, Corollary 1], and as (1/ ) , which we estimate when solving the highquality approximations of the two smooth optimal transport problems, is strictly positive. Therefore, Corollary 4.10 (iii) is indeed applicable and guarantees that the suboptimality gap is bounded above by (1/ ). Finally, Figure 1b suggests that ‖¯− ⋆ ‖ 2 2 converges to 0 at rate (1/ ) for the entropic, the 2 -divergence and the hyperbolic divergence regularized optimal transport problems, which is consistent with [Bach, 2014, Proposition 10].
Acknowledgements. This research was supported by the Swiss National Science Foundation under the NCCR Automation, grant agreement 51NF40_180545. The research of the second author is supported by an Early Postdoc.Mobility Fellowship, grant agreement P2ELP2_195149.
(i) Given an oracle that evaluates exactly, we can compute a -approximation for ⋆ with 2 oracle calls.
(ii) Given an oracle that evaluates inexactly to within an absolute accuracy = 1 4 min (b) Figure 1: Suboptimality (a) and discrepancy to ⋆ (b) of the outputs¯of Algorithm 1 for the original (blue), the entropic regularized (orange), the 2 -divergence regularized (red) and the hyperbolic divergence regularized (purple) optimal transport problems.
we can compute a 2 -approximation for ⋆ with 2 oracle calls.
Proof. Consider the uniform grid { 0 , . . . , 2 }, and note that the difference 2 − between consecutive grid points is strictly smaller than . Next, introduce a piecewise affine function¯: [0, 1] → R + that linearly interpolates between consecutive grid points. By construction,¯is affine on the interval [ −1 , ] with slope / and = ( ) − ( −1 ) for all ∈ [2 ]. In addition,¯is continuous and inherits convexity from . As is strictly convex, it is easy to verify that¯has also a kink at every inner grid point for ∈ [2 − 1], and therefore the distance between the unique minimizer ⋆ of and any minimizer of¯is at most 2 − < . In order to compute a -approximation for ⋆ , it thus suffices to find a minimizer of¯.
Next, define = ( 0 , . . . , 2 ) with 0 = −∞. As¯has a kink at every inner grid point, we may conclude that the array is sorted in ascending order, that is, > −1 for all ∈ [2 ]. This implies that at most one element of can vanish. In the following, define ⋆ = max{ ∈ {0} ∪ [2 ] : ≤ 0}. If ⋆ = 0, thenī s uniquely minimized by ⋆ = 0, and ⋆ must fall within the interval [ 0 , 1 ]. If ⋆ > 0 and ⋆ < 0, on the other hand, then¯is uniquely minimized by ⋆ , and ⋆ must fall within the interval [ ⋆ −1 , ⋆ +1 ]. If ⋆ > 0 and ⋆ = 0, finally, then every point in [ ⋆ −1 , ⋆ ] minimizes¯, and ⋆ must also fall within [ ⋆ −1 , ⋆ ]. In any case, ⋆ provides a -approximation for ⋆ . In the remainder of the proof we show that the index ⋆ can be computed efficiently by Algorithm 3. This bisection scheme maintains lower and upper bounds and on the sought index ⋆ , respectively, and reduces the search interval between and by a factor of two in each iteration. Thus, Algorithm 3 computes ⋆ in exactly iterations [Cormen et al., 2009, § 12].

Algorithm 3 Binary search algorithm
Input: An array ∈ R 2 sorted in ascending order We remark that is guaranteed to be an integer and thus represents a valid index in each iteration of the algorithm because and are initialized as 0 and 2 , respectively. Note also that if we have access to an oracle for evaluating exactly, then any element of the array can be computed with merely two oracle calls. Algorithm 3 thus computes ⋆ with 2 oracle calls in total. Hence, assertion (i) follows.
As for assertion (ii), assume now that we have only access to an inexact oracle that outputs, for any fixed ∈ This reasoning reveals that̃︀ has the same sign as for every ∈ [2 ] with ̸ = 0. In addition, it implies that˜⋆ approximates ⋆ irrespective of whether or not the array has a vanishing element. Indeed, if no element of vanishes, theñ︀ has the same sign as for all ∈ [2 ]. As Algorithm 3 only checks signs, this implies that˜⋆ = ⋆ and that˜⋆ provides a -approximation for ⋆ as in assertion (i). If one element of vanishes, on the other hand, theñ︀ has the same sign as for all ∈ [2 ] with ̸ = ⋆ . As Algorithm 3 only checks signs, this implies that |˜⋆ − ⋆ | ≤ 1. Recalling that | ⋆ − ⋆ | ≤ , we thus have In either case,˜⋆ provides a 2 -approximation for ⋆ . As any element of the arraỹ︀ can be evaluated with two oracle calls, Algorithm 3 computes˜⋆ with 2 oracle calls in total. Hence, assertion (ii) follows.

B. Efficiency of Binary Search
We adopt the conventions of Schrijver [1998, § 2.1] to measure the size of a computational problem, which is needed to reason about the problem's complexity. Specifically, the size of a scalar ∈ R is defined as size( ) = ⎧ ⎨ ⎩ 1 + ⌈log 2 (| | + 1)⌉ + ⌈log 2 ( + 1)⌉ if = / with ∈ Z and ∈ N are relatively prime, where we reserve one bit to encode the sign of . The size of a real vector is defined as the sum of the sizes of its components plus its dimension. Thus, the input size of an instance ∈ R + and ∈ R + of the knapsack problem described in Lemma 2.3 amounts to size( , ) = + 1 + ∑︁ =1 size( ) + size( ).
Using a similar reasoning, we find The vector ⋆ constructed in this way constitutes an optimal solution for the maximization problem that defines ( , ). Evaluating the objective function value of ⋆ in this problem finally confirms that the smooth -transform coincides with the log-partition function (20).

D. The Sparse Maximum Function
The following proposition, which is a simple extension of [Martins and Astudillo, 2016, Proposition 1], suggests that the solution of (35) can be computed by a simple sorting algorithm.  A simple reordering, dividing both sides of the above inequality by ∑︀ =1 ( ) , and using the definition of ⋆ then yields ( +1) ≤ ⋆ . In addition, by the definition of the permutation , we have ( ) ≤ ( +1) for all > . Hence, we conclude that ( ) ≤ ⋆ for all > . One can then show that