## 1 Introduction

Optimal transport and Wasserstein metrics are nowadays among the major tools for analyzing complex data. Theoretical advances in the last decades characterize existence, uniqueness, representation and smoothness properties of optimal transport plans in a variety of different settings. Recent algorithmic advances (Peyré and Cuturi 2018) make it possible to compute exact transport plans and Wasserstein distances between discrete measures on regular grids of tens of thousands of support points, see e.g. Schmitzer (2016, Sect. 6), and to approximate such distances (to some extent) on larger and/or irregular structures, see Altschuler et al. (2017) and references therein. The development of new methodology for data analysis based on optimal transport is a booming research topic in statistics and machine learning, see e.g. Sommerfeld and Munk (2018), Schmitz et al. (2018), Arjovsky et al. (2017), Genevay et al. (2018), and Flamary et al. (2018). Applications are abundant throughout all of the applied sciences, including biomedical sciences (e.g. microscopy or tomography images; Basua et al. 2014, Gramfort et al. 2015), geography (e.g. remote sensing; Courty et al. 2016, Guo et al. 2017), and computer science (e.g. image processing and computer graphics; Nicolas 2016, Solomon et al. 2015). In brief: whenever data of a sufficiently complex structure that can be thought of as a mass distribution is available, optimal transport offers an effective, intuitively reasonable and robust tool for analysis.

More formally, for measures $$\mu$$ and $$\nu$$ on $$\mathbb {R}^d$$ with $$\mu (\mathbb {R}^d)=\nu (\mathbb {R}^d) < \infty$$ the Wasserstein distance of order $$p \ge 1$$ is defined as

\begin{aligned} W_p(\mu ,\nu ) = \biggl ( \min _{\pi } \int _{\mathbb {R}^d \times \mathbb {R}^d} \Vert x-y\Vert ^p \; \pi (dx,dy) \biggr )^{1/p}, \end{aligned}
(1)

where the minimum is taken over all transport plans (couplings) $$\pi$$ between $$\mu$$ and $$\nu$$, i.e. measures $$\pi$$ on $$\mathbb {R}^d \times \mathbb {R}^d$$ with marginals

\begin{aligned} \pi (A \times \mathbb {R}^d) = \mu (A) \quad \text {and} \quad \pi (\mathbb {R}^d \times A) = \nu (A) \end{aligned}

for every Borel set $$A \subset \mathbb {R}^d$$. The minimum exists by Villani (2009, Theorem 4.1) and it is readily verified, see e.g. Villani (2009, after Example 6.3), that the map $$W_p$$ is a $$[0,\infty ]$$-valued metric on the space of measures with fixed finite mass. The constraint linear minimization problem (1) is known as Monge–Kantorovich problem (Kantorovich 1942; Villani 2009). From an intuitive point of view, a minimizing $$\pi$$ describes how the mass of $$\mu$$ is to be associated with the mass of $$\nu$$ in order to make the overall transport cost minimal.

A transport map from $$\mu$$ to $$\nu$$ is a measurable map $$T :\mathbb {R}^d \rightarrow \mathbb {R}^d$$ satisfying $$T_{\#}\mu =\nu$$, where $$T_{\#}$$ denotes the push-forward, i.e. $$(T_{\#} \mu ) (A) = \mu (T^{-1}(A))$$ for every Borel set $$A \subset \mathbb {R}^d$$. We say that T induces the coupling $$\pi =\pi _T$$ if

\begin{aligned} \pi _T(A \times B) = \mu (A \cap T^{-1}(B)) \end{aligned}

for all Borel sets $$A,B \subset \mathbb {R}^d$$, and call the coupling $$\pi$$deterministic in that case. It is easily seen that the support of $$\pi _T$$ is contained in the graph of T. Intuitively speaking, we associate with each location in the domain of the measure $$\mu$$ exactly one location in the domain of the measure $$\nu$$ to which positive mass is moved, i.e. the mass of $$\mu$$ is not split.

The generally more difficult (non-linear) problem of finding (the p-th root of)

\begin{aligned} \inf _{T} \int _{\mathbb {R}^d} \Vert x-T(x)\Vert ^p \; \mu (dx) = \inf _{T} \int _{\mathbb {R}^d \times \mathbb {R}^d} \Vert x-y\Vert ^p \; \pi _T(dx,dy), \end{aligned}
(2)

where the infima are taken over all transport maps T from $$\mu$$ to $$\nu$$ (and are in general not attained) is known as Monge’s problem (Monge 1781; Villani 2009).

In practical applications, based on discrete measurement and/or storage procedures, we often face discrete measures $$\mu = \sum _{i=1}^m \mu _i \delta _{x_i}$$ and $$\nu = \sum _{j=1}^n \nu _j \delta _{y_j}$$, where $$\{x_1,\ldots ,x_m\}$$, $$\{y_1,\ldots ,y_n\}$$ are finite collections of support points, e.g. grids of pixel centers in a grayscale image. The Monge–Kantorovich problem (1) is then simply the discrete transport problem from classical linear programming (Luenberger and Ye 2008):

\begin{aligned} W_p(\mu ,\nu ) = \biggl (\min _{(\pi _{ij})} \,\sum _{i=1}^m \sum _{j=1}^n d_{ij} \pi _{ij} \biggr )^{1/p}, \end{aligned}
(3)

where $$d_{ij} = \Vert x_i-y_j\Vert ^p$$ and any measure $$\pi = \sum _{i=1}^m \sum _{j=1}^n \pi _{ij} \delta _{(x_i,y_j)}$$ is represented by the $$m \times n$$ matrix $$(\pi _{ij})_{i,j}$$ with nonnegative entries $$\pi _{ij}$$ satisfying

\begin{aligned} \sum _{j=1}^n \pi _{ij} = \mu _i \text { for } 1 \le i \le m \quad \text {and} \quad \sum _{i=1}^m \pi _{ij} = \nu _j \text { for } 1 \le j \le n. \end{aligned}

Due to the sheer size of m and n in typical applications this is still computationally a very challenging problem; we have e.g. $$m=n=10^6$$ for $$1000 \times 1000$$ grayscale images, which is far beyond the performance of a standard transportation simplex or primal-dual algorithm. Recently many dedicated algorithms have been developed, such as (Schmitzer 2016), which can give enormous speed-ups mainly if $$p=2$$ and can compute exact solutions for discrete transportation problems with $$10^5$$ support points in seconds to a few minutes, but still cannot deal with $$10^6$$ or more points. Approximative solutions can be computed for this order of magnitude and $$p=2$$ by variants of the celebrated Sinkhorn algorithm (Cuturi 2013; Schmitzer 2019; Altschuler et al. 2017), but it has been observed that these approximations have their limitations (Schmitzer 2019; Klatt et al. 2019).

The main advantage of using $$p=2$$ is that we can decompose the cost function as $$\Vert x-y\Vert ^2 = \Vert x\Vert ^2 + \Vert y\Vert ^2 - 2x^\top y$$ and hence formulate the Monge–Kantorovich problem equivalently as $$\max _{\pi } \int _{\mathbb {R}^d \times \mathbb {R}^d} x^{\top } y \; \pi (dx,dy)$$. For the discrete problem (3) this decomposition is used in Schmitzer (2016) to construct particularly simple so-called shielding neighborhoods. But also if one or both of $$\mu$$ and $$\nu$$ are assumed absolutely continuous with respect to Lebesgue measure, this decomposition for $$p=2$$ has clear computational advantages. For example if the measures $$\mu$$ and $$\nu$$ are assumed to have densities f and g, respectively, the celebrated Brenier’s theorem, which yields an optimal transport map that is the gradient of a convex function u (McCann 1995), allows to solve Monge’s problem by finding a numerical solution u to the Monge-Ampère equation $$\det (D^2 u(x)) = f(x) \big / g(\nabla u(x))$$; see Santambrogio (2015, Sect. 6.3) and the references given there.

In the rest of this article we focus on the semi-discrete setting, meaning here that the measure $$\mu$$ is absolutely continuous with respect to Lebesgue measure and the measure $$\nu$$ has finite support. This terminology was recently used in Wolansky (2015), Kitagawa et al. (2019), Genevay et al. (2016) and Bourne et al. (2018) among others. In the semi-discrete setting we can represent a solution to Monge’s problem as a partition of $$\mathbb {R}^d$$, where each cell is the pre-image of a support point of $$\nu$$ under the optimal transport map. We refer to such a partition as optimal transport partition.

In the case $$p=2$$ this setting is well studied. It was shown in Aurenhammer et al. (1998) that an optimal transport partition always exists, is essentially unique, and takes the form of a Laguerre tessellation, a.k.a. power diagram. The authors proved further that the right tessellation can be found numerically by solving a (typically high dimensional) unconstrained convex optimization problem. Since Laguerre tessellations are composed of convex polytopes, the evaluation of the objective function can be done very precisely and efficiently. Mérigot (2011) elaborates details of this algorithm and combines it with a powerful multiscale idea. In Kitagawa et al. (2019) a damped Newton algorithm is presented for the same objective function and the authors are able to show convergence with optimal rates.

In this article we present the corresponding theory for the case $$p=1$$. It is shown in Sect. 2.3 of Crippa et al. (2009) and independently in Geiß et al. (2013), which both treat more general cost functions, that an optimal transport partition always exists, is essentially unique and takes the form of a weighted Voronoi tessellation, or more precisely an Apollonius diagram. We extend this result somewhat within the case $$p=1$$ in Theorems 1 and 2 below. We prove then in Theorem 3 that the right tessellation can be found by optimizing an objective function corresponding to that from the case $$p=2$$. Since the cell boundaries in an Apollonius diagram in 2d are segments of hyperbolas, computations are more involved and we use a new strategy for computing integrals over cells and for performing line search in the optimization method. Details of the algorithm are given in Sect. 4 and the complete implementation can be downloaded from GithubFootnote 1 and is included in the latest version of the transport-package (Schuhmacher et al. 2019) for the statistical computing environment R (R Core Team 2017). Up to Sect. 4 the present paper is a condensed version of the thesis (Hartmann 2016), to which we refer from time to time for more details. In the remainder we evaluate the performance of our algorithm on a set of test cases (Sect. 5), give a number of applications (Sect. 6), and provide a discussion and open questions for further research (Sect. 7).

At the time of finishing the present paper, it has come to our attention that Theorem 2.1 of Kitagawa et al. (2019), which is for very general cost funtions including the Euclidean distance (although the remainder of the paper is not), has a rather large overlap with our Theorem 3. Within the case of Euclidean cost it assumes somewhat stronger conditons than our Theorem 3, namely a compact domain $$\mathscr {X}$$ and a bounded density for $$\mu$$. In addition the statement is somewhat weaker as it does not contain our statement (c). We also believe that due to the simpler setting of $$p=1$$ our proof is accessible to a wider audience and it is more clearly visible that the additional restrictions on $$\mathscr {X}$$ and $$\mu$$ are in fact not needed.

We end this introduction by providing some first motivation for studying the semi-discrete setting for $$p=1$$. This will be further substantiated in the application Sect. 6.

### 1.1 Why semi-discrete?

The semi-discrete setting appears naturally in problems of allocating a continuously distributed resource to a finite number of sites. Suppose for example that a fast-food chain introduces a home delivery service. Based on a density map of expected orders (the “resource”), the management would like to establish delivery zones for each branch (the “sites”). We assume that each branch has a fixed capacity (at least in the short run), that the overall capacity matches the total number of orders (peak time scenario), and that the branches are not too densely distributed, so that the Euclidean distance is actually a reasonable approximation to the actual travel distance; see Boscoe et al. (2012). We take up this example in Sect. 6.2. A somewhat different model that adds waiting time costs to the distance-based costs instead of using capacity constraints was studied theoretically in Crippa et al. (2009).

An important general class that builds on resource allocation are location-allocation problems: where to position a number of sites (branches, service stations, etc.) in such a way that the sum of the resource allocation cost plus maybe further costs for installation, maintainance and waiting times is minimized, possibly under capacity and/or further constraints. See e.g. Mallozzi et al. (2019) for a flexible model, which was algorithmically solved via discretizing the continuous domain. Positioning of sites can also be competitive, involving different agents (firms), such as in Núñez and Scarsini (2016).

A special case of location-allocation is the quantization problem, which consists in finding positions and capacities of sites that minimize the resulting resource allocation cost. See Bourne et al. (2018, Sect. 4) for a recent discussion using incomplete transport and $$p=2$$.

As a further application we propose in Sect. 6.3 optimal transport partitions as a simple visual tool for investigating local deviations from a continuous probability distribution based on a finite sample.

Since the computation of the semi-discrete optimal transport is linear in the resolution at which we consider the continuous measure (for computational purposes), it can also be attractive to use the semi-discrete setting as an approximation of either the fully continuous setting (if $$\nu$$ is sufficiently simple) or the fully discrete setting (if $$\mu$$ has a large number of support points). This will be further discussed in Sect. 2.

### 1.2 Why $$p=1$$?

The following discussion highlights some of the strengths of optimal transport based on an unsquared Euclidean distance ($$p=1$$), especially in the semi-discrete setting, and contrasts $$p=1$$ with $$p=2$$.

From a computational point of view the case $$p=2$$ can often be treated more efficiently, mainly due to the earlier mentioned decomposability, leading e.g. to the algorithms in Schmitzer (2016) in the discrete and Aurenhammer et al. (1998), Mérigot (2011) in the semi-discrete setting. The case $$p=1$$ has the advantage that the Monge–Kantorovich problem has a particularly simple dual (Villani 2009, Particular Case 5.16), which is equivalent to Beckmann’s problem (Beckmann 1952; Santambrogio 2015, Theorem 4.6). If we discretize the measures (if necessary) to a common mesh of n points, the latter is an optimization problem in n variables rather than the $$n^2$$ variables needed for the general discrete transport formulation (3). Algorithms that make use of this reduction have been described in  Solomon et al. (2014) (for general discrete surfaces) and in Schmitzer and Wirth (2019, Sect. 4) (for general incomplete transport), but their performance in a standard situation, e.g. complete optimal transport on a regular grid in $$\mathbb {R}^d$$, remains unclear. In particular we are not aware of any performance comparisons between $$p=1$$ and $$p=2$$.

In the present paper we do not make use of this reduction, but keep the source measure $$\mu$$ truly continuous except for an integral approximation that we perform for numerical purposes. We describe an algorithm for the semi-discrete problem with $$p=1$$ that is reasonably fast, but cannot quite reach the performance of the algorithm for $$p=2$$ in Mérigot (2011). This is again mainly due to the nice decomposition property of the cost function for $$p=2$$ or, more blatantly, the fact that we minimize for $$p=2$$ over partitions formed by line rather than hyperbola segments.

From an intuitive point of view $$p=1$$ and $$p=2$$ have both nice interpretations and depending on the application setting either the one or the other may be more justified. The difference is between thinking in terms of transportation logistics or in terms of fluid mechanics. If $$p=1$$, the optimal transport plan minimizes the cumulative distance by which mass is transported. This is (up to a factor that would not change the transport plan) the natural cost in the absence of fixed costs or any other savings on long-distance transportation. If $$p=2$$, the optimal transport plan is determined by a pressureless potential flow from $$\mu$$ to $$\nu$$ as seen from the kinetic energy minimization formulation of Benamou and Brenier (2000), Villani (2009, Chapter 7).

The different behaviors in the two cases can be illustrated by the discrete toy example in Fig. 1. Each point along the incomplete circle denotes the location of one unit of mass of $$\mu$$ (blue x-points) and/or $$\nu$$ (red o-points). The unique solution for $$p=1$$ moves one unit of mass from one end of the circular structur to the other. This is how we would go about carrying boxes around to get from the blue scenario to the red scenario. The unique solution for $$p=2$$ on the other hand is to transport each unit a tiny bit further to the next one, corresponding to a (discretized) flow along the circle. It is straightforward to adapt this toy example for the semi-discrete or the continuous setting. A more complex semi-discrete example is given in Sect. 6.1.

One argument in favour of the metric $$W_1$$ is its nice invariant properties that are not shared by the other $$W_p$$. In particular, considering finite measures $$\mu ,\nu ,\alpha$$ on $$\mathbb {R}^d$$ satisfying $$\mu (\mathbb {R}^d) = \nu (\mathbb {R}^d)$$, $$p \ge 1$$ and $$c > 0$$, we have

\begin{aligned} W_1(\alpha + \mu , \alpha + \nu )&= W_1(\mu , \nu ), \end{aligned}
(4)
\begin{aligned} W_1(c \mu , c \nu )&= c W_1(\mu , \nu ). \end{aligned}
(5)

The first result is in general not true for any other p, the second result holds with a factor $$c^{1/p}$$ on the right hand side. We prove these statements in the appendix. These invariance properties have important implications for image analysis, where it is quite common to adjust for differring levels of brightness (in grayscale images) by affine transformations. While the above equalities show that it is safe to do so for $$p=1$$, it may change the resulting Wasserstein distance and the optimal transport plan dramatically for other p; see Appendix and Sect. 6.1.

It is sometimes considered problematic that optimal transport plans for $$p=1$$ are in general not unique. But this is not so in the semi-discrete case, as we will see in Sect. 2: the minimal transport cost in (1) is realized by a unique coupling $$\pi$$, which is always deterministic. The same is true for $$p=2$$. A major difference in the case $$p=1$$ is that for $$d>1$$ each cell of the optimal transport partition contains the support point of the target measure $$\nu$$ that it assigns its mass to. This can be seen as a consequence of cyclical monotonicity (Villani 2009, beginning of Chapter 8). In contrast, for $$p=2$$, optimal transport cells can be separated by many other cells from their support points, which can make the resulting partition hard to interpret without drawing corresponding arrows for the assignment; see the bottom panels of Fig. 5. For this reason we prefer to use $$p=1$$ for the goodness-of-fit partitions considered in Sect. 6.3.

## 2 Semi-discrete optimal transport

We first concretize the semi-discrete setting and introduce some additional notation. Let now $$\mathscr {X}$$ and $$\mathscr {Y}$$ be Borel subsets of $$\mathbb {R}^d$$ and let $$\mu$$ and $$\nu$$ be probability measures on $$\mathscr {X}$$ and $$\mathscr {Y}$$, respectively. This is just for notational convenience and does not change the set of admissible measures in an essential way: we may always set $$\mathscr {X}=\mathscr {Y}=\mathbb {R}^d$$ and any statement about $$\mu$$ and $$\nu$$ we make can be easily recovered for $$c\mu$$ and $$c\nu$$ for arbitrary $$c>0$$.

For the rest of the article it is tacitly assumed that $$d \ge 2$$ to avoid certain pathologies of the one-dimensional case that would lead to a somewhat tedious distinction of cases in various results for a case that is well-understood anyway. Moreover, we always require $$\mu$$ to be absolutely continuous with density $$\varrho$$ with respect to d-dimensional Lebesgue measure $$\mathrm {Leb}^d$$ and to satisfy

\begin{aligned} \int _{\mathscr {X}} \Vert x\Vert \; \mu (dx) < \infty . \end{aligned}
(6)

We assume further that $$\nu = \sum _{j=1}^n \nu _j \delta _{y_j}$$, where $$n \in \mathbb {N}$$, $$y_1, \ldots , y_n \in \mathscr {Y}$$ and $$\nu _1, \ldots \nu _n \in (0,1]$$. Condition (6) guarantees that

\begin{aligned} W_1(\mu ,\nu ) \le \int _{\mathscr {X}} \Vert x\Vert \; \mu (dx) + \int _{\mathscr {Y}} \Vert y\Vert \; \nu (dy) =: C < \infty , \end{aligned}
(7)

which simplifies certain arguments.

The set of Borel subsets of $$\mathscr {X}$$ is denoted by $$\mathscr {B}_{\mathscr {X}}$$. Lebesgue mass is denoted by absolute value bars, i.e. $$|A| = \mathrm {Leb}^d(A)$$ for every $$A \in \mathscr {B}_{\mathscr {X}}$$.

We call a partition $$\mathfrak {C}= (C_j)_{1 \le j \le n}$$ of $$\mathscr {X}$$ into Borel sets satisfying $$\mu (C_j) = \nu _j$$ for every j a transport partition from $$\mu$$ to $$\nu$$. Any such partition characterizes a transport map T from $$\mu$$ to $$\nu$$, where we set $$T_{\mathfrak {C}}(x) = \sum _{j=1}^n y_j 1\{x \in C_j\}$$ for a given transport partition $$\mathfrak {C}= (C_j)_{1 \le j \le n}$$ and $$\mathfrak {C}_T = (T^{-1}(y_j))_{1 \le j \le n}$$ for a given transport map T. Monge’s problem for $$p=1$$ can then be equivalently formulated as finding

\begin{aligned} \inf _{\mathfrak {C}} \int _{\mathscr {X}} \Vert x-T_{\mathfrak {C}}(x)\Vert \; \mu (dx) = \inf _{\mathfrak {C}} \sum _{j=1}^n \int _{C_j} \Vert x-y_j\Vert \; \mu (dx), \end{aligned}
(8)

where the infima are taken over all transport partitions $$\mathfrak {C}= (C_j)_{1 \le j \le n}$$ from $$\mu$$ to $$\nu$$. Contrary to the difficulties encountered for more general measures $$\mu$$ and $$\nu$$ when considering Monge’s problem with Euclidean costs, we can give a clear-cut existence and uniqueness theorem in the semi-discrete case, without any further restrictions.

### Theorem 1

In the semi-discrete setting with Euclidean costs (always including $$d\ge 2$$ and (6)) there is a $$\mu$$-a.e. unique solution $$T_*$$ to Monge’s problem. The induced coupling $$\pi _{T_*}$$ is the unique solution to the Monge–Kantorovich problem, yielding

\begin{aligned} W_1(\mu ,\nu ) = \int _{\mathscr {X}} \Vert x-T_{*}(x)\Vert \; \mu (dx). \end{aligned}
(9)

### Proof

The part concerning Monge’s problem is a consequence of the concrete construction in Sect. 3; see Theorem 2.

Clearly $$\pi _{T_*}$$ is an admissible transport plan for the Monge–Kantorovich problem. Since $$\mu$$ is non-atomic and the Euclidean cost function is continuous, Theorem B in Pratelli (2007) implies that the minimum in the Monge–Kantorovich problem is equal to the infimum in the Monge problem, so $$\pi _{T*}$$ must be optimal.

For the uniqueness of $$\pi _{T*}$$ in the Monge–Kantorovich problem, let $$\pi$$ be an arbitrary optimal transport plan. Define the measures $$\tilde{\pi }_i$$ on $$\mathscr {X}$$ by $$\tilde{\pi }_i(A){:=}\pi (A\times \{y_i\})$$ for all $$A\in \mathscr {B}_{\mathscr {X}}$$ and $$1 \le i \le n$$. Since $$\sum _i \pi _i = \mu$$, all $$\pi _i$$ are absolutely continuous with respect to $$\mathrm {Leb}^d$$ with densities $$\tilde{\rho }_i$$ satisfying $$\sum \tilde{\rho }_i = \rho$$. Set then $$S_i{:=}\lbrace x\in \mathscr {X}\vert \tilde{\rho }_i>0 \rbrace$$.

Assume first that there exist $$i,j \in \{1,\ldots ,n\}$$, $$i \ne j$$, such that $$|S_i\cap S_j|>0$$. Define $$H_{<}^{i,j}(q){:=}\lbrace x\in \mathscr {X}\vert \Vert x-y_i\Vert < \Vert x-y_j\Vert + q \rbrace$$ and $$H_{>}^{i,j}(q)$$, $$H_{=}^{i,j}(q)$$ analogously. There exists a $$q\in \mathbb {R}$$ for which both $$S_i \cap S_j \cap H_{<}^{i,j}(q)$$ and $$S_i \cap S_j \cap H_{>}^{i,j}(q)$$ have positive Lebesgue measure: choose $$q_1, q_2\in \mathbb {R}$$ such that $$|S_i \cap S_j \cap H_{<}^{i,j}(q_1)| > 0$$ and $$|S_i \cap S_j \cap H_{>}^{i,j}(q_2)| > 0$$; using binary search between $$q_1$$ and $$q_2$$, we find the desired q in finitely many steps, because otherwise there would have to exist a $$q_0$$ such that $$|S_i \cap S_j \cap H_{=}^{i,j}(q_0)| > 0$$, which is not possible. By the definition of $$S_i$$ and $$S_j$$ we thus have $$\alpha = \pi _i(S_i \cap S_j \cap H_{>}^{i,j}(q)) > 0$$ and $$\beta = \pi _j(S_i \cap S_j \cap H_{<}^{i,j}(q)) > 0$$. Switching i and j if necessary, we may assume $$\alpha \le \beta$$. Define then

\begin{aligned} \begin{aligned} \pi _i'&= \pi _i - \pi _i\vert _{S_i \cap S_j \cap H_{>}^{i,j}(q)} + \frac{\alpha }{\beta } \pi _j\vert _{S_i \cap S_j \cap H_{<}^{i,j}(q)}, \\ \pi _j'&= \pi _j + \pi _i\vert _{S_i \cap S_j \cap H_{>}^{i,j}(q)} - \frac{\alpha }{\beta } \pi _j\vert _{S_i \cap S_j \cap H_{<}^{i,j}(q)} \end{aligned} \end{aligned}

and $$\pi _k' = \pi _k$$ for $$k \not \in \{i,j\}$$. It can be checked immediately that the measure $$\pi '$$ given by $$\pi '(A \times \{y_i\}) = \pi '_i(A)$$ for all $$A \in \mathscr {B}_{\mathscr {X}}$$ and all $$i \in \{1,2,\ldots ,n\}$$ is a transport plan from $$\mu$$ to $$\nu$$ again. It satisfies

\begin{aligned} \begin{aligned} \int _{\mathscr {X}\times \mathscr {Y}} \Vert x-y\Vert&\; \pi '(dx, dy) - \int _{\mathscr {X}\times \mathscr {Y}} \Vert x-y\Vert \; \pi (dx, dy) \\&= \int _{S_i \cap S_j \cap H_{>}^{i,j}(q)} \bigl ( -\Vert x-y_i\Vert + \Vert x-y_j\Vert \bigr ) \; \pi _i(dx) \\&\quad + \frac{\alpha }{\beta } \int _{S_i \cap S_j \cap H_{<}^{i,j}(q)} \bigl ( \Vert x-y_i\Vert - \Vert x-y_j\Vert \bigr ) \; \pi _j(dx) \\&< 0, \end{aligned} \end{aligned}

because the integrands are strictly negative on the sets over which we integrate. But this contradicts the optimality of $$\pi$$.

We thus have proved that $$|S_i\cap S_j| = 0$$ for all pairs with $$i \ne j$$. This implies that we can define a transport map T inducing $$\pi$$ in the following way. If $$x\in S_i\setminus (\cup _{j\ne i} S_j)$$ for some i, set $$T(x){:=}y_i$$. Since the intersections $$S_i\cap S_j$$ are Lebesgue null sets, the value of T on them does not matter. So we can for example set $$T(x){:=}y_{1}$$ or $$T(x){:=}y_{i_0}$$ for $$x\in \bigcap _{i \in I} S_{i} \setminus \bigcap _{i \in I^c} S_{i}$$, where $$I \subset \{1,\ldots ,n\}$$ contains at least two elements and $$i_0 = \min (I)$$. It follows that $$\pi _T = \pi$$. But by the optimality of $$\pi$$ and Theorem 2 we obtain $$T=T_*$$$$\mu$$-almost surely, which implies $$\pi = \pi _T = \pi _{T_*}$$. $$\square$$

It will be desirable to know in what way we may approximate the continuous and discrete Monge–Kantorovich problems by the semi-discrete problem we investigate here.

In the fully continuous case, we have a measure $$\tilde{\nu }$$ on $$\mathscr {Y}$$ with density $$\tilde{\varrho }$$ with respect to $$\mathrm {Leb}^d$$ instead of the discrete measure $$\nu$$. In the fully discrete case, we have a discrete measure $$\tilde{\mu }= \sum _{i=1}^m \tilde{\mu }_i \delta _{x_i}$$ instead of the absolutely continuous measure $$\mu$$, where $$m \in \mathbb {N}$$, $$x_1, \ldots , x_m \in \mathscr {X}$$ and $$\tilde{\mu }_1, \ldots \tilde{\mu }_m \in (0,1]$$. In both cases existence of an optimal transport plan is still guaranteed by Villani (2009, Theorem 4.1), however we lose to some extent the uniqueness property.

One reason for this is that mass transported within the same line segment can be reassigned at no extra cost; see the discussion on transport rays in Sect. 6 of Ambrosio and Pratelli (2003). In the continuous case this is the only reason, and uniqueness can be restored by minimizing a secondary functional (e.g. total cost with respect to $$p>1$$) over all optimal transport plans; see Theorem 7.2 in Ambrosio and Pratelli (2003).

In the discrete case uniqueness depends strongly on the geometry of the support points of $$\tilde{\mu }$$ and $$\nu$$. In addition to collinearity of support points, equality of interpoint distances can also lead to non-unique solutions. While uniqueness can typically be achieved when the support points are in sufficiently general position, we are not aware of any precise result to this effect.

When approximating the continuous problem with measures $$\mu$$ and $$\tilde{\nu }$$ by a semi-discrete problem, we quantize the measure $$\tilde{\nu }$$ into a discrete measure $$\nu = \sum _{j=1}^n \nu _j \delta _{y_j}$$, where $$\nu _j = \tilde{\nu }(N_j)$$ for a partition $$(N_j)$$ of $${{\,\mathrm{supp}\,}}(\tilde{\nu })$$. The error we commit in Wasserstein distance by discretization of $$\tilde{\nu }$$ is bounded by the quantization error, i.e.

\begin{aligned} \bigl |W_1(\mu ,\tilde{\nu }) - W_1(\mu ,\nu )\bigr | \le W_1(\tilde{\nu },\nu ) \le \sum _{j=1}^n \int _{N_j} \Vert y-y_j\Vert \; \tilde{\nu }(dy). \end{aligned}
(10)

We can compute $$W_1(\tilde{\nu },\nu )$$ exactly by solving another semi-discrete transport problem, using the algorithm described in Sect. 4 to compute an optimal partition $$(N_j)$$ for the second inequality above. However, choosing $$\nu$$ for given n in such a way that $$W_1(\tilde{\nu },\nu )$$ is minimal is usually practically infeasible. So we would use an algorithm that makes $$W_1(\tilde{\nu },\nu )$$ reasonably small, such as a suitable version of Lloyd’s algorithm; see Sect. 4.1 below.

When approximating the discrete problem with measures $$\tilde{\mu }$$ and $$\nu$$ by a semi-discrete problem, we blur each mass $$\tilde{\mu }_i$$ of $$\tilde{\mu }$$ over a neighborhood of $$x_i$$ using a probability density $$f_i$$, to obtain a measure $$\mu$$ with density $$\varrho (x) = \sum _{i=1}^m \tilde{\mu }_i f_i(x)$$. Typical examples use $$f_i(x) = \frac{1}{h^d} \varphi \bigl (\frac{x-x_i}{h}\bigr )$$, where $$\varphi$$ is the standard normal density and the bandwidth $$h>0$$ is reasonably small, or $$f_i(x) = \frac{1}{|M_i|} \mathbb {1}_{M_i}(x)$$, where $$M_i$$ is some small neighborhood of $$x_i$$. In practice, discrete measures are often available in the form of images, where the support points $$x_i$$ form a fine rectangular grid; then the latter choice of $$f_i$$s is very natural, where the $$M_i$$s are just adjacent squares, each with an $$x_i$$ at the center. There are similar considerations for the approximation error as in the fully continuous case above. In particular the error we commit in Wasserstein distance is bounded by the blurring error

\begin{aligned} \bigl |W_1(\tilde{\mu },\nu ) - W_1(\mu ,\nu )\bigr | \le W_1(\tilde{\mu },\mu ) \le \sum _{i=1}^m \tilde{\mu }_i \int _{\mathbb {R}^d} \Vert x-x_i\Vert f_i(x) \; dx. \end{aligned}
(11)

The right hand side is typically straightforward to compute exactly, e.g. in the normal density and grid cases described above. It can be made small by choosing the bandwidth h very small or picking sets $$M_i$$ of small radius $$r = \sup _{x \in M_i} \Vert x-x_i\Vert$$.

What about the approximation properties of the optimal transport plans obtained by the semi-discrete setting? Theorem 5.20 in Villani (2009) implies for $$\nu ^{(k)} \rightarrow \tilde{\nu }$$ weakly and $$\mu ^{(k)} \rightarrow \tilde{\mu }$$ weakly that every subsequence of the sequence of optimal transport plans $$\pi ^{(k)}_*$$ between $$\mu ^{(k)}$$ and $$\nu ^{(k)}$$ has a further subsequence that converges weakly to an optimal transport plan $$\pi _*$$ between $$\mu$$ and $$\nu$$. This implies that for every $$\varepsilon >0$$ there is a $$k_0 \in \mathbb {N}$$ such that for any $$k \ge k_0$$ the plan $$\pi ^{(k)}$$ is within distance $$\varepsilon$$ (in any fixed metrization of the weak topology) of some optimal transport plan between $$\mu$$ and $$\nu$$, which is the best we could have hoped for in view of the non-uniqueness of optimal transport plans we have in general. If (in the discrete setting) there is a unique optimal transport plan $$\pi _*$$, this yields that $$\pi _*^{(k)} \rightarrow \pi _*$$ weakly.

## 3 Optimal transport maps via weighted Voronoi tessellations

As shown for bounded $$\mathscr {X}$$ in Geiß et al. (2013), the solution to the semi-discrete transport problem has a nice geometrical interpretation, which is similar to the well-known result in Aurenhammer et al. (1998): we elaborate below that the sets $$C^{*}_j$$ of the optimal transport partition are the cells of an additively weighted Voronoi tessellation of $$\mathscr {X}$$ around the support points of $$\nu$$.

For the finite set of points $$\{y_1, \dots , y_n\}$$ and a vector $$w\in \mathbb {R}^n$$ that assigns to each $$y_j$$ a weight $$w_j$$, the additively weighted Voronoi tessellation is the set of cells

\begin{aligned} {{\,\mathrm{Vor}\,}}_w(j)= & {} \lbrace x\in \mathscr {X}\vert \Vert x-y_j\Vert - w_j \\\le & {} \Vert x-y_k\Vert - w_k \quad \text {for all } k \ne j \rbrace , \quad j=1, \dots , n. \end{aligned}

Note that adjacent cells $${{\,\mathrm{Vor}\,}}_w(j)$$ and $${{\,\mathrm{Vor}\,}}_w(k)$$ have disjoint interiors. The intersection of their boundaries is a subset of $$H = \lbrace x\in \mathscr {X}\vert \Vert x-y_j\Vert - \Vert x-y_k\Vert = w_j - w_k \rbrace$$, which is easily seen to have Lebesgue measure (and hence $$\mu$$-measure) zero. If $$d=2$$, the set H is a branch of a hyperbola with foci at $$y_j$$ and $$y_k$$. It may also be interpreted as the set of points that have the same distance from the spheres $$S(y_j,w_j)$$ and $$S(y_k,w_k)$$, where $$S(y,w) = \lbrace x \in \mathscr {X}\vert \Vert x-y\Vert = w \rbrace$$. See Fig. 2 for an illustration of these properties.

Of course not all weighted Voronoi tessellations are valid transport partitions from $$\mu$$ to $$\nu$$. But suppose we can find a weight vector w such that the resulting Voronoi tessellation satisfies indeed $$\mu ({{\,\mathrm{Vor}\,}}_w(j)) = \nu _j$$ for every $$j \in \{1,\ldots ,n\}$$; we call such a w adapted to $$(\mu ,\nu )$$. Then this partition is automatically optimal.

### Theorem 2

If $$w\in \mathbb {R}^n$$ is adapted to $$(\mu , \nu )$$, then $$({{\,\mathrm{Vor}\,}}_w(j))_{1 \le j \le n}$$ is the $$\mu$$-almost surely unique optimal transport partition from $$\mu$$ to $$\nu$$.

A proof was given in Geiß et al. (2013), Theorem 2 for more general distance functions, but required $$\mathscr {X}$$ to be bounded. For the Euclidean distance we consider here, we can easily extend it to unbounded $$\mathscr {X}$$; see Hartmann (2016, Theorem 3.2).

Having identified this class of optimal transport partitions, it remains to show that for each pair $$(\mu , \nu )$$ we can find an adapted weight vector. We adapt the approach of Aurenhammer et al. (1998) to the case $$p=1$$, which gives us a constructive proof that forms the basis for the algorithm in Sect. 4. Our key tool is the function $$\Phi$$ defined below.

### Theorem 3

Let $$\Phi : \mathbb {R}^n \rightarrow \mathbb {R}$$,

\begin{aligned} \Phi (w) = \sum _{j=1}^n\left( -\nu _j w_j - \int _{{{\,\mathrm{Vor}\,}}_w(j)} \left( \Vert x - y_j\Vert - w_j\right) \; \mu (dx)\right) . \end{aligned}

Then

1. a.

$$\Phi$$ is convex;

2. b.

$$\Phi$$ is continuously differentiable with partial derivatives

\begin{aligned} \frac{\partial \Phi }{\partial w_j}(w) = -\nu _j+\mu ({{\,\mathrm{Vor}\,}}_w(j)); \end{aligned}
3. c.

$$\Phi$$ takes a minimum in $$\mathbb {R}^n$$.

### Remark 1

Let $$w^* \in \mathbb {R}^n$$ be a minimizer of $$\Phi$$. Then by Theorem 3b)

\begin{aligned} \mu ({{\,\mathrm{Vor}\,}}_{w^*}(j))-\nu _j = \frac{\partial \Phi }{\partial w_j}(w^*) = 0 \quad \text {for every}\,j \in \{1,\ldots ,n\}, \end{aligned}

i.e. $$w_{*}$$ is adapted to $$(\mu , \nu )$$. Theorem 2 yields that $$(Vor_{w^*}(j))_{1 \le j \le n}$$ is the $$\mu$$-almost surely unique optimal transport partition from $$\mu$$ to $$\nu$$.

### Proof (of Theorem 3)

We take a few shortcuts; for full technical details see Chapter 3 of Hartmann (2016).

Part a) relies on the observation that $$\Phi$$ can be written as

\begin{aligned} \Phi (w) = \sum _j(-\nu _j w_j) - \Psi (w) \end{aligned}

where

\begin{aligned} \Psi (w) = \int _{\mathscr {X}} (\Vert x - T^w(x)\Vert - w_{T^w(x)}) \; \mu (dx), \end{aligned}

$$T^w$$ denotes the transport map induced by the Voronoi tessellation with weight vector w and we write $$w_{y_j}$$ instead of $$w_j$$ for convenience. By definition of the weighted Voronoi tessellation $$\Psi$$ is the infimum of the affine functions

\begin{aligned} \Psi _f :\mathbb {R}^n \rightarrow \mathbb {R}, \ w \mapsto \int _{\mathscr {X}} (\Vert x - f(x)\Vert - w_{f(x)}) \; \mu (dx) \end{aligned}

over all measurable maps f from $$\mathscr {X}$$ to $$\mathscr {Y}$$. Since pointwise infima of affine functions are concave and the first summand of $$\Phi$$ is linear, we see that $$\Phi$$ is convex.

By geometric arguments it can be shown that $$[w \mapsto \mu ({{\,\mathrm{Vor}\,}}_w(j))]$$ is continuous; see Hartmann (2016, Lemma 3.3). A short computation involving the representation $$\Psi (w) = \inf _f \Psi _f(w)$$ used above yields for the difference quotient of $$\Psi$$, writing $$e_j$$ for the j-th standard basis vector and letting $$h \ne 0$$,

\begin{aligned} \biggl | \frac{\Psi (w+he_j)-\Psi (w)}{h} + \mu ({{\,\mathrm{Vor}\,}}_w(j)) \biggr | \le \bigl | -\mu ({{\,\mathrm{Vor}\,}}_{w+he_j}(j)) + \mu ({{\,\mathrm{Vor}\,}}_w(j)) \bigr | \longrightarrow 0 \end{aligned}

as $$h \rightarrow 0$$. This implies that $$\Psi$$ is differentiable with continuous j-th partial derivative $$-\mu ({{\,\mathrm{Vor}\,}}_w(j))$$ and hence statement b) follows.

Finally, for the existence of a minimizer of $$\Phi$$ we consider an arbitrary sequence $$(w^{(k)})_{k\in \mathbb {N}}$$ of weight vectors in $$\mathbb {R}^n$$ with

\begin{aligned} \lim _{k\rightarrow \infty } \Phi (w^{(k)}) = \inf _{w\in \mathbb {R}^n} \Phi (w). \end{aligned}

We show below that a suitably shifted version of $$(w^{(k)})_{k\in \mathbb {N}}$$ that has the same $$\Phi$$-values contains a bounded subsequence. This subsequence then has a further subsequence $$(u^{(k)})$$ which converges towards some $$u \in \mathbb {R}^n$$. Continuity of $$\Phi$$ yields

\begin{aligned} \Phi (u) = \lim _{k\rightarrow \infty } \Phi (u^{(k)}) = \inf _{w\in \mathbb {R}^n} \Phi (w) \end{aligned}

and thus statement c).

To obtain the bounded subsequence, note first that adding to each weight the same constant neither affects the Voronoi tessellation nor the value of $$\Phi$$. We may therefore assume $$w_j^{(k)} \ge 0$$, $$1 \le j \le n$$, for all $$k \in \mathbb {N}$$. Choosing an entry i and an infinite set $$K\subset \mathbb {N}$$ appropriately leaves us with a sequence $$(w^{(k)})_{k\in K}$$ satisfying $$w_i^{(k)} \ge w_j^{(k)}$$ for all j and k. Taking a further subsequence $$(w^{(l)})_{l \in L}$$ for some infinite $$L\subset K$$ allows the choice of an $$R \ge 0$$ and the partitioning of $$\{1,\dots ,n\}$$ into two sets A and B such that for every $$l \in L$$

1. i)

$$\displaystyle 0 \le w_i^{(l)} - w_j^{(l)} \le R \quad \text {if } j \in A,$$

2. ii)

$$\displaystyle w_i^{(l)} - w_j^{(l)} \ge {{\,\mathrm{index}\,}}(l) \quad \text {if } j \in B,$$

where $${{\,\mathrm{index}\,}}(l)$$ denotes the rank of l in L, in the sense that l is the $${{\,\mathrm{index}\,}}(l)$$-th smallest element of L.

Assume that $$B \ne \emptyset$$. The Voronoi cells with indices in B will at some point be shrunk to measure zero, meaning there exists an $$N \in L$$ such that

\begin{aligned} \sum _{j\in A} \mu \bigl ({{\,\mathrm{Vor}\,}}_{w^{(l)}}(j)\bigr ) = 1 \quad \text {for all } l \ge N. \end{aligned}

Write

\begin{aligned} \underline{w}_A^{(l)} = \min _{i \in A} w_i^{(l)} \quad \text {and} \quad \overline{w}_B^{(l)} = \max _{i \in B} w_i^{(l)}, \end{aligned}

and recall the constant C from (7), which may clearly serve as an upper bound for the transport cost under an arbitrary plan. We then obtain for every $$l \ge N$$

\begin{aligned} \begin{aligned} \Phi (w^{(l)})&= \sum _{j=1}^n \biggl ( -\nu _j w_j^{(l)} - \int _{{{\,\mathrm{Vor}\,}}_{w^{(l)}}(j)} \bigl ( \Vert x-y_j\Vert - w_j^{(l)} \bigr ) \; \mu (dx) \biggr ) \\&\ge -C + \sum _{j=1}^n w_j^{(l)} \Bigl ( \mu \bigl ({{\,\mathrm{Vor}\,}}_{w^{(l)}}(j)\bigr ) - \nu _j \Bigr ) \\&= -C + \sum _{j \in A} w_j^{(l)} \Bigl ( \mu \bigl ({{\,\mathrm{Vor}\,}}_{w^{(l)}}(j)\bigr ) - \nu _j \Bigr ) - \sum _{j \in B} w_j^{(l)} \nu _j \\&\ge -C-R + \underline{w}_A^{(l)} \biggl ( 1 - \sum _{j \in A} \nu _j \biggr ) - \overline{w}_B^{(l)} \sum _{j \in B} \nu _j \\&\ge -C-2R + {{\,\mathrm{index}\,}}(l), \end{aligned} \end{aligned}

which contradicts the statement $$\lim _{k\rightarrow \infty } \Phi (w^{(k)}) = \inf _{w\in \mathbb {R}^n} \Phi (w) < \infty$$. Thus we have $$B=\emptyset$$.

We can then simply turn $$(w^{(l)})_{l\in L}$$ into a bounded sequence by substracting the minimal entry $$\underline{w}^{(l)} = \min _{1 \le i \le n} w_i^{(l)}$$ from each $$w_j^{(l)}$$ for all $$l \in L$$. $$\square$$

## 4 The algorithm

The previous section provides the theory needed to compute the optimal transport partition. It is sufficient to find a vector $$w^*$$ where $$\Phi$$ is locally optimal. By convexity, $$w^*$$ is then a global minimizer of $$\Phi$$ and Remark 1 identifies the $$\mu$$-a.e. unique optimal transport partition as $$({{\,\mathrm{Vor}\,}}_{w^*}(j))_{1 \le j \le n}$$.

For the optimization process we can choose from a variety of methods thanks to knowing the gradient $$\nabla \Phi$$ of $$\Phi$$ analytically from Theorem 3. We consider iterative methods that start at an initial weight vector $$w^{(0)}$$ and apply steps of the form

\begin{aligned} w^{(k+1)} = w^{(k)} + t_k \Delta w^{(k)}, \quad k \ge 0, \end{aligned}

where $$\Delta w^{(k)}$$ denotes the search direction and $$t_k$$ the step size.

Newton’s method would use $$\Delta w^{(k)} = -\bigl ( D^2 \Phi (w^{(k)}) \bigr )^{-1} \nabla \Phi (w^{(k)})$$, but the Hessian matrix $$D^2 \Phi (w^{(k)})$$ is not available to us. We therefore use a quasi-Newton method that makes use of the gradient. Just like Mérigot (2011) for the case $$p=2$$, we have obtained many good results using L-BFGS (Nocedal 1980), the limited-memory variant of the Broyden–Fletcher–Goldfarb–Shanno algorithm, which uses the value of the gradient at the current as well as at preceding steps for approximating the Hessian. The limited-memory variant works without storing the whole Hessian of size $$n\times n$$, which is important since in applications our n is typically large.

To determine a suitable step size $$t_k$$ for L-BFGS, we use the Armijo rule (Armijo 1966), which has proven to be well suited for our problem. It considers different values for $$t_k$$ until it arrives at one that sufficiently decreases $$\Phi (w^{(k)})$$: the step size $$t_k$$ needs to fulfill $$\Phi (w^{(k)} + t_k \Delta w^{(k)}) \le \Phi (w^{(k)}) + c t_k \nabla \Phi (w^{(k)})^T \Delta w^{(k)}$$ for a small fixed c with $$0<c<1$$. We use the default value $$c=10^{-4}$$ of the L-BFGS library (Okazaki and Nocedal 2010) employed by our implementation, which is also given as an example by Nocedal and Wright (1999). An alternative that could be investigated is to use a non-monotone line search such as the one proposed in Grippo et al. (1986). There the above condition is relaxed by admitting a step whenever it sufficiently decreases a function value from one of the previous K iterations, for some $$K\ge 1$$. This might lead to fewer function evaluations and also to convergence in fewer steps. We also considered replacing the Armijo rule with the strong Wolfe conditions (1969, 1971) as done in Mérigot (2011), which contain an additional decrease requirement on the gradient. In our case, however, this requirement could often not be fulfilled because of the pixel splitting method used for computing the gradient (cf. Sect. 4.2), which made it less suited.

### 4.1 Multiscale approach to determine starting value

To find a good starting value $$w^{(0)}$$ we use a multiscale method similar to the one proposed in Mérigot (2011). We first create a decomposition of $$\nu$$, i.e. a sequence $$\nu = \nu ^{(0)}, \dots , \nu ^{(L)}$$ of measures with decreasing cardinality of the support. Here $$\nu ^{(l)}$$ is obtained as a coarsening of $$\nu ^{(l-1)}$$ by merging the masses of several points into one point.

It seems intuitively reasonable to choose $$\nu ^{(l)}$$ in such a way that $$W_1(\nu ^{(l)},\nu ^{(l-1)})$$ is as small as possible, since the latter bounds $$|W_1(\mu ,\nu ^{(l)}) - W_1(\mu ,\nu ^{(l-1)})|$$. This comes down to a capacitated location-allocation problem, which is NP-hard even in the one-dimensional case; see Sherali and Nordai (1988). Out of speed concerns and since we only need a reasonably good starting value for our algorithm, we decided to content ourselves with the same weighted K-means clustering algorithm used by Mérigot (2011) (referred to as Lloyd’s algorithm), which iteratively improves an initial aggregation of the support of $$\nu ^{(l-1)}$$ into $$|{{\,\mathrm{supp}\,}}(\nu ^{(l)})|$$ clusters towards local optimality with respect to the squared Euclidean distance. The resulting $$\nu ^{(l)}$$ is then the discrete measure with the cluster centers as its support points and as weights the summed up weights of the points of $$\nu ^{(l-1)}$$ contained in each cluster; see Algorithm 3 in Hartmann (2016). The corresponding weighted K-median clustering algorithm, based on alternating between assignment of points to clusters and recomputation of cluster centers as the median of all weighted points in the cluster, should intuitively give a $$\nu ^{(l)}$$ based on which we obtain a better starting solution. This may sometimes compensate for the much longer time needed for performing K-median clustering.

Having created the decomposition $$\nu = \nu ^{(0)}, \dots , \nu ^{(L)}$$, we minimize $$\Phi$$ along the sequence of these coarsened measures, beginning at $$\nu ^{(L)}$$ with the initial weight vector $$w^{(L,0)} = 0\in \mathbb {R}^{|{{\,\mathrm{supp}\,}}(\nu ^{(L)})|}$$ and computing the optimal weight vector $$w^{(L,*)}$$ for the transport from $$\mu$$ to $$\nu ^{(L)}$$. Every time we pass to a finer measure $$\nu ^{(l-1)}$$ from the coarser measure $$\nu ^{(l)}$$, we generate the initial weight vector $$w^{(l-1,0)}$$ from the last optimal weight vector $$w^{(l,*)}$$ by assigning the weight of each support point of $$\nu ^{(l)}$$ to all the support points of $$\nu ^{(l-1)}$$ from whose merging the point of $$\nu ^{(l)}$$ originated; see also Algorithm 2 in Hartmann (2016).

### 4.2 Numerical computation of $$\Phi$$ and $$\nabla \Phi$$

For practical computation we assume here that $$\mathscr {X}$$ is a bounded rectangle in $$\mathbb {R}^2$$ and that the density of the measure $$\mu$$ is of the form

\begin{aligned} \varrho (x) = \sum _{i \in I} a_{i} \mathbb {1}_{Q_{i}}(x) \end{aligned}

for $$x \in \mathscr {X}$$, where we assume that I is a finite index set and $$(Q_i)_{i \in I}$$ is a partition of the domain $$\mathscr {X}$$ into (small) squares, called pixels, of equal side length. This is natural if $$\varrho$$ is given as a grayscale image and we would then typically index the pixels $$Q_i$$ by their centers $$i \in I \subset \mathbb {Z}^2$$. It may also serve as an approximation for arbitrary $$\varrho$$. It is however easy enough to adapt the following considerations to more general (not necessarily congruent) tiles and to obtain better approximations if the function $$\varrho$$ is specified more generally than piecewise constant.

The optimization procedure requires the non-trivial evaluation of $$\Phi$$ at a given weight vector w. This includes the integration over Voronoi cells and therefore the construction of a weighted Voronoi diagram. The latter task is solved by the package 2D Apollonius Graphs as part of the Computational Geometry Algorithms Library (CGAL 2015). The integrals we need to compute are

\begin{aligned} \int _{{{\,\mathrm{Vor}\,}}_w(j)} \rho (x) \; dx \quad \text {and}\quad \int _{{{\,\mathrm{Vor}\,}}_w(j)} \Vert x - y_j\Vert \rho (x) \; dx. \end{aligned}

By definition the boundary of a Voronoi cell $${{\,\mathrm{Vor}\,}}_w(j)$$ is made up of hyperbola segments, each between $$y_j$$ and one of the other support points of $$\nu$$. The integration could be performed by drawing lines from $$y_j$$ to the end points of those segments and integrating over the resulting triangle-shaped areas separately. This would be executed by applying an affinely-linear transformation that moves the hyperbola segment onto the hyperbola $$y=1/x$$ to both the area and the function we want to integrate. The required transformation can be found in Hartmann (2016, Sect. 5.6).

However, we take a somewhat more crude, but also more efficient path here, because it is a quite time-consuming task to decide which pixels intersect which weighted Voronoi cells and then to compute the (areas of the) intersections. We therefore approximate the intersections by splitting the pixels into a quadratic number of subpixels (unless the former are already very small) and assuming that each of them is completely contained in the Voronoi cell in which its center lies. This reduces the problem from computing intersections to determining the corresponding cell for each center, which the data structure used for storing the Voronoi diagram enables us to do in roughly $$\mathscr {O}(\log n)$$ time; see Karavelas and Yvinec (2002). The operation can be performed even more efficiently: when considering a subpixel other than the very first one, we already know the cell that one of the neighboring subpixel’s center belongs to. Hence, we can begin our search at this cell, which is either already the cell we are looking for or lies very close to it.

The downside of this approximation is that it can make the L-BFGS algorithm follow search directions along which the value of $$\Phi$$ cannot be sufficiently decreased even though there exist different directions that allow a decrease. This usually only happens near a minimizing weight vector and can therefore be controlled by choosing a not too strict stopping criterion for a given subpixel resolution, see the next subsection.

### 4.3 Our implementation

Implementing the algorithm described in this section requires two technical choices: the number of subpixels every pixel is being split into and the stopping criterion for the minimization of $$\Phi$$. We found that choosing the number of subpixels to be the smallest square number such that their total number is larger than or equal to 1000n gives a good compromise between performance and precision.

The stopping criterion is implemented as follows: we terminate the optimization process once $$\Vert \nabla \Phi (w)\Vert _1/2 \le \varepsilon$$ for some $$\varepsilon > 0$$. Due to Theorem 3b) this criterion yields an intuitive interpretation: $$\Vert \nabla \Phi (w)\Vert _1/2$$ is the amount of mass that is being mistransported, i.e. the total amount of mass missing or being in surplus at some $$\nu$$-location $$y_i$$ when transporting according to the current tessellation. In our experience this mass is typically rather proportionally distributed among the different cells and tends to be assigned in a close neighborhood of the correct cell rather than far away. So even with somewhat larger $$\varepsilon$$, the computed Wasserstein distance and the overall visual impression of the optimal transport partition remain mostly the same. In the numerical examples in Sects. 5 and 6 we choose the value of $$\varepsilon = 0.05$$.

We implemented the algorithm in C++ and make it available on GitHubFootnote 2 under the MIT license. Our implementation uses libLBFGS (Okazaki and Nocedal 2010) for the L-BFGS procedure and the geometry library CGAL (CGAL 2015) for the construction and querying of weighted Voronoi tessellations. The repository also contains a Matlab script to visualize such tessellations. Our implementation is also included in the latest version of the transport-package (Schuhmacher et al. 2019) for the statistical computing environment R (R Core Team 2017).

## 5 Performance evaluation

We evaluated the performance of our algorithm by randomly generating measures $$\mu$$ and $$\nu$$ with varying features and computing the optimal transport partitions between them. The measure $$\mu$$ was generated by simulating its density $$\varrho$$ as a Gaussian random field with Matérn covariance function on the rectangle $$[0,1] \times [0,0.75]$$, applying a quadratic function and normalizing the result to a probability density. Corresponding images were produced at resolution $$256\times 196$$ pixels and were further divided into 25 subpixels each to compute integrals over Voronoi cells. In addition to a variance parameter, which we kept fixed, the Matérn covariance function has parameters for the scale $$\gamma$$ of the correlations, which we varied among 0.05, 0.15 and 0.5, and the smoothness s of the generated surface, which we varied between 0.5 and 2.5 corresponding to a continuous surface and a $$C^2$$-surface, respectively. The simulation mechanism is similar to the ones for classes 2–5 in the benchmark DOTmark proposed in Schrieber et al. (2017), but allows to investigate the influence of individual parameters more directly. Figure 3 shows one realization for each parameter combination. For the performance evaluation we generated 10 realizations each.

The measures $$\nu$$ have n support points generated uniformly at random on $$[0,1] \times [0,0.75]$$, where we used $$n=250$$ and $$n=1000$$. We then assigned either mass 1 or mass $$\varrho (x)$$ to each point x and normalized to obtain probability measures. We generated 20 independent $$\nu$$-measures of the first kind (unit mass) and computed the optimal transport from each of the $$10 \times 6$$ $$\mu$$-measures for each of the 6 parameter combinations. We further generated for each of the $$10 \times 6$$ $$\mu$$-measures 20 corresponding $$\nu$$-measures of the second kind (masses from $$\mu$$) and computed again the corresponding optimal transports. The stopping criterion for the optimization was an amount of $$\le 0.05$$ of mistransported mass.

The results for $$n=250$$ support points of $$\nu$$ are shown in Fig. 4a, b, those for $$n=1000$$ support points in Fig. 4c, d. Each bar shows the mean of the runtimes on one core of a mobile Intel Core i7 across the 200 experiments for the respective parameter combination; the blue bars are for the $$\nu$$ measures with uniform masses, the red bars for the measures with masses selected from the corresponding $$\mu$$ measure. The lines indicate the standard deviations.

We observe that computation times stay more or less the same between parameter choices (with some sampling variation) if the $$\nu$$-masses are taken from the corresponding $$\mu$$-measure. In this case mass can typically be assigned (very) locally, and slightly more so if $$\rho$$ has fewer local fluctuations (higher $$\gamma$$ and/or s).

This seems a plausible explanation for the relatively small computation times.

In contrast, if all $$\nu$$-masses are the same, the computation times are considerably higher and increase substantially with increasing $$\gamma$$ and somewhat with increasing smoothness. This seems consistent with the hypothesis that the more the optimal transport problem can be solved by assigning mass locally the lower the computation times. For larger scales many of the support points of $$\nu$$ compete strongly for the assignment of mass and a solution can only be found globally. A lower smoothness may alleviate the problem somewhat, because it creates locally more variation in the available mass.

In addition to the runtimes, we also recorded how many update steps for the weight vector w were performed until convergence. We only investigate the update steps for the transport to the original measure $$\nu$$, not the coarsenings $$\nu ^{l},\ l>0$$, because the former dominates the runtime, and also has a different dimensionality than the coarsenings. We have computed the Pearson and Spearman correlation coefficients between the numbers of update steps and the runtimes. Both for $$n=250$$ and $$n=1000$$ support points of $$\nu$$, these correlation coefficients are larger than 0.99, indicating very high correlation. This strongly suggests that the differences in runtimes are not due to intricacies of the line search procedure or Voronoi cell computations, but rather due to differences in the structures of the simulated problem instances.

We would like to note that to the best of our knowledge the present implementation is the first one for computing the optimal transport in the semi-discrete setting for the case $$p=1$$, which means that fair performance comparisons with other algorithms are not easily possible.

## 6 Applications

We investigate three concrete problem settings in order to better understand the workings and performance of our algorithm as well as to illustrate various theoretical and practical aspects pointed out in the paper.

### 6.1 Optimal transport between two normal distributions

We consider the two bivariate normal distributions $$\mu = \mathrm {MVN}_2(a, \sigma ^2 \mathrm {I}_2)$$ and $$\nu = \mathrm {MVN}_2(b, \sigma ^2 \mathrm {I}_2)$$, where $$a = 0.8 \cdot \mathbb {1}$$, $$b = 2.2 \cdot \mathbb {1}$$ and $$\sigma ^2 = 0.1$$, i.e. they both have the same spherical covariance matrix such that one distribution is just a displacement of the other. For computations we have truncated both measures to the set $$\mathscr {X}= [0,3]^2$$. By discretization (quantization) a measure $$\tilde{\nu }$$ is obtained from $$\nu$$. We then compute the optimal transport partition and the Wasserstein distances between $$\mu$$ and $$\tilde{\nu }$$ for both $$p=1$$ and $$p=2$$. Computations and plots for $$p=2$$ are obtained with the package transport (Schuhmacher et al. 2019) for the statistical computing environment R (R Core Team 2017). For $$p=1$$ we use our implementation presented in the previous section.

Note that for the original problem of optimal transport from $$\mu$$ to $$\nu$$ the solution is known exactly, so we can use this example to investigate the correct working of our implementation. In fact, for any probability measure $$\mu '$$ on $$\mathbb {R}^d$$ and its displacement $$\nu ' = T_{\#} \mu '$$, where $$T :\mathbb {R}^2 \rightarrow \mathbb {R}^2, x \mapsto x + (b-a)$$ for some vector $$b-a \in \mathbb {R}^d$$, it is immediately clear that the translation T induces an optimal transport plan for (1) and that $$W_p(\mu ',\nu ') = \Vert b-a\Vert$$ for arbitrary $$p \ge 1$$. This holds because we obtain by Jensen’s inequality $$(\mathbb {E}\Vert X-Y\Vert ^p)^{1/p} \ge \Vert \mathbb {E}(X-Y)\Vert = \Vert b-a\Vert$$ for $$X \sim \mu '$$, $$Y \sim \nu '$$; therefore $$W_p(\mu ',\nu ') \ge \Vert b-a\Vert$$ and T is clearly a transport map from $$\mu '$$ to $$\nu '$$ that achieves this lower bound. For $$p=2$$ Theorem 9.4 in Villani (2009) yields that T is the unique optimal transport map and the induced plan $$\pi _T$$ is the unique optimal transport plan. In the case $$p=1$$ neither of these objects is unique due to the possibility to rearrange mass transported within the same line segment at no extra cost.

Discretization was performed by applying the weighted K-means algorithm based on the discretization of $$\mu$$ to a fine grid and an initial configuration of cluster centers drawn independently from distribution $$\nu$$ and equipped with the corresponding density values of $$\nu$$ as weights. The number of cluster centers was kept to $$n=300$$ for better visibility in the plots below. We write $$\tilde{\nu }= \sum _{i=1}^n \delta _{y_i}$$ for the discretized measure. The discretization error can be computed numerically by solving another semi-discrete transport problem, see the third column of Table 1 below.

The first column of Fig. 5 depicts the measures $$\mu$$ and $$\tilde{\nu }$$ and the resulting optimal transport partitions for $$p=1$$ and $$p=2$$. In the case $$p=1$$ the nuclei of the weighted Voronoi tessellation are always contained in their cells, whereas for $$p=2$$ this need not be the case. We therefore indicate the relation by a gray arrow pointing from the centroid of the cell to its nucleus whenever the nucleus is outside the cell. The theory for the case $$p=2$$, see e.g. Merigot (2011, Sect. 2), identifies the tessellation as a Laguerre tessellation (or power diagram), which consists of convex polygons.

The partitions obtained for $$p=1$$ and $$p=2$$ look very different, but they both capture optimal transports along the direction $$b-a$$ very closely. For $$p=2$$ we clearly see a close approximation of the optimal transport map T introduced above. For $$p=1$$ we see an approximation of an optimal transport plan $$\pi$$ that collects the mass for any $$y \in \mathscr {Y}$$ somewhere along the way in the direction $$b-a$$.

The second column of Table 1 gives the Wasserstein distances computed numerically based on these partitions. Both of them are very close to the theoretical value of $$\Vert b-a\Vert = \sqrt{2} \cdot 1.4 \approx 1.979899$$, and in particular they are well inside the boundaries set by the approximation error.

We also investigate the effect of adding a common measure to both $$\mu$$ and $$\nu$$: let $$\alpha = \mathrm {Leb}^d\vert _{\mathscr {X}}$$ and proceed in the same way as above for the measures $$\mu ' = \mu +\alpha$$ and $$\nu '=\nu +\alpha$$, calling the discretized measure $$\tilde{\nu }'$$. Note that the discretization error (sixth column of Table 1) is considerably higher, on the one hand due to the fact that the $$n=300$$ support points of $$\tilde{\nu }'$$ have to be spread far wider, on the other hand because the total mass of each measure is 10 now compared to 1 before.

The second column of Fig. 5 depicts the measures $$\mu '$$ and $$\tilde{\nu }'$$ and the resulting optimal transport partitions for $$p=1$$ and $$p=2$$. Both partitions look very different from their counterparts when no $$\alpha$$ is added. However the partition for $$p=1$$ clearly approximates a transport plan along the direction of $$b-a$$ again. Note that the movement of mass is much more local now, meaning the approximated optimal transport plan is not just obtained by keeping measure $$\alpha$$ in place and moving the remaining measure $$\mu$$ according to the optimal transport plan $$\pi$$ approximated in Fig. 5c, but a substantial amount of mass available from $$\alpha$$ is moved as well. Furthermore, Fig. 5d gives the impression of a slightly curved movement of mass. We attribute this to a combination of a boundary effect from trimming the Lebesgue measure to $$\mathscr {X}$$ and numerical error based on the coarse discretization and a small amount of mistransported mass.

The computed $$W_1$$-value for this new example (last column of Table 1) lies in the vicinity of the theoretical value again if one allows for the rather large discretization error.

The case $$p=2$$ exhibits the distinctive curved behavior that goes with the fluid mechanics interpretation discussed in Sect. 1.2, see also Fig. 1. Various of the other points mentioned in Sect. 1.2 can be observed as well, e.g. the numerically computed Wasserstein distance is much smaller than for $$p=1$$, which illustrates the lack of invariance and seems plausible in view of the example in Remark 2 in the appendix.

### 6.2 A practical resource allocation problem

We revisit the delivery problem mentioned in the introduction. A fast-food delivery service has 32 branches throughout a city area, depicted by the black dots on the map in Fig. 6. For simplicity of representation we assume that most branches have the same fixed production capacity and a few individual ones (marked by an extra circle around the dot) have twice that capacity. We assume further that the expected orders at peak times have a spatial distribution as indicated by the heatmap (where yellow to white means higher number of orders) and a total volume that matches the total capacity of the branches. The task of the fast-food chain is to partition the map into 32 delivery zones, matching expected orders in each zone with the capacity of the branches, in such a way that the expected cost in form of the travel distance between branch and customer is minimal. We assume here the Euclidean distance, either because of a street layout that comes close to it, see e.g. Boscoe et al. (2012), or because the deliveries are performed by drones. The desired partition, computed by our implementation described in Sect. 4.3, is also displayed in Fig. 6. A number of elongated cells in the western and central parts of the city area suggest that future expansions of the fast-food chain should concentrate on the city center in the north.

### 6.3 A visual tool for detecting deviations from a density map

Very recently, asymptotic theory has been developed that allows, among other things, to test based on the Wasserstein metric $$W_p$$ whether a sample in $$\mathbb {R}^d$$ comes from a given multivariate probability distribution Q. More precisely, assuming independent and identically distributed random vectors $$X_1,\ldots ,X_n$$ with distribution P, limiting distributions have been derived for suitable standardizations of $$W_p(\frac{1}{n} \sum _{i=1}^n \delta _{X_i}, Q)$$ both if $$P=Q$$ and if $$P \ne Q$$. Based on an observed value $$W_p(\frac{1}{n} \sum _{i=1}^n \delta _{x_i}, Q)$$, where $$x_1,\ldots ,x_n \in \mathbb {R}^d$$, these distributions allow to assign a level of statistical certainty (p-value) to statements of $$P=Q$$ and $$P \ne Q$$, respectively. See Sommerfeld and Munk (2018), which uses general $$p \ge 1$$, but requires discrete distributions P and Q; and del Barrio and Loubes (2018), which is constraint to $$p=2$$, but allows for quite general distributions (P and Q not both discrete).

We propose here the optimal transport partition between an absolutely continuous Q and $$\frac{1}{n} \sum _{i=1}^n \delta _{x_i}$$ as a simple but useful tool for assessing the hypothesis $$P=Q$$. We refer to this tool as goodness-of-fit (GOF) partition. If $$d=2$$, relevant information may be gained from a simple plot of this partition in a similar way as residual plots are used for assessing the fit of linear models. As a general rule of thumb the partition is consistent with the hypothesis $$P=Q$$ if it consists of many “round” cells that contain their respective P-points roughly in their middle. The size of cells may vary according to local densities and there are bound to be some elongated cells due to sampling error (i.e. the fact that we can only sample from P and do not know it exactly), but a local accumulation of many elongated cells should give rise to the suspicion that $$P=Q$$ may be violated in a specific way. Thus GOF partitions provide the data scientist both with a global impression for the plausibility of $$P=Q$$ and with detailed local information about the nature of potential deviations of P from Q. Of course they are a purely explorative tool and do not give any quantitative guarantees.

We give here an example for illustration. Suppose we have data as given in the left panel of Fig. 7 and a distribution Q as represented by the heat map in the right panel. Fig. 8 shows the optimal transport partition for this situation on the left hand side. The partition indicates that the global fit of the data is quite good. However it also points out some deviations that might be spurious, but might also well be worth further investigation: one is the absence of points close to the two highest peaks in the density map, another one that there are some points too many in the cluster on the very left of the plot. Both of them are quite clearly visible as accumulations of elongated cells.

As an example of a globally bad fit we show in the right panel of Fig. 8 the GOF partition when taking as Q the uniform measure on the square.

For larger d direct visual inspection becomes impossible. However, a substantial amount of information may still be extracted, either by considering statistics of the GOF partition in d dimensions that are able to detect local regions of common orientation and high eccentricity of cells, or by applying dimension reduction methods, such as (Flamary et al. 2018), before applying the GOF partition.

## 7 Discussion and outlook

We have given a comprehensive account on semi-discrete optimal transport for the Euclidean cost function, arguing that there are sometimes good reasons to prefer Euclidean over squared Euclidean cost and showing that for the Euclidean case the semi-discrete setting is particularly nice because we obtain a unique solution to the Monge–Kantorovich problem that is induced by a transport map. We have provided a reasonably fast algorithm that is similar to the AHA-algorithm described in detail in Mérigot (2011) but adapted in various aspects to the current situation of $$p=1$$.

Our algorithm converges towards the optimal partition subject to the convergence conditions for the L-BFGS algorithm; see e.g. Nocedal (1980). Very loosely, such conditions state that we start in a region around the minimizer where the objective function $$\Phi$$ shows to some extent quadratic behavior. Similar to the AHA-algorithm in Mérigot (2011), a proof of such conditions is not available. In practice, the algorithm has converged in all the experiments and examples given in the present paper.

There are several avenues for further research, both with regard to improving speed and robustness of the algorithm and for solving more complicated problems where our algorithm may be useful. Some of them are:

• As mentioned earlier, it may well be that the choice of our starting value is too simplistic and that faster convergence is obtained more often if the sequence $$\nu = \nu ^{(0)}, \dots , \nu ^{(L)}$$ of coarsenings is e.g. based on the K-median algorithm or a similar method. The difficulty lies in finding $$\nu ^{(l-1)}$$ that makes $$W_1(\nu ^{(l)},\nu ^{(l-1)})$$ substantially smaller without investing too much time in its computation.

• We currently keep the threshold $$\varepsilon$$ in the stopping criterion of the multiscale approach in Sect. 4.1 fixed. Another alleviation of the computational burden may be obtained by choosing a suitable sequence $$\varepsilon _L, \ldots , \varepsilon _0$$ of thresholds for the various scales. It seems particularly attractive to use for the threshold at the coarser scale a value $$\varepsilon _l > 0$$ that is smaller than the value $$\varepsilon _{l-1}$$ at the finer scale, especially for the last step, where $$l=1$$. The rationale is that at the coarser scale we do not run easily into numerical problems and still reach the stricter $$\varepsilon _l$$-target efficiently. The obtained weight vector is expected to result in a better starting solution for the finer problem that reaches the more relaxed threshold $$\varepsilon _{l-1}$$ more quickly than a starting solution stemming from an $$\varepsilon _{l-1}$$-target at the coarser scale.

• The L-BFGS algorithm used for the optimization process may be custom-tailored to our discretization of $$\mu$$ in order to reach the theoretically maximal numerical precision that the discretization allows. It could e.g. use simple gradient descent from the point on where L-BFGS cannot minimize $$\Phi$$ any further since even in the discretized case the gradient always points in a descending direction.

• Approximating the intersections of $$\mu$$-pixels with weighted Voronoi cells by splitting pixels into very small subpixels has shown good results. However, as mentioned in Sect. 4.2, higher numerical stability and precision could be obtained by computing the intersections between the Voronoi cells and the pixels of $$\mu$$ exactly. Currently we are only able to do this at the expense of a large increase in the overall computation time. It is of considerable interest to have a more efficient method at hand.

• One of the reviewers pointed out that there are recent formulae available for the Hessian of the function $$\Phi$$ in Theorem 3. Indeed, based on Theorem 1 in De Gournay et al. (2019) we formally obtain in our setting a Hessian matrix with entries

\begin{aligned} \frac{\partial ^2 \Phi }{\partial w_j \partial w_k} (w) = - \int _{{{\,\mathrm{Vor}\,}}_w(j) \cap {{\,\mathrm{Vor}\,}}_w(k)} \left\Vert \frac{x - y_j}{\Vert x-y_j\Vert } - \frac{x - y_k}{\Vert x-y_k\Vert } \right\Vert ^{-1} \varrho (x) \; \sigma _{d-1}(dx) \end{aligned}
(12)

for $$j \ne k$$, where $$\sigma _{d-1}$$ denotes $$(d-1)$$-dimensional Hausdorff measure, and

\begin{aligned} \frac{\partial ^2 \Phi }{\partial w_j^2} (w) = - \sum _{k \ne j} \frac{\partial ^2 \Phi }{\partial w_j \partial w_k} (w). \end{aligned}

Unfortunately, condition (Diff-2-a) required for this theorem is not satisfied for the unsquared Euclidean cost, since the norm term in (12) (without taking the inverse) goes to 0 as $$x \rightarrow \infty$$ along the boundary set $$H = \lbrace x\in \mathbb {R}^d \vert \Vert x-y_j\Vert - \Vert x-y_k\Vert = w_j - w_k \rbrace$$. We conjecture that the second derivative of $$\Phi$$ at w still exists and is of the above form if the integrals in (12) are finite (maybe under mild additional conditions). If this can be established, we may in principle use a Newton method (with appropriate step size correction) for optimizing $$\Phi$$. It remains to be seen, however, if the advantage from using the Hessian rather than performing a quasi Newton method outweighs the considerably higher computational cost due to computing the above integrals numerically. Another goal could be to establish global convergence of such a Newton algorithm under similar conditions as in Theorem 1.5 in Kitagawa et al. (2019), which is quite general, but requires higher regularity of the cost function.

• Semi-discrete optimal transport may be used as an auxiliary step in a number of algorithms for more complicated problems. The most natural example is a simple alternating scheme for the capacitated location-allocation (or transportation-location) problem; see Cooper (1972). Suppose that our fast-food chain from Sect. 6.2 has not entered the market yet and would like to open n branches anywhere in the city and divide up the area into delivery zones in such a way that (previously known) capacity constraints of the branches are met and the expected cost in terms of travel distance is minimized again. A natural heuristic algorithm would start with a random placement of n branches and alternate between capacitated allocation of expected orders (the continuous measure $$\mu$$) using our algorithm described in Sect. 4 and the relocation of branches to the spatial medians of the zones. The latter can be computed by discrete approximation, see e.g. Croux et al. (2012), and possibly by continuous techniques, see Fekete et al. (2005) for a vantage point.