1 Introduction

Multimarginal Optimal Transport (\(\textsf {MOT}\)) is the problem of linear programming over joint probability distributions with fixed marginal distributions. In this way, \(\textsf {MOT}\) generalizes the classical Kantorovich formulation of Optimal Transport from 2 marginal distributions to an arbitrary number \(k \geqslant 2\) of them.

More precisely, an \(\textsf {MOT}\) problem is specified by a cost tensor C in the k-fold tensor product space \((\mathbb {R}^n)^{\otimes k}= \mathbb {R}^n \otimes \cdots \otimes \mathbb {R}^n\), and k marginal distributions \(\mu _1, \dots , \mu _k\) in the simplex \(\varDelta _n = \{v \in \mathbb {R}_{\geqslant 0}^n : \sum _{i=1}^n v_i = 1 \}\).Footnote 1 The \(\textsf {MOT}\) problem is to compute

$$\begin{aligned} \min _{P \in \mathcal {M}(\mu _1,\dots ,\mu _k)} \langle P, C \rangle \end{aligned}$$
(MOT)

where \(\mathcal {M}(\mu _1,\dots ,\mu _k)\) is the “transportation polytope” consisting of all entrywise non-negative tensors \(P \in (\mathbb {R}^{n})^{\otimes k}\) satisfying the marginal constraints \(\sum _{j_1,\dots ,j_{i-1}, j_{i+1}, \dots , j_{k}} P_{j_1, \dots , j_{i-1}, j, j_{i+1}, \dots , j_k} = [\mu _i]_j\) for all \(i \in \{1, \dots , k\}\) and \(j \in \{1, \dots , n\}\).

This \(\textsf {MOT}\) problem has many applications throughout machine learning, computer science, and the natural sciences since it arises in tasks that require “stitching” together aggregate measurements. For instance, applications of \(\textsf {MOT}\) include inference from collective dynamics [38, 49], information fusion for Bayesian learning [80], averaging point clouds [2, 36], the n-coupling problem [76], quantile aggregation [60, 75], matching for teams [28, 31], image processing [74, 79], random combinatorial optimization [1, 50, 61, 65, 68, 89, 93], Distributionally Robust Optimization [30, 62, 66], simulation of incompressible fluids [17, 26], and Density Functional Theory [16, 27, 32].

However, in most applications, the success of \(\textsf {MOT}\) is severely limited by the lack of efficient algorithms. Indeed, in general, \(\textsf {MOT}\) requires exponential time in the number of marginals k and their support sizes n. For instance, applying a linear program solver out-of-the-box takes \(n^{\varTheta (k)}\) time because \(\textsf {MOT}\) is a linear program with \(n^k\) variables, \(n^k\) non-negativity constraints, and nk equality constraints. Specialized algorithms in the literature such as the Sinkhorn algorithm yield similar \(n^{\varTheta (k)}\) runtimes. Such runtimes currently limit the applicability of \(\textsf {MOT}\) to tiny-scale problems (e.g., \(n=k=10\)).

Polynomial-time algorithms for \({{\textsf {\textit{MOT}}}}\). This paper develops polynomial-time algorithms for \(\textsf {MOT}\), where here and henceforth “polynomial” means in the number of marginals k and their support sizes n—and possibly also \(C_{\max }/\varepsilon \) for \(\varepsilon \)-additive approximation, where \(C_{\max }\) is a bound on the entries of C.

At first glance, this may seem impossible for at least two “trivial” reasons. One is that it takes exponential time to read the input cost C since it has \(n^k\) entries. We circumvent this issue by considering costs C with \(\mathrm {poly}(n,k)\)-size implicit representations, which encompasses essentially all \(\textsf {MOT}\) applications.Footnote 2 A second obvious issue is that it takes exponential time to write the output variable P since it has \(n^k\) entries. We circumvent this issue by returning solutions P with \(\mathrm {poly}(n,k)\)-size implicit representations, for instance sparse solutions.

But, of course, circumventing these issues of input/output size is not enough to actually solve \(\textsf {MOT}\) in polynomial time. See [6] for examples of \(\mathsf {NP}\)-hard \(\textsf {MOT}\) problems with costs that have \(\mathrm {poly}(n,k)\)-size implicit representations.

Remarkably, for several \(\textsf {MOT}\) problems, there are specially-tailored algorithms that run in polynomial time—notably, for \(\textsf {MOT}\) problems with graphically-structured costs of constant treewidth [47, 49, 82], variational mean-field games [15], computing generalized Euler flows [14], computing low-dimensional Wasserstein barycenters [7, 14, 29], and filtering and estimation tasks based on target tracking [38, 46, 47, 49, 77]. However, the number of \(\textsf {MOT}\) problems that are known to be solvable in polynomial time is small, and it is unknown if these techniques can be extended to the many other \(\textsf {MOT}\) problems arising in applications. This motivates the central question driving this paper:

$$\begin{aligned} {Are there general ``structural properties'' that make } \textsf {MOT}{ solvable in } \mathrm {poly}(n,k) { time? } \end{aligned}$$

This paper is conceptually divided into two parts. In the first part of the paper, we develop a unified algorithmic framework for \(\textsf {MOT}\) that characterizes the structure required for different algorithms to solve \(\textsf {MOT}\) in \(\mathrm {poly}(n,k)\) time, in terms of simple variants of the dual feasibility oracle. This enables us to prove that some algorithms can solve \(\textsf {MOT}\) problems in polynomial time whenever any algorithm can; whereas the popular Sinkhorn algorithm cannot. Moreover, this algorithmic framework makes it significantly easier to design a \(\mathrm {poly}(n,k)\) time algorithm for a given \(\textsf {MOT}\) problem (when possible) because it now suffices to solve the dual feasibility oracle—and this is much more amenable to standard algorithmic techniques. In the second part of the paper, we demonstrate the ease-of-use of our algorithmic framework by applying it to three general classes of \(\textsf {MOT}\) cost structures.

Below, we detail these two parts of the paper in Sects. 1.1 and 1.2, respectively.

1.1 Contribution 1: unified algorithmic framework for \(\textsf {MOT}\)

In order to understand what structural properties make \(\textsf {MOT}\) solvable in polynomial time, we first lay a more general groundwork. The purpose of this is to understand the following fundamental questions:

  1. Q1

    What are reasonable candidate algorithms for solving structured \(\textsf {MOT}\) problems in polynomial time?

  2. Q2

    What structure must an \(\textsf {MOT}\) problem have for these algorithms to have polynomial runtimes?

  3. Q3

    Is the structure required by a given algorithm more restrictive than the structure required by a different algorithm (or any algorithm)?

  4. Q4

    How to check if this structure occurs for a given \(\textsf {MOT}\) problem?

We detail our answers to these four questions below in Sects. 1.1.1 to 1.1.4, and then briefly discuss practical tradeoffs beyond polynomial-time solvability in Sect. 1.1.5; see Table 1 for a summary. We expect that this general groundwork will prove useful in future investigations of tractable \(\textsf {MOT}\) problems.

Table 1 These \(\textsf {MOT}\) algorithms have polynomial runtime except for a bottleneck “oracle”

1.1.1 Answer to Q1: candidate \(\mathrm {poly}(n,k)\)-time algorithms

We consider three algorithms for \(\textsf {MOT}\) whose exponential runtimes can be isolated into a single bottleneck—and thus can be implemented in polynomial time whenever that bottleneck can. These algorithms are the Ellipsoid algorithm ELLIPSOID [44], the Multiplicative Weights Update algorithm MWU [91], and the natural multidimensional analog of Sinkhorn’s scaling algorithm SINKHORN [14, 70]. \(\texttt {SINKHORN}\) is specially tailored to \(\textsf {MOT}\) and is currently the predominant algorithm for it. To foreshadow our answer to Q3, the reason that we restrict to these candidate algorithms is: we show that \(\texttt {ELLIPSOID}\) and \(\texttt {MWU}\) can solve an \(\textsf {MOT}\) problem in polynomial time if and only if any algorithm can.

1.1.2 Answer to Q2: structure necessary to run candidate algorithms

These three algorithms only access the cost tensor C through polynomially many calls of their respective bottlenecks. Thus the structure required to implement these candidate algorithms in polynomial time is equivalent to the structure required to implement their respective bottlenecks in polynomial time.

In Sect. 4, we show that the bottlenecks of these three algorithms are polynomial-time equivalent to natural analogs of the feasibility oracle for the dual LP to \(\textsf {MOT}\). Namely, given weights \(p_1, \dots , p_k \in \mathbb {R}^n\), compute

$$\begin{aligned} \min _{(j_1,\dots ,j_k) \in \{1, \dots , n\}^k} C_{j_1,\dots ,j_k}- \sum _{i=1}^k [p_i]_{j_i} \end{aligned}$$
(1.1)

either exactly for \(\texttt {ELLIPSOID}\), approximately for \(\texttt {MWU}\), or with the “min” replaced by a “softmin” for \(\texttt {SINKHORN}\). We call these three tasks the \(\textsf {MIN}\), \(\textsf {AMIN}\), and \(\textsf {SMIN}\) oracles, respectively. See Remark 3.4 for the interpretation of these oracles as variants of the dual feasibility oracle.

These three oracles take \(n^k\) time to implement in general. However, for a wide range of structured cost tensors C they can be implemented in \(\mathrm {poly}(n,k)\) time, see Sect. 1.2 below. For such structured costs C, our oracle abstraction immediately implies that the \(\textsf {MOT}\) problem with cost C and any input marginals \(\mu _1, \dots , \mu _k\) can be (approximately) solved in polynomial time by any of the three respective algorithms.

Our characterization of the algorithms’ bottlenecks as variations of the dual feasibility oracle has two key benefits—which are the answers to Q3 and Q4, described below.

1.1.3 Answer to Q3: characterizing what \(\textsf {MOT}\) problems each algorithm can solve

A key benefit of our characterization of the algorithms’ bottlenecks as variations of the dual feasibility oracles is that it enables us to establish whether the structure required by a given \(\textsf {MOT}\) algorithm is more restrictive than the structure required by a different algorithm (or by any algorithm).

In particular, this enables us to answer the natural question: why restrict to just the three algorithms described above? Can other algorithms solve \(\textsf {MOT}\) in \(\mathrm {poly}(n,k)\) time in situations when these algorithms cannot? Critically, the answer is no: restricting ourselves to the \(\texttt {ELLIPSOID}\) and \(\texttt {MWU}\) algorithms is at no loss of generality.

Theorem 1.1

(Informal statement of part of Theorems 4.1 and 4.7) For any family of costs \(C \in (\mathbb {R}^n)^{\otimes k}\):

  • \(\texttt {ELLIPSOID}\) computes an exact solution for \(\textsf {MOT}\) in \(\mathrm {poly}(n,k)\) time if and only if any algorithm can.

  • \(\texttt {MWU}\) computes an \(\varepsilon \)-approximate solution for \(\textsf {MOT}\) in \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) time if and only if any algorithm can.

The statement for \(\texttt {ELLIPSOID}\) is implicit from classical results about LP [44] combined with arguments from [7], see the previous work section Sect. 1.3. The statement for \(\texttt {MWU}\) is new to this paper.

The oracle abstraction helps us show Theorem 1.1 because it reduces this question of what structure is needed for the algorithms to solve \(\textsf {MOT}\) in polynomial time, to the question of what structure is needed to solve their respective bottlenecks in polynomial time. Thus Theorem 1.1 is a consequence of the following result. (The “if” part of this result is a contribution of this paper; the “only if” part was shown in [6].)

Theorem 1.2

(Informal statement of part of Theorems 4.1 and 4.7) For any family of costs \(C \in (\mathbb {R}^n)^{\otimes k}\):

  • \(\textsf {MOT}\) can be exactly solved in \(\mathrm {poly}(n,k)\) time if and only if \(\textsf {MIN}\) can.

  • \(\textsf {MOT}\) can be \(\varepsilon \)-approximately solved in \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) time if and only if \(\textsf {AMIN}\) can.

Interestingly, a further consequence of our unified algorithm-to-oracle abstraction is that it enables us to show that \(\texttt {SINKHORN}\)—which is currently the most popular algorithm for \(\textsf {MOT}\) by far—requires strictly more structure to solve an \(\textsf {MOT}\) problem than other algorithms require. This is in sharp contrast to the complete generality of the other two algorithms shown in Theorem 1.1.

Theorem 1.3

(Informal statement of Theorem 4.19) Under standard complexity-theoretic assumptions, there exists a family of \(\textsf {MOT}\) problems that can be solved exactly in \(\mathrm {poly}(n,k)\) time using \(\texttt {ELLIPSOID}\), however it is impossible to implement a single iteration of \(\texttt {SINKHORN}\) (even approximately) in \(\mathrm {poly}(n,k)\) time.

The reason that our unified algorithm-to-oracle abstraction helps us show Theorem 1.3 is that it puts \(\texttt {SINKHORN}\) on equal footing with the other two classical algorithms in terms of their reliance on variants of the dual feasibility oracle. This reduces proving Theorem 1.3 to showing the following separation between the \(\textsf {SMIN}\) oracle and the other two oracles.

Theorem 1.4

(Informal statement of Lemma 3.7) Under standard complexity-theoretic assumptions, there exists a family of cost tensors \(C \in (\mathbb {R}^n)^{\otimes k}\) such that there are \(\mathrm {poly}(n,k)\)-time algorithms for \(\textsf {MIN}\) and \(\textsf {AMIN}\), however it is impossible to solve \(\textsf {SMIN}\) (even approximately) in \(\mathrm {poly}(n,k)\) time.

1.1.4 Answer to Q4: ease-of-use for checking if \(\textsf {MOT}\) is solvable in polynomial time

The second key benefit of this oracle abstraction is that it is helpful for showing that a given \(\textsf {MOT}\) problem (whose cost C is input implicitly through some concise representation) is solvable in polynomial time as it without loss of generality reduces \(\textsf {MOT}\) to solving any of the three corresponding oracles in polynomial time. The upshot is that these oracles are more directly amenable to standard algorithmic techniques since they are phrased as more conventional combinatorial-optimization problems. In the second part of the paper, we illustrate this ease-of-use via applications to three general classes of structured \(\textsf {MOT}\) problems; for an overview see Sect. 1.2.

1.1.5 Practical algorithmic tradeoffs beyond polynomial-time solvability

From a theoretical perspective, the most important aspect of an algorithm is whether it can solve \(\textsf {MOT}\) in polynomial time if and only if any algorithm can. As we have discussed, this is true for \(\texttt {ELLIPSOID}\) and \(\texttt {MWU}\) (Theorem 1.1) but not for \(\texttt {SINKHORN}\) (Theorem 1.3). Nevertheless, for a wide range of \(\textsf {MOT}\) cost structures, all three oracles can be implemented in polynomial time, which means that all three algorithms \(\texttt {ELLIPSOID}\), \(\texttt {MWU}\), and \(\texttt {SINKHORN}\) can be implemented in polynomial time. Which algorithm is best in practice depends on the relative importance of the following considerations for the particular application.

  • Error. \(\texttt {ELLIPSOID}\) computes exact solutions, whereas \(\texttt {MWU}\) and \(\texttt {SINKHORN}\) only compute low-precision solutions due to \(\mathrm {poly}(1/\varepsilon )\) runtime dependence.

  • Solution sparsity. \(\texttt {ELLIPSOID}\) and \(\texttt {MWU}\) output solutions with polynomially many non-zero entries (roughly nk), whereas \(\texttt {SINKHORN}\) outputs fully dense solutions with \(n^k\) non-zero entries (through a polynomial-size implicit representation, see Sect. 4.3). Solution sparsity enables interpretability, visualization, and efficient downstream computation—benefits which are helpful in diverse applications, for example ranging from computer graphics [20, 71, 79] to facility location problems [10] to machine learning [7, 33] to ecological inference [64] to fluid dynamics (see Sect. 5.3), and more. Furthermore, in Sect. 7.4, we show that sparse solutions for \(\textsf {MOT}\) (a.k.a. linear optimization over the transportation polytope) enable efficiently solving certain non-linear optimization problems over the transportation polytope.

  • Practical runtime. Although all three algorithms enjoy polynomial runtime guarantees, the polynomials are smaller for some algorithms than for others. In particular, \(\texttt {SINKHORN}\) has remarkably good scalability in practice as long the error \(\varepsilon \) is not too small and its bottleneck oracle \(\textsf {SMIN}\) is practically implementable. By Theorems 1.1 and 1.3, \(\texttt {MWU}\) can solve strictly more \(\textsf {MOT}\) problems in polynomial time than \(\texttt {SINKHORN}\); however, it is less scalable in practice when both \(\texttt {MWU}\) and \(\texttt {SINKHORN}\) can be implemented. \(\texttt {ELLIPSOID}\) is not practical and is used solely as a proof of concept that problems are tractable to solve exactly; in practice, we use Column Generation (see, e.g., [18,  §6.1]) rather than \(\texttt {ELLIPSOID}\) as it has better empirical performance, yet still has the same bottleneck oracle \(\textsf {MIN}\), see Sect. 4.1.3. Column Generation is not as practically scalable as \(\texttt {SINKHORN}\) in n and k but has the benefit of computing exact, sparse solutions.

To summarize: which algorithm is best in practice depends on the application. For example, Column Generation produces the qualitatively best solutions for the fluid dynamics application in Sect. 5.3, \(\texttt {SINKHORN}\) is the most scalable for the risk estimation application in Sect. 7.3, and \(\texttt {MWU}\) is the most scalable for the network reliability application in Sect. 6.3 (for that application there is no known implementation of \(\texttt {SINKHORN}\) that is practically efficient).

1.2 Contribution 2: applications to general classes of structured \(\textsf {MOT}\) problems

In the second part of the paper, we illustrate the algorithmic framework developed in the first part of the paper by applying it to three general classes of MOT cost structures:

  1. 1.

    Graphical structure (in Sect. 5).

  2. 2.

    Set-optimization structure (in Sect. 6).

  3. 3.

    Low-rank plus sparse structure (in Sect. 7).

Specifically, if the cost C is structured in any of these three ways, then \(\textsf {MOT}\) can be (approximately) solved in \(\mathrm {poly}(n,k)\) time for any input marginals \(\mu _1, \dots , \mu _k\).

Previously, it was known how to solve \(\textsf {MOT}\) problems with structure (1) using \(\texttt {SINKHORN}\) [49, 82], but this only computes solutions that are dense (with \(n^k\) non-zero entries) and low-precision (due to \(\mathrm {poly}(1/\varepsilon )\) runtime dependence). We therefore provide the first solutions that are sparse and exact for structure (1). For structures (2) and (3), we provide the first polynomial-time algorithms, even for approximate computation. These three structures are incomparable: it is in general not possible to model a problem falling under any of the three structures in a non-trivial way using any of the others, for details see Remarks 6.7 and 7.3. This means that the new structures (2) and (3) enable capturing a wide range of new applications.

Below, we detail these structures individually in Sects. 1.2.1,1.2.2, and 1.2.3. See Table 2 for a summary.

Table 2 In the second part of the paper, we illustrate the ease-of-use of our algorithmic framework by applying it to three general classes of \(\textsf {MOT}\) cost structures. These structures encompass many—if not most—current applications of MOT

1.2.1 Graphical structure

In Sect. 5, we apply our algorithmic framework to \(\textsf {MOT}\) problems with graphical structure, a broad class of \(\textsf {MOT}\) problems that have been previously studied [47, 49, 82]. Briefly, an \(\textsf {MOT}\) problem has graphical structure if its cost tensor C decomposes as

$$\begin{aligned} C_{j_1, \dots , j_k} = \sum _{S \in \mathcal {S}} f_S(\vec {j}_S), \end{aligned}$$

where \(f_S(\vec {j}_S)\) are arbitrary “local interactions” that depend only on tuples \(\vec {j}_S := \{j_i\}_{i \in S}\) of the k variables.

In order to derive efficient algorithms, it is necessary to restrict how local the interactions are because otherwise \(\textsf {MOT}\) is \(\mathsf {NP}\)-hard (even if all interaction sets \(S \in \mathcal {S}\) have size 2) [6]. We measure the locality of the interactions via the standard complexity measure of the “treewidth” of the associated graphical model. See Sect. 5.1 for formal definitions. While the runtimes of our algorithms (and all previous algorithms) depend exponentially on the treewidth, we emphasize that the treewidth is a very small constant (either 1 or 2) in all current applications of \(\textsf {MOT}\) falling under this framework; see the related work section.

We show that for \(\textsf {MOT}\) cost tensors that have graphical structure of constant treewidth, all three oracles can be implemented in \(\mathrm {poly}(n,k)\) time. We accomplish this by leveraging the known connection between graphically structured \(\textsf {MOT}\) and graphical models shown in [49]. In particular, the \(\textsf {MIN}\), \(\textsf {AMIN}\), and \(\textsf {SMIN}\) oracles are respectively equivalent to the mode, approximate mode, and log-partition function of an associated graphical model. Thus we can implement all oracles in \(\mathrm {poly}(n,k)\) time by simply applying classical algorithms from the graphical models community [55, 88].

Theorem 1.5

(Informal statement of Theorem 5.5) Let \(C \in (\mathbb {R}^n)^{\otimes k}\) have graphical structure of constant treewidth. Then the \(\textsf {MIN}\), \(\textsf {AMIN}\), and \(\textsf {SMIN}\) oracles can be computed in \(\mathrm {poly}(n,k)\) time.

It is an immediate corollary of Theorem 1.5 and our algorithms-to-oracles reduction described in Sect. 1.1 that one can implement \(\texttt {ELLIPSOID}\), \(\texttt {MWU}\), and \(\texttt {SINKHORN}\) in polynomial time. Below, we record the theoretical guarantee of \(\texttt {ELLIPSOID}\) since it is the best of the three algorithms as it computes exact, sparse solutions.

Theorem 1.6

(Informal statement of Corollary 5.6) Let \(C \in (\mathbb {R}^n)^{\otimes k}\) have graphical structure of constant treewidth. Then an exact, sparse solution for \(\textsf {MOT}\) can be computed in \(\mathrm {poly}(n,k)\) time.

Previously, it was known how to solve such \(\textsf {MOT}\) problems [49, 82] using \(\texttt {SINKHORN}\), but this only computes a solution that is fully dense (with \(n^k\) non-zero entries) and low-precision (due to \(\mathrm {poly}(1/\varepsilon )\) runtime dependence). Details in the related work section. Our result improves over this state-of-the-art algorithm by producing solutions that are exact and sparse in \(\mathrm {poly}(n,k)\) time.

In Sect. 5.3, we demonstrate the benefit of Theorem 1.6 on the application of computing generalized Euler flows, which was historically the motivation of \(\textsf {MOT}\) and has received significant attention, e.g., [14, 17, 23,24,25,26]. While there is a specially-tailored version of the \(\texttt {SINKHORN}\) algorithm for this problem that runs in polynomial time [14, 17], it produces solutions that are approximate and fully dense. Our algorithm produces exact, sparse solutions which lead to sharp visualizations rather than blurry ones (see Fig. 4).

1.2.2 Set-optimization structure

In Sect. 6, we apply our algorithmic framework to \(\textsf {MOT}\) problems whose cost tensors C take value 0 or 1 in each entry. That, is costs C of the form

$$\begin{aligned} C_{j_1,\dots ,j_k} = \mathbb {1}[(j_1,\dots ,j_k) \notin S], \end{aligned}$$

for some subset \(S \subseteq [n]^k\). Such \(\textsf {MOT}\) problems arise naturally in applications where one seeks to minimize the probability that some event S occurs, given marginal probabilities on each variable \(j_i\), see Example 6.1.

In order to derive efficient algorithms, it is necessary to restrict the (otherwise arbitrary) set S. We parametrize the complexity of such \(\textsf {MOT}\) problems via the complexity of finding the minimum-weight object in S. This opens the door to combinatorial applications of \(\textsf {MOT}\) because finding the minimum-weight object in S is well-known to be polynomial-time solvable for many “combinatorially-structured” sets S of interest—e.g., the set S of cuts in a graph, or the set S of independent sets in a matroid.

We show that for \(\textsf {MOT}\) cost tensors with this structure, all three oracles can be implemented efficiently.

Theorem 1.7

(Informal statement of Theorem 6.8) Let \(C \in (\mathbb {R}^n)^{\otimes k}\) have set-optimization structure. Then the \(\textsf {MIN}\), \(\textsf {AMIN}\), and \(\textsf {SMIN}\) oracles can be computed in \(\mathrm {poly}(n,k)\) time.

It is an immediate corollary of Theorem 1.7 and our algorithms-to-oracles reduction described in Sect. 1.1 that one can implement \(\texttt {ELLIPSOID}\), \(\texttt {MWU}\), and \(\texttt {SINKHORN}\) in polynomial time. Below, we record the theoretical guarantee for \(\texttt {ELLIPSOID}\) since it is the best of these three algorithms as it computes exact, sparse solutions.

Theorem 1.8

(Informal statement of Corollary 6.9) Let \(C \in (\mathbb {R}^n)^{\otimes k}\) have set-optimization structure. Then an exact, sparse solution for \(\textsf {MOT}\) can be computed in \(\mathrm {poly}(n,k)\) time.

This is the first polynomial-time algorithm for this class of \(\textsf {MOT}\) problems. We note that a more restrictive class of \(\textsf {MOT}\) problems was studied in [93] under the additional restriction that S is upwards-closed.

In Sect. 6.3, we show how this general class of set-optimization structure captures, for example, the classical application of computing the extremal reliability of a network with stochastic edge failures. Network reliability is a fundamental topic in network science and engineering [12, 13, 41] which is often studied in an average-case setting where each edge fails independently with some given probability [52, 63, 72, 85]. The application investigated here is a robust notion of network reliability in which edge failures may be maximally correlated (e.g., by an adversary) or minimally correlated (e.g., by a network maintainer) subject to a marginal constraint on each edge’s failure probability, a setting that dates back to the 1980s [89, 93]. We show how to express both the minimally and maximally correlated network reliability problems as \(\textsf {MOT}\) problems with set-optimization structure, recovering as a special case of our general framework the known polynomial-time algorithms in [89, 93] as well as more practical polynomial-time algorithms that scale to input sizes that are an order-of-magnitude larger.

1.2.3 Low-rank and sparse structure

In Sect. 7, we apply our algorithmic framework to \(\textsf {MOT}\) problems whose cost tensors C decompose as

$$\begin{aligned} C = R + S, \end{aligned}$$

where R is a constant-rank tensor, and S is a polynomially-sparse tensor. We assume that R is represented in factored form, and that S is represented through its non-zero entries, which overall yields a \(\mathrm {poly}(n,k)\)-size representation of C.

We show that for \(\textsf {MOT}\) cost tensors with low-rank plus sparse structure, the \(\textsf {AMIN}\) and \(\textsf {SMIN}\) oracles can be implemented in polynomial time.Footnote 3 This may be of independent interest because, by taking all oracle inputs \(p_i = 0\) in (1.1), this generalizes the previously open problem of approximately computing the smallest entry of a constant-rank tensor with \(n^k\) entries in \(\mathrm {poly}(n,k)\) time.

Theorem 1.9

(Informal statement of Theorem 7.4) Let \(C \in (\mathbb {R}^n)^{\otimes k}\) have low-rank plus sparse structure. Then the \(\textsf {AMIN}\) and \(\textsf {SMIN}\) oracles can be computed in \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) time.

It is an immediate corollary of Theorem 1.9 and our algorithms-to-oracles reduction described in Sect. 1.1 that one can implement \(\texttt {MWU}\) and \(\texttt {SINKHORN}\) in polynomial time. Of these two algorithms, \(\texttt {MWU}\) computes sparse solutions, yielding the following theorem.

Theorem 1.10

(Informal statement of Corollary 7.5) Let \(C \in (\mathbb {R}^n)^{\otimes k}\) have low-rank plus sparse structure. Then a sparse, \(\varepsilon \)-approximate solution for \(\textsf {MOT}\) can be computed in \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) time.

This is the first polynomial-time result for this class of \(\textsf {MOT}\) problems. We note that the runtime of our \(\textsf {MOT}\) algorithm depends exponentially on the rank r of R, hence why we take r to be constant. Nevertheless, such a restriction on the rank is unavoidable since unless \(\mathsf {P}= \mathsf {NP}\), there does not exist an algorithm with runtime that is jointly polynomial in n, k, and the rank r [6].

We demonstrate this polynomial-time algorithm concretely on two applications. First, in Sect. 7.3 we consider the risk estimation problem of computing an investor’s expected profit in the worst-case over all future prices that are consistent with given marginal distributions. We show that this is equivalent to an \(\textsf {MOT}\) problem with a low-rank tensor and thereby provide the first efficient algorithm for it.

Second, in Sect. 7.4, we consider the fundamental problem of projecting a joint distribution Q onto the transportation polytope. We provide the first polynomial-time algorithm for solving this when Q decomposes into a constant-rank and sparse component, which models mixtures of product distributions with polynomially many corruptions. This application illustrates the versatility of our algorithmic results beyond polynomial-time solvability of \(\textsf {MOT}\), since this projection problem is a quadratic optimization over the transportation polytope rather than linear optimization (a.k.a. \(\textsf {MOT}\)). In order to achieve this, we develop a simple quadratic-to-linear reduction tailored to this problem that crucially exploits the sparsity of the \(\textsf {MOT}\) solutions enabled by the \(\texttt {MWU}\) algorithm.

1.3 Related work

1.3.1 \(\textsf {MOT}\) algorithms

\(\textsf {MOT}\) algorithms fall into two categories. One category consists of general-purpose algorithms that do not depend on the specific \(\textsf {MOT}\) cost. For example, this includes running an LP solver out-of-the-box, or running the Sinkhorn algorithm where in each iteration one sums over all \(n^k\) entries of the cost tensor to implement the marginalization bottleneck [40, 58, 84]. These approaches are robust in the sense that they do not need to be changed based on the specific \(\textsf {MOT}\) problem. However, they are impractical beyond tiny input sizes (e.g., \(n=k=10\)) because their runtimes scale as \(n^{\varOmega (k)}\).

The second category consists of algorithms that are much more scalable but require extra structure of the \(\textsf {MOT}\) problem. Specifically, these are algorithms that somehow exploit the structure of the relevant cost tensor C in order to (approximately) solve an \(\textsf {MOT}\) problem in \(\mathrm {poly}(n,k)\) time [1, 7, 14, 15, 17, 29, 30, 38, 46, 47, 49, 50, 61, 62, 65,66,67,68, 77, 82, 89, 93]. Such a \(\mathrm {poly}(n,k)\) runtime is far more tractable—but it is not well understood for which \(\textsf {MOT}\) problems such a runtime is possible. The purpose of this paper is to clarify this question.

To contextualize our answer to this question with the rapidly growing literature requires further splitting this second category of algorithms.

Sinkhorn algorithm. Currently, the predominant approach in the second category is to solve an entropically regularized version of \(\textsf {MOT}\) with the Sinkhorn algorithm, a.k.a. Iterative Proportional Fitting or Iterative Bregman Projections or RAS algorithm or Iterative Scaling algorithm, see e.g., [15,16,17, 47, 49, 67, 82]. Recent work has shown that a polynomial number of iterations of this algorithm suffices [40, 58, 84]. However, the bottleneck is that each iteration requires \(n^{k}\) operations in general because it requires marginalizing a tensor with \(n^k\) entries. The critical question is therefore: what structure of an \(\textsf {MOT}\) problem enables implementing this marginalization bottleneck in polynomial time.

This paper makes two contributions to this question. First, we identify new broad classes of \(\textsf {MOT}\) problems for which this bottleneck can be implemented in polynomial time, and thus \(\texttt {SINKHORN}\) can be implemented in polynomial time (see Sect. 1.2). Second, we propose other algorithms that require strictly less structure than \(\texttt {SINKHORN}\) does in order to solve an \(\textsf {MOT}\) problem in polynomial time (Theorem 4.19).

Ellipsoid algorithm. The Ellipsoid algorithm is among the most classical algorithms for implicit LP [43, 44, 53], however it has taken a back seat to the \(\texttt {SINKHORN}\) algorithm in the vast majority of the \(\textsf {MOT}\) literature.

In Sect. 4.1, we make explicit the fact that the variant of \(\texttt {ELLIPSOID}\) from [7] can solve \(\textsf {MOT}\) exactly in \(\mathrm {poly}(n,k)\) time if and only if any algorithm can (Theorem 4.1). This is implicit from combining several known results [6, 7, 44]. In the process of making this result explicit, we exploit the special structure of the \(\textsf {MOT}\) LP to significantly simplify the reduction from the dual violation oracle to the dual feasibility oracle. The previously known reduction is highly impractical as it requires an indirect “back-and-forth” use of the Ellipsoid algorithm [44,  page 107]. In contrast, our reduction is direct and simple; this is critical for implementing our practical alternative to \(\texttt {ELLIPSOID}\), namely \(\texttt {COLGEN}\), with the dual feasibility oracle.

Multiplicative Weights Update algorithm. This algorithm, first introduced by [91], has been studied in the context of optimal transport when \(k=2\) [19, 73], in which case implicit LP is not necessary for a polynomial runtime. \(\texttt {MWU}\) lends itself to implicit LP [91], but is notably absent from the \(\textsf {MOT}\) literature.

In Sect. 4.2, we show that \(\texttt {MWU}\) can be applied to \(\textsf {MOT}\) in polynomial time if and only if the approximate dual feasibility oracle can be solved in polynomial time. To do this, we show that in the special case of \(\textsf {MOT}\), the well-known “softmax-derivative” bottleneck of \(\texttt {MWU}\) is polynomial-time equivalent to the approximate dual feasibility oracle. Since it is known that the approximate dual feasibility oracle is polynomial-time reducible to approximate \(\textsf {MOT}\) [6], we therefore establish that \(\texttt {MWU}\) can solve \(\textsf {MOT}\) approximately in polynomial time if and only if any algorithm can (Theorem 4.7).

1.3.2 Graphically structured \(\textsf {MOT}\) problems with constant treewidth

We isolate here graphically structured costs with constant treewidth because this framework encompasses all \(\textsf {MOT}\) problems that were previously known to be tractable in polynomial time [49, 82], with the exceptions of the fixed-dimensional Wasserstein barycenter problem and \(\textsf {MOT}\) problems related to combinatorial optimization—both of which are described below in Sect. 1.3.3. This family of graphical structured costs with treewidth 1 (a.k.a. “tree-structured costs” [47]) includes applications in economics such as variational mean-field games [15], interpolating histograms on trees [3], matching for teams [29, 67]; as well as encompasses applications in filtering and estimation for collective dynamics such as target tracking [38, 46, 47, 49, 77] and Wasserstein barycenters in the case of fixed support [14, 29, 38, 67]. With treewidth 2, this family of costs also includes dynamic multi-commodity flow problems [48], as well as the application of computing generalized Euler flows in fluid dynamics [14, 17, 67], which was historically the original motivation of \(\textsf {MOT}\) [23,24,25,26].

Previous polynomial-time algorithms for graphically structured \({{\textsf {\textit{MOT}}}}\) compute approximate, dense solutions. Implementing \(\texttt {SINKHORN}\) for graphically structured \(\textsf {MOT}\) problems by using belief propagation to efficiently implement the marginalization bottleneck was first proposed twenty years ago in [82]. There have been recent advancements in understanding connections of this algorithm to the Schrödinger bridge problem in the case of trees [47], as well as developing more practically efficient single-loop variations [49].

All of these works prove theoretical runtime guarantees only in the case of tree structure (i.e., treewidth 1). However, this graphical model perspective for efficiently implementing \(\texttt {SINKHORN}\) readily extends to any constant treewidth: simply implement the marginalization bottleneck using junction trees. This, combined with the iteration complexity of \(\texttt {SINKHORN}\) which is known to be polynomial [40, 58, 84], immediately yields an overall polynomial runtime. This is why we cite [49, 82] throughout this paper regarding the fact that \(\texttt {SINKHORN}\) can be implemented in polynomial time for graphical structure with any constant treewidth.

While the use of \(\texttt {SINKHORN}\) for graphically structured \(\textsf {MOT}\) is mathematically elegant and can be impressively scalable in practice, it has two drawbacks. The first drawback of this algorithm is that it computes (implicit representations of) solutions that are fully dense with \(n^k\) non-zero entries. Indeed, it is well-known that \(\texttt {SINKHORN}\) finds the unique optimal solution to the entropically regularized \(\textsf {MOT}\) problem \(\min _{P \in \mathcal {M}(\mu _1,\dots ,\mu _k)} \langle P,C \rangle - \eta ^{-1}H(P)\), and that this solution is fully dense [70]. For example, in the simple case of cost \(C = 0\), uniform marginals \(\mu _i\), and any strictly positive regularization parameter \(\eta > 0\), this solution P has value \(1/n^k\) in each entry.

The second drawback of this algorithm is that it only computes solutions that are low-precision due to \(\mathrm {poly}(1/\varepsilon )\) runtime dependence on the accuracy \(\varepsilon \). This is because the number of \(\texttt {SINKHORN}\) iterations is known to scale polynomially in the entropic regularization parameter \(\eta \) even in the matrix case \(k=2\) [59, §1.2], and it is known that \(\eta = \varOmega (\varepsilon ^{-1} k \log n)\) is necessary for the converged solution of \(\texttt {SINKHORN}\) to be an \(\varepsilon \)-approximate solution to the (unregularized) original \(\textsf {MOT}\) problem [58].

Improved algorithms for graphically structured \({{\textsf {\textit{MOT}}}}\) problems. The contribution of this paper to the study of graphically structured \(\textsf {MOT}\) problems is that we give the first \(\mathrm {poly}(n,k)\) time algorithms that can compute solutions which are exact and sparse (Corollary 5.6). Our framework also directly recovers all known results about \(\texttt {SINKHORN}\) for graphically structured \(\textsf {MOT}\) problems—namely that it can be implemented in polynomial time for trees [47, 82] and for constant treewidth [49, 82].

1.3.3 Tractable \(\textsf {MOT}\) problems beyond graphically structured costs

The two new classes of \(\textsf {MOT}\) problems studied in this paper—namely, set-optimization structure and low-rank plus sparse structure—are incomparable to each other as well as to graphical structure. Details in Remarks 6.7 and 7.3. This lets us handle a wide range of new \(\textsf {MOT}\) problems that could not be handled before.

There are two other classes of \(\textsf {MOT}\) problems studied in the literature which do not fall under the three structures studied in this paper. We elaborate on both below.

Remark 1.11

(Low-dimensional Wasserstein barycenter) This \(\textsf {MOT}\) problem has cost \(C_{j_1,\dots ,j_k} = \sum _{i,i'=1}^k \Vert x_{i,j_i} - x_{i',j_{i'}}\Vert ^2\) where \(x_{i,j} \in \mathbb {R}^d\) denotes the j-th atom in the distribution \(\mu _i\). Clearly this cost is not a graphically structured cost of constant treewidth—indeed, representing it through the lens of graphical structure requires the complete graph of interactions, which means a maximal treewidth of \(k-1\).Footnote 4 This problem also does not fall under the set-optimization or constant-rank structures. Nevertheless, this \(\textsf {MOT}\) problem can be solved in \(\mathrm {poly}(n,k)\) time for any fixed dimension d by exploiting the low-dimensional geometric structure of the points \(\{x_{i,j}\}\) that implicitly define the cost [7].

Remark 1.12

(Random combinatorial optimization) \(\textsf {MOT}\) problems also appear in the random combinatorial optimization literature since the 1970s, see e.g., [50, 61, 65, 89, 93], although under a different name and in a different community. These papers consider \(\textsf {MOT}\) problems with costs of the form \(C(x) = \min _{v \in V} \langle x, v\rangle \) for polytopes \(V \subseteq \{0,1\}^k\) given through a list of their extreme points. Applications include PERT (Program Evaluation and Review Technique), extremal network reliability, and scheduling. Recently, applications to Distributionally Robust Optimization were investigated in [30, 62, 66] which considered general polytopes \(V \subset \mathbb {R}^k\), as well as in [68] which considered \(\textsf {MOT}\) costs of the related form \(C(x) = \mathbb {1}[\min _{v \in V} \langle x, v \rangle \geqslant t]\), and in [1] which considers other combinatorial costs C such as sub/supermodular functions. These papers show that these random combinatorial optimization problems are in general intractable, and give sufficient conditions on when they can be solved in polynomial time. In general, these families of \(\textsf {MOT}\) problems are different from the three structures studied in this paper, although some \(\textsf {MOT}\) applications fall under multiple umbrellas (e.g., extremal network reliability). It is an interesting question to understand to what extent these structures can be reconciled (as well as the algorithms, which sometimes use extended formulations in these papers).

1.3.4 Intractable \(\textsf {MOT}\) problems

These algorithmic results beg the question: what are the fundamental limitations of this line of work on polynomial-time algorithms for structured \(\textsf {MOT}\) problems? To this end, the recent paper [6] provides a systematic investigation of \(\mathsf {NP}\)-hardness results for structured \(\textsf {MOT}\) problems, including converses to several results in this paper. In particular, [6,  Propositions 4.1 and 4.2] justify the constant-rank regime studied in Sect. 7 by showing that unless \(\mathsf {P}=\mathsf {NP}\), there does not exist an algorithm with runtime that is jointly polynomially in the rank r and the input parameters n and k. Similarly, [6,  Propositions 5.1 and 5.2] justify the constant-treewidth regime for graphically structured costs studied in Sect. 5 and all previous work by showing that unless \(\mathsf {P}=\mathsf {NP}\), there does not exist an algorithm with polynomial runtime even in the seemingly simple class of \(\textsf {MOT}\) costs that decompose into pairwise interactions \(C_{j_1,\dots ,j_k} = \sum _{i \ne i' \in [k]} c_{i,i'}(j_i,j_i')\). The paper [6] also shows \(\mathsf {NP}\)-hardness for several \(\textsf {MOT}\) problems with repulsive costs, including for example the \(\textsf {MOT}\) formulation of Density Functional Theory with Coulomb-Buckingham potential. It is an problem whether the Coulomb potential, studied in [16, 27, 32], also leads to an \(\mathsf {NP}\)-hard \(\textsf {MOT}\) problem [6,  Conjecture 6.4].

1.3.5 Variants of \(\textsf {MOT}\)

The literature has studied several other variants of the \(\textsf {MOT}\) problem, notably with entropic regularization and/or with constraints on a subset of the k marginals, see, e.g., [14,15,16,17, 38, 46,47,48,49, 58, 77]. Our techniques readily apply with little change. Briefly, to handle entropic regularization, simply use the \(\textsf {SMIN}\) oracle and \(\texttt {SINKHORN}\) algorithm with fixed regularization parameter \(1/\eta > 0\) (rather than \(1/\eta \) of vanishing size \(\varTheta (\varepsilon / \log n)\)) as described in Sect. 4.3. And to handle partial marginal constraints, essentially the only change is that in the \(\textsf {MIN}\), \(\textsf {AMIN}\), and \(\textsf {SMIN}\) oracles, the potentials \(p_i\) are zero for all indices \(i \in [k]\) corresponding to unconstrained marginals \(m_i(P)\). Full details are omitted for brevity since they are straightforward modifications of our main results.

1.3.6 Optimization over joint distributions

Optimization problems over exponential-size joint distributions appear in many domains. For instance, they arise in game theory when computing correlated equilibria [69]; however, in that case the optimization has different constraints which lead to different algorithms. Such problems also arise in variational inference [87]; however, the optimization there typically constrains this distribution to ensure tractability (e.g., mean-field approximation restricts to product distributions). The different constraints in these optimization problems over joint distributions versus \(\textsf {MOT}\) lead to significant differences in computational complexity, and thus also necessitate different algorithmic techniques.

1.4 Organization

In Sect. 2 we recall preliminaries about \(\textsf {MOT}\) and establish notation. The first part of the paper then establishes our unified algorithmic framework for \(\textsf {MOT}\). Specifically, in Sect. 3 we define and compare three variants of the dual feasibility oracle; and in Sect. 4 we characterize the structure that \(\textsf {MOT}\) algorithms require for polynomial-time implementation in terms of these three oracles. For an overview of these results, see Sect. 1.1. The second part of the paper applies this algorithmic framework to three general classes of \(\textsf {MOT}\) cost structures: graphical structure (Sect. 5), set-optimization structure (Sect. 6), and low-rank plus sparse structure (Sect. 7). For an overview of these results, see Sect. 1.2. These three application sections are independent of each other and can be read separately. We conclude in Sect. 8.

2 Preliminaries

General notation. The set \(\{1, \dots , n\}\) is denoted by [n]. For shorthand, we write \(\mathrm {poly}(t_1, \dots , t_m)\) to denote a function that grows at most polynomially fast in those parameters. Throughout, we assume for simplicity of exposition that all entries of the input C and \(\mu _1, \dots , \mu _k\) have bit complexity at most \(\mathrm {poly}(n,k)\), and same with the components defining C in structured settings. As such, throughout runtimes refer to the number of arithmetic operations. The set \(\mathbb {R}\cup \{-\infty \}\) is denoted by \(\bar{\mathbb {R}}\), and note that the value \(-\infty \) can be represented efficiently by adding a single flag bit. We use the standard \(O(\cdot )\) and \(\varOmega (\cdot )\) notation, and use \(\tilde{O}(\cdot )\) and \(\tilde{\varOmega }(\cdot )\) to denote that polylogarithmic factors may be omitted.

Tensor notation. The k-fold tensor product space \(\mathbb {R}^n \otimes \cdots \otimes \mathbb {R}^n\) is denoted by \((\mathbb {R}^n)^{\otimes k}\), and similarly for \((\mathbb {R}_{\geqslant 0}^n)^{\otimes k}\). Let \(P \in (\mathbb {R}^n)^{\otimes k}\). Its i-th marginal, \(i \in [k]\), is denoted by \(m_i(P) \in \mathbb {R}^n\) and has entries \([m_i(P)]_j := \sum _{j_1,\dots ,j_{i-1},j_{i+1}, \dots , j_k} P_{j_1,\dots ,j_{i-1},j,j_{i+1}, \dots , j_k}\). For shorthand, we often denote an index \((j_1,\dots ,j_k)\) by \(\vec {j}\). The sum of P’s entries is denoted by \(m(P) = \sum _{\vec {j}} P_{\vec {j}}\). The maximum absolute value of P’s entries is denoted by \(\Vert P\Vert _{\max } := \max _{\vec {j}} |P_{\vec {j}}|\), or simply \(P_{\max }\) for short. For \(\vec {j}\in [n]^k\), we write \(\delta _{\vec {j}}\) to denote the tensor with value 1 at entry \(\vec {j}\), and 0 elswewhere. The operations \(\odot \) and \(\otimes \) respectively denote the entrywise product and the Kronecker product. The notation \(\otimes _{i=1}^k d_i\) is shorthand for \(d_1 \otimes \cdots \otimes d_k\). A non-standard notation we use throughout is that f[P] denotes a function \(f : \mathbb {R}\rightarrow \mathbb {R}\) (typically \(\exp \), \(\log \), or a polynomial) applied entrywise to a tensor P.

2.1 Multimarginal optimal transport

The transportation polytope between measures \(\mu _1, \dots , \mu _k \in \varDelta _n\) is

$$\begin{aligned} \mathcal {M}(\mu _1,\dots ,\mu _k):= \left\{ P \in (\mathbb {R}_{\geqslant 0}^n)^{\otimes k}\; : \; m_i(P) = \mu _i, \; \forall i \in [k] \right\} . \end{aligned}$$
(2.1)

For a fixed cost \(C \in (\mathbb {R}^n)^{\otimes k}\), the \(\textsf {MOT}_C\) problem is to solve the following linear program, given input measures \(\mu = (\mu _1, \dots , \mu _k) \in (\varDelta _n)^k\):

$$\begin{aligned} \min _{P \in \mathcal {M}(\mu _1,\dots ,\mu _k)} \langle P, C \rangle . \end{aligned}$$
(MOT)

In the \(k=2\) matrix case, (MOT) is the Kantorovich formulation of \(\textsf {OT}\) [86]. Its dual LP is

$$\begin{aligned}&\max _{p_1,\ldots ,p_k \in \mathbb {R}^n} \sum _{i=1}^k \langle p_i, \mu _i \rangle \quad \text {subject to} \quad C_{j_1,\dots ,j_k} - \sum _{i=1}^k [p_i]_{j_i} \geqslant 0, \; \forall (j_1,\dots ,j_k) \in [n]^k. \end{aligned}$$
(MOT-D)

A basic, folklore fact about \(\textsf {MOT}\) is that it always has a sparse optimal solution (e.g., [10,  Lemma 3]). This follows from elementary facts about standard-form LP; we provide a short proof for completeness.

Lemma 2.1

(Sparse solutions for \(\textsf {MOT} \)) For any cost \(C \in (\mathbb {R}^n)^{\otimes k}\) and any marginals \(\mu _1, \dots , \mu _k \in \varDelta _n\), there exists an optimal solution P to \(\textsf {MOT}_C(\mu )\) that has at most \(nk-k+1\) non-zero entries.

Proof

Since (MOT) is an LP over a compact domain, it has an optimal solution at a vertex [18, Theorem 2.7]. Since (MOT) is a standard-form LP, these vertices are in correspondence with basic solutions, thus their sparsity is bounded by the number of linearly dependent constraints defining \(\mathcal {M}(\mu _1,\dots ,\mu _k)\) [18,  Theorem 2.4]. We bound this quantity by \(nk-k+1\) via two observations. First, \(\mathcal {M}(\mu _1,\dots ,\mu _k)\) is defined by nk equality constraints \([m_i(P)]_j = [\mu _i]_j\) in (2.1), one for each coordinate \(j \in [n]\) of each marginal constraint \(i \in [k]\). Second, at least \(k-1\) of these constraints are linearly dependent because we can construct k distinct linear combinations of them, namely \(\sum _{j \in [n]} [m_i(P)]_j = \sum _{j \in [n]} [\mu _i]_j\) for each marginal \(i \in [k]\), which all simplify to the same constraint \(m(P) = 1\), and thus are redundant with each other. \(\square \)

Definition 2.2

(\(\varepsilon \) -approximate \({{\textsf {\textit{MOT}}}}\) solution) P is an \(\varepsilon \)-approximate solution to \(\textsf {MOT}_C(\mu )\) if P is feasible (i.e., \(P \in \mathcal {M}(\mu _1,\dots ,\mu _k)\)) and \(\langle C, P \rangle \) is at most \(\varepsilon \) more than the optimal value.

2.2 Regularization

We introduce two standard regularization operators. First is the Shannon entropy \(H(P) := -\sum _{\vec {j}} P_{\vec {j}} \log P_{\vec {j}}\) of a tensor \(P \in (\mathbb {R}_{\geqslant 0}^n)^{\otimes k}\) with entries summing to \(m(P) = 1\). We adopt the standard notational convention that \(0 \log 0 = 0\). Second is the softmin operator, which is defined for parameter \(\eta > 0\) as

$$\begin{aligned} \mathop {{{\,\mathrm{{\textstyle {smin_\eta }}}\,}}}\limits _{i \in [m]} a_i := -\frac{1}{\eta } \log \left( \sum _{i=1}^m e^{-\eta a_i} \right) . \end{aligned}$$
(2.2)

This softmin operator naturally extends to \(a_i \in \mathbb {R}\cup \{\infty \}\) by adopting the standard notational conventions that \(e^{-\infty } = 0\) and \(\log 0 = -\infty \).

We make use of the following folklore fact, which bounds the error between the \(\min \) and \({{\,\mathrm{smin}\,}}\) operators based on the regularization and the number of points. For completeness, we provide a short proof.

Lemma 2.3

(Softmin approximation bound) For any \(a_1, \dots , a_m \in \mathbb {R}\cup \{\infty \}\) and \(\eta > 0\),

$$\begin{aligned} \min _{i \in [m]} a_i \geqslant \mathop {{{\,\mathrm{{\textstyle {smin_\eta }}}\,}}}\limits _{i \in [m]} a_i \geqslant \min _{i \in [m]} a_i - \frac{\log m}{\eta }. \end{aligned}$$

Proof

Assume without loss of generality that all \(a_i\) are finite, else \(a_i\) can be dropped (if all \(a_i = \infty \) then the claim is trivial). For shorthand, denote \(\min _{i \in [m]} a_i\) by \(a_{\min }\). For the first inequality, use the non-negativity of the exponential function to bound

$$\begin{aligned} {\mathop {{{\,\mathrm{{\textstyle {smin_\eta }}}\,}}}\limits _{i \in [m]}} a_i = -\frac{1}{\eta } \log \left( \sum _{i=1}^m e^{-\eta a_i} \right) \leqslant -\frac{1}{\eta } \log \left( e^{-\eta a_{\min }} \right) = a_{\min }. \end{aligned}$$

For the second inequality, use the fact that each \(a_i \geqslant a_{\min }\) to bound

$$\begin{aligned} {\mathop {{{\,\mathrm{{\textstyle {smin_\eta }}}\,}}}\limits _{i \in [m]}} a_i = -\frac{1}{\eta } \log \left( \sum _{i=1}^m e^{-\eta a_i} \right) \geqslant -\frac{1}{\eta } \log \left( m e^{-\eta a_{\min }} \right) = a_{\min } -\frac{\log m}{\eta }. \end{aligned}$$

\(\square \)

The entropically regularized MOT problem (\(\textsf {RMOT}\) for short) is the convex optimization problem

$$\begin{aligned} \min _{P \in \mathcal {M}(\mu _1,\dots ,\mu _k)} \langle P, C \rangle - \eta ^{-1} H(P). \end{aligned}$$
(RMOT)

This is the natural multidimensional analog of entropically regularized \(\textsf {OT}\), which has a rich literature in statistics [57] and transportation theory [90], and has recently attracted significant interest in machine learning [35, 70]. The convex dual of (RMOT) is the convex optimization problem

$$\begin{aligned} \max _{p_1, \dots , p_k \in \mathbb {R}^n} \sum _{i=1}^k \langle p_i, \mu _i \rangle + {\mathop {{{\,\mathrm{{\textstyle {smin_\eta }}}\,}}}\limits _{\vec {j}\in [n]^k}}\left( C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i} \right) . \end{aligned}$$
(RMOT-D)

In contrast to \(\textsf {MOT}\), there is no analog of Lemma 2.1 for \(\textsf {RMOT}\): the unique optimal solution to \(\textsf {RMOT}\) is dense. Further, this solution may not even be “approximately” sparse. For example, when \(C = 0\), all \(\mu _i\) are uniform, and \(\eta > 0\) is any positive number, the solution is fully dense with all entries equal to \(1/n^k\).

We define P to be an \(\varepsilon \)-approximate \(\textsf {RMOT}\) solution in the analogous way as in Definition 2.2. A basic, folklore fact about \(\textsf {RMOT}\) is that if the regularization \(\eta \) is sufficiently large, then \(\textsf {RMOT}\) and \(\textsf {MOT}\) are equivalent in terms of approximate solutions.

Lemma 2.4

(\(\textsf {MOT} \) and \(\textsf {RMOT} \) are close for large regularization \(\eta \)) Let \(P \in \mathcal {M}(\mu _1,\dots ,\mu _k)\), \(\varepsilon > 0\), and \( \eta \geqslant \varepsilon ^{-1} k \log n\). If P is an \(\varepsilon \)-approximate solution to (RMOT), then P is also a \((2\varepsilon )\)-approximate solution to (MOT); and vice versa.

Proof

Since a discrete distribution supported on \(n^k\) atoms has entropy at most \(k \log n\) [34], the objectives of (MOT) and (RMOT) differ pointwise by at most \(\eta ^{-1} k \log n \leqslant \varepsilon \). Since (MOT) and (RMOT) also have the same feasible sets, their optimal values therefore differ by at most \(\varepsilon \). \(\square \)

3 Oracles

Here we define the three oracle variants described in the introduction and discuss their relations. In the below definitions, let \(C \in (\mathbb {R}^n)^{\otimes k}\) be a cost tensor.

Definition 3.1

(\({{\textsf {\textit{MIN}}}}\) oracle) For weights \(p = (p_1,\ldots ,p_k) \in \mathbb {R}^{n \times k}\), \(\textsf {MIN}_C(p)\) returns

$$\begin{aligned} \min _{\vec {j}\in [n]^k} C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}. \end{aligned}$$

Definition 3.2

(\({{\textsf {\textit{AMIN}}}}\) oracle) For weights \(p = (p_1,\ldots ,p_k) \in \mathbb {R}^{n \times k}\) and accuracy \(\varepsilon > 0\), \(\textsf {AMIN}_C(p,\varepsilon )\) returns \(\textsf {MIN}_C(p)\) up to additive error \(\varepsilon \).

Definition 3.3

(\({{\textsf {\textit{SMIN}}}}\) oracle) For weights \(p = (p_1,\ldots ,p_k) \in \bar{\mathbb {R}}^{n \times k}\) and regularization parameter \(\eta > 0\), \(\textsf {SMIN}_C(p, \eta )\) returns

$$\begin{aligned} \displaystyle {\mathop {{{\,\mathrm{{\textstyle {smin_\eta }}}\,}}}\limits _{\vec {j}\in [n]^k}} C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}. \end{aligned}$$

An algorithm is said to “solve” or “implement” \(\textsf {MIN}_C\) if given input p, it outputs \(\textsf {MIN}_C(p)\). Similarly for \(\textsf {AMIN}_C\) and \(\textsf {SMIN}_C\). Note that the weights p that are input to \(\textsf {SMIN}\) have values inside \(\bar{\mathbb {R}}= \mathbb {R}\cup \{-\infty \}\); this simplifies the notation in the treatment of the \(\texttt {SINKHORN}\) algorithm below and does not increase the bit-complexity by more than 1 bit by adding a flag for the value \(-\infty \).

Remark 3.4

(Interpretation as variants of the dual feasibility oracle) These three oracles can be viewed as variants of the feasibility oracle for (MOT-D). For \(\textsf {MIN}_C(p)\), this relationship is exact: \(p \in \mathbb {R}^{n \times k}\) is feasible for (MOT-D) if and only if \(\textsf {MIN}_C(p)\) is non-negative. For \(\textsf {AMIN}_C\) and \(\textsf {SMIN}_C\), this relationship is approximate, with the approximation depending on how small \(\varepsilon \) is and how large \(\eta \) is, respectively.

Since these oracles form the respective bottlenecks of all algorithms from the \(\textsf {MOT}\) and implicit linear programming literatures (see the overview in the introduction Sect. 1.1), an important question is: if one oracle can be implemented in \(\mathrm {poly}(n,k)\) time, does this imply that the other can be too?

Two reductions are straightforward: the \(\textsf {AMIN}\) oracle can be implemented in \(\mathrm {poly}(n,k)\) time whenever either the \(\textsf {MIN}\) oracle or the \(\textsf {SMIN}\) oracle can be implemented in \(\mathrm {poly}(n,k)\) time. We record these simple observations in remarks for easy recall.

Remark 3.5

(\(\textsf {MIN} \) implies \(\textsf {AMIN} \)) For any accuracy \(\varepsilon > 0\), the \(\textsf {MIN}_C(p)\) oracle provides a valid answer to the \(\textsf {AMIN}_C(p,\varepsilon )\) oracle by definition.

Remark 3.6

(\(\textsf {SMIN} \) implies \(\textsf {AMIN} \)) For any \(p \in \mathbb {R}^{n \times k}\) and regularization \(\eta \geqslant \varepsilon ^{-1} k\log n\), the \(\textsf {SMIN}_C(p,\eta )\) oracle provides a valid answer to the \(\textsf {AMIN}_C(p,\varepsilon )\) oracle due to the approximation property of the \({{\,\mathrm{smin}\,}}\) operator (Lemma 2.3).

In the remainder of this section, we show a separation between the \(\textsf {SMIN}\) oracle and both the \(\textsf {MIN}\) and \(\textsf {AMIN}\) oracles by exhibiting a family of cost tensors C for which there exist polynomial-time algorithms for \(\textsf {MIN}\) and \(\textsf {AMIN}\), however there is no polynomial-time algorithm for \(\textsf {SMIN}\). The non-existence of a polynomial-time algorithm of course requires a complexity theoretic assumption; our result holds under \(\#\)BIS-hardness—which is a by-now standard complexity theory assumption introduced in [37], and in words is the statement that there does not exist a polynomial-time algorithm for counting the number of independent sets in a bipartite graph.

Lemma 3.7

(Restrictiveness of the \(\textsf {SMIN} \) oracle) There exists a family of costs \(C \in (\mathbb {R}^n)^{\otimes k}\) for which \(\textsf {MIN}_C\) and \(\textsf {AMIN}_C\) can be solved in \(\mathrm {poly}(n,k)\) time, however \(\textsf {SMIN}_C\) is #BIS-hard.

Proof

In order to prove hardness for general n, it suffices to exhibit such a family of cost tensors when \(n=2\). Since \(n=2\), it is convenient to abuse notation slightly by indexing a cost tensor \(C \in (\mathbb {R}^n)^{\otimes k}\) by \(\vec {j}\in \{-1,1\}^k\) rather than by \(\vec {j}\in \{1,2\}^k\). The family we exhibit is \( \{ C(A,b) : A \in \mathbb {R}_{\geqslant 0}^{k \times k}, \, b \in \mathbb {R}^k \}\), where the cost tensors C(Ab) are parameterized by a non-negative square matrix A and a vector b, and have entries of the form

Polynomial-time algorithm for \({{\textsf {\textit{MIN}}}}\) and \({{\textsf {\textit{AMIN}}}}\). We show that given a matrix \(A \in \mathbb {R}_{\geqslant 0}^{k \times k}\), vector \(b \in \mathbb {R}^k\), and weights \(p \in \mathbb {R}^{2 \times k}\), it is possible to compute \(\textsf {MIN}_C(p)\) on the cost tensor C(Ab) in \(\mathrm {poly}(k)\) time. Clearly this also implies a \(\mathrm {poly}(k)\) time algorithm for \(\textsf {AMIN}_C(p,\varepsilon )\) for any \(\varepsilon > 0\), see Remark 3.5.

To this end, we first re-write the \(\textsf {MIN}_C(p)\) problem on C(Ab) in a more convenient form that enables us to “ignore” the weights p. Recall that \(\textsf {MIN}_C(p)\) is the problem of

Note that the linear part of the cost is equal to

$$\begin{aligned} \langle b, \vec {j}\rangle + \sum _{i=1}^k [p_i]_{j_i} = \langle \ell , \vec {j}\rangle + d, \end{aligned}$$
(3.1)

where \(\ell \in \mathbb {R}^k\) is the vector with entries \(\ell _i = b_i + {{\,\mathrm{\frac{1}{2}}\,}}((p_i)_1 - (p_i)_{-1})\), and d is the scalar \(d = {{\,\mathrm{\frac{1}{2}}\,}}\sum _{i=1}^k ([p_i]_1 + [p_i]_{-1})\). Thus, since d is clearly computable in O(k) time, the \(\textsf {MIN}_C\) problem is equivalent to solving

(3.2)

when given as input a non-negative matrix \(A \in \mathbb {R}_{\geqslant 0}^{k \times k}\) and a vector \(\ell \in \mathbb {R}^k\).

To show that this task is solvable in \(\mathrm {poly}(k)\) time, note that the objective in (3.2) is a submodular function because it is a quadratic whose Hessian \(-A\) has non-positive off-diagonal terms [11,  Proposition 6.3]. Therefore (3.2) is a submodular optimization problem, and thus can be solved in \(\mathrm {poly}(k)\) time using classical algorithms from combinatorial optimization [44,  Chapter 10.2].

\({{\textsf {\textit{SMIN}}}}\) oracle is \(\#\)BIS-hard. On the other hand, by using the definition of the \(\textsf {SMIN}\) oracle, the re-parameterization (3.1), and then the definition of the softmin operator,

where is the partition function of the ferromagnetic Ising model with inconsistent external fields given by

$$\begin{aligned} Q(\vec {j}) = \exp \left( \eta \langle \vec {j}, A \vec {j} \rangle + \eta \langle \ell , \vec {j}\rangle \right) . \end{aligned}$$

Because it is \(\#\)BIS hard to compute the partition function Z of a ferromagnetic Ising model with inconsistent external fields [42], it is \(\#\)BIS hard to compute the value \(-\eta ^{-1} \log Z - d\) of the oracle \(\textsf {SMIN}_C(p,\eta )\). \(\square \)

Remark 3.8

(The restrictiveness of \(\textsf {SMIN} \) extends to approximate computation) The separation between the oracles shown in Lemma 3.7 further extends to approximate computation of the \(\textsf {SMIN}\) oracle under the assumption that \(\#\)BIS is hard to approximate, since under this assumption it is hard to approximate the partition function of a ferromagnetic Ising model with inconsistent external fields [42].

4 Algorithms to oracles

In this section, we consider three algorithms for \(\textsf {MOT}\). Each is iterative and requires only polynomially many iterations. The key issue for each algorithm is the per-iteration runtime, which is in general exponential (roughly \(n^k\)). We isolate the respective bottlenecks of these three algorithms into the three variants of the dual feasibility oracle defined in Sect. 3. See Sect. 1.1 and Table 1 for a high-level overview of this section’s results.

4.1 The Ellipsoid algorithm and the \(\textsf {MIN}\) oracle

Among the most classical algorithms for implicit LP is the Ellipsoid algorithm [43, 44, 53]. However it has taken a back seat to the \(\texttt {SINKHORN}\) algorithm in the vast majority of the \(\textsf {MOT}\) literature. The very recent paper [7], which focuses on the specific \(\textsf {MOT}\) application of computing low-dimensional Wasserstein barycenters, develops a variant of the classical Ellipsoid algorithm specialized to \(\textsf {MOT}\); henceforth this is called \(\texttt {ELLIPSOID}\), see Sect. 4.1.1 for a description of this algorithm. The objective of this section is to analyze \(\texttt {ELLIPSOID}\) in the context of general \(\textsf {MOT}\) problems in order to prove the following.

Theorem 4.1

For any family of cost tensors \(C \in (\mathbb {R}^n)^{\otimes k}\), the following are equivalent:

  1. (i)

    \(\texttt {ELLIPSOID}\) takes \(\mathrm {poly}(n,k)\) time to solve the \(\textsf {MOT}_C\) problem. (Moreover, it outputs a vertex solution represented as a sparse tensor with at most \(nk-k+1\) non-zeros.)

  2. (ii)

    There exists a \(\mathrm {poly}(n,k)\) time algorithm that solves the \(\textsf {MOT}_C\) problem.

  3. (iii)

    There exists a \(\mathrm {poly}(n,k)\) time algorithm that solves the \(\textsf {MIN}_C\) problem.

Interpretation of results. In words, the equivalence “(i) \(\Longleftrightarrow \) (ii)” establishes that \(\texttt {ELLIPSOID}\) can solve any \(\textsf {MOT}\) problem in polynomial time that any other algorithm can. Thus from a theoretical perspective, this paper’s restriction to \(\texttt {ELLIPSOID}\) is at no loss of generality for developing polynomial-time algorithms that exactly solve \(\textsf {MOT}\). In words, the equivalence “(ii) \(\Longleftrightarrow \) (iii)” establishes that the \(\textsf {MOT}\) and \(\textsf {MIN}\) problems are polynomial-time equivalent. Thus we may investigate when \(\textsf {MOT}\) is tractable by instead investigating the more amenable question of when \(\textsf {MIN}\) is tractable (see Sect. 1.1.4) at no loss of generality.

As stated in the related work section, Theorem 4.1 is implicit from combining several known results [6, 7, 44]. Our contribution here is to make this result explicit, since this allows us to unify algorithms from the implicit LP literature with the \(\texttt {SINKHORN}\) algorithm. We also significantly simplify part of the implication “(iii) \(\Longrightarrow \) (i)”, which is crucial for making an algorithm that relies on the \(\textsf {MIN}\) oracle practical—namely, the Column Generation algorithm discussed below.

Organization of Sect. 4.1. In Sect. 4.1.1, we recall this \(\texttt {ELLIPSOID}\) algorithm and how it depends on the violation oracle for (MOT-D). In Sect. 4.1.2, we give a significantly simpler proof that the violation and feasibility oracles are polynomial-time equivalent in the case of (MOT-D), and use this to prove Theorem 4.1. In Sect. 4.1.3, we describe a practical implementation that replaces the \(\texttt {ELLIPSOID}\) outer loop with Column Generation.

4.1.1 Algorithm

A key component of the proof of Theorem 4.1 is the \(\texttt {ELLIPSOID}\) algorithm introduced in [7] for \(\textsf {MOT}\), which we describe below. In order to present this, we first define a variant of the \(\textsf {MIN}\) oracle that returns a minimizing tuple rather than the minimizing value.

Definition 4.2

(Violation oracle for (MOT-D)) Given weights \(p = (p_1,\dots ,p_k) \in \mathbb {R}^{n \times k}\), \(\textsf {ARGMIN}_C\) returns the minimizing solution \(\vec {j}\) and value of \(\min _{\vec {j}\in [n]^k} C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}\).

\(\textsf {ARGMIN}_C\) can be viewed as a violation oracleFootnote 5 for the decision set to (MOT-D). This is because, given \(p = (p_1,\dots ,p_k) \in \mathbb {R}^{n \times k}\), the tuple \(\vec {j}\) output by \(\textsf {ARGMIN}_C(p)\) either provides a violated constraint if \(C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i} < 0\), or otherwise certifies p is feasible. In [7] it is proved that \(\textsf {MOT}\) can be solved with polynomially many calls to the \(\textsf {ARGMIN}_C\) oracle.

Theorem 4.3

(\(\texttt {ELLIPSOID} \) guarantee; Proposition 12 of [7]) Algorithm 1 finds an optimal vertex solution for \(\textsf {MOT}_C(\mu )\) using \(\mathrm {poly}(n,k)\) calls to the \(\textsf {ARGMIN}_C\) oracle and \(\mathrm {poly}(n,k)\) additional time. The solution is returned as a sparse tensor with at most \(nk-k+1\) non-zero entries.

figure a

Sketch of algorithm. Full details and a proof are in [7]. We give a brief overview here for completeness. First, recall from the implicit LP literature that the classical Ellipsoid algorithm can be implemented in polynomial time for an LP with arbitrarily many constraints so long as it has polynomially many variables and the violation oracle for its decision set is solvable in polynomial time [44]. This does not directly apply to the LP (MOT) because that LP has \(n^k\) variables. However, it can apply to the dual LP (MOT-D) because that LP only has nk variables.

This suggests a natural two-step algorithm for \(\textsf {MOT}\). First, compute an optimal dual solution by directly applying the Ellipsoid algorithm to (MOT-D). Second, use this dual solution to construct a sparse primal solution. Although this dual-to-primal conversion does not extend to arbitrary LP [18, Exercise 4.17], the paper [7] provides a solution by exploiting the standard-form structure of \(\textsf {MOT}\). The procedure is to solve

$$\begin{aligned} \min _{\begin{array}{c} P \in \mathcal {M}(\mu _1,\dots ,\mu _k)\\ \text {s.t. } P_{\vec {j}} = 0, \; \forall \vec {j}\notin S \end{array}} \langle C, P \rangle \end{aligned}$$
(4.1)

which is the \(\textsf {MOT}\) problem restricted to sparsity pattern S, where S is the set of tuples \(\vec {j}\) returned by the violation oracle during the execution of step one of Algorithm 1. This second step takes \(\mathrm {poly}(n,k)\) time using a standard LP solver, because running the Ellipsoid algorithm in the first step only calls the violation oracle \(\mathrm {poly}(n,k)\) times, and thus S has \(\mathrm {poly}(n,k)\) size, and therefore the LP (4.1) has \(\mathrm {poly}(n,k)\) variables and constraints. In [7] it is proved that this produces a primal vertex solution to the original \(\textsf {MOT}\) problem.

4.1.2 Equivalence of bottleneck to \(\textsf {MIN}\)

Although Theorem 4.3 shows that \(\texttt {ELLIPSOID}\) can solve \(\textsf {MOT}\) in \(\mathrm {poly}(n,k)\) time using the \(\textsf {ARGMIN}\) oracle, this is not sufficient to prove the implication “(iii)\(\implies \)(i)” in Theorem 4.1. In order to prove that implication requires showing the polynomial-time equivalence between \(\textsf {MIN}\) and \(\textsf {ARGMIN}\).

Lemma 4.4

(Equivalence of \(\textsf {MIN} \) and \(\textsf {ARGMIN} \)) Each of the oracles \(\textsf {MIN}_C\) and \(\textsf {ARGMIN}_C\) can be implemented using \(\mathrm {poly}(n,k)\) calls of the other oracle and \(\mathrm {poly}(n,k)\) additional time.

This equivalence follows from classical results about the equivalence of violation and feasibility oracles [92]. However, the known proof of that general result requires an involved and indirect argument based on “back-and-forth” applications of the Ellipsoid algorithm [44,  §4.3]. Here we exploit the special structure of \(\textsf {MOT}\) to give a direct and elementary proof. This is essential to practical implementations (see Sect. 4.1.3).

Proof

It is obvious how the \(\textsf {MIN}_C\) oracle can be implemented via a single call of the \(\textsf {ARGMIN}_C\) oracle; we now show the converse. Specifically, given \(p_1,\dots ,p_k \in \mathbb {R}^n\), we show how to compute a solution \(\vec {j}= (j_1,\dots ,j_k) \in [n]^k\) for \(\textsf {ARGMIN}_C([p_1, \dots , p_k])\) using nk calls to the \(\textsf {MIN}_C\) oracle and polynomial additional time. We use the first n calls to compute the first index \(j_1\) of the solution, the next n calls to compute the next index \(j_2\), and so on.

Formally, for \(s \in [k]\), let us say that \((j_1^*, \dots ,j_{s}^*) \in [n]^s\) is a “partial solution” of size s if there exists a solution \(j \in [n]^k\) for \(\textsf {ARGMIN}_C([p_1, \dots , p_k])\) that satisfies \(j_i = j_i^*\) for all \(i \in [s]\). Then it suffices to show that for every \(s \in [k]\), it is possible to compute a partial solution \((j_1^*,\dots ,j_{s}^*)\) of size s from a partial solution \((j_1^*,\dots ,j_{s-1}^*)\) of size \(s-1\) using n calls to the \(\textsf {MIN}_C\) oracle and polynomial additional time.

The simple but key observation enabling this is the following. Below, for \(i \in [k]\) and \(j \in [n]\), define \(q_{i,j}\) to be the vector in \(\mathbb {R}^n\) with value \([p_i]_j\) on entry j, and value \(-M\) on all other entries. In words, the following observation states that if the constant M is sufficiently large, then for any indices \(j_i'\), replacing the vectors \(p_i\) with the vectors \(q_{i,j_i'}\) in a \(\textsf {MIN}\) oracle query effectively performs a \(\textsf {MIN}\) oracle query on the original input \(p_1,\dots ,p_k\) except that now the minimization is only over \(\vec {j}\in [n]^k\) satisfying \(j_i = j_i'\).

Observation 4.5

Set \(M := 2C_{\max }+ 2\sum _{i=1}^k \Vert p_i\Vert _{\max } + 1\). Then for any \(s \in [k]\) and any \((j_1', \dots , j_s') \in [n]^s\),

$$\begin{aligned} \textsf {MIN}_C([q_{1,j_1'}, \dots , q_{s,j_s'}, p_{s+1},\dots , p_k]) = \min _{\begin{array}{c} \vec {j} \in [n]^k \\ \text {s.t. } j_1 = j_1', \dots , j_s = j_s' \end{array}} C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i} . \end{aligned}$$

Proof

By definition of the \(\textsf {MIN}\) oracle,

$$\begin{aligned} \textsf {MIN}_C([q_{1,j_1'}, \dots , q_{s,j_s'}, p_{s+1},\dots , p_k])&= \min _{\vec {j}\in [n]^k} C_{\vec {j}} - \sum _{i=1}^s [q_{i,j_i'}]_{j_i} - \sum _{i=s+1}^k [p_{i}]_{j_i} \end{aligned}$$

It suffices to prove that every minimizing tuple \(\vec {j}\in [n]^k\) for the right hand side satisfies \(j_i = j_i'\) for all \(i \in [s]\). Suppose not for sake of contradiction. Then there exists a minimizing tuple \(\vec {j}\in [n]^k\) for which \(j_{\ell } \ne j_{\ell }'\) for some \(\ell \in [s]\). But then \([q_{\ell ,j_\ell '}]_{j_\ell } = -M\), so the objective value of \(\vec {j}\) is at least

$$\begin{aligned} C_{\vec {j}} - \sum _{i=1}^s [q_{i,j_i'}]_{j_i} - \sum _{i=s+1}^k [p_{i}]_{j_i} \geqslant M - C_{\max }- \sum _{i=1}^k \Vert p_i\Vert _{\max } = C_{\max }+ \sum _{i=1}^k \Vert p_i\Vert _{\max } + 1. \end{aligned}$$

But this is strictly larger (by at least 1) than the value of any tuple with prefix \((j_1',\dots ,j_{s}')\), contradicting the optimality of \(\vec {j}\). \(\square \)

Thus, given a partial solution \((j_1^*,\dots ,j_{s-1}^*)\) of length \(s-1\), we construct a partial solution \((j_1^*,\dots ,j_s^*)\) of length s by setting \(j_s^*\) to be a minimizer of

$$\begin{aligned} \min _{j_s' \in [n]} \textsf {MIN}_C([q_{1,j_1^*}, \dots , q_{s-1,j_{s-1}^*}, q_{s,j_s'}, p_{s+1},\dots , p_k]). \end{aligned}$$
(4.2)

The runtime bound is clear; it remains to show correctness. To this end, note that

$$\begin{aligned} \min _{\begin{array}{c} \vec {j}\in [n]^k \\ \text {s.t. }j_1 = j_1^*, \dots , j_s = j_s^* \end{array}} C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}&= \textsf {MIN}_C([q_{1,j_1^*}, \dots ,q_{s,j_s^*}, p_{s+1},\dots , p_k])\\&= \min _{j_s' \in [n]} \textsf {MIN}_C([q_{1,j_1^*}, \dots , q_{s-1,j_{s-1}^*}, q_{s,j_s'}, p_{s+1},\dots , p_k])\\&= \min _{j_s' \in [n]} \min _{\begin{array}{c} \vec {j}\in [n]^k \\ \text {s.t. } j_1 = j_1^*, \dots , j_{s-1} = j_{s-1}^*, j_s = j_s' \end{array}} C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}\\&= \min _{\begin{array}{c} \vec {j}\in [n]^k \\ \text {s.t. }j_1 = j_1^*, \dots , j_{s-1} = j_{s-1}^* \end{array}} C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}\\&= \textsf {MIN}_{C}([p_1,\dots ,p_k]), \end{aligned}$$

where above the first and third steps are by Observation 4.5, the second step is by construction of \(j_s^*\), the fourth step is by simplifying, and the final step is by definition of \((j_1^*,\dots ,j_{s-1}^*)\) being a partial solution of size \(s-1\). We conclude that \((j_1^*,\dots ,j_s^*)\) is a partial solution of size s, as desired. \(\square \)

We can now conclude the proof of the main result of Sect. 4.1.

Proof of Theorem 4.1

The implication “(i) \(\implies \) (ii)” is trivial, and the implication “(ii) \(\implies \) (iii)” is shown in [6]. It therefore suffices to show the implication “(iii) \(\implies \) (i)”. This follows from combining the fact that \(\texttt {ELLIPSOID}\) solves \(\textsf {MOT}_C\) in polynomial time given an efficient implementation of \(\textsf {ARGMIN}_C\) (Theorem 4.3), with the fact that the \(\textsf {MIN}_C\) and \(\textsf {ARGMIN}_C\) oracles are polynomial-time equivalent (Lemma 4.4). \(\square \)

4.1.3 Practical implementation via Column Generation

Although \(\texttt {ELLIPSOID}\) enjoys powerful theoretical runtime guarantees, it is slow in practice because the classical Ellipsoid algorithm is. Nevertheless, whenever \(\texttt {ELLIPSOID}\) is applicable (i.e., whenever the \(\textsf {MIN}_C\) oracle can be efficiently implemented), we can use an alternative practical algorithm, namely the delayed Column Generation method \(\texttt {COLGEN}\), to compute exact, sparse solutions to \(\textsf {MOT}\).

For completeness, we briefly recall the idea behind \(\texttt {COLGEN}\); for further details see the standard textbook [18,  §6.1]. \(\texttt {COLGEN}\) runs the Simplex method, keeping only basic variables in the tableau. Each time that \(\texttt {COLGEN}\) needs to find a Simplex variable on which to pivot, it solves the “pricing problem” of finding a variable with negative reduced cost. This is the key subroutine in \(\texttt {COLGEN}\). In the present context of the \(\textsf {MOT}\) LP, this pricing problem is equivalent to a call to the \(\textsf {ARGMIN}\) violation oracle (see [18,  Definition 3.2] for the definition of reduced costs). By the polynomial-time equivalence of the \(\textsf {ARGMIN}\) and \(\textsf {MIN}\) oracles shown in Lemma 4.4, this bottleneck subroutine in \(\texttt {COLGEN}\) can be computed in polynomial time whenever the \(\textsf {MIN}\) oracle can. For easy recall, we summarize this discussion as follows.

Theorem 4.6

(Standard guarantee for \(\texttt {COLGEN} \); Section 6.1 of [18]) For any \(T > 0\), one can implement T iterations of \(\texttt {COLGEN}\) in \(\mathrm {poly}(n,k,T)\) time and calls to the \(\textsf {MIN}_C\) oracle. When \(\texttt {COLGEN}\) terminates, it returns an optimal vertex solution, which is given as a sparse tensor with at most \(nk-k+1\) non-zero entries.

Note that \(\texttt {COLGEN}\) does not have a theoretical guarantee stating that it terminates after a polynomial number of iterations. But it often performs well in practice and terminates after a small number of iterations, leading to much better empirical performance than \(\texttt {ELLIPSOID}\).

4.2 The Multiplicative Weights Update and the \(\textsf {AMIN}\) oracle

The second classical algorithm for solving implicitly-structured LPs that we study in the context of \(\textsf {MOT}\) is the Multiplicative Weights Update algorithm \(\texttt {MWU}\) [91]. The objective of this section is to prove the following guarantees for its specialization to \(\textsf {MOT}\).

Theorem 4.7

For any family of cost tensors \(C \in (\mathbb {R}^n)^{\otimes k}\), the following are equivalent:

  1. (i)

    For any \(\varepsilon > 0\), \(\texttt {MWU}\) takes \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) time to solve the \(\textsf {MOT}_C\) problem \(\varepsilon \)-approximately. (Moreover, it outputs a sparse solution with at most \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) non-zero entries.)

  2. (ii)

    There exists a \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\)-time algorithm that solves the \(\textsf {MOT}_C\) problem \(\varepsilon \)-approximately for any \(\varepsilon > 0\).

  3. (iii)

    There exists a \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\)-time algorithm that solves the \(\textsf {AMIN}_C\) problem \(\varepsilon \)-approximately for any \(\varepsilon > 0\).

Interpretation of results. Similarly to the analogous Theorem 4.1 for \(\texttt {ELLIPSOID}\), the equivalence “(i) \(\Longleftrightarrow \) (ii)” establishes that \(\texttt {MWU}\) can approximately solve any \(\textsf {MOT}\) problem in polynomial time that any other algorithm can. Thus, from a theoretical perspective, restricting to \(\texttt {MWU}\) for approximately solving \(\textsf {MOT}\) problems is at no loss of generality. In words, the equivalence “(ii) \(\Longleftrightarrow \) (iii)” establishes that approximating \(\textsf {MOT}\) and approximating \(\textsf {MIN}\) are polynomial-time equivalent. Thus we may investigate when \(\textsf {MOT}\) is tractable to approximate by instead investigating the more amenable question of when \(\textsf {MIN}\) is tractable (see Sect. 1.1.4) at no loss of generality.

Theorem 4.7 is new to this work. In particular, equivalences between problems with polynomially small error do not fall under the purview of classical LP theory, which deals with exponentially small error [44]. Our use of the \(\texttt {MWU}\) algorithm exploits a simple reduction of \(\textsf {MOT}\) to a mixed packing-covering LP that has appeared in the \(k=2\) matrix case of Optimal Transport in [19, 73], where implicit LP is not necessary for polynomial runtime.

Organization of Sect. 4.2. In Sect. 4.2.1 we present the specialization of Multiplicative Weights Update to \(\textsf {MOT}\), and recall how it runs in polynomial time and calls to a certain bottleneck oracle. In Sect. 4.2.2, we show that this bottleneck oracle is equivalent to the \(\textsf {AMIN}\) oracle, and then use this to prove Theorem 4.7.

4.2.1 Algorithm

Here we present the \(\texttt {MWU}\) algorithm, which combines the generic Multiplicative Weights Update algorithm of [91] specialized to \(\textsf {MOT}\), along with a final rounding step that ensures feasibility of the solution.

In order to present \(\texttt {MWU}\), it is convenient to assume that the cost C has entries in the range \([1,2] \subset \mathbb {R}\), which is at no loss of generality by simply translating and rescaling the cost (see Sect. 4.2.2), and can be done implicitly given a bound on \(C_{\max }\). This is why in the rest of this subsection, every runtime dependence on \(\varepsilon \) is polynomial in \(1/\varepsilon \) for costs in the range [1, 2]; after transformation back to \([-C_{\max },C_{\max }]\), this is polynomial dependence in the standard scale-invariant quantity \(C_{\max }/\varepsilon \).

Since the cost C is assumed to have non-negative entries, for any \(\lambda \in [1,2]\), the polytope

$$\begin{aligned} K(\lambda ) = \{P \in \mathcal {M}(\mu _1,\dots ,\mu _k): \langle C, P \rangle \leqslant \lambda \} \end{aligned}$$

of couplings with cost at most \(\lambda \) is a mixed packing-covering polytope (i.e., all variables are non-negative and all constraints have non-negative coefficients). Note that \(K(\lambda )\) is non-empty if and only if \(\textsf {MOT}_C(\mu )\) has value at most \(\lambda \). Thus, modulo a binary search on \(\lambda \), this reduces computing the value of \(\textsf {MOT}_C(\mu )\) to the task of detecting whether \(K(\lambda )\) is empty. Since \(K(\lambda )\) is a mixed packing-covering polytope, the Multiplicative Weights Update algorithm of [91] determines whether \(K(\lambda )\) is empty, and runs in polynomial time apart from one bottleneck, which we now define.

In order to define the bottleneck, we first define a potential function. For this, we define the softmax analogously to the softmin as

$$\begin{aligned} {{\,\mathrm{smax}\,}}(a_1,\ldots ,a_t) = -{{\,\mathrm{smin}\,}}(-a_1,\ldots ,-a_t) = \log \left( \sum _{i=1}^t e^{a_i}\right) . \end{aligned}$$

Here we use regularization parameter \(\eta = 1\) for simplicity, since this suffices for analyzing \(\texttt {MWU}\), and thus we have dropped this index \(\eta \) for shorthand.

Definition 4.8

(Potential function for \({{\texttt {\textit{MWU}}}}\) ) Fix a cost \(C \in (\mathbb {R}^n)^{\otimes k}\), target marginals \(\mu \in (\varDelta _n)^k\), and target value \(\lambda \in \mathbb {R}\). Define the potential function \(\varPhi := \varPhi _{C,\mu ,\lambda } : (\mathbb {R}_{\geqslant 0}^n)^{\otimes k}\rightarrow \mathbb {R}\) by

$$\begin{aligned}\varPhi (P) = {{\,\mathrm{smax}\,}}\left( \frac{\langle C, P \rangle }{\lambda }, \frac{m_1(P) }{ \mu _1}, \ldots , \frac{m_k(P) }{ \mu _k}\right) . \end{aligned}$$

The softmax in the above expression is interpreted as a softmax over the \(nk+1\) values in the concatenation of vectors and scalars in its input. (This slight abuse of notation significantly reduces notational overhead.)

Given this potential function, we now define the bottleneck operation for \(\texttt {MWU}\): find a direction \(\vec {j} \in [n]^k\) in which P can be increased such that the potential is increased as little as possible.

Definition 4.9

(Bottleneck oracle for \({{\texttt {\textit{MWU}}}}\) ) Given iterate \(P \in (\mathbb {R}_{\geqslant 0}^n)^{\otimes k}\), target marginals \(\mu \in (\varDelta _n)^k\), target value \(\lambda \in \mathbb {R}\), and accuracy \(\varepsilon > 0\), \(\textsf {MWU\_BOTTLENECK}_C(P,\mu ,\lambda ,\varepsilon )\) either:

  • Outputs “null”, certifying that \(\min _{\vec {j}\in [n]^k} \frac{\partial }{\partial h} \varPhi (P + h \cdot \delta _{\vec {j}}) \mid _{h = 0} > 1\), or

  • Outputs \(\vec {j}\in [n]^k\) such that \(\frac{\partial }{\partial h} \varPhi (P + h \cdot \delta _{\vec {j}}) \mid _{h = 0} \leqslant 1 + \varepsilon \).

(If \(\min _{\vec {j}\in [n]^k} \frac{\partial }{\partial h} \varPhi (P + h \cdot \delta _{\vec {j}}) \mid _{h = 0}\) is within \((1,1+\varepsilon ]\), then either return behavior is a valid output.)

Pseudocode for the \(\texttt {MWU}\) algorithm is given in Algorithm 2. We prove that \(\texttt {MWU}\) runs in polynomial time given access to this bottleneck oracle.

Theorem 4.10

Let the entries of the cost C lie in the range [1, 2]. Given \(\lambda \in \mathbb {R}\) and accuracy parameter \(\varepsilon > 0\), \(\texttt {MWU}\) either certifies that \(\textsf {MOT}_C(\mu ) \leqslant \lambda \), or returns a \(\mathrm {poly}(n,k,1/\varepsilon )\)-sparse \(P \in \mathcal {M}(\mu _1,\dots ,\mu _k)\) satisfying \(\langle C,P \rangle \leqslant \lambda + 8\varepsilon \).

Furthermore, the loop in Step 1 runs in \(\tilde{O}(nk/\varepsilon ^2)\) iterations, and Step 2 runs in \(\mathrm {poly}(n,k,1/\varepsilon )\) time.

The \(\texttt {MWU}\) algorithm can be used to output a \(O(\varepsilon )\)-approximate solution for \(\textsf {MOT}\) time via an outer loop that performs binary search over \(\lambda \); this only incurs \(O(\log (1 / \varepsilon ))\)-multiplicative overhead in runtime.

figure b

Proof

We analyze Step 1 (Multiplicative Weights Update) and Step 2 (rounding) of \(\texttt {MWU}\) separately. \(\square \)

Lemma 4.11

(Correctness of Step 1) Step 1 of Algorithm 2 runs in \(\tilde{O}(nk /\varepsilon ^2)\) iterations. It either returns (i) “infeasible”, certifying that \(K(\lambda )\) is empty; or (ii) finds a \(\mathrm {poly}(n,k,1/\varepsilon )\)-sparse tensor \(P \in (\mathbb {R}_{\geqslant 0}^n)^{\otimes k}\) that is approximately in \(K(\lambda )\), i.e., P satisfies:

$$\begin{aligned} m(P) \geqslant 1 - 4\varepsilon , \quad \langle C, P \rangle \leqslant \lambda , \quad \text{ and } \quad m_i(P) \leqslant \mu _i \text{ for } \text{ all } i \in [k] \end{aligned}$$

Step 1 is the Multiplicative Weights Update algorithm of [91] applied to the polytope \(K(\lambda )\), so correctness follows from the analysis of [91]. We briefly recall the main idea behind this algorithm for the convenience of the reader, and provide a full proof in Appendix A.1 for completeness. The main idea behind the algorithm is that on each iteration, \(\vec {j}\in [n]^k\) is chosen so that the increase in the potential \(\varPhi (P)\) is approximately bounded by the increase in the total mass m(P). If this is impossible, then the bottleneck oracle returns null, which means \(K(\lambda )\) is empty. So assume otherwise. Then once the total mass has increased to \(m(P) = \eta + O(\varepsilon )\), the potential \(\varPhi (P)\) must be bounded by \(\eta (1 + O(\varepsilon ))\). By exploiting the inequality between the max and the softmax, this means that \(\max (\langle C, P \rangle / \lambda , \max _{i \in [n], j \in [k]} [m_i(P)]_j / [\mu _i]_j) \leqslant \varPhi (P) \leqslant \eta (1 + O(\varepsilon ))\) as well. Thus, rescaling P by \(1 / (\eta (1 + O(\varepsilon )))\) in Line 6 satisfies \(m(P) \geqslant 1 - O(\varepsilon )\), \(\langle C, P \rangle / \lambda \leqslant 1\), and \(m_i(P)/\mu _i \leqslant 1\). See Appendix A.1 for full details and a proof of the runtime and sparsity claims.

Lemma 4.12

(Correctness of Step 2) Step 2 of Algorithm 2 runs in \(\mathrm {poly}(n,k,1/\varepsilon )\) time and returns \(P \in \mathcal {M}(\mu _1,\dots ,\mu _k)\) satisfying \(\langle C, P \rangle \leqslant \lambda + 8\varepsilon \). Furthermore, P only has \(\mathrm {poly}(n,k,1/\varepsilon )\) non-zero entries.

Proof of Lemma 4.12

By Lemma 4.11, P satisfies \(m_i(P) \leqslant \mu _i\) for all \(i \in [k]\). Observe that this is an invariant that holds throughout the execution of Step 2. This, along with the fact that \(\sum _{j=1}^n [m_i(P)]_j = m(P)\) is equal for all i, implies that the indices \((j_1,\dots ,j_k)\) found in Line 8 satisfy \([\mu _i]_{j_i} - [m_i(P)]_{j_i} > 0\) for each \(i \in [k]\). Thus in particular \(\alpha > 0\) in Line 9. It follows that Line 10 makes at least one more constraint satisfied (in particular the constraint “\([\mu _i]_{j_i} = [m_i(P)]_{j_i}\)” where i is the minimizer in Line 9). Since there are nk constraints total to be satisfied, Step 2 terminates in at most nk iterations. Each iteration increases the number of non-zero entries in P by at most one, thus P is \(\mathrm {poly}(n,k,1/\varepsilon )\) sparse throughout. That P is sparse also implies that each iteration can be performed in \(\mathrm {poly}(n,k,1/\varepsilon )\) time, thus Step 2 takes \(\mathrm {poly}(n,k,1/\varepsilon )\) time overall.

Finally, we establish the quality guarantee on \(\langle C, P \rangle \). By Lemma 4.11, this is at most \(\lambda \) before starting Step 2. During Step 2, the total mass added to P is equal to \(1 - m(P)\). This is upper bounded by \(4\varepsilon \) by Lemma 4.11. Since \(C_{\max }\leqslant 2\), we conclude that the value of \(\langle C, P \rangle \) is increased by at most \(8\varepsilon \) in Step 2. \(\square \)

Combining Lemmas 4.11 and 4.12 concludes the proof of Theorem 4.10. \(\square \)

4.2.2 Equivalence of bottleneck to \(\textsf {AMIN}\)

In order to prove Theorem 4.7, we show that the \(\texttt {MWU}\) algorithm can be implemented in polynomial time and calls to the \(\textsf {AMIN}\) oracle. First, we prove this fact for the \(\textsf {ARGAMIN}\) oracle, which differs from the \(\textsf {AMIN}\) oracle in that it also returns a tuple \(\vec {j}\in [n]^k\) that is an approximate minimizer.

Definition 4.13

(Approximate violation oracle for (MOT-D)) Given weights \(p = (p_1,\ldots ,p_k) \in \mathbb {R}^{n \times k}\) and accuracy \(\varepsilon > 0\), \(\textsf {ARGAMIN}_C\) returns \(\vec {j}\in [n]^k\) that minimizes \(\min _{\vec {j}\in [n]^k} C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}\) up to additive error \(\varepsilon \), and its value up to additive error \(\varepsilon \).

Lemma 4.14

Let the entries of the cost C lie in the range [1, 2]. The \(\texttt {MWU}\) algorithm (Algorithm 2), can be implemented by \(\mathrm {poly}(n,k,1/\varepsilon )\) time and calls to the \(\textsf {ARGAMIN}_C\) oracle with accuracy parameter \(\varepsilon ' = \varTheta (\varepsilon ^2 / (nk))\).

Proof

We show that on each iteration of Step 1 of Algorithm 2 we can emulate the call to the \(\textsf {MWU\_BOTTLENECK}\) oracle with one call to the \(\textsf {ARGAMIN}\) oracle. Recall that \(\textsf {MWU\_BOTTLENECK}_C(P,\mu ,\lambda ,\varepsilon )\) seeks to find \(\vec {j}\in [n]^k \) such that

$$\begin{aligned} V_{\vec {j}} := \frac{\partial }{\partial h} \varPhi (P + h \delta _{\vec {j}}) \Big |_{h=0} \end{aligned}$$

is at most \(1 + \varepsilon \), or to certify that for all \(\vec {j}\) it is greater than 1. By explicit computation,

$$\begin{aligned} V_{\vec {j}}&= \frac{\partial }{\partial h} \log \left( \exp \left( \left( \langle C, P \rangle + h C_{\vec {j}}\right) /\lambda + \sum _{s=1}^k \sum _{t=1}^n \exp \left( \left( [m_s(P)]_t + \delta _{t,j_s}\right) /[\mu _s]_t\right) \right) \right) \Bigg |_{h=0} \nonumber \\&= \left( C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}\right) \frac{\exp (\langle C, P \rangle / \lambda ) / \lambda }{\exp (\langle C, P \rangle / \lambda ) + \sum _{s=1}^k \sum _{t=1}^n \exp ([m_s(P)]_t / [\mu _s]_t)}, \end{aligned}$$
(4.3)

where the weights \(p = (p_1,\ldots ,p_k) \in \mathbb {R}^{k \times n}\) in the last line are defined as

$$\begin{aligned}_j = -\frac{\lambda }{ \exp (\langle C, P \rangle / \lambda )} \cdot \frac{\exp ([m_i(P)]_{j} / [\mu _i]_{j})}{ [\mu _i]_{j}}, \qquad \forall i \in [k], j \in [n].\end{aligned}$$

Note that the second term in the product in (4.3) is positive and does not depend on \(\vec {j}\). This suggests that in order to minimize (4.3), it suffices to compute \(\vec {j}\leftarrow \textsf {ARGAMIN}_C(p,\varepsilon ')\) for some accuracy parameter \(\varepsilon ' > 0\). \(\square \)

The main technical difficulty with formalizing this intuitive approach is that the weights p are not necessarily efficiently computable. Nevertheless, using \(\mathrm {poly}(n,k)\) extra time on each iteration, we can compute the marginals \(m_1(P),\ldots ,m_k(P)\). Since the \(\textsf {ARGAMIN}\) oracle returns an \(\varepsilon '\)-additive approximation of the cost, we can also compute a running estimate \(\tilde{c}\) of the cost such that, on iteration T,

$$\begin{aligned} \tilde{c} - T\varepsilon ' \leqslant \langle C, P \rangle \leqslant \tilde{c} + T\varepsilon '. \end{aligned}$$

Therefore, we define weights \(\tilde{p} \in \mathbb {R}^{n \times k}\), which approximate p and which can be computed in \(\mathrm {poly}(n,k)\) time on each iteration:

$$\begin{aligned} {[}\tilde{p}_i]_j = -\frac{\lambda }{\exp (\tilde{c} / \lambda )} \cdot \frac{\exp ([m_i(P)]_j / [\mu _i]_j)}{[\mu _i]_j}, \qquad \forall i \in [k], j \in [n]. \end{aligned}$$

We also define the approximate value for any \(\vec {j}\in [n]^k\):

$$\begin{aligned} \tilde{V}_{\vec {j}} := \left( C_{\vec {j}} - \sum _{i=1}^k [\tilde{p}_i]_{j_i}\right) \frac{\exp (\tilde{c}/\lambda )/\lambda }{\exp (\tilde{c}/\lambda ) + \sum _{s=1}^k \sum _{t=1}^n \exp ([m_s(P)]_t / [\mu _s]_t)} \end{aligned}$$

It holds that \(\textsf {ARGAMIN}_C(\tilde{p}, \varepsilon ')\) returns a \(\vec {j}\in [n]^k\) that minimizes \(C_{\vec {j}} - \sum _{i=1}^k [\tilde{p}_i]_j\) up to multiplicative error \(1/(1-\varepsilon ')\), because the entries of the cost C are lower-bounded by 1, and \([\tilde{p}_i]_j \leqslant 0\) for all \(i \in [n], j \in [k]\). In particular, \(\textsf {ARGAMIN}_C(\tilde{p}, \varepsilon ')\) minimizes \(\tilde{V}_{\vec {j}}\) up to multiplicative error \(1/(1 - \varepsilon ')\). We prove the following claim relating \(V_{\vec {j}}\) and \(\tilde{V}_{\vec {j}}\):

Claim 4.15

For any \(\vec {j}\in [n]^k\), on iteration T, we have \(V_{\vec {j}} / \tilde{V}_{\vec {j}} \in [\exp (-2T\varepsilon '/\lambda ), \exp (2T\varepsilon '/\lambda )]\).

By the above claim, therefore \(\textsf {ARGAMIN}_C(\tilde{p},\varepsilon ')\) minimizes \(V_{\vec {j}}\) up to multiplicative error \(\exp (4T\varepsilon '/\lambda ) / (1 - \varepsilon ') \leqslant (1 + \varepsilon /3)\) if we choose \(\varepsilon ' = \varOmega (\lambda \varepsilon / T)\). Thus one can implement \(\textsf {MWU\_BOTTLENECK}_C(p,\mu ,\lambda ,\varepsilon )\) by returning the value of \(\textsf {ARGAMIN}_C(\tilde{p},\varepsilon ')\) if its value is estimated to be at most \(1 + \varepsilon /3\), and returning “null“ otherwise. The bound on the accuracy \(\varepsilon ' = \tilde{\varOmega }(\varepsilon ^2 / (nk))\) follows since \(\lambda \in [1,2]\) follows since \(\lambda \in [1,2]\) and \(T = \tilde{O}(nk/\varepsilon ^2)\) by Theorem 4.10.

Proof of Claim

We compare the expressions for \(V_{\vec {j}}\) and \(\tilde{V}_{\vec {j}}\). Each of these is a product of two terms. Since \(C_{\vec {j}} \geqslant 0\), and \([\tilde{p}_i]_{j_i}, [p_i]_{j_i} \leqslant 0\) for all i, the ratio of the first terms is

$$\begin{aligned} \frac{C_{\vec {j}} - \sum _{i=1}^k [\tilde{p}_i]_{j_i}}{C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}} \in [\min _{i} [\tilde{p}_i]_{j_i} / [p_i]_{j_i}, \max _{i} [\tilde{p}_i]_{j_i} / [p_i]_{j_i}] \subset [\exp (-T\varepsilon '/\lambda ), \exp (T\varepsilon '/\lambda )], \end{aligned}$$

where we have used that, for all \(i \in [k]\),

$$\begin{aligned}_{j_i} / [p_i]_{j_i} = \exp (\langle C, P \rangle / \lambda ) / \exp (\tilde{c} / \lambda ) \in [\exp (-T\varepsilon '/\lambda ), \exp (T\varepsilon '/\lambda )].\end{aligned}$$

Similarly the ratio of the second terms in the expression for \(V_{\vec {j}}\) and \(\tilde{V}_{\vec {j}}\) is also in the range \([\exp (-T\varepsilon '/\lambda ), \exp (T\varepsilon '/\lambda )]\). This concludes the proof of the claim. \(\square \)

Finally, we show that the \(\textsf {ARGAMIN}\) oracle can be reduced to the \(\textsf {AMIN}\) oracle, which completes the proof that \(\texttt {MWU}\) can be run with \(\textsf {AMIN}\).

Lemma 4.16

(Equivalence of \(\textsf {AMIN} \) and \(\textsf {ARGAMIN} \)) Each of the oracles \(\textsf {AMIN}_C\) and \(\textsf {ARGAMIN}_C\) with accuracy parameter \(\varepsilon > 0\) can be implemented using \(\mathrm {poly}(n,k)\) calls of the other oracle with accuracy parameter \(\varTheta (\varepsilon / k)\) and \(\mathrm {poly}(n,k)\) additional time.

It is worth remarking that the equivalence that we show between \(\textsf {AMIN}\) and \(\textsf {ARGAMIN}\) is not known to hold for the feasibility and separation oracles of general LPs, since the known result for general LPs requires exponentially small error in nk [44,  §4.3]. However, in the case of \(\textsf {MOT}\) the equivalence follows from a direct and practical reduction, similar to the proof of the equivalence of the exact oracles (Lemma 4.4). The main difference is that some care is needed to bound the propagation of the errors of the approximate oracles. For completeness, we provide the full proof of Lemma 4.16 in Appendix A.

We conclude by proving Theorem 4.7.

Proof of Theorem 4.7

The implication “(i) \(\implies \) (ii)” is trivial, and the implication “(ii) \(\implies \) (iii)” is shown in [6]. It therefore suffices to show the implication “(iii) \(\implies \) (i)”. For costs C with entries in the range [1, 2], this follows from combining the fact that \(\texttt {MWU}\) can be implemented to solve \(\textsf {MOT}_C\) in \(\mathrm {poly}(n,k,1/\varepsilon )\) time given an efficient implementation of \(\textsf {ARGAMIN}_C\) with polynomially-sized accuracy parameter \(\varepsilon ' = \mathrm {poly}(1/n,1/k,\varepsilon )\) (Lemma 4.14), along with the fact that the \(\textsf {AMIN}_C\) and \(\textsf {ARGAMIN}_C\) oracles are polynomially-time equivalent with polynomial-sized accuracy parameter (Lemma 4.16).

The assumption that C has entries within the range [1, 2] can be removed with no loss by translating and rescaling the original cost \(C' \leftarrow (C + 3C_{\max })/(2C_{\max })\) and running Algorithm 2 on \(C'\) with approximation parameter \(\varepsilon ' \leftarrow \varepsilon / (2C_{\max })\). Each \(\tau '\)-approximate query to the \(\textsf {AMIN}_{C'}\) oracle can be simulated by a \(\tau \)-approximate query to the \(\textsf {AMIN}_C\) oracle, where \(\tau = 2C_{\max }\tau '\). \(\square \)

Remark 4.17

(Practical optimizations) Our numerical implementation of \(\texttt {MWU}\) has two modifications that provide practical speedups. One is maintaining a cached list of the tuples \(\vec {j}\in [n]^k\) previously returned by calls to \(\textsf {MWU\_BOTTLENECK}\). Whenever \(\textsf {MWU\_BOTTLENECK}\) is called, we first check whether any tuple \(\vec {j}\) in the cache satisfies the desiderata \(\frac{\partial }{\partial h} \varPhi (P + h \cdot \delta _{\vec {j}}) \mid _{h = 0} \leqslant 1 + \varepsilon \), in which case we use this \(\vec {j}\) to answer the oracle query. Otherwise, we answer the oracle query using \(\textsf {AMIN}\) as explained above. In practice, this cache allows us to avoid many calls to the potentially expensive \(\textsf {AMIN}\) bottleneck. Our second optimization is that, at each iteration of \(\texttt {MWU}\), we check whether the current iterate P can be rescaled in order to satisfy the guarantees in Lemma 4.11 required from Step 1. If so, we stop Step 1 early and use this rescaled P.

4.3 The Sinkhorn algorithm and the \(\textsf {SMIN}\) oracle

The Sinkhorn algorithm (\(\texttt {SINKHORN}\)) is specially tailored to \(\textsf {MOT}\), and does not apply to general exponential-size LP. Currently it is by far the most popular algorithm in the \(\textsf {MOT}\) literature (see Sect. 1.3). However, in general each iteration of \(\texttt {SINKHORN}\) takes exponential time \(n^{\varTheta (k)}\), and it is not well-understood when it can be implemented in polynomial-time. The objective of this section is to show that this bottleneck is polynomial-time equivalent to the \(\textsf {SMIN}\) oracle, and in doing so put \(\texttt {SINKHORN}\) on equal footing with classical implicit LP algorithms in terms of their reliance on variants of the dual feasibility oracle for \(\textsf {MOT}\). Concretely, this lets us establish the following two results.

First, \(\texttt {SINKHORN}\) can solve \(\textsf {MOT}\) in polynomial time whenever \(\textsf {SMIN}\) can be solved in polynomial time.

Theorem 4.18

For any family of cost tensors \(C \in (\mathbb {R}^n)^{\otimes k}\) and accuracy \(\varepsilon > 0\), \(\texttt {SINKHORN}\) solves \(\textsf {MOT}_C\) to \(\varepsilon \) accuracy in \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) time and \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) calls to the \(\textsf {SMIN}_C\) oracle with regularization \(\eta = (2 k \log n)/\varepsilon \). (The solution is output through a polynomial-size implicit representation, see Sect. 4.3.1.)

Second, we show that \(\texttt {SINKHORN}\) requires strictly more structure than other algorithms do to solve an \(\textsf {MOT}\) problem. This is why the results about \(\texttt {ELLIPSOID}\) (Theorem 4.1) and \(\texttt {MWU}\) (Theorem 4.7) state that those algorithms solve an \(\textsf {MOT}\) problem whenever possible, whereas Theorem 4.18 cannot be analogously extended to such an “if and only if” characterization.

Theorem 4.19

There is a family of cost tensors \(C \in (\mathbb {R}^n)^{\otimes k}\) for which \(\texttt {ELLIPSOID}\) solves \(\textsf {MOT}_C\) exactly in \(\mathrm {poly}(n,k)\) time, however it is \(\#\)BIS-hard to implement a single iteration of \(\texttt {SINKHORN}\) in \(\mathrm {poly}(n,k)\) time.

Organization of Sect. 4.3. In Sect. 4.3.1, we recall this \(\texttt {SINKHORN}\) algorithm and how it depends on a certain marginalization oracle. In Sect. 4.3.2, we show that this marginalization oracle is polynomial-time equivalent to the \(\textsf {SMIN}\) oracle, and use this to prove Theorems 4.18 and 4.19.

4.3.1 Algorithm

Here we recall \(\texttt {SINKHORN}\) and its known guarantees. To do this, we first define the following oracle. While this oracle does not have an interpretation as a dual feasibility oracle, we show below that it is polynomial-time equivalent to \(\textsf {SMIN}\), which is a specific type of approximate dual feasibility oracle (Remark 3.6).

Definition 4.20

(\({{{\textsf {\textit{MARG}}}}}\)) Given scalings \(d = (d_1, \dots , d_k) \in \mathbb {R}_{\geqslant 0}^{n \times k}\), regularization \(\eta > 0\), and an index \(i \in [k]\), the marginalization oracle \(\textsf {MARG}_C(d,\eta ,i)\) returns the vector \(m_i((\otimes _{i'=1}^k d_{i'}) \odot \exp [-\eta C]) \in \mathbb {R}_{\geqslant 0}^{n}\).

It is known that \(\texttt {SINKHORN}\) can solve \(\textsf {MOT}\) with only polynomially many calls to this oracle [58]. The approximate solution that \(\texttt {SINKHORN}\) computes is a fully dense tensor with \(n^k\) non-zero entries, but it is output implicitly in O(nk) space through “scaling vectors” and “rounding vectors”, described below.

Theorem 4.21

(\(\texttt {SINKHORN} \) guarantee, [58]) Algorithm 3 computes an \(\varepsilon \)-approximate solution to \(\textsf {MOT}_C(\mu )\) using \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) calls to the \(\textsf {MARG}_C\) oracle with parameter \(\eta = (2 k \log n)/\varepsilon \), and \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) additional time. The solution is of the form

$$\begin{aligned} P =\left( \otimes _{i=1}^k d_i\right) \odot \exp [-\eta C] + \left( \otimes _{i=1}^k v_i\right) , \end{aligned}$$
(4.4)

and is output implicitly via the scaling vectors \(d_1, \dots , d_k \in \mathbb {R}_{\geqslant 0}^n\) and rounding vectors \(v_1, \dots , v_k \in \mathbb {R}_{\geqslant 0}^n\).

figure c

Sketch of algorithm. Full details and a proof are in [58]. We give a brief overview here for completeness. The main idea of \(\texttt {SINKHORN}\) is to solve \(\textsf {RMOT}\), the entropically regularized variant of \(\textsf {MOT}\) described in Sect. 2.2. On one hand, this provides an \(\varepsilon \)-approximate solution to \(\textsf {MOT}\) by taking the regularization parameter \(\eta = \varTheta (\varepsilon ^{-1} k \log n)\) sufficiently high (Lemma 2.4). On the other hand, solving \(\textsf {RMOT}\) rather than \(\textsf {MOT}\) enables exploiting the first-order optimality conditions of \(\textsf {RMOT}\), which imply that the unique solution to \(\textsf {RMOT}\) is the unique tensor in \(\mathcal {M}(\mu _1,\dots ,\mu _k)\) of the form

$$\begin{aligned} P^* = (\otimes _{i=1}^k d_i^*) \odot K, \end{aligned}$$
(4.5)

where K denotes the entrywise exponentiated tensor \(\exp [-\eta C]\), and \(d_1^*, \dots , d_k^* \in \mathbb {R}_{\geqslant 0}^n\) are non-negative vectors. The \(\texttt {SINKHORN}\) algorithm approximately computes this solution in two steps.

The first and main step of Algorithm 3 is the natural multimarginal analog of the Sinkhorn scaling algorithm [78]. It computes an approximate solution \(P = (\otimes _{i=1}^k d_i) \odot K\) by finding \(d_1, \dots , d_k\) such that P is nearly feasible in the sense that \(m_i(P) \approx \mu _i\) for each \(i \in [k]\). Briefly, it does this via alternating optimization: initialize \(d_i\) to the all-ones vector \(\mathbf {1}\in \mathbb {R}^n\), and then iteratively update one \(d_i\) so that the i-th marginal \(m_i(P)\) of the current scaled iterate \(P= (\otimes _{i=1}^k d_i) \odot K\) is \(\mu _i\). Although correcting one marginal can detrimentally affect the others, this algorithm nevertheless converges—in fact, in a polynomial number of iterations [58].

The second step of Algorithm 3 is the natural multimarginal analog of the rounding algorithm [5,  Algorithm 2]. It rounds the solution \(P = (\otimes _{i=1}^k d_i) \odot K\) found in step one to the transportation polytope \(\mathcal {M}(\mu _1,\dots ,\mu _k)\). Briefly, it performs this by scaling each marginal \(m_i(P)\) to be entrywise less than the desired \(\mu _i\), and then adding mass back to P so that all marginals constraints are exactly satisfied. The former adjustment is done by adjusting the diagonal scalings \(d_1, \dots , d_k\), and the latter adjustment is done by adding a rank-1 term \(\otimes _{i=1}^k v_i\).

Critically, note that Algorithm 3 takes polynomial time except for possibly the calls to the \(\textsf {MARG}_C\) oracle. In the absence of structure in the cost tensor C, evaluating this \(\textsf {MARG}_C\) oracle takes exponential time because it requires computing marginals of a tensor with \(n^k\) entries.

We conclude this discussion with several remarks about \(\texttt {SINKHORN}\).

Remark 4.22

(Choice of update index in \(\texttt {SINKHORN} \)) In line 3 there are several ways to choose update indices, all of which lead to the polynomial iteration complexity we desire. Iteration-complexity bounds are shown for a greedy choice in [40, 58]. Similar bounds can be shown for random and round-robin choices by adapting the techniques of [9]. These latter two choices do not incur the overhead of k \(\textsf {MARG}\) computations per iteration required by the greedy choice, which is helpful in practice. Empirically, we observe that round-robin works quite well, and we use this in our experiments.

Remark 4.23

(Alternative implementations of \(\texttt {SINKHORN} \)) For simplicity, Algorithm 3 provides pseudocode for the “vanilla” version of \(\texttt {SINKHORN}\) as it performs well in practice and it achieves the polynomial iteration complexity we desire. There are several variants in the literature, including accelerated versions and first rounding small entries of the marginals—these variants have iteration-complexity bounds with better polynomial dependence on \(\varepsilon \) and k, albeit sometimes at the expense of larger polynomial factors in n [58, 84].

Remark 4.24

(Output of \(\texttt {SINKHORN} \) and efficient downstream tasks) While the output P of \(\texttt {SINKHORN}\) is fully dense with \(n^k\) non-zero entries, its specific form (4.4) enables performing downstream tasks in polynomial time. This is conditional on a polynomial-time \(\textsf {MARG}_C\) oracle, which is at no loss of generality since that is required for running \(\texttt {SINKHORN}\) in polynomial time in the first place. The basic idea is that P is a mixture of two simple distributions (modulo normalization). The first is \(\left( \otimes _{i=1}^k d_i\right) \odot \exp [-\eta C]\), which is marginalizable using \(\textsf {MARG}_C\). The second is \(\otimes _{i=1}^k v_i\), which is easily marginalizable since it is a product distribution (as the \(v_i\) are non-negative). Together, this enables efficient marginalization of P. By recursively marginalizing on conditional distributions, this enables efficiently sampling from P. This in turn enables efficient estimation of bounded statistics of P (e.g., the cost \(\langle C, P\rangle \)) by Hoeffding’s inequality.

4.3.2 Equivalence of bottleneck to \(\textsf {SMIN}\)

Although Theorem 4.21 shows that \(\texttt {SINKHORN}\) solves \(\textsf {MOT}\) in polynomial time using the \(\textsf {MARG}\) oracle, this is neither sufficient to prove the implication “(ii)\(\implies \)(i)” in Theorem 4.18, nor to prove Theorem 4.19. In order to prove these results, we show that \(\textsf {SMIN}\) and \(\textsf {MARG}\) are polynomial-time equivalent.

Lemma 4.25

(Equivalence of \(\textsf {MARG} \) and \(\textsf {SMIN} \)) For any regularization \(\eta > 0\), each of the oracles \(\textsf {MARG}_C\) and \(\textsf {SMIN}_C\) can be implemented using \(\mathrm {poly}(n)\) calls of the other oracle and \(\mathrm {poly}(n,k)\) additional time.

Proof

Reduction from \({{\textsf {\textit{SMIN}}}}\) to \({{\textsf {\textit{MARG}}}}\). First, we show how to compute \(\textsf {SMIN}_C(p,\eta )\) using one call to the marginalization oracle and O(n) additional time. Consider the entrywise exponentiated matrix \(d = \exp [\eta p] \in \mathbb {R}_{\geqslant 0}^{n \times k}\), and let \(\mu _1 = m_1((\otimes _{i=1}^k d_i) \odot \exp [-\eta C])\) be the answer to \(\textsf {MARG}_C(d,\eta ,1)\). Observe that

$$\begin{aligned} -\eta ^{-1} \log \left( \sum _{j_1=1}^n [\mu _1]_{j_1} \right)&= -\eta ^{-1} \log \left( \sum _{j_1=1}^n \sum _{j_2,\dots ,j_k \in [n]} \prod _{i=1}^k [d_i]_{j_i} e^{-\eta C_{\vec {j}}} \right) \\&= -\eta ^{-1} \log \left( \sum _{\vec {j}\in [n]^k} e^{-\eta (C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i})} \right) \\&= \mathop {{{\,\mathrm{{\textstyle {smin_\eta }}}\,}}}\limits _{\vec {j}\in [n]^k} \left( C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}\right) , \end{aligned}$$

where above the first step is by definition of \(\mu _1\), the second step is by definition of d and combining the sums, and the third step is by definition of \({{\,\mathrm{smin}\,}}\). We conclude that \(-\eta ^{-1} \log \sum _{j_1=1}^n [\mu _1]_{j_1}\) is a valid answer to \(\textsf {SMIN}_C(p,\eta )\). Since this is clearly computable from \(\mu _1\) in O(n) time, this establishes the claimed reduction.

Reduction from \({{\textsf {\textit{MARG}}}}\) to \({{\textsf {\textit{SMIN}}}}\). Next, we show that for any marginalization index \(i \in [k]\) and entry \(\ell \in [n]\), it is possible to compute the \(\ell \)-th entry of the vector \(\textsf {MARG}_C(d,\eta ,i)\) using one call to the \(\textsf {SMIN}_C\) oracle and \(\mathrm {poly}(n,k)\) additional time. Define \(v \in \mathbb {R}^n\) to be the vector with \(\ell \)-th entry equal to \([d_i]_{\ell }\), and all other entries 0. Define the matrix \(p = \eta ^{-1} \log [d_1,\ldots ,d_{i-1},v,d_{i+1},\ldots ,d_k] \in \bar{\mathbb {R}}^{n \times k}\), where recall that \(\log 0 = -\infty \) (see Sect. 2). Let \(s\in \mathbb {R}\) denote the answer to \(\textsf {SMIN}_C(p, \eta )\). Observe that

$$\begin{aligned} e^{-\eta s}&= \sum _{\vec {j}\in [n]^k} e^{-\eta (C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i})} = \sum _{\vec {j}\in [n]^k \; : \; \vec {j}_{i} = \ell } \prod _{i=1}^k [d_i]_{j_i} e^{-\eta C_{\vec {j}}} \\&= \left[ m_i\left( (\otimes _{i=1}^k d_i) \odot \exp [-\eta C] \right) \right] _{\ell }, \end{aligned}$$

where above the first step is by definition of s, the second step is by definition of p and v, and the third step is by definition of the marginalization notation \(m_i(\cdot )\). We conclude that \(\exp (-\eta s)\) is a valid answer for the \(\ell \)-th entry of the vector \(\textsf {MARG}_C(d,\eta ,i)\). This establishes the claimed reduction since we may repeat this procedure n times to compute all n entries of the the vector \(\textsf {MARG}_C(d,\eta ,i)\). \(\square \)

We can now conclude the proofs of the main results of Sect. 4.3.

Proof of Theorem 4.18

This follows from the fact that \(\texttt {SINKHORN}\) approximates \(\textsf {MOT}_C\) in polynomial time given a efficient implementation of \(\textsf {MARG}_C\) (Theorem 4.21), combined with the fact that the \(\textsf {MARG}_C\) and \(\textsf {SMIN}_C\) oracles are polynomial-time equivalent (Lemma 4.25).

Proof of Theorem 4.19

Consider the family of cost tensors in Lemma 3.7 for which the \(\textsf {MIN}_C\) oracle admits a polynomial-time algorithm, but for which the \(\textsf {SMIN}_C\) oracle is #BIS-hard. Then on one hand, the \(\texttt {ELLIPSOID}\) algorithm solves \(\textsf {MOT}_C\) in polynomial time by Theorem 4.1. And on the other hand, it is #BIS-hard to implement a single iteration of \(\texttt {SINKHORN}\) because that requires implementing the \(\textsf {MARG}_C\) oracle, which is polynomial-time equivalent to the \(\textsf {SMIN}_C\) oracle by Lemma 4.25. \(\square \)

5 Application: \(\textsf {MOT}\) problems with graphical structure

In this section, we illustrate our algorithmic framework on \(\textsf {MOT}\) problems with graphical structure. Although a polynomial-time algorithm is already known for this particular structure [49, 82], that algorithm computes solutions that are approximate and dense; see the related work section for details. By combining our algorithmic framework developed above with classical facts about graphical models, we show that it is possible to compute solutions that are exact and sparse in polynomial time.

The section is organized as follows. In Sect. 5.1, we recall the definition of graphical structure. In Sect. 5.2, we show that the \(\textsf {MIN}\), \(\textsf {AMIN}\), and \(\textsf {SMIN}\) oracles can be implemented in polynomial time for cost tensors with graphical structure; from this it immediately follows that all of the \(\textsf {MOT}\) algorithms discussed in part 1 of this paper can be implemented in polynomial time. Finally, in Sect. 5.3, we demonstrate our results on the popular application of computing generalized Euler flows, which was the original motivation of \(\textsf {MOT}\). Numerical simulations demonstrate how the exact, sparse solutions produced by our new algorithms provide qualitatively better solutions than previously possible in polynomial time.

5.1 Setup

We begin by recalling preliminaries about undirected graphical models, a.k.a., Markov Random Fields. We recall only the relevant background; for further details we refer the reader to the textbooks [55, 88].

In words, graphical models provide a way of encoding the independence structure of a collection of random variables in terms of a graph. The formal definition is as follows. Below, all graphs are undirected, and the notation \(2^V\) means the power set of V (i.e., the set of all subsets of V).

Definition 5.1

(Graphical model structure) Let \(\mathcal {S}\subset 2^{[k]}\). The graphical model structure corresponding to \(\mathcal {S}\) is the graph \(G_{\mathcal {S}} = (V,E)\) with vertices \(V = [k]\) and edges \(E = \{(i,j) : i,j \in S, \text { for some } S \in \mathcal {S}\}\).

Definition 5.2

(Graphical model) Let \(\mathcal {S}\subset 2^{[k]}\). A probability distribution P over \(\{X_i\}_{i \in [k]}\) is a graphical model with structure \(\mathcal {S}\) if there exist functions \(\{\psi _S\}_{S \in \mathcal {S}}\) and normalizing constant Z such that

$$\begin{aligned} P\Big (\{x_i\}_{i \in [k]}\Big ) = \frac{1}{Z} \prod _{S \in \mathcal {S}} \psi _S\Big (\{x_i\}_{i \in S}\Big ). \end{aligned}$$

A standard measure of complexity for graphical models is the treewidth of the underlying graphical model structure \(G_{\mathcal {S}}\) because this captures not just the storage complexity, but also the algorithmic complexity of performing fundamental tasks such as computing the mode, log-partition function, and marginal distributions [55, 88]. There are a number of equivalent definitions of treewidth [22]. Each requires defining intermediate combinatorial concepts. We recall here the definition that is based on the concept of a junction tree because this is perhaps the most standard definition in the graphical models community.

Definition 5.3

(Junction tree, treewidth) A junction tree \(T = (V_T, E_T, \{B_u\}_{u \in V_T})\) for a graph \(G = (V,E)\) consists of a tree \((V_T,E_T)\) and a set of bags \(\{B_u \subseteq V\}_{u \in V_T}\) satisfying:

  • For each variable \(i \in V\), the set of nodes \(U_i = \{u \in V_T : i \in B_u\}\) induces a subtree of T.

  • For each edge \(e \in E\), there is some bag \(B_u\) containing both endpoints of e.

The width of the junction tree is one less than the size of the largest bag, i.e., is \(\max _{u \in V_T} |B_u| - 1\). The treewidth of a graph is the width of its minimum-width junction tree.

See Figs. 12, and 3 for illustrated examples.

We now formally recall the definition of graphical structure for \(\textsf {MOT}\).

Fig. 1
figure 1

The path graph (left) has treewidth 1 because the corresponding junction tree (right) has bags of size at most 2

Fig. 2
figure 2

The graph that has an edges between all vertices of distance at most two when ordered sequentially (left) has treewidth 2 because the corresponding junction tree (right) has bags of size at most 3

Fig. 3
figure 3

The cycle graph (left) has treewidth 2 because the corresponding junction tree (right) has bags of size at most 3

Definition 5.4

(Graphical structure for \({{\textsf {\textit{MOT}}}}\) ) An \(\textsf {MOT}\) cost tensor \(C \in (\mathbb {R}^n)^{\otimes k}\) has graphical structure with treewidth \(\omega \) if there is a graphical model structure \(\mathcal {S}\subset 2^{[k]}\) and functions \(\{f_S\}_{S \in \mathcal {S}}\) such that

$$\begin{aligned} C_{\vec {j}} = \sum _{S \in \mathcal {S}} f_S\Big (\{j_i\}_{i \in S}\Big ), \qquad \forall \vec {j}:= (j_1,\dots ,j_k) \in [n]^k, \end{aligned}$$
(5.1)

and such that the graph \(G_{\mathcal {S}}\) has treewidth \(\omega \).

We make three remarks about this structure. First, note that the functions \(\{f_S\}_{S \in \mathcal {S}}\) can be arbitrary so long as the corresponding graphical model structure has treewidth at most \(\omega \).

Second, if Definition 5.4 did not constrain the treewidth \(\omega \), then every tensor C would trivially have graphical structure with maximal treewidth \(\omega = k -1\) (take \(\mathcal {S}\) to be the singleton containing [k], \(G_{\mathcal {S}}\) to be the complete graph, and \(f_{[k]}\) to be C). Just like all previous algorithms, our algorithms have runtimes that depend exponentially (only) on the treewidth of \(G_{\mathcal {S}}\). This is optimal in the sense that unless \(\mathsf {P}= \mathsf {NP}\), there is no algorithm with jointly polynomial runtime in the input size and treewidth [6]. We also point out that in all current applications of graphically structured \(\textsf {MOT}\), the treewidth is either 1 or 2, see Sect. 1.3.

Third, as in all previous work on graphically structured \(\textsf {MOT}\), we make the natural assumptions that the cost C is input implicitly through the functions \(\{f_S\}_{S \in \mathcal {S}}\), and that each function \(f_S\) can be evaluated in polynomial time, since otherwise graphical structure is useless for designing polynomial-time algorithms. In all applications in the literature, these two basic assumptions are always satisfied. Note also that if the treewidth of the graphical structure is constant, then there is a linear-time algorithm to compute the treewidth and a corresponding minimum-width junction tree [21].

5.2 Polynomial-time algorithms

By our oracle reductions in Sect. 4, in order to design polynomial-time algorithms for \(\textsf {MOT}\) with graphical structure, it suffices to design polynomial-time algorithms for the \(\textsf {MIN}\), \(\textsf {AMIN}\), or \(\textsf {SMIN}\) oracles. This follows directly from classical algorithmic results in the graphical models literature [55].

Theorem 5.5

(Polynomial-time algorithms for the \(\textsf {MIN} \), \(\textsf {AMIN} \), and \(\textsf {SMIN} \) oracles for costs with graphical structure) Let \(C \in (\mathbb {R}^n)^{\otimes k}\) be a cost tensor that has graphical structure with constant treewidth \(\omega \) (see Definition 5.4). Then the \(\textsf {MIN}_C\), \(\textsf {AMIN}_C\), and \(\textsf {SMIN}_C\) oracles can be computed in \(\mathrm {poly}(n,k)\) time.

figure d
figure e

Proof

Consider input p for the oracles. Let P denote the probability distribution on \([n]^k\) given by

$$\begin{aligned} P(\vec {j}) = \frac{1}{Z} \exp \left( -\eta \left( C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}\right) \right) , \qquad \forall \vec {j}\in [n]^k, \end{aligned}$$
(5.2)

where \(Z = \sum _{\vec {j}\in [n]^{[k]}} \exp (-\eta (C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}))\) ensures P is normalized. Observe that the \(\textsf {MIN}_C\) oracle amountsFootnote 6 to computing the mode of the distribution P because \(\textsf {MIN}_C(p) = C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}\), where \(\vec {j}\in [n]^k\) is a maximizer of \(P_{\vec {j}}\). Also, the \(\textsf {SMIN}_C\) oracle amounts to computing the partition function Z because \(\textsf {SMIN}_C(p) = - \eta ^{-1} \log Z\). Thus it suffices to compute the mode and partition function of P in polynomial time. (The \(\textsf {AMIN}_C\) oracle follows from the \(\textsf {MIN}_C\) oracle by Remark 3.5).

To this end, observe that by assumption on C, there is a graphical model structure \(\mathcal {S}\in 2^{[k]}\) and functions \(\{f_S\}_{S \in \mathcal {S}}\) such that the corresponding graph \(G_{\mathcal {S}}\) has treewidth \(\omega \) and the distribution P factors as

$$\begin{aligned} P(\vec {j}) = \exp \left( -\eta \left( \sum _{S \in \mathcal {S}} f_S\left( \{j_i\}_{i \in S} \right) - \sum _{i=1}^k [p_i]_{j_i}\right) \right) . \end{aligned}$$

It follows that P is a graphical model with respect to the same graphical model structure \(\mathcal {S}\) because the “vertex potentials” \(\exp (\eta [p_i]_{j_i})\) do not affect the underlying graphical model structure. Thus P is a graphical model with constant treewidth \(\omega \), so we may compute the mode and partition function of P in \(\mathrm {poly}(n,k)\) time using, respectively, the classical max-product and sum-product algorithms [55,  Chapters 13.3 and 10.2]. For convenience, pseudocode summarizing this discussion is provided in Algorithms 4 and 5. \(\square \)

An immediate consequence of Theorem 5.5 combined with our oracle reductions is that all candidate \(\textsf {MOT}\) algorithms in Sect. 4 can be efficiently implemented for \(\textsf {MOT}\) problems with graphical structure. From a theoretical perspective, \(\texttt {ELLIPSOID}\) gives the best guarantee since it produces an exact, sparse solution.

Corollary 5.6

(Polynomial-time algorithms for \(\textsf {MOT} \) problems with graphical structure) Let \(C \in (\mathbb {R}^n)^{\otimes k}\) be a cost tensor that has graphical structure with constant treewidth \(\omega \) (see Definition 5.4). Then:

  • The \(\texttt {ELLIPSOID}\) algorithm in Sect. 4.1 computes an exact solution to \(\textsf {MOT}_C\) in \(\mathrm {poly}(n,k)\) time.

  • The \(\texttt {MWU}\) algorithm in Sect. 4.2 computes an \(\varepsilon \)-approximate solution to \(\textsf {MOT}_C\) in \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) time.

  • The \(\texttt {SINKHORN}\) algorithm in Sect. 4.3 computes an \(\varepsilon \)-approximate solution to \(\textsf {MOT}_C\) in \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) time.

  • The \(\texttt {COLGEN}\) algorithm in Sect. 4.1.3 can be run for T iterations in \(\mathrm {poly}(n,k,T)\) time.

Moreover, \(\texttt {ELLIPSOID}\), \(\texttt {MWU}\), and \(\texttt {COLGEN}\) output a polynomially sparse tensor, whereas \(\texttt {SINKHORN}\) outputs a fully dense tensor through the implicit representation described in Sect. 4.3.1.

Proof

Combine the polynomial-time implementations of the oracles in Theorem 5.5 with the polynomial-time algorithm-to-oracle reductions in Theorems 4.14.74.18, and 4.6, respectively. \(\square \)

5.3 Application vignette: Fluid dynamics

In this section, we numerically demonstrate our new results for graphically structured \(\textsf {MOT}\)—namely the ability to compute exact, sparse solutions in polynomial time (Corollary 5.6). We illustrate this on the problem of computing generalized Euler flows—an \(\textsf {MOT}\) application which has received significant interest and which was historically the motivation of \(\textsf {MOT}\), see e.g., [14, 16, 23,24,25,26]. This \(\textsf {MOT}\) problem is already known to be tractable via a popular, specially-tailored modification of SINKHORN [14]—which can be interpreted as implementing \(\texttt {SINKHORN}\) using graphical structure [49, 82]. However, that algorithm is based on \(\texttt {SINKHORN}\) and thus unavoidably produces solutions that are low-precision (due to \(\mathrm {poly}(1/\varepsilon )\) runtime dependence), fully dense (with \(n^k\) non-zero entries), and have well-documented numerical precision issues. We offer the first polynomial-time algorithm for computing exact and/or sparse solutions.

We briefly recall the premise of this \(\textsf {MOT}\) problem; for further background see [14, 26]. An incompressible fluid (e.g., water) is modeled by n particles which are uniformly distributed in space (due to incompressibility) at all times \(t \in \{1, \dots ,k+1\}\). We observe each particle’s location at initial time \(t=1\) and final time \(t=k+1\). The task is to infer the particles’ locations at all intermediate times \(t \in \{2, \dots , k\}\), and this is modeled by an \(\textsf {MOT}\) problem as follows.

Specifically, the locations of the fluid particles are discretized to points \(\{x_{j}\}_{j \in [n]} \subset \mathbb {R}^d\), and \(\sigma \) is a known permutation on this set that encodes the relation between each initial location \(x_j\) at time \(t=1\) and final location \(\sigma (x_j)\) at time \(t = k+1\). The total movement of a particle that takes the trajectory \(x_{j_1},x_{j_2},\ldots ,x_{j_k},\sigma (x_{j_1})\) is given by

$$\begin{aligned} C_{j_1,\ldots ,j_k} = \Vert \sigma (x_{j_1}) - x_{j_k}\Vert ^2 + \sum _{t=1}^{k-1} \Vert x_{j_{t+1}} - x_{j_t}\Vert ^2, \end{aligned}$$
(5.3)

By the principle of least action, the generalized Euler flow problem of inferring the most likely trajectories of the fluid particles is given by the solution to the \(\textsf {MOT}\) problem with this cost C and uniform marginals \(\mu _t = \mathbf {1}_n/n \in \varDelta _n\) which impose the constraint that the fluid is incompressible.

Corollary 5.7

(Exact, sparse solutions for generalized Euler flows) The \(\textsf {MOT}\) problem with cost (5.3) can be solved in \(d \cdot \mathrm {poly}(n,k)\) time. The solution is returned as a sparse tensor with at most \(nk-k+1\) non-zeros.

Proof

This cost tensor C can be expressed in graphical form \(C_{\vec {j}} = \sum _{S \in \mathcal {S}} f_S(\{j_i\})\) where \(\mathcal {S}\) consists of the sets \(\{1,2\}, \dots , \{k-1,k\}\) of adjacent time points as well as the set \(\{1,k\}\). Moreover, each function \(f_S : [n]^2 \rightarrow \mathbb {R}\) can be computed in \(O(dn^2)\) time since this simply requires computing \(\Vert x_j - x_{j'}\Vert ^2\) for \(n^2\) pairs of points \(x_j,x_{j'} \in \mathbb {R}^d\). Once this graphical representation is computed, Corollary 5.6 implies a \(\mathrm {poly}(n,k)\) time algorithm for this \(\textsf {MOT}\) problem because the graphical model structure \(\mathcal {S}\) is a cycle graph and thus has treewidth 2 (cf., Fig. 3). \(\square \)

Fig. 4
figure 4

Transport maps computed by the fast implementation of \(\texttt {SINKHORN}\) [14] (left) and our \(\texttt {COLGEN}\) implementation (right) on a standard fluid dynamics benchmark problem in dimension \(d = 1\) [26]. The pairwise transport maps between successive timesteps are plotted with opacity proportional to the mass. The \(\texttt {SINKHORN}\) algorithm is run at the highest precision (i.e., smallest regularization) before serious numerical precision issues (NaNs). It returns a dense, approximate solution in 2.25 seconds. (All experiments in this paper are run on a standard-issue Apple MacBook Pro 2020 laptop with an M1 Chip.) \(\texttt {COLGEN}\) returns an exact, sparse solution in 9.52 seconds. Furthermore, in this particular problem instance, the \(\texttt {COLGEN}\) method returns a Monge solution, i.e., the sparsity is n so that the particles never split in the computed trajectories

Figure 4 illustrates how the exact, sparse solutions found by our new algorithm provide visually sharper estimates than the popular modification of \(\texttt {SINKHORN}\) in [14], which blurs the trajectories. The latter is the state-of-the-art algorithm in the literature and in particular is the only previously known non-heuristic algorithm that has polynomial-time guarantees. Note that this algorithm is identical to implementing \(\texttt {SINKHORN}\) by exploiting the graphical structure to perform exact marginalization efficiently [49, 82].

The numerical simulation is on a standard benchmark problem used in the literature (see e.g., [14,  Figure 9] and [26,  Figure 2]) in which the particle at initial location \(x \in [0,1]\) moves to final location \(\sigma (x) = x + {{\,\mathrm{\frac{1}{2}}\,}}\pmod {1}\). This is run with \(k = 6\) and marginals \(\mu _1 = \dots = \mu _k\) uniformly supported on \(n = 51\) positions in [0, 1]. See Appendix B for numerics on other standard benchmark instances. Note that this amounts to solving an \(\textsf {MOT}\) LP with \(n^k = 51^6 \approx 1.8 \times 10^{10}\) variables, which is infeasible for standard LP solvers. Our algorithm is the first to compute exact solutions for problem instances of this scale.

Two important remarks. First, since this \(\textsf {MOT}\) problem is a discretization of the underlying PDE, an exact solution is of course not necessary; however, there is an important—even qualitative—difference between low-precision solutions (computable with \(\mathrm {poly}(1/\varepsilon )\) runtime) and high-precision solutions (computable with \(\mathrm {polylog}(1/\varepsilon )\) runtime) for the discretized problem. Second, a desirable feature of \(\texttt {SINKHORN}\) that should be emphasized is its practical scalability, which might make it advantageous for problems where very fine discretization is required. It is an interesting direction of practical relevance to develop algorithms that can compute high-precision solutions at a similarly large scale in practice (see the discussion in Sect. 8).

6 Application: \(\textsf {MOT}\) problems with set-optimization structure

In this section, we consider \(\textsf {MOT}\) problems whose cost tensors C take values 0 and 1—or more generally any two values, by a straightforward reductionFootnote 7. Such \(\textsf {MOT}\) problems arise naturally in applications where one seeks to minimize or maximize the probability that some event occurs given marginal probabilities on each variable (see Example 6.1). We establish that this general class of \(\textsf {MOT}\) problems can be solved in polynomial time under a condition on the sparsity pattern of C that is often simple to check due its connection to classical combinatorial optimization problems.

The section is organized as follows. In Sect. 6.1 we formally describe this setup and discuss why it is incomparable to all other structures discussed in this paper. In Sect. 6.2, we show that for costs with this structure, the \(\textsf {MIN}\), \(\textsf {AMIN}\), and \(\textsf {SMIN}\) oracles can be implemented in polynomial time; from this it immediately follows that the \(\texttt {ELLIPSOID}\), \(\texttt {MWU}\), \(\texttt {SINKHORN}\), and \(\texttt {COLGEN}\) algorithms discussed in part 1 of this paper can be implemented in polynomial time. In Sect. 6.3, we illustrate our results via a case study on network reliability.

6.1 Setup

Example 6.1

(Motivation for binary-valued \(\textsf {MOT} \) costs: minimizing/maximizing probability of an event) Let \(S \subset [n]^k\). If \(C_{\vec {j}} = \mathbb {1}[\vec {j}\in S]\), then the \(\textsf {MOT}_C\) problem amounts to minimizing the probability that event S occurs, given marginals on each variable. On the other hand, if \(C_{\vec {j}} = \mathbb {1}[\vec {j}\notin S]\), then the \(\textsf {MOT}_C\) problem amounts to maximizing the probability that event S occurs since

$$\begin{aligned} \textsf {MOT}_C(\mu _1,\dots ,\mu _k) = \min _{P \in \mathcal {M}(\mu _1,\dots ,\mu _k)} \mathbb {P}_{\vec {j}\sim P}[\vec {j}\notin S] = 1 - \max _{P \in \mathcal {M}(\mu _1,\dots ,\mu _k)} \mathbb {P}_{\vec {j}\sim P}[\vec {j}\in S]. \end{aligned}$$

Even if the cost is binary-valued, there is no hope to solve \(\textsf {MOT}\) in polynomial time without further assumptions—essentially because in the worst case, any algorithm must query all \(n^k\) entries if C is a completely arbitrary \(\{0,1\}\)-valued tensor.

We show that \(\textsf {MOT}\) is polynomial-time solvable under the general and often simple-to-check condition that the \(\textsf {MIN}\), \(\textsf {AMIN}\), and \(\textsf {SMIN}\) oracles introduced in Sect. 3 are polynomial-time solvable when restricted to the set S of indices \(\vec {j}\in [n]^k\) for which \(C_{\vec {j}} = 0\). For simplicity, our definition of these set oracles removes the cost \(C_{\vec {j}}\) as it is constant on S. Of course it is also possible to remove the negative sign in \(-p\) by re-parameterizing the inputs as \(w = -p\); however, we keep this notation in order to parallel the original oracles.

Definition 6.2

(\({{\textsf {\textit{MIN}}}}\) set oracle) Let \(S \subset [n]^k\). For weights \(p = (p_1,\ldots ,p_k) \in \mathbb {R}^{n \times k}\), \(\textsf {MIN}_{C,S}(p)\) returns

$$\begin{aligned} \min _{\vec {j}\in S} - \sum _{i=1}^k [p_i]_{j_i}. \end{aligned}$$

Definition 6.3

(\({{\textsf {\textit{AMIN}}}}\) set oracle) Let \(S \subset [n]^k\). For weights \(p = (p_1,\ldots ,p_k) \in \mathbb {R}^{n \times k}\) and accuracy \(\varepsilon > 0\), \(\textsf {AMIN}_{C,S}(p,\varepsilon )\) returns \(\textsf {MIN}_{C,S}(p)\) up to additive error \(\varepsilon \).

Definition 6.4

(\({{\textsf {\textit{SMIN}}}}\) set oracle) Let \(S \subset [n]^k\). For weights \(p = (p_1,\ldots ,p_k) \in \bar{\mathbb {R}}^{n \times k}\) and regularization parameter \(\eta > 0\), \(\textsf {SMIN}_{C,S}(p, \eta )\) returns

$$\begin{aligned} \displaystyle \mathop {{{\,\mathrm{{\textstyle {smin_\eta }}}\,}}}\limits _{\vec {j}\in S} - \sum _{i=1}^k [p_i]_{j_i}. \end{aligned}$$

The key motivation behind these set oracle definitions (aside from the syntactic similarity to the original oracles) is that they encode the problem of (approximately) finding the min-weight object in S. This opens the door to combinatorial applications of \(\textsf {MOT}\) because finding the min-weight object in S is well-known to be polynomial-time solvable for many “combinatorial-structured” sets S of interest—e.g., the set S of cuts in a graph, or the set S of independent sets in a matroid. See Sect. 6.3 for fully-detailed applications.

Definition 6.5

(Set-optimization structure for \({{\textsf {\textit{MOT}}}}\) ) An \(\textsf {MOT}\) cost tensor \(C \in (\mathbb {R}^n)^{\otimes k}\) has exact, approximate, or soft set-optimization structure of complexity \(\beta \) if

$$\begin{aligned} C_{\vec {j}} = \mathbb {1}[\vec {j}\notin S] \end{aligned}$$

for a set \(S \subset [n]^k\) for which there is an algorithm solving \(\textsf {MIN}_{C,S}\), \(\textsf {AMIN}_{C,S}\), or \(\textsf {SMIN}_{C,S}\), respectively, in \(\beta \) time.

We make two remarks about this structure.

Remark 6.6

(Only require set oracle for \(C^{-1}(0)\), not for \(C^{-1}(1)\)) Note that Definition 6.5 only requires the set oracles for the set S of entries where C is 0, and does not need the set oracles for the set \([n]^k \setminus S\) where C is 1. The fact that both set oracles are not needed makes set-optimization structure easier to check than the original oracles in Sect. 3, because those effectively require optimization over both S and \([n]^k \setminus S\).

Remark 6.7

(Set-optimization structure is incomparable to graphical and low-rank plus sparse structure) Costs C that satisfy Definition 6.5 in general do not have non-trivial graphical structure or low-rank plus sparse structure. Specifically, there are costs C that satisfy Definition 6.5, yet require maximal \(k-1\) treewidth to model via graphical structure, and super-constant rank or exponential sparsity to model via low-rank plus sparse structure. (A concrete example is the network reliability application in Sect. 6.3.) Because of the \(\mathsf {NP}\)-hardness of \(\textsf {MOT}\) problems with \((k-1)\)-treewidth graphical structure or super-constant rank [6], simply modeling such problems with graphical structure or low-rank plus rank structure is therefore useless for the purpose of designing polynomial-time \(\textsf {MOT}\) algorithms.

6.2 Polynomial-time algorithms

By our oracle reductions in part 1 of this paper, in order to design polynomial-time algorithms for \(\textsf {MOT}\) with set-optimization structure, it suffices to design polynomial-time algorithms for the \(\textsf {MIN}\), \(\textsf {AMIN}\), or \(\textsf {SMIN}\) oracles. We show how to do this for all three oracles in a straightforward way by exploiting the set-optimization structure.

Theorem 6.8

(Polynomial-time algorithms for the \(\textsf {MIN} \), \(\textsf {AMIN} \), and \(\textsf {SMIN} \) oracles for costs with set-optimization structure) If \(C \in (\mathbb {R}^n)^{\otimes k}\) is a cost tensor with exact, approximate, or soft set-optimization structure of complexity \(\beta \) (see Definition 6.5), then the \(\textsf {MIN}_C\), \(\textsf {AMIN}_C\), and \(\textsf {SMIN}_C\) oracles, respectively, can be computed in \(\beta + \mathrm {poly}(n,k)\) time.

figure f
figure g

Proof

Polynomial-time algorithm for \({{\textsf {\textit{MIN}}}}\). We first claim that Algorithm 6 implements the \(\textsf {MIN}_C(p)\) oracle. To this end, define

$$\begin{aligned} a := \textsf {MIN}_{C,S}(p) = \min _{\begin{array}{c} \vec {j}\in [n]^k \\ \text {s.t. } C_{\vec {j}} = 0 \end{array}} - \sum _{i=1}^k [p_i]_{j_i} \qquad \text {and} \qquad b := \min _{\begin{array}{c} \vec {j}\in [n]^k \\ \text {s.t. } C_{\vec {j}} = 1 \end{array}} - \sum _{i=1}^k [p_i]_{j_i}. \end{aligned}$$
(6.1)

By re-arranging the sum and max, it follows that

$$\begin{aligned} x := -\sum _{i=1}^k \max _{j \in [n]} [p_i]_{j} = - \max _{\vec {j}\in [n]^k} \sum _{i=1}^k [p_i]_{j_i} = \min _{\vec {j}\in [n]^k} \sum _{i=1}^k - [p_i]_{j_i} = \min (a,b). \end{aligned}$$
(6.2)

Therefore

$$\begin{aligned} \textsf {MIN}_C(p) = \min _{\vec {j}\in [n]^k} C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i} = \min (a,1+b)&= {\left\{ \begin{array}{ll} a &{} \text { if } a \leqslant b \\ \min (a,1+\min (a,b)) &{} \text { if } a> b \end{array}\right. }\nonumber \\&= {\left\{ \begin{array}{ll} a &{} \text { if } a \leqslant x \\ \min (a,1+x) &{} \text { if } a > x \end{array}\right. }, \end{aligned}$$
(6.3)

where above the first step is by definition of \(\textsf {MIN}_C\); the second step is by partitioning the minimization over \(\vec {j}\in [n]^k\) into \(\vec {j}\) such that \(C_{\vec {j}} = 0\) or \(C_{\vec {j}} = 1\), and then plugging in the definitions of a and b; the third step is by manipulating \(\min (a,1+b)\) in both cases; and the last step is because \(x = \min (a,b)\) as shown above. We conclude that Algorithm 6 correctly outputs \(\textsf {MIN}_C(p)\). Since the algorithm uses one call to the \(\textsf {MIN}_{C,S}\) oracle and O(nk) additional time, the claim is proven.

Polynomial-time algorithm for \({{\textsf {\textit{AMIN}}}}\). Next, we claim that the same Algorithm 6, now run with the approximate oracle \(\textsf {AMIN}_{C,S}(p,\varepsilon )\) in the first step instead of the exact oracle \(\textsf {MIN}_{C,S}(p)\), computes a valid solution to \(\textsf {AMIN}_C(p,\varepsilon )\). To prove this, let a, b, and x be as defined in (6.1) and (6.2) for the \(\textsf {MIN}\) analysis, and let \(\tilde{a} = \textsf {AMIN}_{C,S}(p,\varepsilon )\). By the same logic as in (6.3), except now reversed, the output

$$\begin{aligned} {\left\{ \begin{array}{ll} \tilde{a} &{} \text { if } \tilde{a} \leqslant x \\ \min (\tilde{a},1+x) &{} \text { if } \tilde{a} > x \end{array}\right. } \end{aligned}$$

is equal to \(\min (\tilde{a},1+b)\). Now because \(\tilde{a}\) is within additive \(\varepsilon \) error of a (by definition of the \(\textsf {AMIN}_{C,S}\) oracle), it follows that the above output is within \(\varepsilon \) additive error of

$$\begin{aligned} \min (a,1+b) = \min _{\vec {j}\in [n]^k} C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i} = \textsf {MIN}_C(p). \end{aligned}$$

Thus the output is a valid answer to \(\textsf {AMIN}_C(p,\varepsilon )\), establishing correctness. The runtime claim is obvious.

Polynomial-time algorithm for \({{\textsf {\textit{SMIN}}}}\). Finally, we claim that Algorithm 7 implements the \(\textsf {SMIN}_C(p,\eta )\) oracle. To this end, define

$$\begin{aligned} a := e^{-\eta \cdot \textsf {SMIN}_{C,S}(p,\eta )} = \sum _{\begin{array}{c} \vec {j}\in [n]^k \\ \text {s.t. } C_{\vec {j}= 0} \end{array}} e^{\eta \sum _{i=1}^k [p_i]_{j_i}} \qquad \text { and } \qquad b := \sum _{\begin{array}{c} \vec {j}\in [n]^k \\ \text {s.t. } C_{\vec {j}= 1} \end{array}} e^{\eta \sum _{i=1}^k [p_i]_{j_i}}. \end{aligned}$$

By re-arranging products and sums, it follows that

$$\begin{aligned} x := \prod _{i=1}^k \sum _{j=1}^n e^{\eta [p_i]_{j_i}} = \sum _{\vec {j}\in [n]^k} \prod _{i=1}^k e^{\eta [p_i]_{j_i}} = a + b. \end{aligned}$$

Therefore

$$\begin{aligned} \textsf {SMIN}_C(p,\eta )= & {} -\frac{1}{\eta } \log \left( \sum _{\vec {j}\in [n]^k} e^{-\eta (C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i})} \right) = -\frac{1}{\eta } \log \left( a + e^{-\eta }b \right) \\= & {} -\frac{1}{\eta } \log \left( e^{-\eta } x + (1-e^{-\eta })a \right) , \end{aligned}$$

where above the first step is by definition of \(\textsf {SMIN}_C\); the second step is by partitioning the sum over \(\vec {j}\in [n]^k\) into \(\vec {j}\) such that \(C_{\vec {j}} = 0\) or \(C_{\vec {j}} = 1\), and then plugging in the definitions of a and b; and the third step is because \(x = a+b\) as shown above. We conclude that Algorithm 7 correctly outputs \(\textsf {SMIN}_C(p,\eta )\). Since the algorithm uses one call to the \(\textsf {SMIN}_{C,S}\) oracle and O(nk) additional time, the claim is proven. \(\square \)

An immediate consequence of Theorem 6.8 combined with our oracle reductions is that all of the candidate \(\textsf {MOT}\) algorithms described in Sect. 4 can be efficiently implemented for \(\textsf {MOT}\) problems with set-optimization structure. From a theoretical perspective, the \(\texttt {ELLIPSOID}\) algorithm gives the best guarantee since it produces an exact, sparse solution in polynomial time.

Corollary 6.9

(Polynomial-time algorithms for \(\textsf {MOT}\) problems with set-optimization structure) Let \(C \in (\mathbb {R}^n)^{\otimes k}\) be a cost tensor that has set-optimization structure with \(\mathrm {poly}(n,k)\) complexity (see Definition 6.5).

  • Exact set-optimization structure. The \(\texttt {ELLIPSOID}\) algorithm in Sect. 4.1 computes an exact solution to \(\textsf {MOT}_C\) in \(\mathrm {poly}(n,k)\) time. Also, the \(\texttt {COLGEN}\) algorithm in Sect. 4.1.3 can be run for T iterations in \(\mathrm {poly}(n,k,T)\) time.

  • Approximate set-optimization structure. The \(\texttt {MWU}\) algorithm in Sect. 4.2 computes an \(\varepsilon \)-approximate solution to \(\textsf {MOT}_C\) in \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) time.

  • Soft set-optimization structure. The \(\texttt {SINKHORN}\) algorithm in Sect. 4.3 computes an \(\varepsilon \)-approximate solution to \(\textsf {MOT}_C\) in \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) time.

Moreover, \(\texttt {ELLIPSOID}\), \(\texttt {MWU}\), and \(\texttt {COLGEN}\) output a polynomially sparse tensor, whereas \(\texttt {SINKHORN}\) outputs a fully dense tensor through the implicit representation described in Sect. 4.3.1.

Proof

Combine the polynomial-time implementations of the oracles in Theorem 7.4 with the polynomial-time algorithm-to-oracle reductions in Theorems 4.14.64.7, and 4.18, respectively. \(\square \)

6.3 Application vignette: Network reliability with correlations

In this section, we illustrate this class of \(\textsf {MOT}\) structures via an application to network reliability, a central topic in network science, engineering, and operations research, see e.g., the textbooks [12, 13, 41]. The basic network reliability question is: given an undirected graph \(G = (V,E)\) where each edge \(e \in E\) is reliable with some probability \(q_e\) and fails with probability \(1-q_e\), what is the probability that all vertices are reachable from all others? This connectivity is desirable in applications, e.g., if G is a computer cluster, the vertices are the machines, and the edges are communication links, then connectivity corresponds to the reachability of all machines. See the aforementioned textbooks for many other applications.

Of course, the above network reliability question is not yet well-defined since the edge failures are only prescribed up to their marginal distributions. Which joint distribution greatly impacts the answer.

The most classical setup posits that edge failures are independent [63]. Denote the network reliability probability for this setting by \(\rho ^{\mathrm {ind}}\). This quantity \(\rho ^{\mathrm {ind}}\) is \(\#\)P-complete [72, 85] and thus \(\mathsf {NP}\)-hard to compute, but there exist fully polynomial randomized approximation schemes (a.k.a. FPRAS) for multiplicatively approximating both the connection probability \(\rho ^{\mathrm {ind}}\) [52] and the failure probability \(1 - \rho ^{\mathrm {ind}}\) [45].

Here we investigate the setting of coordinated edge failures, which dates back to the 1980s [89, 93]. This coordination may optimize for disconnection (e.g., by an adversary), or for connection (e.g., maximize the time a network is connected while performing maintenance on each edge e during \(1 - q_e\) fraction of the time). We define these notions below; see also Fig. 5 for an illustration. Below, \({{\,\mathrm{Ber}\,}}(q_e)\) denotes the Bernoulli distribution with parameter \(q_e\).

Definition 6.10

(Network reliability with correlations) For an undirected graph \(G = (V,E)\) and edge reliability probabilities \(\{q_e\}_{e \in E}\):

  • The worst-case network reliability is

    $$\begin{aligned} \rho ^{\min }:= \min _{P \in \mathcal {M}(\{{{\,\mathrm{Ber}\,}}(q_e)\}_{e \in E})} \mathbb {P}_{H \sim P}\left[ H\text { is a connected subgraph of } G\right] . \end{aligned}$$
  • The best-case network reliability is

    $$\begin{aligned} \rho ^{\max }:= \max _{P \in \mathcal {M}(\{{{\,\mathrm{Ber}\,}}(q_e)\}_{e \in E})} \mathbb {P}_{H \sim P}[H\text { is a connected subgraph of }G]. \end{aligned}$$
Fig. 5
figure 5

Optimal decompositions for the the worst-case (top) and best-case (bottom) reliability problems on the same graph G and edge reliability probabilities \(q_e\) (left). Coordinating edge failures yields significantly different connection probabilities: \(\rho ^{\min }= 40\%\), \(\rho ^{\mathrm {ind}}\approx 60\%\), and \(\rho ^{\max }= 90\%\)

Clearly \(\rho ^{\min }\leqslant \rho ^{\mathrm {ind}}\leqslant \rho ^{\max }\). These gaps can be large (e.g., see Fig. 5), which promises large opportunities for applications in which coordination is possible. However, in order to realize such an opportunity requires being able to compute \(\rho ^{\min }\) and \(\rho ^{\max }\), and both of these problems require solving an exponentially large LP with \(2^{|E|}\) variables. Below we show how to use set-optimization structure to compute these quantities in \(\mathrm {poly}(|E|)\) time, thereby recovering as a special case of our general framework the known polynomial-time algorithms for this particular problem in [89, 93], as well as more practical polynomial-time algorithms that scale to input sizes that are an order-of-magnitude larger.

Corollary 6.11

(Polynomial-time algorithm for network reliability with correlations) The worst-case and best-case network reliability can both be computed in \(\mathrm {poly}(|E|)\) time.

Proof

By the observation in Example 6.1, the optimization problems defining \(\rho ^{\min }\) and \(1 - \rho ^{\max }\) are instances of \(\textsf {MOT}\) in which \(k=|E|\), \(n=2\), \(\mu _e = {{\,\mathrm{Ber}\,}}(q_e)\), and each entry of the cost \(C \in \{0,1\}^{|E|}\) is the indicator of whether that subset of edges is a connected or disconnected subgraph of G, respectively. It therefore suffices to show that both of these \(\textsf {MOT}\) cost tensors satisfy exact set-optimization structure (Definition 6.5) since that implies a polynomial-time algorithm for exactly solving \(\textsf {MOT}\) (Corollary 6.9).

Set-optimization structure for \(1-\rho ^{\max }\). In this case, S is the set of connected subgraphs of G. Thus the \(\textsf {MIN}_{C,S}\) problem is: given weights \(p \in \mathbb {R}^{2 \times |E|}\), compute

$$\begin{aligned} \min _{\text {connected subgraph } H \text { of }G} - \sum _{e \in H} p_{2,e} - \sum _{e \notin H} p_{1,e}. \end{aligned}$$

Note that this objective is equal to \(\sum _{e \in H} x_e - \sum _{e \in E} p_{1,e}\) where \(x_e := p_{1,e} - p_{2,e}\). Since the latter sum is independent of H, the \(\textsf {MIN}_{C,S}\) problem therefore reduces to the problem of finding a minimum-weight connected subgraph in G; that is, given edge weights \(x \in \mathbb {R}^{|E|}\), compute

$$\begin{aligned} \min _{\text {connected subgraph } H \text { of }G} \sum _{e \in H} x_e. \end{aligned}$$
(6.4)

We first show how to solve this in polynomial time in the case that all edge weights \(x_e\) are positive. In this case, the optimal solution H is a minimum-weight spanning tree of G. This can be found by Kruskal’s algorithm in \(O(|E| \log |E|)\) time [56].

For the general case of arbitrary edge weights, note that the edges e with non-positive weight \(x \leqslant 0\) can be added to any solution without worsening the cost or feasibility. Thus these edges are without loss of generality in every solution H, and so it suffices to solve the same problem (6.4) on the graph \(G'\) obtained by contracting these non-positively-weighted edges in G. This reduces (6.4) to the same problem of finding a minimum-weight connected subgraph, except now in the special case that all edge weights are positive. Since we have already shown how to solve this case in polynomial time, the proof is complete.

Set-optimization structure for \(\rho ^{\min }\). In this case, S is the set of disconnected subgraphs of G. We may simplify the \(\textsf {MIN}_{C,S}\) problem for \(\rho ^{\min }\) by re-parameterizing the input \(p \in \mathbb {R}^{2 \times |E|}\) to edge weights \(x \in \mathbb {R}^{|E|}\) as done above in (6.4) for \(1-\rho ^{\max }\). Thus the \(\textsf {MIN}_{C,S}\) problem for \(\rho ^{\min }\) is: given weights \(x \in \mathbb {R}^{|E|}\), compute

$$\begin{aligned} \min _{\text {disconnected subgraph } H \text { of }G} \sum _{e \in H} x_e. \end{aligned}$$
(6.5)

We first show how to solve this in the case that all edge weights \(x_e\) are negative. In that case, the optimal solution is of the form \(H = E \setminus C\), where C is a maximum-weight cut of the graph G with weights \(x_e\). Equivalently, by negating all edge weights, C is a minimum-weight cut of the graph G with weights \(-x_e\). Since a minimum-weight cut of a graph with positive weights can be found in polynomial time [81], the problem (6.5) can be solved in polynomial time when all \(x_e\) are negative.

Now in the general case of arbitrary edge weights, note that the edges e with non-negative weight \(x \geqslant 0\) can be removed from any solution without worsening the cost or feasibility. Thus these edges are without loss of generality not in every solution H, and so it suffices to solve the same problem (6.5) on the graph \(G'\) obtained by deleting these non-negatively-weighted edges in G. This reduces (6.5) to the same problem of finding a minimum-weight disconnected subgraph, except now in the special case that all edge weights are negative. Since we have already shown how to solve this case in polynomial time, the proof is complete. \(\square \)

Fig. 6
figure 6

Top: comparison of the runtime (left) and accuracy (right) of the algorithms described in the main text, for the worst-case reliability of a clique graph on t vertices and \(k = \left( {\begin{array}{c}t\\ 2\end{array}}\right) \) edges with reliability probabilities \(q_e = 0.99\). Bottom: same, but for best-case reliability and reliability probabilities \(q_e = 0.01\). For worst-case reliability, the algorithms compute an upper bound, so a smaller value is better; reverse for best-case reliability. The algorithms are cut off at 2 minutes, denoted by an “x”. \(\texttt {SINKHORN}\) is run at the highest precision (i.e., highest \(\eta \)) before numerical precision issues. The \(\texttt {COLGEN}\) algorithm that our framework recovers computes exact solutions an order-of-magnitude faster than the other algorithms, and the new \(\texttt {MWU}\) algorithm computes reasonably approximate solutions for \(k = 400\), which amounts to an \(\textsf {MOT}\) LP with \(n^k = 2^{400} \approx 2.6 \times 10^{120}\) variables

In Fig. 6, we compare the numerical performance of the algorithms in Corollary 6.11\(\texttt {COLGEN}\) and \(\texttt {MWU}\) with polynomial-time implementation of their bottlenecks—with the fastest previous algorithms for both best-case and worst-case network reliability. Previously, the fastest algorithms that apply to this problem are (1) out-of-the-box LP solvers run on \(\textsf {MOT}\), (2) the brute-force implementation of \(\texttt {SINKHORN}\) which marginalizes over all \(n^k = 2^{|E|}\) entries in each iteration, and (3) this \(\texttt {COLGEN}\) algorithm that we recover [89, 93]. It is unknown if there is a practically efficient implementation of the \(\textsf {SMIN}_{C,S}\) oracle (and thus of \(\texttt {SINKHORN}\)) for both best-case or worst-case reliability. Since the previous algorithms (1) and (2) have exponential runtime that scales as \(n^{\varOmega (k)} = 2^{\varOmega (|E|)}\), they do not scale past tiny input sizes. In contrast, the algorithms in Corollary 6.11 scale to much larger inputs. Indeed, the \(\texttt {COLGEN}\) algorithm that our framework recovers can compute exact solutions roughly an order-of-magnitude faster than the other algorithms, and the new \(\texttt {MWU}\) algorithm computes reasonably approximate solutions beyond \(k = 400\), which amounts to an \(\textsf {MOT}\) LP with \(n^k = 2^{400} \approx 2.6 \times 10^{120}\) variables.

7 Application: \(\textsf {MOT}\) problems with low-rank plus sparse structure

In this section, we consider \(\textsf {MOT}\) problems whose cost tensors C decompose into low-rank and sparse components. We propose the first polynomial-time algorithms for this general class of \(\textsf {MOT}\) problems.

The section is organized as follows. In Sect. 7.1 we formally describe this setup and discuss why it is incomparable to all other structures discussed in this paper. In Sect. 7.2, we show that for costs with this structure, the \(\textsf {AMIN}\) and \(\textsf {SMIN}\) oracles can be implemented in polynomial time; from this it immediately follows that \(\texttt {MWU}\) and \(\texttt {SINKHORN}\) can be implemented in polynomial time. Finally, in Sect. 7.3 and Sect. 7.4, we provide two illustrative applications of these algorithms. The former regards portfolio risk management and is a direct application of our result for \(\textsf {MOT}\) with low-rank cost tensors. The latter regards projecting mixture distributions to the transportation polytope and illustrates the versality of our algorithmic results since this problem is quadratic optimization over the transportation polytope rather than linear (a.k.a. \(\textsf {MOT}\)).

7.1 Setup

We begin by recalling the definition of tensor rank. It is the direct analog of the standard concept of matrix rank. See the survey [54] for further background.

Definition 7.1

(Tensor rank) A rank-r factorization of a tensor \(R \in (\mathbb {R}^n)^{\otimes k}\) is a collection of rk vectors \(\{u_{i,\ell }\}_{i \in [k], \ell \in [r]} \subset \mathbb {R}^n\) satisfying

$$\begin{aligned} R = \sum _{\ell =1}^r \bigotimes _{i=1}^k u_{i,\ell }. \end{aligned}$$

The rank of a tensor is the minimal r for which there exists a rank-r factorization.

In this section we consider \(\textsf {MOT}\) problems with the following “low-rank plus sparse” structure.

Definition 7.2

(Low-rank plus sparse structure for \({{\textsf {\textit{MOT}}}}\) ) An \(\textsf {MOT}\) cost tensor \(C \in (\mathbb {R}^n)^{\otimes k}\) has low-rank plus sparse structure of rank r and sparsity s if it decomposes as

$$\begin{aligned} C = R + S, \end{aligned}$$
(7.1)

where R is a rank-r tensor and S is an s-sparse tensor.

Throughout, we make the natural assumption that S is input through its s non-zero entries, and that R is input through a rank-r factorization. We also make the natural assumption that the entries of both R and S are of size \(O(C_{\max })\)—this rules out the case of having extremely large entries of R and S, one positive and one negative, which cancel to yield a small entry of \(C = R + S\).

Remark 7.3

(Neither low-rank structure nor sparse structure can be modeled by graphical structure or set-optimization structure) In general, both rank-1 costs and polynomially sparse costs do not have non-trivial graphical structure. Specifically, modeling these costs with graphical structure requires the complete graph (a.k.a., maximal treewidth of \(k-1\))—and because \(\textsf {MOT}\) problems with graphical structure of treewidth \(k-1\) are \(\mathsf {NP}\)-hard to solve in the absence of further structure [6], modeling such problems with graphical structure is useless for the purpose of designing polynomial-time \(\textsf {MOT}\) algorithms. It is also clear that neither low-rank structure nor sparse structure can be modeled by set-optimization structure because in general, neither R nor S nor \(R+S\) has binary-valued entries.

7.2 Polynomial-time algorithms

From a technical perspective, the main result of this section is that there is a polynomial-time algorithm for approximating the minimum entry of a tensor that decomposes into constant-rank and sparse components. Previously, this was not known even for constant-rank tensors. This result may be of independent interest. We remark that this result is optimal in the sense that unless \(\mathsf {P}= \mathsf {NP}\), there does not exist an algorithm with runtime that is jointly polynomial in the input size and the rank r [6].

Theorem 7.4

(Polynomial-time algorithm solving \(\textsf {AMIN} \) and \(\textsf {SMIN} \) for low-rank + sparse costs) Consider cost tensors \(C \in (\mathbb {R}^n)^{\otimes k}\) that have low-rank plus sparse structure of rank r and sparsity s (see Definition 7.2). For any fixed r, Algorithm 8 runs in \(\mathrm {poly}(n, k, s, C_{\max }/\varepsilon )\) time and solves the \(\varepsilon \)-approximate \(\textsf {AMIN}_C\) oracle. Furthermore, it also solves the \(\textsf {SMIN}_{\tilde{C}}\) oracle for \(\eta = (2k \log n)/\varepsilon \) on some cost tensor \(\tilde{C} \in (\mathbb {R}^n)^{\otimes k}\) satisfying \(\Vert C - \tilde{C}\Vert _{\max } \leqslant \varepsilon /2\).

We make three remarks about Theorem 7.4. First, we are unaware of any polynomial-time implementation of \(\textsf {SMIN}_C\) for the cost C. Instead, Theorem 7.4 solves the \(\textsf {SMIN}_{\tilde{C}}\) oracle for an \(O(\varepsilon )\)-approximate cost tensor \(\tilde{C}\) since this is sufficient for implementing \(\texttt {SINKHORN}\) on the original cost tensor C (see Corollary 7.5 below). Second, it is an interesting open question if the \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) runtime for the \(\varepsilon \)-approximate \(\textsf {AMIN}_C\) oracle can be improved to \(\mathrm {poly}(n,k,\log (C_{\max }/\varepsilon ))\), as this would imply a \(\mathrm {poly}(n,k)\) runtime for the \(\textsf {MIN}_C\) oracle and thus for this class of \(\textsf {MOT}\) problems (see also Footnote 4 in the introduction). Third, we remark about practical efficiency: the runtime of Algorithm 8 is not just polynomially small in s and n, but in fact linear in s and near-linear in n. However, since this improved runtime is not needed for the theoretical results in the sequel, we do not pursue this further.

Combining the efficient oracle implementations in Theorem 7.4 with our algorithm-to-oracles reductions in Sect. 4 implies the first polynomial-time algorithms for \(\textsf {MOT}\) problems with costs that have constant-rank plus sparse structure. This is optimal in the sense that unless \(\mathsf {P}= \mathsf {NP}\), there does not exist an algorithm with runtime that is jointly polynomial in the input size and the rank r [6].

Corollary 7.5

(Polynomial-time algorithms solving \(\textsf {MOT} \) for low-rank + sparse costs) Consider cost tensors \(C \in (\mathbb {R}^n)^{\otimes k}\) that have low-rank plus sparse structure of constant rank r and \(\mathrm {poly}(n,k)\) sparsity s (see Definition 7.2). For any \(\varepsilon > 0\):

  • The \(\texttt {MWU}\) algorithm in Sect. 4.2 computes an \(\varepsilon \)-approximate solution to \(\textsf {MOT}_C\) in \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) time.

  • The \(\texttt {SINKHORN}\) algorithm in Sect. 4.3 computes an \(\varepsilon \)-approximate solution to \(\textsf {MOT}_C\) in \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) time.

Moreover, \(\texttt {MWU}\) outputs a polynomially sparse tensor, whereas \(\texttt {SINKHORN}\) outputs a fully dense tensor through the implicit representation described in Sect. 4.3.1.

Proof

For \(\texttt {MWU}\), simply combine the polynomial-time reduction to the \(\textsf {AMIN}_C\) oracle (Theorem 4.7) with the polynomial-time algorithm for the \(\textsf {AMIN}\) oracle (Theorem 7.4). For \(\texttt {SINKHORN}\), combining the polynomial-time reduction to the \(\textsf {SMIN}_{\tilde{C}}\) oracle (Theorem 4.18) with the polynomial-time algorithm for the \(\textsf {SMIN}_{\tilde{C}}\) oracle (Theorem 7.4) yields a \(\mathrm {poly}(n,k,C_{\max }/\varepsilon )\) algorithm for \(\varepsilon /2\)-approximating the \(\textsf {MOT}\) problem with cost tensor \(\tilde{C}\). It therefore suffices to show that the values of the \(\textsf {MOT}\) problems with cost tensors C and \(\tilde{C}\) differ by at most \(\varepsilon /2\), that is,

$$\begin{aligned} \left|\min _{P \in \mathcal {M}(\mu _1,\dots ,\mu _k)} \langle P, C \rangle - \min _{P \in \mathcal {M}(\mu _1,\dots ,\mu _k)} \langle P, \tilde{C} \rangle \right| \leqslant \varepsilon /2. \end{aligned}$$

But this holds because both \(\textsf {MOT}\) problems have the same feasible set, and for any feasible \(P \in \mathcal {M}(\mu _1,\dots ,\mu _k)\) it follows from Hölder’s inequality that the objectives of the two \(\textsf {MOT}\) problems differ by at most

$$\begin{aligned} \left|\langle P, C \rangle - \langle P, \tilde{C} \rangle \right| \leqslant \Vert P\Vert _1 \Vert C - \tilde{C}\Vert _{\max } \leqslant \varepsilon /2. \end{aligned}$$

\(\square \)

Below, we describe the algorithm in Theorem 7.4. Specifically, in Sect. 7.2.1, we give four helper lemmas which form the technical core of our algorithm; and then in Sect. 7.2.2, we combine these ingredients to design the algorithm and prove its correctness. Throughout, recall that we use the bracket notation f[A] to denote the entrwise application of a univariate function f (e.g., \(\exp \), \(\log \), or a polynomial) to A.

7.2.1 Technical ingredients

At a high level, our approach to designing the algorithm in Theorem 7.4 is to approximately compute the \(\textsf {SMIN}\) oracle in polynomial time by synthesizing four facts:

  1. 1.

    By expanding the softmin and performing simple operations, it suffices to compute the total sum of all \(n^k\) entries of the entrywise exponentiated tensor \(\exp [-\eta R]\) (modulo simple transforms).

  2. 2.

    Although \(\exp [-\eta R]\) is in general a full-rank tensor, we can exploit the fact that R is a low-rank tensor in order to approximate \(\exp [-\eta R]\) by a low-rank tensor L. (Moreover, we can efficiently compute a low-rank factorization of L in closed form.)

  3. 3.

    There is a simple algorithm for computing the sum of all \(n^k\) entries of L in polynomial time because L is low-rank. (And thus we may approximate the sum of all \(n^k\) entries of \(\exp [-\eta R]\) as desired in step 1.)

  4. 4.

    This approximation is sufficient for computing both the \(\textsf {AMIN}\) and \(\textsf {SMIN}\) oracle in Theorem 7.4.

Of these four steps, the main technical step is the low-rank approximation in step two. Below, we formalize these four steps individually in Lemmas 7.67.77.8, and 7.9. Further detail on how to synthesize these four steps is then provided afterwards, in the proof of Theorem 7.4.

It is convenient to write the first lemma in terms of an approximate tensor \(\tilde{C} = \tilde{R}+S\) rather than the original cost \(C = R + S\).

Lemma 7.6

(Softmin for cost with sparse component) Let \(\tilde{C} = \tilde{R} + S\) and \(p_1,\dots ,p_k \in \mathbb {R}^n\). Then

$$\begin{aligned} \mathop {{{\,\mathrm{{\textstyle {smin_\eta }}}\,}}}\limits _{\vec {j}\in [n]^k} \tilde{C}_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i} = -\eta ^{-1} \log (a + b), \end{aligned}$$

where \(d_i := \exp [\eta p_i] \in \mathbb {R}_{\geqslant 0}^n\),

$$\begin{aligned} a := \sum _{\begin{array}{c} \vec {j}\in [n]^k \\ \text {s.t. } S_{\vec {j}} \ne 0 \end{array}} \prod _{i=1}^k [d_i]_{j_i} \cdot e^{-\eta \tilde{R}_{\vec {j}}} \cdot (e^{-\eta S_{\vec {j}}} - 1) \end{aligned}$$
(7.2)

and

$$\begin{aligned} b := \sum _{\vec {j}\in [n]^k} \prod _{i=1}^k [d_i]_{j_i} \cdot e^{-\eta \tilde{R}_{\vec {j}}}. \end{aligned}$$
(7.3)

Proof

By expanding the definition of softmin, and then substituting \(p_i\) with \(d_i\) and \(\tilde{C}\) with \(\tilde{R} + S\),

$$\begin{aligned} {\mathop {{{\,\mathrm{{\textstyle {smin_\eta }}}\,}}}\limits _{\vec {j}\in [n]^k}} \tilde{C}_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}&= -\frac{1}{\eta } \log \left( \sum _{\vec {j}\in [n]^k} e^{\eta \sum _{i=1}^k [p_i]_{j_i}} e^{-\eta \tilde{C}_{\vec {j}}} \right) \\&= -\frac{1}{\eta } \log \left( \sum _{\vec {j}\in [n]^k} \prod _{i=1}^k [d_i]_{j_i} \cdot e^{-\eta \tilde{R}_{\vec {j}}} e^{-\eta S_{\vec {j}}} \right) . \end{aligned}$$

By simple manipulations, we conclude that the above quantity is equal to the desired quantity:

$$\begin{aligned} \cdots =&-\frac{1}{\eta } \log \left( \sum _{\begin{array}{c} \vec {j}\in [n]^k \\ \text {s.t. } S_{\vec {j}} \ne 0 \end{array}} \prod _{i=1}^k [d_i]_{j_i} \cdot e^{-\eta \tilde{R}_{\vec {j}}} e^{-\eta S_{\vec {j}}} + \sum _{\begin{array}{c} \vec {j}\in [n]^k \\ \text {s.t. } S_{\vec {j}} = 0 \end{array}} \prod _{i=1}^k [d_i]_{j_i} \cdot e^{-\eta \tilde{R}_{\vec {j}}} \right) \\ =&-\frac{1}{\eta } \log \left( \sum _{\begin{array}{c} \vec {j}\in [n]^k \\ \text {s.t. } S_{\vec {j}} \ne 0 \end{array}} \prod _{i=1}^k [d_i]_{j_i} \cdot e^{-\eta \tilde{R}_{\vec {j}}} \left( e^{-\eta S_{\vec {j}}} - 1\right) + \sum _{\vec {j}\in [n]^k} \prod _{i=1}^k [d_i]_{j_i} \cdot e^{-\eta \tilde{R}_{\vec {j}}} \right) \\ =&- \frac{1}{\eta } \log (a + b). \end{aligned}$$

Above, the first step is by partitioning the sum over \(\vec {j}\in [n]^k\) based on if \(S_{\vec {j}} = 0\), the second step is by adding and subtracting \(\sum _{\vec {j}\in [n]^k \text { s.t. } S_{\vec {j}} \ne 0} \prod _{i=1}^k [d_i]_{j_i} \cdot e^{-\eta \tilde{R}_{\vec {j}}}\), and the last step is by definition of a and b. \(\square \)

Lemma 7.7

(Low-rank approximation of the exponential of a low-rank tensor) There is an algorithm that given \(R \in (\mathbb {R}^n)^{\otimes k}\) in rank-r factored form, \(\eta > 0\), and a precision \(\tilde{\varepsilon }< e^{-\eta R_{\max }}\), takes \(n \cdot \mathrm {poly}(k,\tilde{r})\) time to compute a rank-\(\tilde{r}\) tensor \(L \in (\mathbb {R}^n)^{\otimes k}\) in factored form satisfying \(\Vert L - \exp [-\eta R]\Vert _{\max } \leqslant \tilde{\varepsilon }\), where

$$\begin{aligned} \tilde{r} \leqslant \left( {\begin{array}{c}r + O(\log \tfrac{1}{\tilde{\varepsilon }})\\ r\end{array}}\right) . \end{aligned}$$
(7.4)

Proof

By classical results from approximation theory (see, e.g., [83]), there exists a polynomial q of degree \(m = O(\log 1/\tilde{\varepsilon })\) satisfying

$$\begin{aligned} \left|\exp (-\eta x) - q(x)\right| \leqslant \tilde{\varepsilon }, \qquad \forall x \in [-R_{\max },R_{\max }]. \end{aligned}$$

For instance, the Taylor or Chebyshev expansion of \(x \mapsto \exp (-\eta x)\) suffices. Thus the tensor L with entries

$$\begin{aligned} L_{\vec {j}} = q(R_{\vec {j}}) \end{aligned}$$

approximates \(\exp [-\eta R]\) to error

$$\begin{aligned} \Vert L - \exp [-\eta R]\Vert _{\max } \leqslant \tilde{\varepsilon }. \end{aligned}$$

We now show that L has rank \(\tilde{r} \leqslant \left( {\begin{array}{c}r + m \\ r\end{array}}\right) \), and moreover that a rank-\(\tilde{r}\) factorization can be computed in \(n \cdot \mathrm {poly}(k,\tilde{r})\) time. Denote \(q(x) = \sum _{t=0}^m a_t x^t\) and \(R = \sum _{\ell =1}^r \otimes _{i=1}^k u_{i,\ell }\). By definition of L, definition of q and R, and then the Multinomial Theorem,

$$\begin{aligned} L_{\vec {j}} = q(R_{\vec {j}}) = \sum _{t=0}^m a_t \left( \sum _{\ell =1}^r \prod _{i=1}^k [u_{i,\ell }]_{j_i} \right) ^t = \sum _{\alpha \in \mathbb {N}_0^r \, : \, |\alpha | \leqslant m} \left( {\begin{array}{c}|\alpha |\\ \alpha \end{array}}\right) a_{|\alpha |} \prod _{\ell =1}^r \prod _{i=1}^k [u_{i,\ell }]_{j_i}^{\alpha _i}, \end{aligned}$$

where the sum is over r-tuples \(\alpha \) with non-negative entries summing to at most m. Thus

$$\begin{aligned} L = \sum _{\alpha \in \mathbb {N}_0^r \, : \, |\alpha | \leqslant m} \bigotimes _{i=1}^kv_{i,\alpha }, \end{aligned}$$

where \(v_{i,\alpha } \in \mathbb {R}^n\) denotes the vector with j-th entry \( \left( {\begin{array}{c}|\alpha |\\ \alpha \end{array}}\right) a_{|\alpha |} \prod _{\ell =1}^r [u_{i,\ell }]_j^{\alpha _i}\) for \(i=1\), and \(\prod _{\ell =1}^r [u_{i,\ell }]_j^{\alpha _i}\) for \(i > 1\). This yields the desired low-rank factorization of L because

$$\begin{aligned} \tilde{r} \leqslant \# \{\alpha \in \mathbb {N}_0^r \; : \; |\alpha | \leqslant m \} = \left( {\begin{array}{c}r+m\\ r\end{array}}\right) . \end{aligned}$$

Finally, since each of the \(k\tilde{r}\) vectors \(v_{i,\alpha }\) in the factorization of L can be computed efficiently from the closed-form expression above, the desired runtime follows. \(\square \)

Lemma 7.8

(Marginalizing a scaled low-rank tensor) Given vectors \(d_1, \dots , d_k \in \mathbb {R}^n\) and a tensor \(L \in (\mathbb {R}^n)^{\otimes k}\) through a rank \(\tilde{r}\) factorization, we can compute \(m((\otimes _{i=1}^k d_i) \odot L)\) in \(O(nk\tilde{r})\) time.

Proof

Denote the factorization of L by \(L = \sum _{\ell =1}^{\tilde{r}} \otimes _{i=1}^k v_{i,\ell }\). Then

$$\begin{aligned} m((\otimes _{i=1}^k d_i) \odot L )&= \sum _{\vec {j}\in [n]^k} \left[ (\otimes _{i=1}^k d_i) \odot L \right] _{\vec {j}} = \sum _{\vec {j}\in [n]^k} \sum _{\ell =1}^{\tilde{r}} \prod _{i=1}^k [d_i]_{j_i} [v_{i,\ell }]_{j_i} \\&= \sum _{\ell =1}^{\tilde{r}} \prod _{i=1}^k \sum _{j=1}^n [d_i]_{j} [v_{i,\ell }]_{j} = \sum _{\ell =1}^{\tilde{r}} \prod _{i=1}^k \langle d_i, v_{i,\ell } \rangle , \end{aligned}$$

where the first step is by definition of the \(m(\cdot )\) operation that sums over all entries, the second step is by definition of L, and the third step is by swapping products and sums. Thus computing the desired quantity amounts to computing \(\tilde{r}k\) inner products of n-dimensional vectors. This can be done in \(O(nr\tilde{k})\) time. \(\square \)

Lemma 7.9

(Precision of the low-rank approximation) Let \(\varepsilon \leqslant 1\). Suppose \(L \in (\mathbb {R}^n)^{\otimes k}\) satisfies \(\Vert L - \exp [-\eta R]\Vert _{\max } \leqslant \tfrac{\varepsilon }{3} e^{-\eta R_{\max }}\). Then the matrix \(\tilde{C} := - \frac{1}{\eta } \log [L] + S\) satisfies

$$\begin{aligned} \Vert \tilde{C} - C\Vert _{\max } \leqslant \frac{\varepsilon }{2}. \end{aligned}$$
(7.5)

Proof

Observe that the minimum entry of L is at least

$$\begin{aligned} e^{-\eta R_{\max }} - \tfrac{\varepsilon }{3} e^{-\eta R_{\max }} \geqslant \tfrac{2}{3}e^{-\eta R_{\max }}. \end{aligned}$$
(7.6)

Since this is strictly positive, the tensor \(\tilde{R} := -\eta ^{-1} \log [L]\) is well defined. Furthermore,

$$\begin{aligned} \Vert \eta \tilde{R} - \eta R\Vert _{\max } = \max _{\vec {j}\in [n]^k} \left|\eta \tilde{R}_{\vec {j}} - \eta R_{\vec {j}}\right| \leqslant \max _{\vec {j}\in [n]^k} \frac{\left|L_{\vec {j}} - e^{-\eta R_{\vec {j}}}\right|}{\min (L_{\vec {j}}, e^{-\eta R_{\vec {j}}}) } \leqslant \frac{\tfrac{\varepsilon }{3} e^{-\eta R_{\max }}}{\tfrac{2}{3} e^{-\eta R_{\max }}} = \frac{\varepsilon }{2}, \end{aligned}$$

where above the first step is by definition of the max norm; the second step is by the elementary inequality \(|\log x - \log y| \leqslant |x-y|/\min (x,y)\) which holds for positive scalars x and y [4,  Lemma K]; and the third step is by (7.6) and the approximation bound of L. Since \(\eta \geqslant 1\), we therefore conclude that \(\Vert \tilde{R} - R\Vert _{\max } \leqslant \varepsilon /2\). By adding and subtracting S, this implies \(\Vert \tilde{C} - C\Vert _{\max } = \Vert \tilde{R} - R\Vert _{\max } \leqslant \varepsilon /2\). \(\square \)

7.2.2 Proof of Theorem 7.4

We are now ready to state the algorithm in Theorem 7.4. Pseudocode is in Algorithm 8. Note that \(\tilde{R} = -\eta ^{-1} \log [L]\) and \(\tilde{C} = \tilde{R} + S\) are never explicitly computed because in both Lines 3 and 4, the algorithm performs the relevant operations only through the low-rank tensor L and the sparse tensor S.

figure h

Proof of Theorem 7.4

Proof of correctness for \({{\textsf {\textit{SMIN}}}}\). Consider any oracle inputs \(p = (p_1, \dots , p_k) \in \bar{\mathbb {R}}^{n \times k}\). By Lemma 7.9, the tensor \(\tilde{C} = \tilde{R}+ S = -\eta ^{-1} \log L + S\) satisfies \(\Vert \tilde{C} - C\Vert _{\max } \leqslant \varepsilon /2\). Therefore it suffices to show that Algorithm 8 correctly computes \(\textsf {SMIN}_{\tilde{C}}(p,\eta )\). This is true because that quantity is equal to \(-\eta ^{-1} \log (a + b)\) by Lemma 7.6.

Proof of correctness for \({{\textsf {\textit{AMIN}}}}\). We have just established that Algorithm 8 computes \(\textsf {SMIN}_{\tilde{C}}(p,\eta )\). Because \(\eta = (2k \log n)/\varepsilon \) and the fact that \(\textsf {SMIN}\) is a special case of \(\textsf {AMIN}\) (Remark 3.6), it follows that \(\textsf {SMIN}_{\tilde{C}}(p,\eta )\) is within additive accuracy \(\varepsilon /2\) of \(\textsf {MIN}_{\tilde{C}}(p,\eta )\). Therefore, by the triangle inequality, it suffices to show that \(\textsf {MIN}_{\tilde{C}}(p)\) is within \(\varepsilon /2\) additive accuracy of \(\textsf {MIN}_{C}(p)\). That is, it suffices to show that

$$\begin{aligned} \left|\min _{\vec {j}\in [n]^k} C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i} - \min _{\vec {j}\in [n]^k} \tilde{C}_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}\right| \leqslant \varepsilon /2. \end{aligned}$$

But this is true because \(\Vert C - \tilde{C}\Vert _{\max } \leqslant \varepsilon /2\) by Lemma 7.9, and thus the quantities \(C_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}\) and \(\tilde{C}_{\vec {j}} - \sum _{i=1}^k [p_i]_{j_i}\) are within additive accuracy \(\varepsilon /2\) for each \(\vec {j}\in [n]^k\).

Proof of runtime. We prove the claimed runtime bound simultaneously for the \(\textsf {AMIN}\) and \(\textsf {SMIN}\) computation because we use the same algorithm for both. To this end, we first bound the rank \(\tilde{r}\) of the low-rank approximation L computed in Lemma 7.7. Note that since \(\tilde{\varepsilon } = \tfrac{\varepsilon }{3} e^{-\eta R_{\max }}\) and since it is assumed that \(R_{\max } = O(C_{\max })\), we have \(\log 1/\tilde{\varepsilon } = O(\tfrac{C_{\max }}{\varepsilon }k \log n )\). Therefore

$$\begin{aligned} \tilde{r} \leqslant \left( {\begin{array}{c}r + O(\log 1/\tilde{\varepsilon })\\ r\end{array}}\right) = O(\log 1/\tilde{\varepsilon })^r = O(\tfrac{C_{\max }}{\varepsilon }k \log n )^r = \mathrm {poly}(\log n,k,C_{\max }/\varepsilon ). \end{aligned}$$

Above, the first step is by Lemma 7.7, and the final step is because r is assumed constant.

Therefore Line 2 in Algorithm 8 takes polynomial time by Lemma 7.7, Line 3 takes polynomial time by simply enumerating over the s non-zero entries of S, and Line 4 takes polynomial time by Lemma 7.8. \(\square \)

7.3 Application vignette: Risk estimation

Here we consider an application to portfolio risk management. For simplicity of exposition, let us first describe the setting of 1 financial instrument (“stock”). Consider investing in one unit of a stock for k years. For \(i \in \{0, \dots , k\}\), let \(X_i\) denote the price of the stock at year i. Suppose that the return \(\rho _i = X_{i}/X_{i-1}\) of the stock between years \(i-1\) and i is believed to follow some distribution \(\rho _i \sim \mu _i\). A fundamental question about the riskiness of this stock is to compute the investor’s expected profit in the worst-case over all joint probability distributions on future returns \((\rho _1,\dots ,\rho _k)\) that are consistent with the modeled marginal distributions \((\mu _1,\dots ,\mu _k)\). This is an \(\textsf {MOT}\) problem with cost C given by

$$\begin{aligned} C(\rho _1,\dots ,\rho _k) = \prod _{i \in [k]} \rho _i, \end{aligned}$$

where here we view C as a function rather than a tensor for notational simplicity. If each return \(\rho _i\) has n possible values (e.g., after quantization), then the cost C is equivalently represented as a rank-1 tensor in \((\mathbb {R}^n)^{\otimes k}\) (by assigning an index to each of the n possible values of each \(\rho _i\)). Therefore our result Corollary 7.5 provides a polynomial-time algorithm for solving this \(\textsf {MOT}\) problem defining the investor’s worst-case profit.

Rather than formalize this proof for 1 stock, we directly generalize to the general case of investing in r stocks, \(r \geqslant 1\). This is essentially identical to the simple case of \(r=1\) stock, modulo additional notation.

Corollary 7.10

(Polynomial-time algorithm for expected profit given marginals on the returns) Suppose an investor holds 1 unit of r stocks for k years. For each stock \(\ell \in [r]\) and each year \(i \in [k]\), let \(\rho _{i,\ell }\) denote the relative price of stock \(\ell \) between years i and \(i-1\). Suppose \(\rho _{i,\ell }\) has distribution \(\mu _{i,\ell }\), and that each \(\mu _{i,\ell }\) has at most n atoms. Let \(R_{\max }= \max _{\{\rho _{i,\ell }\}} \sum _{\ell =1}^r \prod _{i=1}^k \rho _{i,\ell }\) denote the maximal possible return. For any constant number of stocks r, there is a \(\mathrm {poly}(n,k,R_{\max }/\varepsilon )\) time algorithm for \(\varepsilon \)-approximating the expected profit in the worst-case over all futures that are consistent with the returns’ marginal distributions.

Proof

This is the optimization problem

$$\begin{aligned} \min _{P \in \mathcal {M}(\{\mu _{i,\ell }\}_{i \in [k], \ell \in [r]} )} \mathbb {E}_{ \{\rho _{i,\ell }\}_{i \in [k], \ell \in [r]} \sim P} \left[ \sum _{\ell =1}^r \prod _{i=1}^k \rho _{i,\ell } \right] \end{aligned}$$

over all joint distributions P on the returns \(\{\rho _{i,\ell }\}_{i \in [k], \ell \in [k]}\) that are consistent with the marginal distibutions \(\{\mu _{i,\ell }\}_{i \in [k], \ell \in [k]}\). This is an \(\textsf {MOT}\) problem with \(k' = rk\) marginals, each over n atoms, with cost function

$$\begin{aligned} C\Big (\{ \rho _{i,\ell } \}_{i \in [k], \ell \in [r]}\Big ) = \sum _{\ell ' \in [r]} \prod _{(i,\ell ) \in [k] \times [r] \cong [k']} \big (\rho _{i,\ell } \cdot \mathbb {1}[\ell = \ell '] + \mathbb {1}[\ell \ne \ell ']\big ). \end{aligned}$$
(7.7)

By viewing this cost function C as a cost tensor in the natural way (i.e., assigning an index to each of the n possible values of \(\rho _{i,\ell }\)), this representation (7.7) shows that the corresponding cost tensor \(C \in (\mathbb {R}^n)^{\otimes k'}\) has rank r. Moreover, observe that the maximum entry of the cost is \(R_{\max }\). Therefore we may appeal to our polynomial-time \(\textsf {MOT}\) algorithms in Corollary 7.5 for costs with constant rank. \(\square \)

The algorithm is readily generalized, e.g., if the investor has different units of a stock, or if a stock is held for a different number of years. The former is modeled simply by adding an extra year in which the return of stock \(\ell \) is equal to the number of units, with probability 1. The latter is modeled simply by setting the return of stock \(\ell \) to be 1 for all years after it is held, with probability 1.

In Fig. 7, we provide a numerical illustration comparing our new polynomial-time algorithms for this risk estimation task with the previous fastest algorithms. Previously, the fastest algorithms that apply to this problem are out-of-the-box LP solvers run on \(\textsf {MOT}\), and the brute-force implementation of \(\texttt {SINKHORN}\) which marginalizes over all \(n^k\) entries in each iteration. Since both of these previous algorithms have exponential runtime that scales as \(n^{\varOmega (k)}\), they do not scale beyond tiny input sizes of \(n=10\) and \(k=8\) even with two minutes of computation time. In contrast, our new polynomial-time algorithms compute high-quality solutions for problems that are orders-of-magnitude larger. For example, our polynomial-time implementation of \(\texttt {SINKHORN}\) takes less than a second to solve an \(\textsf {MOT}\) LP with \(n^k = 10^{30}\) variables.

Details for this numerical experiment: we consider \(r=1\) stock over k timesteps, where each marginal distribution \(\mu _i\) is uniform on \([1,1+1/k]\), discretized with \(n = 10\). We implement the \(\textsf {AMIN}\) and \(\textsf {SMIN}\) oracle efficiently by using our above algorithm to exploit the rank-one structure of the cost tensor. In particular, the polynomial approximation we use here to approximate \(\exp [-\eta C]\) is the degree-5 Taylor approximation (cf., Lemma 7.7). This lets us run \(\texttt {SINKHORN}\) and \(\texttt {MWU}\) in polynomial time, as described above. In the numerical experiment, we also implement an approximate version of \(\texttt {COLGEN}\) using our polynomial-time implementation of the approximate violation oracle \(\textsf {AMIN}\). Since the algorithms compute an upper bound, lower value is better in the right plot of Fig. 7. We observe that \(\texttt {MWU}\) yields the loosest approximation for this application, whereas our implementations of \(\texttt {SINKHORN}\) and \(\texttt {COLGEN}\) produce high-quality approximations, as is evident by comparing to the exact LP solver in the regime that the latter is tractable to run.

Fig. 7
figure 7

Comparison of the runtime (left) and accuracy (right) of the fastest existing algorithms (naive LP solver and naive \(\texttt {SINKHORN}\) which both have exponential runtimes that scale as \(n^{\varOmega (k)}\)) with our algorithms (\(\texttt {SINKHORN}\), \(\texttt {MWU}\), and \(\texttt {COLGEN}\) and \(\texttt {MWU}\) with polynomial-time implementations of their bottlenecks) for the risk estimation problem described in the main text. The algorithms are cut off at 2 minutes, denoted by an “x”. Our new polynomial-time implementation of \(\texttt {SINKHORN}\) returns high-quality solutions for problems that are orders-of-magnitude larger than previously possible: e.g., it takes less than a second to solve the problem for \(k = 30\), which amounts to an \(\textsf {MOT}\) LP with \(10^{30}\) variables

7.4 Application vignette: Projection to the transportation polytope

Here we consider the fundamental problem of projecting a joint probability distribution Q onto the transportation polytope \(\mathcal {M}(\mu _1,\dots ,\mu _k)\), i.e.,

$$\begin{aligned} {\mathop {{{\,\mathrm{argmin}\,}}}\limits _{P \in \mathcal {M}(\mu _1,\dots ,\mu _k)}} \sum _{\vec {j}} (P_{\vec {j}} - Q_{\vec {j}})^2. \end{aligned}$$
(7.8)

We provide the first polynomial-time algorithm for solving this problem in the case where Q is a distribution that decomposes into a low-rank component plus a sparse component. The low-rank component enables modeling mixtures of product distributions (e.g., mixtures of isotropic Gaussians), which arise frequently in statistics and machine learning; see, e.g., [39]. In such applications, the number of product distributions in the mixture corresponds to the tensor rank. The sparse component further enables modeling arbitrary corruptions to the distribution in polynomially many entries.

We emphasize that this projection problem (7.8) is not an \(\textsf {MOT}\) problem since the objective is quadratic rather than linear. This illustrates the versatility of our algorithmic results. Our algorithm is based on a reduction from quadratic optimization to linear optimization over \(\mathcal {M}(\mu _1,\dots ,\mu _k)\) that is tailored to this problem. Crucial to this reduction is the fact that the \(\textsf {MOT}\) algorithms in Sect. 4 can compute sparse solutions. In particular, this reduction does not work with \(\texttt {SINKHORN}\) because \(\texttt {SINKHORN}\) cannot compute sparse solutions.

Corollary 7.11

(Efficient projection to the transportation polytope) Let \(Q = R + S\in (\mathbb {R}_{\geqslant 0}^n)^{\otimes k}\), where R has constant rank and S is polynomially sparse. Suppose that \(R_{\max }\) and \(S_{\max }\) are O(1). Given R in factored form, S through its non-zero entries, measures \(\mu _1, \dots , \mu _k \in \varDelta _n\), and accuracy \(\varepsilon > 0\), we can compute in \(\mathrm {poly}(n,k,1/\varepsilon )\) time a feasible \(P \in \mathcal {M}(\mu _1,\dots ,\mu _k)\) that has \(\varepsilon \)-suboptimal cost for the projection problem (7.8). This solution P is a sparse tensor output through its \(\mathrm {poly}(n,k,1/\varepsilon )\) non-zero entries.

Proof

We apply the Frank-Wolfe algorithm (a.k.a., Conditional Gradient Descent) to solve (7.8), specifically using approximate LP solutions for the descent direction as in [51,  Algorithm 2]. By the known convergence guarantee of this algorithm [51,  Theorem 1.1], if each LP is solved to \(\varepsilon ' = O(\varepsilon )\) accuracy, then \(T = O(1/\varepsilon )\) Frank-Wolfe iterations suffice to obtain an \(\varepsilon \)-suboptimal solution to (7.8).

The crux, therefore, is to show that each Frank-Wolfe iteration can be computed efficiently, and that the final solution is sparse. Initialize \(P^{(0)}\) to be an arbitrary vertex of \(\mathcal {M}(\mu _1,\dots ,\mu _k)\). Then \(P^{(0)}\) is feasible and is polynomially sparse (see Sect. 2.1). Let \(P^{(t)} \in (\mathbb {R}_{\geqslant 0}^n)^{\otimes k}\) denote the t-th Frank-Wolfe iterate. Performing the next iteration requires two computations:

  1. 1.

    Approximately solve the following LP to \(\varepsilon '\) accuracy:

    $$\begin{aligned} D^{(t)} \leftarrow \min _{P \in \mathcal {M}(\mu _1,\dots ,\mu _k)} \langle P, P^{(t)} - Q \rangle . \end{aligned}$$
    (7.9)
  2. 2.

    Update \(P^{(t+1)} \leftarrow (1 - \gamma _t)P^{(t)} + \gamma _t D^{(t)}\), where \(\gamma _t = 2/(t+2)\) is the current stepsize.

For the first iteration \(t = 0\), note that the LP (7.9) is an \(\textsf {MOT}\) problem with cost \(C^{(0)} = P^{(0)} - Q = P^{(0)} - R - S\) which decomposes into a polynomially sparse tensor \(P^{(0)}-S\) plus a constant-rank tensor \(-R\). Therefore the algorithm in Corollary 7.5 can solve the LP (7.9) to \(\varepsilon ' = O(\varepsilon )\) additive accuracy in \(\mathrm {poly}(n,k,1/\varepsilon )\) time, and it outputs a solution \(D^{(0)}\) that is \(\mathrm {poly}(n,k,1/\varepsilon )\) sparse. It follows that \(P^{(1)}\) can be computed in \(\mathrm {poly}(n,k,1/\varepsilon )\) time and moreover is \(\mathrm {poly}(n,k,1/\varepsilon )\) sparse since it is a convex combination of the similarly sparse tensors \(P^{(0)}\) and \(D^{(0)}\). By repeating this argument identically for \(T = O(1/\varepsilon )\) iterations, it follows that each iteration takes \(\mathrm {poly}(n,k,1/\varepsilon )\) time, and that each iterate \(P^{(t)}\) is \(\mathrm {poly}(n,k,1/\varepsilon )\) sparse. \(\square \)

8 Discussion

In this paper, we investigated what structure enables \(\textsf {MOT}\)—an LP with \(n^k\) variables—to be solved in \(\mathrm {poly}(n,k)\) time. We developed a unified algorithmic framework for \(\textsf {MOT}\) by characterizing what “structure” is required to solve \(\textsf {MOT}\) in polynomial time by different algorithms in terms of simple variants of the dual feasibility oracle. On one hand, this enabled us to show that \(\texttt {ELLIPSOID}\) and \(\texttt {MWU}\) solve \(\textsf {MOT}\) in polynomial time whenever any algorithm can, whereas \(\texttt {SINKHORN}\) requires strictly more structure. And on the other hand, this made the design of polynomial-time algorithms for \(\textsf {MOT}\) much simpler, as we illustrated on three general classes of \(\textsf {MOT}\) cost structures.

Our results suggest several natural directions for future research. One exciting direction is to identify further tractable classes of \(\textsf {MOT}\) cost structures beyond the three studied in this paper, since this may enable new applications of \(\textsf {MOT}\). Our results help guide this search because they make it significantly easier to identify if an \(\textsf {MOT}\) problem is polynomial-time solvable (see Sect. 1.1.4).

Another important direction is practicality. While the focus of this paper is to characterize when \(\textsf {MOT}\) problems can be solved in polynomial time, in practice there is of course a difference between small and large polynomial runtimes. It is therefore a question of practical significance to improve our “proof of concept” polynomial-time algorithms by designing algorithms with smaller polynomial runtimes. Our theoretical results help guide this search for practical algorithms because they make it significantly easier to identify if an \(\textsf {MOT}\) problem is polynomial-time solvable in the first place.

In order to develop more practical algorithms, recall that, roughly speaking, our approach for designing \(\textsf {MOT}\) algorithms consisted of three parts:

  • An “outer loop” algorithm such as \(\texttt {ELLIPSOID}\), \(\texttt {MWU}\), or \(\texttt {SINKHORN}\) that solves \(\textsf {MOT}\) in polynomial time conditionally on a polynomial-time implementation of a certain bottleneck oracle.

  • An “intermediate” algorithm that reduces this bottleneck oracle to polynomial calls of a variant of the dual feasibility oracle.

  • An “inner loop” algorithm that solves the relevant variant of the dual feasibility oracle for the structured \(\textsf {MOT}\) problem at hand.

Obtaining a smaller polynomial runtime for any of these three parts immediately implies smaller polynomial runtimes for the overall \(\textsf {MOT}\) algorithm. Another approach is to design altogether different algorithms that avoid the polynomial blow-up of the runtime that arises from composing these three parts. Understanding how to solve an \(\textsf {MOT}\) problem more “directly” in this way is an interesting question.