1 Introduction

Linear programs arise naturally in many applications and have become ubiquitous in topics such as operations research, control theory, economics, physics, mathematics and statistics (see the textbooks by Dantzig, 1963; Bertsimas & Tsitsiklis, 1997; Luenberger & Ye, 2008; Galichon, 2018 and the references therein). Their solid mathematical foundation dates back to the mid-twentieth century, to mention the seminal works of Kantorovich (1939), Hitchcock (1941), Dantzig (1948) and Koopmans (1949), and its algorithmic computation is an active topic of research until todayFootnote 1. A linear program in standard form writes

figure a

with \((A,b,c)\in \mathbb {R}^{m\times d}\times \mathbb {R}^m\times \mathbb {R}^d\) and matrix A of full rank \(m\le d\). For the purpose of the paper, the lower subscript b in (\(\text {P}_{b}\)) emphasizes the dependence on the vector b. Associated with the primal program (\(\text {P}_{b}\)) is its corresponding dual program

figure b

At the heart of linear programming and fundamental to our work is the observation that if the primal program (\(\text {P}_{b}\)) attains a finite value, the optimum is attained at one of a finite set of candidates termed basic solutions. Each basic solution (possibly infeasible) is identified by a basis \(I\subset \{1,\ldots ,d\}\) indexing m linearly independent columns of the constraint matrix A. The bases I also defines a basic solution for the dual (\(\text {D}_{b}\)). In fact, the simplex algorithm (Dantzig, 1948) is specifically designed to move from one primal feasible basic solution to another while checking if the corresponding basis induces a dual feasible basic solution.

Shortly after first algorithmic approaches and theoretical results became available, the need to incorporate uncertainty in the parameters has become apparent (see Dantzig, 1955; Beale, 1955; Ferguson and Dantzig, 1956 for early contributions). In fact, apart from its relevance in numerical stability issues, in many applications the parameters reflect practical needs (budget, prices or capacities) but are not available exactly. This has opened a wealth of approaches to account for randomness in linear programs. Common to all formulations is their general assumption that some parameters in (\(\text {P}_{b}\)) are random and follow a known probability distribution. Important contributions in this regard are chance constrained linear programs, two- and multiple-stage programming as well as the theory of stochastic linear programs (see Shapiro et al., 2021 for a general overview). Specifically relevant to this paper is the so-called distribution problem characterizing the distribution of the random variable \(\nu (X)\), where the right-hand side b (and possibly A and c) in (\(\text {P}_{b}\)) is replaced by a random variable X following a specific law (Tintner, 1960; Prékopa, 1966; Wets, 1980).

In this paper, we take a related route and focus on statistical aspects of the standard linear program (\(\text {P}_{b}\)) if the right-hand side b is replaced by a consistent estimator \(b_n\) indexed by \(n\in \mathbb {N}\), e.g., based on n observations. Different to aforementioned attempts, we only assume the random quantity \(r_n(b_n-b)\) to converge weakly (denoted by \(\xrightarrow []{D}\)) to some limit law G as n tends to infinityFootnote 2. Our main goal is to characterize the asymptotic distributional limit of the empirical optimal solution

$$\begin{aligned} x^\star (b_n) \in {{\,\mathrm{arg\,min}\,}}_{Ax=b_n,\, x\ge 0} c^Tx \end{aligned}$$
(1)

around its population quantities after proper standardization. For the sake of exposition, suppose that \(x^\star (b)\) in (1) is uniqueFootnote 3. The main results in Theorem 3.1 and Theorem 3.3 state that under suitable assumptions on (\(\text {P}_{b}\)) it holds, as n tends to infinity, that

$$\begin{aligned} r_n\left( x^{\star }(b_n) - x^\star (b)\right) \xrightarrow []{D} M(G), \end{aligned}$$
(2)

where \(M:\mathbb {R}^m\rightarrow \mathbb {R}^d\) is given in Theorem 3.1. The function M in (2) is possibly random, and its explicit form is driven by the amount of degeneracy present in the primal and dual optimal solutions. The simplest case occurs if \(x^\star (b)\) is non-degenerate. The function M is then a linear transformation depending on the corresponding unique optimal basis, so that the limit law M(G) is Gaussian if G is Gaussian. If \(x^\star (b)\) is degenerate but all dual optimal (basic) solutions for (\(\text {D}_{b}\)) are non-degenerate, then M is a sum of deterministic linear transformations defined on closed and convex cones indexed by the collection of dual optimal bases. Specifically, the number of summands in M is equal to the number of dual optimal basic solutions for (\(\text {D}_{b}\)). A more complicated situation arises if both \(x^\star (b)\) and some dual optimal basic solutions are degenerate. In this case, the function M is still a sum of linear transformations defined on closed and convex cones, but these transformations are potentially random and indexed by certain subsets of the set of optimal bases. The latter setting reflects the complex geometric and combinatorial nature in linear programs under degeneracy.

Let us mention at once that limiting distributions for the empirical optimal solution in the form of (2) have been studied for a long time in a more general setting of (potentially) non-linear optimisation problems; see for example Dupačová (1987), Dupačová & Wets (1988), Shapiro (1991), Shapiro (1993), Shapiro (2000), King & Rockafellar (1993). Regularity assumptions such as strong convexity of the objective function near the (unique) optimizer allow for either explicit asymptotic expansions of optimal values and optimal solutions or applications of implicit function theorems and generalizations thereof. These conditions usually do not hold for the linear programs considered in this paper.

To the best of our knowledge, our results are the first that cover limit laws for empirical optimal solutions to standard linear programs even beyond the non-degenerate case and without assuming uniqueness of optimizers. However, our proof technique relies on well-known concepts from parametric optimization and sensitivity analysis for linear programs (Guddat et al., 1974; Greenberg, 1986; Ward & Wendell, 1990; Hadigheh & Terlaky, 2006). Indeed, our approach is based on a careful study of the collection of dual optimal bases. An early contribution in this regard is the basis decomposition theorem by Walkup & Wets, (1969a) analyzing the behavior of \(\nu (b)\) in (\(\text {P}_{b}\)) as a function of b (see also Remark 3.2). Each dual feasible basis defines a so called decision region over which the optimal value \(\nu (b)\) is linear. The integration over the collection of all these regions yields closed form expressions for the distribution problem (Bereanu, 1963; Ewbank et al., 1974). Further, related stability results are also found in the work by Walkup & Wets (1967), Böhm (1975), Bereanu (1976) and Robinson (1977). In algebraic geometry, decision regions are closely related to cone-triangulations of the primal feasible optimization region (Sturmfels & Thomas, 1997; De Loera et al., 2010). We emphasize that rather than working with decision regions directly, our analysis is tailored to cones of feasible perturbations. In particular, we are interested in regions capturing feasible directions as our problem settings is based on the random perturbation \(\sqrt{n}\left( b_n-b\right) \). These regions turn out to be closed, convex cones and appear as indicator functions in the (random) function M in (2).

Our proof technique allows to recover some related known results for random linear programs (see Sect. 3). These include convergence of the optimality sets in Hausdorff distance (Proposition 3.7), and a limit law for the optimal value

(3)

as n tends to infinity. Indeed, (3) is a simple consequence of general results in constrained optimization (Shapiro, 2000; Bonnans & Shapiro, 2000), and the optimality set convergence follows from Walkup & Wets (1969b).

Our statistical analysis for random linear programs in standard form is motivated by recent findings in statistical optimal transport (OT). More precisely, while there exists a thorough theory for limit laws on empirical OT costs on discrete spaces (Sommerfeld & Munk, 2018; Tameling et al., 2019), related statements for their empirical OT solutions remain open. An exception is Klatt et al. (2020), who provide limit laws for empirical (entropy) regularized OT solutions, thus modifying the underlying linear program to be strictly convex, non-linear and most importantly non-degenerate in the sense that every regularized OT solution is strictly positive in each coordinate. Hence, an implicit function theorem approach in conjunction with a delta method allows to conclude for Gaussian limits in this case. This stands in stark contrast to the non-regularized OT considered in this paper, where the degenerate case is generic rather than the exception for most practical situations. Only if the OT solution is unique and non-degenerate, then we observe a Gaussian fluctuation on the support set, i.e., on all entries with positive values. If the OT solution is degenerate (or not unique), then the asymptotic limit law (2) is usually not Gaussian anymore. Degeneracy in OT easily occurs as soon as certain subsets of demand and supply sum up to the same quantity. In particular, we encounter the largest degree of degeneracy if individual demand is equal to individual supply. Additionally, we obtain necessary and sufficient conditions on the cost function in order for the dual OT to be non-degenerate. These may be of independent interest, and allow to prove almost sure uniqueness results for quite general cost functions.

Our distributional results can be viewed as a basis for uncertainty quantification and other statistical inference procedures concerning solutions to linear programs. For brevity, we mention such applications in passing and do not elaborate further on them, leaving a detailed study of statistical consequences such as testing or confidence statements as an important avenue for further research.

The outline of the paper is as follows. We recap basics for linear programming in Sect. 2 also introducing deterministic and stochastic assumptions for our general theory. Our main results are summarized in Sect. 3, followed by their proofs in Sect. 4. The assumptions are discussed in more detail in Sect. 5. Section 6 focuses on OT and gives limit laws for empirical OT solutions.

2 Preliminaries and assumptions

This section introduces notation and assumptions required to state the main results of the paper. Along the way, we recall basic facts of linear programming and refer to Bertsimas & Tsitsiklis, (1997) and Luenberger & Ye, (2008) for details.

Linear Programs and Duality. Let the columns of a matrix \(A\in \mathbb {R}^{m\times d}\) be enumerated by the set \([d]:= \{1,\ldots ,d\}\). Consider for a subset \(I\subseteq [d]\) the sub-matrix \(A_I\in \mathbb {R}^{m\times \vert I \vert }\) formed by the corresponding columns indexed by I. Similarly, \(x_I\in \mathbb {R}^{|I|}\) denotes the coordinates of \(x\in \mathbb {R}^d\) corresponding to I. By full rank of A in (\(\text {D}_{b}\)), there always exists an index set I with cardinality m such that \(A_I\in \mathbb {R}^{m\times m}\) is one-to-one. An index set I with that property is termed basis and induces a primal and dual basic solution

$$\begin{aligned} x(I,b):= \text {Aug}_I\left[ \left( A_I\right) ^{-1}b \right] \in \mathbb {R}^d,\quad \lambda (I):= (A_I)^{-T}c_I \in \mathbb {R}^m, \end{aligned}$$

respectively. Herein, and in order to match dimensions (a solution for (\(\text {P}_{b}\)) has dimension d instead of \(m\le d\)) the linear operator \(\mathrm {Aug}_I:\mathbb {R}^m\rightarrow \mathbb {R}^d\) augments zeroes in the coordinates that are not in I. If \(\lambda (I)\) (resp. x(Ib)) is feasible for (\(\text {D}_{b}\)) (resp. (\(\text {P}_{b}\))) then it constitutes a dual (resp. primal) feasible basic solution with dual (resp. primal) feasible basis I. Moreover, \(\lambda (I)\) (resp. x(Ib)) is termed dual (resp. primal) optimal basic solution if it is feasible and optimal for (\(\text {D}_{b}\)) (resp. (\(\text {P}_{b}\))). Indeed, as long as (\(\text {D}_{b}\)) admits a feasible (optimal) solution then there exists a dual feasible (optimal) basic solution and vice versa for (\(\text {P}_{b}\)). At the heart of linear programming is the strong duality statement.

Fact 2.1

Consider the primal linear program (\(\text {P}_{b}\)) and its dual (\(\text {D}_{b}\)).

  1. (i)

    If either of the linear programs (\(\text {P}_{b}\)) or (\(\text {D}_{b}\)) has a finite optimal solution, so does the other and the corresponding optimal values are the same.

  2. (ii)

    If for a basis \(I\subseteq [d]\) the vector \(\lambda (I)\) is dual feasible and x(Ib) is primal feasible, then both are primal and dual optimal basic solutions, respectively.

  3. (iii)

    If (\(\text {P}_{b}\)) and (\(\text {D}_{b}\)) are feasible then there always exists a basis \(I\subseteq [d]\) such that x(Ib) and \(\lambda (I)\) are primal and dual optimal basic solutions, respectively.

We introduce the feasibility and optimality set for the primal (\(\text {P}_{b}\)) by

$$\begin{aligned} \mathcal {P}(b):= \left\{ x\in \mathbb {R}^d \mid \, Ax=b, x\ge 0\right\} ,\quad \mathcal {P}^\star (b):= \left\{ x^{\star }\in \mathcal {P}(b) \mid \, c^T x^{\star }=\inf _{x\in \mathcal {P}(b)}c^Tx\right\} , \end{aligned}$$
(4)

respectively. Notably, in our theory to follow A and c are generally assumed to be fixed and only the dependence of these sets with respect to parameter b is emphasized. We introduce our first assumption:

figure c

In view of the strong duality statement in Fact 2.1, solving a linear program might be carried out focusing on the collection of all dual feasible bases. We partition this collection into two subsets depending on their feasibility for the primal program.

Remark 2.2

(Splitting of the Bases Collection) Let \(I_1,\dots ,I_{N}\) enumerate all dual feasible bases, and let \(1\le K\le N\) be such that

$$\begin{aligned} x(I_k,b) \text { is feasible,} \, 1\le k\le K; \qquad x(I_k,b) \text { is infeasible, } \, K<k\le N. \end{aligned}$$

Notably, by Fact 2.1 the primal basic solution \(x(I_k,b)\) is optimal for all \(k\le K\). Recall that the convex hull \(\mathcal {C}\left( x_1,\ldots ,x_K\right) \) of a collection of points \(\{x_1,\ldots ,x_K\}\subset \mathbb {R}^d\) is the set of all possible convex combinations of them.

Fact 2.3

Consider the primal linear program (\(\text {P}_{b}\)) and assume (\(\mathbf{A2} \)) holds. Then for any right hand side \(\tilde{b}\in \mathbb {R}^m\) either one of the following statements is correct.

  1. (i)

    The feasible set \(\mathcal {P}\left( \tilde{b}\right) \) is empty.

  2. (ii)

    The optimality set \(\mathcal {P}^\star \left( \tilde{b}\right) \) is non-empty, bounded and equal to the convex hull

    $$\begin{aligned} \mathcal {P}^\star \left( \tilde{b}\right) = \mathcal {C}\left( \left\{ x(I,\tilde{b})\mid \, I { primal and dual feasible basis for \left( {P}_{\widetilde{b}}\right) and \left( {D}_{\widetilde{b}}\right) } \right\} \right) . \end{aligned}$$

The restriction of the convex hull to basic solutions induced by primal and dual optimal bases in Fact 2.3 is well-known. A straightforward argument is based on the simplex method that if set up with appropriate pivoting rules always terminates. If (\(\mathbf{A2} \)) holds and there exists a unique basis \((K=1)\) then the primal program attains a unique solution. Uniqueness of solutions to linear programs is related to degeneracy of corresponding dual solutions. A dual feasible basic solution \(\lambda (I)\) is degenerate if more than m of the d inequalities \(\lambda (I)^TA\le c^T\) hold as equalities. Similarly, a primal feasible basic solution x(Ib) is degenerate if less than m of its coordinates are nonzero.

Fact 2.4

Consider the linear program (\(\text {P}_{b}\)) and its dual (\(\text {D}_{b}\)).

  1. (i)

    If (\(\text {P}_{b}\)) (resp. (\(\text {D}_{b}\))) has a non-degenerate optimal basic solution, then (\(\text {D}_{b}\)) (resp. (\(\text {P}_{b}\))) has a unique solution.

  2. (ii)

    If (\(\text {P}_{b}\)) (resp. (\(\text {D}_{b}\))) has a unique non-degenerate optimal basic solution, then (\(\text {D}_{b}\)) (resp. (\(\text {P}_{b}\))) has a unique non-degenerate optimal solution.

  3. (iii)

    If (\(\text {P}_{b}\)) (resp. (\(\text {D}_{b}\))) has a unique degenerate optimal basic solution, then (\(\text {D}_{b}\)) (resp. (\(\text {P}_{b}\))) has multiple solutions.

For a proof of Fact 2.4, we refer to Gal & Greenberg (2012)[Lemma 6.2] in combination with strict complementary slackness from Goldman & Tucker (1956)[Corollary 2A] stating that for feasible primal and dual linear program there exists a pair \((x,\lambda )\) of primal and dual optimal solution such that either \(x_j>0\) or \(\lambda ^TA_j<c_j\) for all \(1\le j \le d\). In addition to uniqueness statements, many results in linear programming simplify when degeneracy is excluded. Related to degeneracy but slightly weaker is the assumption

figure d

Indeed, if \(\mathcal {P}^\star (b)\) is non-empty and bounded assumption (\(\mathbf{A2} \)) characterizes non-degeneracy of all dual basic solution.

Lemma 2.5

Suppose assumption (\(\mathbf{A2} \)) holds. Then assumption (\(\mathbf{A2} \)) is equivalent to non-degeneracy of all dual optimal basic solutions.

To see that (\(\mathbf{A2} \)) is necessary, let \(D_m\in \mathbb {R}^{m\times m}\) with \(m\ge 2\) be the identity matrix. Suppose that \(A=(D_m, -D_m)\in \mathbb {R}^{m\times 2m}\), \(c\in \mathbb {R}^{2m}_+\) is strictly positive except that \(c_1=c_{m+1}=0\), and \(b=(1,0,0,\dots ,0)\in \mathbb {R}^m\). Then there are \(K=2^{m-1}\) optimal bases defining K distinct (degenerate) dual solutions, so that assumption (\(\mathbf{A2} \)) holds but dual degeneracy fails. Note that \(\mathcal {P}^\star (b)\) is unbounded and contains the optimal ray \((b^T,b^T)\).

Random Linear Programs. Introducing randomness in problems (\(\text {P}_{b}\)) and (\(\text {D}_{b}\)), we suppose to have incomplete knowledge of \(b\in \mathbb {R}^m\), and replace it by a (consistent) estimator \(b_n\), e.g., based on a sample of size n independently drawn from a distribution with mean b. This defines empirical primal and dual counterparts (\(\text {P}_{b}\)) and (\(\text {D}_{b}\)), respectively. We allow the more general case that only the first \(m_0\in \{0,\dots ,m\}\) coordinates of b are unknownFootnote 4 and assume the existence of a sequence of random vectors \(b_n=(b_n^{m_0},[b]_{m-m_0})^T\in \mathbb {R}^{m_0}\times \mathbb {R}^{m-m_0}\) converging to b at rate \(r_n^{-1} \rightarrow 0\) as n tends to infinity

figure e

where \(\xrightarrow []{D}\) denotes convergence is distribution. In a typical central limit theorem type scenario, \(r_n=\sqrt{n}\) and \(G^{m_0}\) is a centred Gaussian random vector in \(\mathbb {R}^{m_0}\), assumed to have a non-singular covariance matrix. Assumption (\(\mathbf{B1} \)) implies that \(b_n\rightarrow b\) in probability. In order to avoid pathological cases, we impose the last assumption that asymptotically an optimal solution \(x^\star (b_n)\) for the primal \((\text {P}_{b_n})\) exists

figure f

Further discussions on the assumptions are deferred to Sect. 5.

3 Main results

According to Fact 2.3, in presence of (\(\mathbf{A2} \)) any optimal solution \(x^{\star }(b_n)\in \mathcal P^\star (b_n)\) takes the form

$$\begin{aligned} x^{\star }(b_n) =\sum _{k\in \mathcal K} [\alpha _n^\mathcal {K}]_k x(I_k,b_n) := \alpha _n^\mathcal {K}\otimes x(I_\mathcal {K},b_n), \end{aligned}$$

where \(\mathcal {K}\) is a non-empty subset of \([N]:= \{1,\dots ,N\}\) and \(\alpha _n^\mathcal {K}\in \mathbb {R}^N\) is a random vector in the (essentially \(|\mathcal {K}|\)-dimensional) unit simplex \(\Delta _\mathcal {K}:= \left\{ \alpha \in \mathbb {R}_+^{N}\, \mid \,\Vert \alpha \Vert _1=1,\alpha _k=0\ \forall k\notin \mathcal {K}\right\} \). The main result of the paper states the following asymptotic behaviour for the empirical optimal solution.

Theorem 3.1

Suppose assumptions (\(\mathbf{A2} \)), (\(\mathbf{B1} \)), and (\(\mathbf{B2} \)) hold, and let \(x^{\star }(b_n)\in \mathcal {P}^\star (b_n)\) be any (measurable) choice of an optimal solution. Further, assume that for all non-empty \(\mathcal {K}\subseteq [K]\), the random vectors \(\left( \alpha _n^\mathcal {K}, G_n\right) \) converge jointly in distribution as n tends to infinity to \((\alpha ^\mathcal {K},G)\) on \(\Delta _{\vert \mathcal {K}\vert }\times \mathbb {R}^m\). Then there exist closed convex cones \(H_1^{m_0},\ldots ,H_K^{m_0}\subseteq \mathbb {R}^{m_0}\) and random vectors \(Y_n\in \mathcal {P}^\star (b)\) such that

$$\begin{aligned} r_n\left( x^{\star }(b_n) - Y_n\right) \xrightarrow []{D} M(G):= \sum _\mathcal {K}\mathbbm {1}_{\left\{ G^{m_0}\in H_\mathcal {K}^{m_0}\setminus \bigcup _{k\in [K]\setminus \mathcal {K}}H_k^{m_0}\right\} }\,\alpha ^\mathcal {K}\otimes x(I_\mathcal {K},G) \, \in \mathbb {R}^d, \end{aligned}$$

where the sum runs over non-empty subsets \(\mathcal {K}\) of [K] and \(H_\mathcal {K}^{m_0}=\cap _{k\in \mathcal {K}}H_k^{m_0}\).

Remark 3.2

Underlying Theorem 3.1 is the well-known approach of partitioning \(\mathbb R^m\) into (closed convex) cones. Indeed, the union of the closed convex cones

is the feasibility set \(A_+:=\{Ax:x\ge 0\}\subseteq \mathbb {R}^m\) and on each cone the optimal solution is an affine function of b (e.g., Walkup & Wets, 1969a; Guddat et al., 1974, ). The cones \(\widetilde{H}_k\) depend only on A and c. In contrast, our cones \(H_k^{m_0}\) also depend on b and define directions of perturbations of b that keep \(\lambda (I_k)\) optimal for the perturbed problem for a given \(k\le K\). Assume for simplicity that \(m_0=m\) and write \(H_k\) instead of \(H_k^{m_0}\). If \(b=0\), then \(K=N\) and cones coincide \(H_k=\widetilde{H}_k\), but otherwise \(H_k\) is a strict super-set of \(\widetilde{H}_k\) as the corresponding representation (7) of \(\widetilde{H}_k\) requires non-negativity on all coordinates. This is also in line with the observation that there are fewer (K) cones \(H_k\) than there are \(\widetilde{H}_k\), namely N, and the union of the \(H_k\)’s is a space that is at least as large as \(A_+\) (since \(b+A_+\subseteq A_+\) because \(b\in A_+\)), typically \(\mathbb R^m\). As an extreme example, suppose that (\(\text {P}_{b}\)) has a unique non-degenerate optimal solution \(x(I_1,b)\). Then \(K=1\) and \(H_1=\mathbb R^m\) but the \(\widetilde{H}_k\)’s are strict subsets of \(\mathbb R^m\) unless \(N=1\).

In Sect. 5, we discuss sufficient conditions for the joint distributional convergence of the random vector \(\left( \alpha _n^\mathcal {K},G_n\right) \). In short, if we use any linear program solver, such joint distributional convergence appears to be reasonable. If the optimal basis is unique (\(K=1\)) with \(x^\star (b)=x(I_1,b)\) non-degenerate, then \(\lambda (I_1)\) is non-degenerate, and the proof shows that \(H_k^{m_0}=\mathbb {R}^{m_0}\). The distributional limit theorem then takes the simple form

$$\begin{aligned} r_n(x^\star (b_n)-x^\star (b)) \xrightarrow []{D} x(I_1,G) \in \mathbb {R}^d. \end{aligned}$$

In general, when \(K>1\), the number of summands in the limiting random variable in Theorem 3.1 might grow exponentially in K. In between these two cases is the situation that assumption (\(\mathbf{A2} \)) holds, which implies all dual optimal basic solutions for (\(\text {D}_{b}\)) are non-degenerate (see Lemma 2.5). The limiting random variable then simplifies, as the subsets \(\mathcal {K}\) must be singletons.

Theorem 3.3

Suppose assumptions (\(\mathbf{A2} \)), (\(\mathbf{A2} \)), and (\(\mathbf{B2} \)) hold, and that \(r_n(b_n - b)\xrightarrow []D G\). Then anyFootnote 5 (measurable) choice of \(x^{\star }(b_n)\in \mathcal {P}^\star (b_n)\) satisfies

$$\begin{aligned} r_n\left( x^{\star }(b_n) - Y_n\right) \xrightarrow []{D} \sum _{k=1}^K \mathbbm {1}_{\left\{ G^{m_0}\in H_k^{m_0}\setminus \cup _{j<k} H_j^{m_0}\right\} }\,x\left( I_k,G\right) \, \in \mathbb {R}^d \end{aligned}$$

with the closed and convex cones \(H_k^{m_0}\) as given in Theorem 3.1.

Remark 3.4

With respect to Theorem 3.1, assumption (\(\mathbf{B1} \)) is weakened in Theorem 3.3 as absolute continuity of G (or \(G^{m_0}\)) is not required. Indeed, it can be arbitrary, and Theorem 3.3 thus accommodates, e.g., Poisson limit distributions. The proof shows that if G is absolutely continuous (i.e., \(m_0=m\)) then the indicator functions of \({G\in H_k^m\setminus \cup _{j<k} H_j^m}\) simplify to \({G\in H_k^m}\), because intersections \(H_k^m\cap H_j^m\) have Lebesgue measure zero. The distributional limit theorem then reads as

$$\begin{aligned} r_n\left( x^{\star }(b_n) - Y_n\right) \xrightarrow []{D} \sum _{k=1}^K \mathbbm {1}_{\left\{ G^{m}\in H_k^{m}\right\} }\,x\left( I_k,G\right) \, \in \mathbb {R}^d. \end{aligned}$$

If the optimal solution of the limiting problem is unique, Theorem 3.1 can be formulated in a set-wise sense. The Hausdorff distance between two closed nonempty sets \(A,B\subseteq \mathbb {R}^d\) is

$$\begin{aligned} d_H(A,B) =\max \left( \sup _{a\in A}\inf _{b\in B}\Vert a-b\Vert ,\sup _{b\in B}\inf _{a\in A}\Vert a-b\Vert \right) \in [0,\infty ]. \end{aligned}$$
(5)

The collection of closed subsets of \({\mathbb {R}^d}\) equipped with \(d_H\) is a metric space (with possibly infinite distance) and convergence in distribution is defined as usual by integrals of continuous real-valued bounded functions; see for example King (1989), where the delta method is developed in this context. Recall that \(\mathcal {C}\) stands for convex hull.

Theorem 3.5

Suppose assumptions (\(\mathbf{A2} \)), (\(\mathbf{B1} \)), and (\(\mathbf{B2} \)) hold, and that \(\mathcal P^{\star }(b)=\{x^\star (b)\}\) is a singleton. On the collection of closed subsets of \(\mathbb {R}^d\) with the Hausdorff distance \(d_H\) it holds that

$$\begin{aligned} r_n\left( \mathcal P^{\star }(b_n) - x^\star (b)\right) \xrightarrow []{D} \sum _\mathcal {K}\mathbbm {1}_{\left\{ G^{m_0}\in H_\mathcal {K}^{m_0}\setminus \bigcup _{k\in [K]\setminus \mathcal {K}}H_k^{m_0}\right\} } \mathcal {C}(x(I_\mathcal {K},G)), \end{aligned}$$

where \(H_k^{m_0}\) and \(H_\mathcal {K}^{m_0}\) are as defined in Theorem 3.1.

We conclude this section by giving two further consequences of our proof techniques: a limit law for the objective value \(\nu (b)\) for (\(\text {P}_{b}\)), and convergence in probability of optimality sets. Since the former is well-known and holds in more general, infinite-dimensional convex programs, we omit the proof details and instead refer to Shapiro (2000), Bonnans & Shapiro (2000) and results by Sommerfeld & Munk (2018), Tameling et al. (2019) tailored to OT.

Proposition 3.6

Under assumptions (\(\mathbf{A2} \)), (\(\mathbf{B1} \)), and (\(\mathbf{B2} \)) it holds that

$$\begin{aligned} r_n[ \nu (b_n) -\nu (b) ] \xrightarrow []{D} \max _{k\in [K]}G^T\lambda (I_k). \end{aligned}$$

Another consequence of our bases driven approach underlying the proof of Theorem 3.1 is that the convergence of the Hausdorff distance

$$\begin{aligned} d_H\left( \mathcal {P}^\star (b_n),\mathcal {P}^\star (b)\right) := \max \left\{ \sup _{x\in \mathcal {P}^\star (b_n)}\inf _{y\in \mathcal {P}^\star (b)}\Vert x-y\Vert , \sup _{x\in \mathcal {P}^\star (b)}\inf _{y\in \mathcal {P}^\star (b_n)}\Vert x-y\Vert \right\} \end{aligned}$$

between \(\mathcal {P}^\star (b_n)\) and \(\mathcal {P}^\star (b)\) is of order \(O_\mathbb {P}(r_n^{-1})\). A different and considerably shorter argument relies on Walkup & Wets (1969b) and proves the following result.

Proposition 3.7

Suppose assumptions (\(\mathbf{A2} \)) and (\(\mathbf{B2} \)) hold. If \(\Vert b_n - b\Vert =O_\mathbb {P}(r_n^{-1})\), then it follows that \(d_H\left( \mathcal {P}^\star (b_n),\mathcal {P}^\star (b)\right) =O_\mathbb {P}(r_n^{-1})\).

We also refer to the work by Robinson (1977) for a similar result when the primal and dual optimality sets are both bounded.

4 Proofs for the main results

To simplify the notation, we assume that all random vectors in the paper are defined on a common generic probability space \((\Omega ,\mathcal F,\mathbb {P})\). This is no loss of generality by the Skorokhod representation theorem.

Preliminary steps. Recall from Remark 2.2 that bases \(I_1,\dots ,I_K\) are feasible for (\(\text {P}_{b}\)) and (\(\text {D}_{b}\)) and hence optimal. The bases \(I_{K+1},\dots ,I_N\) are only feasible for (\(\text {D}_{b}\)) but not for (\(\text {P}_{b}\)). For a set \(\mathcal {K}\subseteq [N]\) define the events, i.e., subsets of the underlying probability space

$$\begin{aligned} A_n^\mathcal {K}&:={\left\{ \begin{array}{ll} x(I_k,b_n(\omega ))\ge 0\\ A^T\lambda (I_k)\le c \end{array}\right. }\iff k\in \mathcal {K}\Bigg \}\subseteq \Omega ,\\ B_n^\mathcal {K}&:= {\left\{ \begin{array}{ll} x(I_k,b_n(\omega ))\ge 0\\ A^T\lambda (I_k)\le c \end{array}\right. }\Longleftarrow k\in \mathcal {K}\Bigg \}\subseteq \Omega . \end{aligned}$$

By strong duality (Fact 2.1 (ii)), the set \(A_n^{\mathcal {K}}\) is the event that the bases indexed by \(\mathcal {K}\) are precisely those that are optimal for (\(\text {P}_{b}\)) and (\(\text {D}_{b}\)). We have \(A_n^\mathcal {K}\subseteq B_n^\mathcal {K}\), and \(B_n^\mathcal {K}\subseteq B_n^{\{k\}}\) for all \(k\in \mathcal {K}\). We start with two important observations, the first stating that only subsets of [K] asymptotically matter.

Lemma 4.1

Suppose that \(b_n\xrightarrow []D b\).

  1. (i)

    It holds that \(\mathbb P\left( B_n^{\{k\}}\right) \rightarrow 0\) as \(n\rightarrow \infty \) for all \(k>K\).

  2. (ii)

    If assumptions (\(\mathbf{A2} \)) and (\(\mathbf{B2} \)) hold, then with high probability \(\mathcal P^*(b_n)\) is bounded and non-empty.

Proof

For (i), observe that for \(k>K\) there exists an index \(i\in [d]\) such that \(x_i(I_k,b)<0\). The same inequality holds for \(b_n\) if sufficiently close to b, which happens with high probability. For (ii), non-emptiness with high probability follows from assumption (\(\mathbf{B2} \)), so we only prove boundedness. Indeed, assumption (\(\mathbf{A2} \)) implies that the recession cone \( \{x\ge 0\mid Ax=0,c^T x=0\} \) is trivial and equals \(\{0\}\). This property does not depend on \(b_n\), which yields the result. \(\square \)

The event \(A_n^\emptyset \) is equivalent to (\(\text {P}_{b}\)) being either infeasible or unbounded, and this has probability o(1) by (\(\mathbf{B2} \)). Combining this with the previous lemma and the sets \((A_n^\mathcal {K})_\mathcal {K}\) forming a partition of the probability space \(\Omega \), we deduce

$$\begin{aligned} x^\star (b_n) =\sum _{\emptyset \subset \mathcal {K}\subseteq [K]} \mathbbm {1}_{A_n^\mathcal {K}}(\omega )\, \alpha _n^\mathcal {K}(\omega ) \otimes x(I_\mathcal {K},b_n(\omega )) + o_\mathbb {P}(1), \end{aligned}$$

where \(\mathbbm {1}_{A}(\omega )\) denotes the usual indicator function of the set A. Defining the random vector

$$\begin{aligned} Y_n =\sum _{\emptyset \subset \mathcal {K}\subseteq [K]} \mathbbm {1}_{A_n^\mathcal {K}}(\omega )\, \alpha _n^\mathcal {K}(\omega ) \otimes x(I_\mathcal {K},b) \end{aligned}$$

that lies in \(\mathcal {P}^\star (b)\) (because \(\mathcal {K}\subseteq [K]\)), we obtain

$$\begin{aligned} r_n[x^\star (b_n) - Y_n] =\sum _{\emptyset \subset \mathcal {K}\subseteq [K]} \mathbbm {1}_{A_n^\mathcal {K}}(\omega )\, \alpha _n^\mathcal {K}(\omega ) \otimes x(I_\mathcal {K},G_n(\omega )) + o_\mathbb {P}(1). \end{aligned}$$
(6)

We next investigate the indicator functions \(\mathbbm {1}_{A_n^\mathcal {K}}(\omega )\) appearing in (6). Omitting the dependence of \(b_n\) on \(\omega \), we rewrite

$$\begin{aligned} \begin{aligned} B_n^\mathcal {K}&=\bigcap _{k\in \mathcal {K}}\bigcap _{i\in I_k}\left\{ x_i(I_k,b_n)\ge 0 \right\} =\bigcap _{k\in \mathcal {K}}\bigcap _{i\in [d]}\{x_i(I_k,G_n)\ge -r_nx_i(I_k,b)\}. \end{aligned} \end{aligned}$$

At the last internal intersection in the above display we can, with high probability, restrict to those i in the primal degeneracy set \(\mathrm {DP}_k:=\{i\in I_k \mid x_i(I_k,b)=0\}\). Indeed, for \(i\notin I_k\), the inequality reads \(0\ge 0\), whereas for \(i\in I_k\setminus \mathrm {DP}_k\) the right-hand side goes to \(-\infty \) and the left-hand side is bounded in probability. In other words \(\mathbb {P}(B_n^\mathcal {K})=o(1)+\mathbb {P}(G_n^{m_0}\in \cap _{k\in \mathcal {K}} H_k^{m_0})\), where

$$\begin{aligned} H_k^{m_0}&=\{g^{m_0}\in \mathbb {R}^{m_0}:[x(I_k,(g^{m_0},0_{m-m_0})]_{\mathrm {DP}_k}\ge 0\}. \end{aligned}$$
(7)

For \(\emptyset \subset \mathcal {K}\subseteq [K]\) define \(H_{\mathcal {K}}^{m_0}=\cap _{k\in \mathcal {K}}H_k^{m_0}\), and write

$$\begin{aligned} A_n^\mathcal {K}=\left( B_n^\mathcal {K}\setminus \bigcup _{k\in [K]\setminus \mathcal {K}}B_n^{\{k\}}\right) \setminus \bigcup _{k>K}B_n^{\{k\}}, \end{aligned}$$

where the union over \(k>K\) can be neglected by Lemma 4.1. Thus we conclude that

$$\begin{aligned} \mathbbm {1}_{A_n^\mathcal {K}} =\mathbbm {1}_{\left\{ G_n^{m_0}\in H_\mathcal {K}^{m_0}\setminus \bigcup _{k\in [K]\setminus \mathcal {K}}H_k^{m_0}\right\} } +o_\mathbb {P}(1). \end{aligned}$$
(8)

With these preliminary statements at our disposal, we are ready to prove the main results.

Proof

(Theorem 3.1) The goal is to replace \(G_n^{m_0}\) by \(G^{m_0}\) in the indicator function in (8) at the limit as n tends to infinity. By the Portmanteau theorem ( Billingsley, 1999, Theorem 2.1) and elementary argumentsFootnote 6 it suffices to show that the \(m_0\)-dimensional boundary of each \(H_k^{m_0}\) has Lebesgue measure zero. This is indeed the case, as they are convex sets. Define the function \(T^\mathcal {K}:\mathbb {R}^{|\mathcal {K}|}\times \mathbb {R}^m\rightarrow \mathbb {R}^d\)

$$\begin{aligned} T^\mathcal {K}(\alpha ,v) =\mathbbm {1}_{\left\{ v_{[m_0]}\in H_\mathcal {K}^{m_0}\setminus \bigcup _{j\in [K]\setminus \mathcal {K}} H_k^{m_0}\right\} } \sum _{k\in \mathcal {K}} \alpha _k x(I_k,v). \end{aligned}$$

This function is continuous for all \(\alpha \in \mathbb {R}^{\mathcal {K}}\) and all vectors \(v\in \mathbb {R}^m\) such that \(v_{[m_0]}\notin \partial [H_\mathcal {K}^{m_0}\setminus \bigcup _{k\in [K]\setminus \mathcal {K}}H_k^{m_0}]\). In particular, the continuity set is of full measure with respect to \((\alpha ^{\mathcal {K}},G)\). As there are finitely many possible subsets \(\mathcal {K}\) denoted by \(\mathcal {K}_1,\dots ,\mathcal {K}_B\), the function \(T=\left( T^{\mathcal {K}_1},\dots ,T^{\mathcal {K}_B}\right) :\mathbb {R}^{\sum _{i=1}^B|\mathcal {K}_i|}\times \mathbb {R}^m \rightarrow (\mathbb {R}^d)^B\) defined by

$$\begin{aligned} T\left( \alpha ^{\mathcal {K}_1},\dots ,\alpha ^{\mathcal {K}_B},v\right) =\left( T^{\mathcal {K}_1}(\alpha ^{\mathcal {K}_1},v),\dots ,T^{\mathcal {K}_B}(\alpha ^{\mathcal {K}_B},v)\right) \end{aligned}$$

is continuous G-almost surely. The continuous mapping theorem together with the assumed joint distributional convergence of the random vector \((\alpha _n^\mathcal {K},G_n)\) yield that

$$\begin{aligned} \begin{aligned} \sum _{\emptyset \subset \mathcal {K}\subseteq [K]} T^\mathcal {K}\left( \alpha ^{\mathcal {K}}_n,G_n\right) \xrightarrow []{D} \sum _{\emptyset \subset \mathcal {K}\subseteq [K]} T^\mathcal {K}\left( \alpha ^{\mathcal {K}},G\right) \end{aligned} \end{aligned}$$

which completes the proof of Theorem 3.1. \(\square \)

Proof

(Theorem 3.3) With high probability (\(\mathbf{A2} \)) and (\(\mathbf{A2} \)) hold for \(b_n\) (by Lemma 4.1 for the former and trivially for the latter), which implies that \(\mathcal {P}^\star (b_n)\) is a singleton (Lemma 2.5 and Fact 2.4). Hence, regardless of the choice of \(\alpha _n^\mathcal {K}\), it holds that \(\mathbbm {1}_{A_n^\mathcal {K}}x^\star (b_n)=x(I_{\min \mathcal {K}},b_n)\). In particular, we may assume without loss of generality that \(\alpha _n^\mathcal {K}\) are deterministic and do not depend on n. Thus the joint convergence in Theorem 3.1 holds, and (6) simplifies to

$$\begin{aligned} r_n[x^\star (b_n) - Y_n] =\sum _{\emptyset \subset \mathcal {K}\subseteq [K]} \mathbbm {1}_{A_n^\mathcal {K}}(\omega )\, x(I_{\min \mathcal {K}},G_n(\omega )) +o_\mathbb {P}(1) =x(I_{K(\omega )},G_n(\omega ))+o_\mathbb {P}(1), \end{aligned}$$

where \(K(\omega )\) is the minimal \(k\le K\) such that \(B_n^{\{k\}}\) holds. Since \(B_n^{\{k\}}\) is asymptotically \(\{G\in H_k^{m_0}\}\), Theorem 3.3 follows. Let us now show that M(G) simplifies to \(\sum _{k=1}^K \mathbbm {1}_{\left\{ G\in H_k^m\right\} }\,x\left( I_k,G\right) \) if G is absolutely continuous. It suffices to show that intersections \(H_j^m\cap H_k^m\) with \(j<k\le K\) have Lebesgue measure zero. If \(v\in H_j^m\cap H_k^m\) then there exists \(\eta >0\) such that \(x(I_j,b+\eta v)\ge 0\) and \(x(I_k,b+\eta v)\ge 0\). Since \(\lambda (I_k)\) and \(\lambda (I_j)\) are dual feasible, they must be optimal with respect to \(b+\eta v\). Thus it holds

$$\begin{aligned} 0=\frac{1}{\eta }(b+\eta v)^T[\lambda (I_k) - \lambda (I_j)] =v^T[\lambda (I_k) - \lambda (I_j)]. \end{aligned}$$

By (\(\mathbf{A2} \)) the vector \(\lambda (I_k)-\lambda (I_j)\) is nonzero and hence v is contained in its orthogonal complement, which indeed has Lebesgue measure zero. \(\square \)

Proof

(Theorem 3.5) We consider the optimality sets \(\mathcal P^\star (b_n)\) as elements of the power set in \(\mathbb {R}^d\) endowed with the Hausdorff distance \(d_H\).Footnote 7 Then, for all \(\mathcal {K}\subseteq [N]\) the mapping \(v\mapsto \mathcal {C}(x(I_\mathcal {K}),v)\) is Lipschitz since without loss of generality \(\mathcal {K}\ne \emptyset \) and

$$\begin{aligned} d_H(\mathcal {C}(x(I_\mathcal {K},u),\mathcal {C}(x(I_\mathcal {K},v))\le \Vert u-v\Vert \max _{k\in \mathcal {K}}\Vert A_{I_k}^{-1}\Vert _\infty . \end{aligned}$$

It follows that

$$\begin{aligned} \mathcal P^\star (b_n) =\sum _{\mathcal {K}\subseteq [N]} \mathbbm {1}_{A_n^\mathcal {K}}\mathcal {C}(x(I_\mathcal {K}, b_n)), \qquad (\mathcal {C}(\emptyset )=\emptyset ) \end{aligned}$$

is a measurable random subset of \(\mathbb {R}^d\). According to Fact 2.3 in presence of (\(\mathbf{A2} \)) and the preceding computations

$$\begin{aligned} r_n\mathcal P^\star (b_n) =o_{\mathbb P}(1)+\sum _{\emptyset \ne \mathcal {K}\subseteq [K]} \mathbbm {1}_{\left\{ G_n^{m_0}\in H_\mathcal {K}^{m_0}\setminus \bigcup _{k\in [K]\setminus \mathcal {K}}H_k^{m_0}\right\} }\mathcal {C}(x(I_\mathcal {K}, r_nb_n)). \end{aligned}$$

If \(\mathcal P^\star (b)=\{x^\star (b)\}\) is a singleton, then \(\mathcal {C}(x(I_{[K]},b)=\{x^\star (b)\}\) and therefore

$$\begin{aligned} r_n(\mathcal P^\star (b_n) - x^\star (b)) =o_{\mathbb P}(1)+&\sum _{\emptyset \ne \mathcal {K}\subseteq [K]} \mathbbm {1}_{\left\{ G_n^{m_0}\in H_\mathcal {K}^{m_0}\setminus \bigcup _{k\in [K]\setminus \mathcal {K}}H_k^{m_0}\right\} }\mathcal {C}(x(I_\mathcal {K}, G_n)) \\ \rightarrow&\sum _{\emptyset \ne \mathcal {K}\subseteq [K]} \mathbbm {1}_{\left\{ G^{m_0}\in H_\mathcal {K}^{m_0}\setminus \bigcup _{k\in [K]\setminus \mathcal {K}}H_k^{m_0}\right\} }\mathcal {C}(x(I_\mathcal {K}, G)) \end{aligned}$$

by the continuous mapping theorem. \(\square \)

Proof

(Proposition 3.7) Let \(K=\mathbb R_+^d\) and define the linear map \(\tau :\mathbb R^d\rightarrow \mathbb R^{m+1}\) by \(\tau (x)=(Ax,c^tx)\). For each b such that the linear program is feasible, let \(v_b\in \mathbb R\) be the optimal objective value. If \(\tau \) is injective, then the optimality sets are singletons and the result holds trivially. We thus assume that \(\tau \) is not injective, and observe that

$$\begin{aligned} K\cap \tau ^{-1}\{(b,v_b)\} =\{x\ge 0:Ax=b,c^tx=v_b\} =\mathcal {P}^\star (b). \end{aligned}$$

Since K is a polyhedron and \(\tau \) is neither identically zero (A has full rank) nor injective, we can apply the main theorem of Walkup & Wets (1969b). We obtain

$$\begin{aligned} d_H(\mathcal {P}^\star (b_n),\mathcal {P}^\star (b)) \le B\sqrt{\Vert b-b_n\Vert ^2 + |v_b-v_{b_n}|^2} =O_\mathbb {P}(r_n^{-1}),\qquad B=B(A,c)<\infty , \end{aligned}$$

because the optimal values satisfy \(v_b-v_{b_n}=O_\mathbb {P}(r_n^{-1})\) by Proposition 3.6. \(\square \)

5 On the assumptions

We start collecting some well-known facts from parametric optimization (see Walkup & Wets, (1969a); Guddat et al., (1974) for details). To this end, denote the dual feasible set by \(\mathcal {N}:=\left\{ \lambda \in \mathbb {R}^m\mid \, A^t\lambda \le c\right\} \). Further, define the set of feasible parameters by \(\mathcal {M}:= \left\{ b\in \mathbb {R}^m \mid \, \mathcal {P}(b)\ne \emptyset \right\} \) and \(\mathcal {M}^\star := \left\{ b\in \mathbb {R}^m \mid \, \mathcal {P}^\star (b)\ne \emptyset \right\} \) the solution set.

Lemma 5.1

If for some \(b_0\in \mathcal {M}\) the set \(\mathcal {P}(b_0)\) is bounded (resp. unbounded) then \(\mathcal {P}(b)\) is bounded (resp. unbounded) for all \(b\in \mathcal {M}\). Similarly, if for some \(b_0\in \mathcal {M}^\star \) the set \(\mathcal {P}^\star (b_0)\) is bounded (resp. unbounded) then \(\mathcal {P}^\star (b)\) is bounded (resp. unbounded) for all \(b\in \mathcal {M}^\star \). Moreover, it holds that

  1. (i)

    the set \(\mathcal {M}\) is non-empty and equal to an m-dimensional convex cone.

  2. (ii)

    if the dual set \(\mathcal {N}\) is non-empty then it holds that \(\mathcal {M}=\mathcal {M}^\star \).

  3. (iii)

    if the dual set \(\mathcal {N}\) is non-empty and bounded then \(\mathcal {M}=\mathcal {M}^\star =\mathbb {R}^m\).

The following discussion on the assumptions is a consequence of Lemma 5.1. We first collect sufficient conditions for assumption (\(\mathbf{A2} \)).

Corollary 5.2

(Sufficiency for (\(\mathbf{A2} \))) The following statements hold.

  1. (i)

    If \(\mathcal {N}\) is non-empty and \(\mathcal {P}(b)\) is bounded for some \(b\in \mathcal {M}\) then assumption (\(\mathbf{A2} \)) holds for all \(b\in \mathcal {M}\).

  2. (ii)

    If \(\mathcal {N}\) is non-empty, bounded and \(\mathcal {P}^\star (b)\) is bounded for some \(b\in \mathbb {R}^m\) then assumption (\(\mathbf{A2} \)) holds for all \(b\in \mathbb {R}^m\).

Certainly, if \(\mathcal {P}^\star (b)\ne \emptyset \) then (\(\mathbf{A2} \)) is equivalent to \(\mathcal {P}^\star (b)\) being bounded. The latter property is independent on b and equivalent to the set \(\left\{ x\in \mathbb {R}^d\mid \, Ax=0,\,x\ge 0,\, c^tx=0\right\} \) being empty. A sufficient condition for that is boundedness of \(\mathcal {P}(b)\) that can be easily checked in certain settings.

Lemma 5.3

(Sufficiency for \(\mathbf {\mathcal {P}(b)}\) bounded) Suppose that A has non-negative entries and no column of A equals \(0\in \mathbb {R}^m\). Then \(\mathcal {P}(b)\) is bounded (possibly infeasible) for all \(b\in \mathbb {R}^m\).

It is noteworthy that if the dual feasible set \(\mathcal {N}\) is non-empty and bounded, then \(\mathcal {P}^\star (b)\ne \emptyset \) for all \(b\in \mathbb {R}^m\), but \(\mathcal {P}(b)\) is necessarily unbounded (Clark, 1961). Thus, \(\mathcal N\) is unbounded under the conditions of Lemma 5.3. We emphasize that assumption (\(\mathbf{A2} \)) is neither easy to verify nor expected to hold for most structured linear programs. Indeed, under (\(\mathbf{A2} \)) assumption (\(\mathbf{A2} \)) is equivalent to all dual basic solutions being non-degenerate (Lemma 2.5). However, degeneracy in linear programs is often the case rather the exception (Bertsimas & Tsitsiklis, 1997). Notably, if (\(\mathbf{A2} \)) and (\(\mathbf{A2} \)) are satisfied the set \(\mathcal {P}^\star (b)\) is singleton.

The assumption (\(\mathbf{B1} \)) has to be checked for each particular case and can usually be verified by an application of the central limit theorem (for a particular example see Sect. 6). Assumption (\(\mathbf{B2} \)) is obviously necessary for the limiting distribution to exist. If the dual feasible set \(\mathcal {N}\) is non-empty and bounded and (\(\mathbf{B1} \)) holds then (\(\mathbf{B2} \)) is always satisfied. A more refined statement is the following.

Lemma 5.4

(Sufficiency for (\(\mathbf{B2} \))) Consider the set \(\mathcal {P}(b_0)\) assumed to be non-empty. Then \(\mathcal {P}(b)\) is non-empty for all b sufficiently close to \(b_0\) if

  1. (i)

    the set \(\mathcal {P}(b_0)\) contains a non-degenerate feasible basic solution.

  2. (ii)

    Slater’s constraint qualificationFootnote 8 holds.

In particular, if the dual feasible set \(\mathcal {N}\) is non-empty and (\(\mathbf{B1} \)) holds then both conditions (i) and (ii) are sufficient for (\(\mathbf{B2} \)).

Joint convergence. Our goal here is to state useful conditions such that the random vector \(\left( \alpha _n^\mathcal {K}, G_n\right) \) jointly convergesFootnote 9 in distribution to some limit random variable \(\left( \alpha ^\mathcal {K},G\right) \) on the space \( \Delta _{\vert \mathcal {K}\vert }\times \mathbb {R}^m\). By assumption (\(\mathbf{B1} \)), \(G_n\rightarrow G\) in distribution, and a necessary condition for the joint distributional convergence of \((\alpha _n^\mathcal {K},G_n)\) is that \(\alpha _n^\mathcal {K}\) has a distributional limit \(\alpha ^\mathcal {K}\). There is no reason to expect \(\alpha _n^\mathcal {K}\) and \(G_n\) to be independent, as discussed at the end of this section. We give a weaker condition than independence that is formulated in terms of the conditional distribution of \(\alpha _n^\mathcal {K}\) given \(G_n\) (or, equivalently, given \(b_n=b+G_n/r_n\)). These conditions are natural in the sense that if \(b_n=g\), then the choice of solution \(x^\star (g)\), as encapsulated by the \(\alpha _n^\mathcal {K}\)’s, is determined by the specific linear program solver in use.

Treating conditional distributions rigorously requires some care and machinery. Let \(\mathcal Z=\mathcal Z^\mathcal {K}=\Delta _{|\mathcal {K}|}\times \mathbb {R}^m\) and for \(\varphi :\mathcal Z\rightarrow \mathbb {R}\) denote

$$\begin{aligned} \Vert \varphi \Vert _\infty = \sup _{z}|\varphi (z)|, \qquad \Vert \varphi \Vert _{\mathrm {Lip}}=\sup _{z_1\ne z_2}\frac{|\varphi (z_1)-\varphi (z_2)}{\Vert z_1-z_2\Vert }, \qquad \Vert \varphi \Vert _{\mathrm {BL}}=\Vert \varphi \Vert _\infty + \Vert \varphi \Vert _{\mathrm {Lip}}. \end{aligned}$$

We say that \(\varphi \) is bounded Lipschitz if it belongs to \(\mathrm {BL}(\mathcal Z)=\{\varphi :\mathcal Z\rightarrow \mathbb {R}\,\vert \,\Vert \varphi \Vert _{\mathrm {BL}}\le 1\}\). The bounded Lipschitz metric

$$\begin{aligned} \mathrm {BL}(\mu _1,\mu _2) :=\sup _{\varphi \in \mathrm {BL}(\mathcal Z)}\left| \int _\mathcal Z\varphi (z)\, d\left( \mu _1 - \mu _2\right) (z)\right| \end{aligned}$$
(9)

is well-known to metrize convergence in distribution of (probability) measures on \(\mathcal Z\) Dudley (1966) [Theorems 6 and 8]. According to the disintegration theorem (see Kallenberg, 1997[Theorem 5.4], Dudley, 2002[Section 10.2] or Chang & Pollard, 1997 for details), we may write the joint distribution of \(\left( \alpha _n^\mathcal {K},b_n\right) \) as an integral of conditional distributions \(\mu _{n,g}^\mathcal {K}\) that represent the distribution of \(\alpha _n^\mathcal {K}\) given that \(b_n=g\). More precisely, \(g\mapsto \mu _{n,g}^\mathcal {K}\) is measurable from \(\mathbb {R}^m\) to the metric space of probability measures on \(\Delta _{|\mathcal {K}|}\) with the bounded Lipschitz metric, so that for any \(\varphi \in \mathrm {BL}(\mathcal Z)\) it holds that

$$\begin{aligned} \mathbb {E}\varphi (\alpha _n^\mathcal {K},b_n) =\mathbb {E}\psi _n(b_n), \qquad \psi _n(g)=\int _{\Delta _{|\mathcal {K}|}}\varphi (\alpha ,g)d\mu ^\mathcal {K}_{n,g}(\alpha ), \end{aligned}$$

where \(\psi _n:\mathbb {R}^m\rightarrow \mathbb {R}\) is a measurable function. The joint distribution of \((\alpha _n^\mathcal {K},G_n)\) is determined by the collection of expectations

$$\begin{aligned} \mathbb {E}\varphi (\alpha _n^\mathcal {K},G_n) =\mathbb {E}\psi _n(G_n) =\mathbb {E}\psi _n\left( r_n(b_n-b)\right) ,\qquad \varphi \in \mathrm {BL}(\mathcal Z). \end{aligned}$$

Our sufficient condition for joint convergence is given by the following lemma. It is noteworthy that the spaces \(\mathbb {R}^m\) and \(\Delta _{|\mathcal {K}|}\) can be replaced with arbitrary Polish spaces, and even more general spaces, as long as the disintegration theorem is valid.

Lemma 5.5

Let \(\{\mu ^\mathcal {K}_{g}\}_{g\in \mathbb {R}^m}\) be a collection of probability measures on \(\Delta _{|\mathcal {K}|}\) such that the map \(g\mapsto \mu ^\mathcal {K}_{g}\) is continuous at G-almost any g, and suppose that \(\mu ^\mathcal {K}_{n,g}\rightarrow \mu ^\mathcal {K}_{g}\) uniformly with respect to the bounded Lipschitz metric \(\mathrm {BL}\). Then \((\alpha _n^\mathcal {K},G_n)\) converges in distribution to a random vector \((\alpha ^\mathcal {K},G)\) satisfying

$$\begin{aligned} \mathbb {E}\varphi (\alpha ^\mathcal {K},G) =\mathbb {E}_{G}\int _{\Delta _{|\mathcal {K}|}}\varphi (\alpha ,G)d\mu _{G}^\mathcal {K}(\alpha ) :=\mathbb {E}\psi (G) \end{aligned}$$

for any continuous bounded function \(\varphi \in \mathrm {BL}(\mathcal Z)\) (this determines the distribution of the random vector \((\alpha ^\mathcal {K},G)\) completely). Moreover, if \(\mathcal L\) denotes the distribution of a random vector, then the rate of convergence can be quantified as

$$\begin{aligned} \mathrm {BL}(\mathcal L[(\alpha _n^\mathcal {K},G_n)],\mathcal L[(\alpha ^\mathcal {K},G)]) \le \sup _{g} \mathrm {BL}(\mu ^\mathcal {K}_{n,g},\mu ^\mathcal {K}_{g})+(1+L)\mathrm {BL}(\mathcal L[G_n],\mathcal L[G]), \end{aligned}$$

where \(L:=\sup _{g_1\ne g_2}\mathrm {BL}(\mu ^\mathcal {K}_{g_1},\mu ^\mathcal {K}_{g_2})/\Vert (g_1-g_2)\Vert \in [0,\infty ]\). The supremum with respect to g can be replaced by an essential supremum.

The conditions of Lemma 5.5 (and hence the joint convergence in Theorem 3.1) will be satisfied in many practical situations. For example, given \(b_n\) and an initial basis for the simplex method, its output is determined by the pivoting rule (for a general overview see Terlaky & Zhang, 1993 and references therein). Deterministic pivoting rules lead to degenerate conditional distributions of \(\alpha _n^\mathcal {K}\) given \(b_n=g\), whereas random pivoting rules may lead to non-degenerate conditional distributions. In both cases these conditional distributions do not depend on n at all, but only on the input vector g. In particular, the uniform convergence in Lemma 5.5 is trivially fulfilled (the supremum is equal to zero). It is reasonable to assume that these conditional distributions depend continuously on g except for some boundary values that are contained in a lower-dimensional space (which will have measure zero under the absolutely continuous random vector G).

6 Optimal transport

Optimal transport (OT) dates back to the French mathematician and engineer Monge (1781). Roughly speaking, it seeks to transport objects from one collection of locations to another in the most economical manner. Apart from the work of Appell (1887), much of the progress of OT began in the mid-twentieth century, firstly due to its practical relevance in economics. Indeed, much of the theory of linear programming including the simplex algorithm has been motivated by findings for OT with early contributions by Hitchcock (1941), Kantorovich (1942), Dantzig (1948) and Koopmans (1951). Since then a surprisingly rich theory has emerged with important contributions by Kantorovich & Rubinstein 1958, Zolotarev (1976), Sudakov (1979), Kellerer (1984), Rachev (1985), Brenier (1987), Smith & Knott 1987), McCann (1997), Jordan et al. (1998), Ambrosio et al. (2008) and Lott & Villani (2009), among many others. We also refer to the excellent monographs by Rachev & Rüschendorf (1998), Villani (2008) and Santambrogio (2015) for further details. In fact, OT has recently gained renewed interest especially as computational progress paves the way to explore novel fields of applications such as imaging (Rubner et al., 2000; Solomon et al., 2015), machine learning (Frogner et al., 2015; Arjovsky et al., 2017; Peyré & Cuturi, 2019), and statistical data analysis (Chernozhukov et al., 2017; Sommerfeld & Munk, 2018; del Barrio et al., 2019; Panaretos & Zemel, 2019).

On a finite space \(\mathcal {X}=\{x_1,\ldots ,x_N\}\) equipped with some underlying cost \(c:\mathcal {X}\times \mathcal {X}\rightarrow \mathbb {R}\) OT between two probability measures \(r,s\in \Delta _N:=\{ r\in \mathbb {R}^N\, \vert \, \mathbf {1}_N^Tr=1,\, r_i\ge 0 \}\) is equal to the linear program

figure g

where \(c_{ij}=c(x_i,x_j)\) and the set \(\Pi (r,s)\) denotes all non-negative matrices with row and column sum equal to r and s, respectively. OT comprises the challenge to find an optimal solution termed OT coupling \(\pi ^\star (r,s)\) between r and s such that the integrated cost is minimal among all possible couplings. We denote by \(\Pi ^\star (r,s)\) the set of all OT couplings. The dual problem is

figure h

In our context reflecting many practical situations (Tameling et al., 2021), the measures r and s are unknown and need to be estimated from data. To this end, we assume to have access to independent and identically distributed (i.i.d.) \(\mathcal {X}\)-valued random variables \(X_1,\ldots ,X_n\sim r\), where a reasonable proxy for the measure r is its empirical version \(\hat{r}_n:= \frac{1}{n}\sum _{i=1}^n \delta _{X_i}\). As an illustration of our general theory, we focus on limit theorems that asymptotically \((n\rightarrow \infty )\) characterize the fluctuations of an estimated coupling \(\pi ^\star (\hat{r}_n,s)\) around \(\pi ^\star (r,s)\). For the sake of readability, we focus primarily on the one sample case, where only r is replaced by \(\hat{r}_n\) but include a short account on the case that both measures are estimated.

A few words regarding the assumptions from Sect. 2 in the OT context are in order. Assumption (\(\mathbf{A2} \)) always holds, since \(\Pi (r,s)\subseteq [0,1]^{N^2}\) is bounded and contains the independence coupling \(rs^T\). Assumption (\(\mathbf{A2} \)) that according to Lemma 2.5 is equivalent to all dual feasible basic solutions for (\(\text {DOT}\)) being non-degenerate, however, does not always hold. Sufficient conditions for (\(\mathbf{A2} \)) to hold in OT are given in Subsect. 6.1. Concerning the probabilistic assumptions, we notice that (\(\mathbf{B2} \)) always holds as for any (possibly random) pair of measures \((\hat{r}_n,s)\) the set \(\Pi (\hat{r}_n,s)\) is non-empty and bounded. Assumption (\(\mathbf{B1} \)) is easily verified by an application of the multivariate central limit theorem. Indeed, the multinomial process of empirical frequencies \(\sqrt{n}(r-\hat{r}_n)\) converges weakly to a centered Gaussian random vector \(G(r)\sim \mathcal {N}(0,\Sigma (r))\) with covariance

$$\begin{aligned} \Sigma (r):= \begin{bmatrix} r_1(1-r_1) &{} -r_1r_2 &{} \ldots &{} -r_1r_N\\ -r_1r_2 &{} r_2(1-r_2) &{} \ldots &{} -r_2r_N\\ \vdots &{} &{} \ddots &{} \vdots \\ -r_1r_N &{} -r_2r_N &{}\ldots &{} r_N(1-r_N) \end{bmatrix}\, . \end{aligned}$$
(10)

Notably, \(\Sigma (r)\) is singular and G(r) fails to be absolutely continuous with respect to Lebesgue measure. A slight modification allows to circumvent this issue. The constraint matrix in OT,

$$\begin{aligned} A=\begin{pmatrix} \mathbf {1}_N^T\\ &{} \ddots &{}\\ &{} &{} \mathbf {1}_N^T\\ \mathbf {I}_N &{} \dots &{} \mathbf {I}_N \end{pmatrix} \, \in \mathbb {R}^{2N\times N^2}, \end{aligned}$$
(11)

has rank \(2N-1\). Letting \(r_\dagger =r_{[N-1]}\in \mathbb {R}^{N-1}\) denote the first \(N-1\) coordinates of \(r\in \mathbb {R}^N\) and \(A_\dagger \in \mathbb {R}^{(2N-1)\times N^2}\) denote A with its N-th row removed, it holds that

$$\begin{aligned} \Pi (r,s)= \left\{ \pi \in \mathbb {R}^{N^2}\,\vert \, A_{\dagger }\pi =\begin{bmatrix} r_{\dagger }\\ s \end{bmatrix},\,\pi \ge 0 \right\} . \end{aligned}$$
(12)

The limiting random variable for \(\sqrt{n}(r_\dagger -\hat{r_\dagger }_n)\), as n tends to infinity, is equal to \(G(r_\dagger )\) following an absolutely continuous distribution if and only if \(r_\dagger >0\) and \(\Vert r_\dagger \Vert _1<1\). Equivalently, r is in the relative interior of \(\Delta _N\) (denoted \(\mathrm {ri}(\Delta _N)\)), i.e., \(0<r\in \Delta _N\). Under this condition (\(\mathbf{A2} \)), (\(\mathbf{B1} \)) and (\(\mathbf{B2} \)) hold and from the main result in Theorem 3.1 we immediately deduce the limiting distribution of optimal OT couplings.

Theorem 6.1

(Distributional limit law for OT couplings) Consider the optimal transport problem (\(\text {OT}\)) between two probability measures \(r,s\in \mathrm {ri}(\Delta _N)\) and let \(\hat{r}_n=\frac{1}{n}\sum _{i=1}^n \delta _{X_i}\) be the empirical measure derived by i.i.d. random variables \(X_1,\ldots ,X_n\sim r\). If sample size n tends to infinity, then there exists a sequence \(\pi _n^\star (r,s)\in \Pi ^\star (r,s)\) such that

$$\begin{aligned} \sqrt{n}\left( \pi ^\star (\hat{r}_n,s)-\pi _n^\star (r,s)\right) \xrightarrow []{D} \sum _\mathcal {K}\mathbbm {1}_{\left\{ G(r_\dagger )\in H_\mathcal {K}\setminus \cup _{k\notin \mathcal {K}}H_k\right\} }\,\alpha ^\mathcal {K}\otimes \pi (I_\mathcal {K},[G(r_\dagger ),0_N]) \end{aligned}$$
(13)

with \(G(r_\dagger )=(G^1(r_\dagger ),0)\). If further assumption (\(\mathbf{A2} \)) holds, then \(\Pi ^\star (r,s)=\left\{ \pi ^\star (r,s)\right\} \) and

$$\begin{aligned} \sqrt{n}\left( \pi ^\star (\hat{r}_n,s)-\pi ^\star (r,s)\right) \xrightarrow []{D} \sum _{k=1}^K \mathbbm {1}_{\left\{ G(r_\dagger )\in H_k\right\} }\, \pi (I_k,[G(r_\dagger ),0_N]). \end{aligned}$$

Remark 6.2

The two sample case presents an additional challenge. By the multivariate central limit theorem we have for \(\min (m,n)\rightarrow \infty \) and \(\frac{m}{n+m}\rightarrow \lambda \in (0,1)\) that

$$\begin{aligned} \sqrt{\frac{nm}{n+m}}\left( \begin{bmatrix} \widehat{r_\dagger }_n\\ \hat{s}_m \end{bmatrix}-\begin{bmatrix} r_\dagger \\ s \end{bmatrix} \right) \xrightarrow []{D} G_\lambda \left( r_\dagger ,s\right) :=\left( \sqrt{\lambda }G^1(r_\dagger ),\sqrt{1-\lambda }G^2(s)\right) \end{aligned}$$
(14)

with \(G^1(r_\dagger )\) and \(G^2(s)\) independent and the compound limit law following a centered Gaussian distribution with block diagonal covariance matrix, where the two blocks are given by (10), respectively. However, the limit law fails to be absolutely continuous. Nevertheless, the distributional limit theorem for OT couplings remains valid in this case and there exists a sequence \(\pi _{n,m}^\star (r,s)\in \Pi ^\star (r,s)\) such that

$$\begin{aligned} \sqrt{\frac{nm}{n+m}}\left( \pi ^\star (\hat{r}_n,\hat{s}_m)-\pi _{n,m}^\star (r,s)\right) \xrightarrow []{D} \sum _\mathcal {K}\mathbbm {1}_{\left\{ G_\lambda \left( r_\dagger ,s\right) \in H_\mathcal {K}\setminus \cup _{k\notin \mathcal {K}}H_k\right\} }\,\alpha ^\mathcal {K}\otimes \pi (I_\mathcal {K},G_\lambda \left( r_\dagger ,s\right) ). \end{aligned}$$

We provide further details in Appendix 1.

We emphasize that once a limit law for the OT coupling is available, one can derive limit laws for sufficiently smooth functionals thereof. As examples let us mention the OT curve (Klatt et al., 2020) and OT geodesics (McCann, 1997). The details are omitted for brevity and instead, we provide an illustration of the distributional limit theorem (Theorem 6.1).

Example 6.3

We consider a ground space \(\mathcal {X}=\left\{ x_1<x_2<x_3\right\} \subset \mathbb {R}\) consisting of \(N=3\) points with cost \(c=\left( 0,\vert x_1-x_2\vert ,\vert x_1-x_3\vert , \vert x_2-x_1\vert ,0,\vert x_2-x_3\vert , \vert x_3-x_1\vert ,\vert x_3-x_2\vert ,0\right) \in \mathbb {R}^9\) for which OT then reads as

$$\begin{aligned} \min _{\pi \in \mathbb {R}^9} \quad c^T\pi \quad \text {s.t.} \quad A_{\dagger }\pi =\begin{bmatrix} r_{\dagger }\\ s \end{bmatrix},\,\pi \ge 0 \end{aligned}$$

with constraint matrix \(A_\dagger \in \mathbb {R}^{5\times 9}\). A basis I is a subset of cardinality five out of the column index set \(\{1,\ldots ,9\}\) such that \((A_\dagger )_I\) is of full rank. For OT it is convenient to think of a feasible solution in terms of a transport matrix \(\pi \in \mathbb {R}^{3\times 3}\) with \(\pi _{ij}\) encoding mass transport from source i to destination j. For instance, the basis \(I=\{1,2,3,5,9\}\) corresponds to the transport scheme

$$\begin{aligned} TS(\{1,2,3,5,9\}):=\begin{pmatrix} *&{} *&{} *\\ &{} *\\ &{}&{}*\\ \end{pmatrix}, \end{aligned}$$

where each possible non-zero entry is marked by a star and specific values depend on the measures r and s. In particular, to basis I corresponds the (possibly infeasible) basic solution \(\pi (I,(r_\dagger ,s))=(A_\dagger )_I^{-1}(r_\dagger ,s)\) that we illustrate in terms of its transport scheme by

$$\begin{aligned} \pi \left( I,(r_\dagger ,s)\right) =\begin{pmatrix} s_1 &{} s_2-r_2 &{} r_1+r_2-s_1-s_2\\ &{} r_2\\ &{}&{}s_1+s_2+s_3-r_1-r_2 \end{pmatrix}=\begin{pmatrix} s_1 &{} s_2-r_2 &{} s_3-r_3\\ &{} r_2\\ &{}&{}r_3 \end{pmatrix}, \end{aligned}$$

where \(r=(r_\dagger ,1-\Vert r_\dagger \Vert _1)\in \mathbb {R}^3\) and the second equality employs that r and s sum up to one. Obviously, \(\pi (I,(r_\dagger ,s))\) is feasible if and only if \(s_2\ge r_2\) and \(s_3\ge r_3\). Suppose that the measures are equal \(r=s\). Then the transport problem attains a unique solution supported on the diagonal, i.e., all the mass remains at its current location. A straightforward computation yields \(K=8\) primal and dual optimal bases

$$\begin{aligned} \begin{array}{cccc} TS(I_1)=\begin{pmatrix} *&{} *\\ &{}*&{}\\ &{}*&{}*\end{pmatrix}, &{} TS(I_2)=\begin{pmatrix} *&{} \\ *&{}*&{}*\\ &{}&{}*\end{pmatrix}, &{} TS(I_3)=\begin{pmatrix} *&{} &{} \\ *&{} *&{} \\ *&{} &{}*\end{pmatrix}, &{} TS(I_4)=\begin{pmatrix} *&{} &{}*\\ &{} *&{}*\\ &{} &{}*\end{pmatrix}, \\ TS(I_5)=\begin{pmatrix} *&{} *&{} *\\ &{}*&{}\\ &{}&{}*\end{pmatrix}, &{} TS(I_6)=\begin{pmatrix} *&{}&{} \\ &{}*&{}\\ *&{}*&{}*\end{pmatrix}, &{} TS(I_7)=\begin{pmatrix} *&{} \\ *&{}*&{}\\ &{}*&{}*\end{pmatrix}, &{} TS(I_8)=\begin{pmatrix} *&{} *\\ &{}*&{}*\\ &{}&{}*\end{pmatrix}. \end{array} \end{aligned}$$

For example, the transport scheme \(TS(I_1)\) corresponds to basis \(I_1=\{1,2,5,8,9\}\) and induces an invertible matrix \(A_{I_1}\). Omitting the superscript \(m_0=5\) for clarity, the respective closed convex cones \(H_k\) for \(1\le k\le K\) as defined in (7) are

$$\begin{aligned} H_{1}&=\left\{ v\in \mathbb {R}^5\, \mid \, v_1\ge v_3,\, v_1+v_2\le v_3+v_4\right\} ,\quad H_{2}=\left\{ v\in \mathbb {R}^5\, \mid \, v_1\le v_3,\, v_1+v_2\ge v_3+v_4\right\} ,\\ H_{3}&=\left\{ v\in \mathbb {R}^5\, \mid \, v_2\ge v_4,\, v_1+v_2\le v_3+v_4\right\} ,\quad H_{4}=\left\{ v\in \mathbb {R}^5\, \mid \, v_1\ge v_3,\, v_2\ge v_4\right\} ,\\ H_{5}&=\left\{ v\in \mathbb {R}^5\, \mid \, v_2\le v_4,\, v_1+v_2\ge v_3+v_4\right\} ,\quad H_{6}=\left\{ v\in \mathbb {R}^5\, \mid \, v_1\le v_3,\, v_2\le v_4\right\} ,\\ H_{7}&=\left\{ v\in \mathbb {R}^5\, \mid \, v_1\le v_3,\, v_1+v_2\le v_3+v_4\right\} , \quad H_{8}=\left\{ v\in \mathbb {R}^5\, \mid \, v_1\ge v_3,\, v_1+v_2\ge v_3+v_4\right\} . \end{aligned}$$

Each of these cones is an intersection of two proper half-spaces, respectively. Some of these cones exhibit non-trivial intersections and in particular (\(\mathbf{A2} \)) fails to hold. Such cases arise for the pairs \(\{I_3,I_7\}\), \(\{I_6,I_7\}\), \(\{I_4,I_8\}\) and \(\{I_5,I_8\}\). The intersections of the corresponding cones are given by

$$\begin{aligned} H_3\cap H_7&=\left\{ v\in \mathbb {R}^5\, \mid \, v_2\ge v_4,\, v_1+v_2\le v_3+v_4\right\} ,&H_6\cap H_7=\left\{ v\in \mathbb {R}^5\, \mid \, v_1\le v_3,\, v_2\le v_4\right\} ,\\ H_5\cap H_8&=\left\{ v\in \mathbb {R}^5\, \mid \, v_2\le v_4, \, v_1+v_2\ge v_3+v_4 \right\} ,&H_4\cap H_8=\left\{ v\in \mathbb {R}^5\,\mid \, v_1\ge v_3,\, v_2\ge v_4 \right\} . \end{aligned}$$

The weak convergence in (14) and together with OT for \(p=1\) and \(r=s\) then leads to the distributional limit law for OT couplings

Although \(K=8\), there are only four distinct dual solutions: \(\lambda (I_1)\), \(\lambda (I_2)\), \(\lambda (I_7)\) and \(\lambda (I_8)\).

6.1 Degeneracy and uniqueness in optimal transport

This subsection provides sufficient conditions for assumption (\(\mathbf{A2} \)) to hold. In view of Lemma 2.5 and since for OT assumption (\(\mathbf{A2} \)) is always satisfied, assumption (\(\mathbf{A2} \)) is equivalent to non-degeneracy of all dual optimal basic solutions. Notably, this implies uniqueness of the OT coupling. Conversely, if for a given cost the OT coupling is unique for all \(r,s\in \Delta _N\), then (\(\mathbf{A2} \)) holds. We begin with a sufficient criterion for (\(\mathbf{A2} \)) only depending on the cost.

Lemma 6.4

Suppose that the following holds for the cost function c. For any \(n\ge 2\) and any family of indices \(\left\{ (i_k,j_k)\right\} _{1\le k \le n}\) with all \(i_k\) pairwise different and all \(j_k\) pairwise different it holds that

$$\begin{aligned} \sum _{k=1}^n c_{i_k j_k}\ne \sum _{k=1}^n c_{i_k j_{k-1}}, \quad j_0:= j_n. \end{aligned}$$
(15)

Then all dual basic solutions are non-degenerate. In particular, (\(\mathbf{A2} \)) holds and the optimal OT coupling is unique for any pair of measures \(r,s\in \Delta _N\).

We are unaware of an explicit reference for condition (15) that is reminiscent to the well-known cyclic monotonicity property (Rüschendorf, 1996). Further, (15) can be thought of as dual to the condition of Klee & Witzgall (1968) that for any proper subsets \(A,B\subset [N]\) not both empty

$$\begin{aligned} \sum _{i\in A} r_i \ne \sum _{j\in B} s_j \end{aligned}$$
(16)

that guarantees every primal basic solution to be non-degenerate. Notably, (15) is satisfied for OT on the real line with cost \(c(x,y)=\vert x-y\vert ^p\) and measures with at least \(N=3\) support points if and only if \(p>0\) and \(p\ne 1\). If the underlying space involves too many symmetries, such as a regular grid with cost defined by the underlying grid structure, it typically fails to hold. An alternative condition that ensures (\(\mathbf{A2} \)) is the strict Monge condition that the cost c satisfies

$$\begin{aligned} c_{ij}+c_{i^{'}j^{'}}<c_{ij^{'}}+c_{i^{'}j}\,, \quad \forall \, i<i^{'}, j<j^{'}, \end{aligned}$$
(17)

possibly after relabelling the indices (Dubuc et al., 1999). This translates to easily interpretable statements on the real line.

Lemma 6.5

Let \(\mathcal {X}:=\{x_1<\ldots < x_N\}\) be a set of N distinct ordered points on the real line. Suppose that the cost takes the form \(c(x,y)=f(\vert x-y\vert )\) with \(f:\mathbb {R}_+\rightarrow \mathbb {R}_+\) such that \(f(0)=0\) and either

$$\begin{aligned} (i)\, f \text { is strictly convex,}\quad (ii)\, f \text { is strictly concave.} \end{aligned}$$

Then assumption (\(\mathbf{A2} \)) holds.

The first statement follows by employing the Monge condition (see also McCann, 1999[Proposition A2] for an alternative approach). The second case is more delicate, and indeed, the description of the unique optimal solution is more complicated (see Appendix 1). In fact, in both cases the unique transport coupling can be computed by the Northwest corner algorithm (Hoffman, 1963). Typical costs covered by Lemma 6.5 are \(c(x,y)=\vert x-y \vert ^p\) for any \(p> 0\) with \(p\ne 1\). Indeed, for \(p=0\) or \(p=1\), uniqueness often fails (see Remark 6.7).

In a general linear program (\(\text {P}_{b}\)), the set of costs c for which (\(\mathbf{A2} \)) fails to hold has Lebesgue measure zero (e.g., Bertsimas & Tsitsiklis, 1997, ). Here we provide a result in the same flavour for OT.

Proposition 6.6

Let \(\mu \) and \(\nu \) be absolutely continuous on \(\mathbb {R}^D\), with \(D\ge 2\), and let \(c(x,y)=\Vert x-y\Vert ^p_q\), where \(p\in \mathbb {R}\setminus \{0\}\) and \(q\in (0,\infty ]\) are such that if \(p=1\) then \(q\notin \{1,\infty \}\). For probability vectors \(r,s\in \Delta _N\) define the probability measures \(r(\mathbf {X})=\sum _{k=1}^N r_k\delta _{X_k}\) and \(s(\mathbf {Y})=\sum _{k=1}^N s_k\delta _{Y_k}\) with two independent collections of i.i.d. \(\mathbb {R}^D\)-valued random variables \(X_1,\ldots ,X_N\sim \mu \) and \(Y_1,\ldots ,Y_N\sim \nu \). Then (15) holds almost surely for the optimal transport (\(\text {OT}\)). In particular, with probability one for any \(r,s\in \Delta _N\) and pair of marginals \(r(\mathbf {X})\) and \(s(\mathbf {Y})\), the corresponding optimal transport coupling is unique.

See Wang et al., (2013) for a related result for \(p=q=2\) and fixed marginals rs. Note that Proposition (6.6) includes the Coulomb case \((p=-1)\) that has applications in physics (Cotar et al., 2013). As the proof details, the result is valid for piece-wise analytic (non-constant) functions.

Remark 6.7

(Non-uniqueness) Let \(\mu \) be uniform on \([0,1]^D\) and \(\nu \) be uniform on \([1,2]^D+(2,0,0,\dots ,0)\). Then, with probability one, all transport couplings bear the same cost if \(p=0\) or if \(p=1\) and \(q\in \{1,\infty \}\). Thus, for \(N\ge 2\) uniqueness fails.