Limit laws for empirical optimal solutions in random linear programs

We consider a general linear program in standard form whose right-hand side constraint vector is subject to random perturbations. For the corresponding random linear program, we characterize under general assumptions the random fluctuations of the empirical optimal solutions around their population quantities after standardization by a distributional limit theorem. Our approach is geometric in nature and further relies on duality and the collection of dual feasible basic solutions. The limiting random variables are driven by the amount of degeneracy inherent in linear programming. In particular, if the corresponding dual linear program is degenerate the asymptotic limit law might not be unique and is determined from the way the empirical optimal solution is chosen. Furthermore, we include consistency and convergence rates of the Hausdorff distance between the empirical and the true optimality sets as well as a limit law for the empirical optimal value involving the set of all dual optimal basic solutions. Our analysis is motivated from statistical optimal transport that is of particular interest here and distributional limit laws for empirical optimal transport plans follow by a simple application of our general theory. The corresponding limit distribution is usually non-Gaussian which stands in strong contrast to recent finding for empirical entropy regularized optimal transport solutions.


Introduction
Linear programs arise naturally in many applications and have become ubiquitous in topics such as operations research, control theory, economics, physics, mathematics and statistics (see the textbooks by Dantzig, 1963;Bertsimas & Tsitsiklis, 1997;Luenberger & Ye, 2008;Galichon, 2018 and the references therein). Their solid mathematical foundation dates back to the mid-twentieth century, to mention the seminal works of Kantorovich (1939), Hitchcock (1941), Dantzig (1948) and Koopmans (1949), and its algorithmic computation is an active topic of research until today 1 . A linear program in standard form writes with (A, b, c) ∈ R m×d ×R m ×R d and matrix A of full rank m ≤ d. For the purpose of the paper, the lower subscript b in (P b ) emphasizes the dependence on the vector b. Associated with the primal program (P b ) is its corresponding dual program At the heart of linear programming and fundamental to our work is the observation that if the primal program (P b ) attains a finite value, the optimum is attained at one of a finite set of candidates termed basic solutions. Each basic solution (possibly infeasible) is identified by a basis I ⊂ {1, . . . , d} indexing m linearly independent columns of the constraint matrix A. The bases I also defines a basic solution for the dual (D b ). In fact, the simplex algorithm (Dantzig, 1948) is specifically designed to move from one primal feasible basic solution to another while checking if the corresponding basis induces a dual feasible basic solution.
Shortly after first algorithmic approaches and theoretical results became available, the need to incorporate uncertainty in the parameters has become apparent (see Dantzig, 1955;Beale, 1955;Ferguson and Dantzig, 1956 for early contributions). In fact, apart from its relevance in numerical stability issues, in many applications the parameters reflect practical needs (budget, prices or capacities) but are not available exactly. This has opened a wealth of approaches to account for randomness in linear programs. Common to all formulations is their general assumption that some parameters in (P b ) are random and follow a known probability distribution. Important contributions in this regard are chance constrained linear programs, two-and multiple-stage programming as well as the theory of stochastic linear programs (see Shapiro et al., 2021 for a general overview). Specifically relevant to this paper is the so-called distribution problem characterizing the distribution of the random variable ν(X ), where the right-hand side b (and possibly A and c) in (P b ) is replaced by a random variable X following a specific law (Tintner, 1960;Prékopa, 1966;Wets, 1980).
In this paper, we take a related route and focus on statistical aspects of the standard linear program (P b ) if the right-hand side b is replaced by a consistent estimator b n indexed by n ∈ N, e.g., based on n observations. Different to aforementioned attempts, we only assume the random quantity r n (b n − b) to converge weakly (denoted by D − →) to some limit law G as n tends to infinity 2 . Our main goal is to characterize the asymptotic distributional limit of the empirical optimal solution x (b n ) ∈ arg min Ax=b n , x≥0 c T x (1) around its population quantities after proper standardization. For the sake of exposition, suppose that x (b) in (1) is unique 3 . The main results in Theorem 3.1 and Theorem 3.3 state that under suitable assumptions on (P b ) it holds, as n tends to infinity, that where M : R m → R d is given in Theorem 3.1. The function M in (2) is possibly random, and its explicit form is driven by the amount of degeneracy present in the primal and dual optimal solutions. The simplest case occurs if x (b) is non-degenerate. The function M is then a linear transformation depending on the corresponding unique optimal basis, so that the limit law M(G) is Gaussian if G is Gaussian. If x (b) is degenerate but all dual optimal (basic) solutions for (D b ) are non-degenerate, then M is a sum of deterministic linear transformations defined on closed and convex cones indexed by the collection of dual optimal bases. Specifically, the number of summands in M is equal to the number of dual optimal basic solutions for (D b ). A more complicated situation arises if both x (b) and some dual optimal basic solutions are degenerate. In this case, the function M is still a sum of linear transformations defined on closed and convex cones, but these transformations are potentially random and indexed by certain subsets of the set of optimal bases. The latter setting reflects the complex geometric and combinatorial nature in linear programs under degeneracy.
Let us mention at once that limiting distributions for the empirical optimal solution in the form of (2) have been studied for a long time in a more general setting of (potentially) nonlinear optimisation problems; see for example Dupačová (1987), Dupačová & Wets (1988), Shapiro (1991), Shapiro (1993), Shapiro (2000), King & Rockafellar (1993). Regularity assumptions such as strong convexity of the objective function near the (unique) optimizer allow for either explicit asymptotic expansions of optimal values and optimal solutions or applications of implicit function theorems and generalizations thereof. These conditions usually do not hold for the linear programs considered in this paper.
To the best of our knowledge, our results are the first that cover limit laws for empirical optimal solutions to standard linear programs even beyond the non-degenerate case and without assuming uniqueness of optimizers. However, our proof technique relies on wellknown concepts from parametric optimization and sensitivity analysis for linear programs (Guddat et al., 1974;Greenberg, 1986;Ward & Wendell, 1990;Hadigheh & Terlaky, 2006). Indeed, our approach is based on a careful study of the collection of dual optimal bases. An early contribution in this regard is the basis decomposition theorem by Walkup & Wets, (1969a) analyzing the behavior of ν(b) in (P b ) as a function of b (see also Remark 3.2). Each dual feasible basis defines a so called decision region over which the optimal value ν(b) is linear. The integration over the collection of all these regions yields closed form expressions for the distribution problem (Bereanu, 1963;Ewbank et al., 1974). Further, related stability results are also found in the work by Walkup & Wets (1967), Böhm (1975), Bereanu (1976) and Robinson (1977). In algebraic geometry, decision regions are closely related to cone-triangulations of the primal feasible optimization region (Sturmfels & Thomas, 1997;De Loera et al., 2010). We emphasize that rather than working with decision regions directly, our analysis is tailored to cones of feasible perturbations. In particular, we are interested in regions capturing feasible directions as our problem settings is based on the random . These regions turn out to be closed, convex cones and appear as indicator functions in the (random) function M in (2).
Our proof technique allows to recover some related known results for random linear programs (see Sect. 3). These include convergence of the optimality sets in Hausdorff distance (Proposition 3.7), and a limit law for the optimal value as n tends to infinity. Indeed, (3) is a simple consequence of general results in constrained optimization (Shapiro, 2000;Bonnans & Shapiro, 2000), and the optimality set convergence follows from Walkup & Wets (1969b).
Our statistical analysis for random linear programs in standard form is motivated by recent findings in statistical optimal transport (OT). More precisely, while there exists a thorough theory for limit laws on empirical OT costs on discrete spaces (Sommerfeld & Munk, 2018;Tameling et al., 2019), related statements for their empirical OT solutions remain open. An exception is Klatt et al. (2020), who provide limit laws for empirical (entropy) regularized OT solutions, thus modifying the underlying linear program to be strictly convex, non-linear and most importantly non-degenerate in the sense that every regularized OT solution is strictly positive in each coordinate. Hence, an implicit function theorem approach in conjunction with a delta method allows to conclude for Gaussian limits in this case. This stands in stark contrast to the non-regularized OT considered in this paper, where the degenerate case is generic rather than the exception for most practical situations. Only if the OT solution is unique and non-degenerate, then we observe a Gaussian fluctuation on the support set, i.e., on all entries with positive values. If the OT solution is degenerate (or not unique), then the asymptotic limit law (2) is usually not Gaussian anymore. Degeneracy in OT easily occurs as soon as certain subsets of demand and supply sum up to the same quantity. In particular, we encounter the largest degree of degeneracy if individual demand is equal to individual supply. Additionally, we obtain necessary and sufficient conditions on the cost function in order for the dual OT to be non-degenerate. These may be of independent interest, and allow to prove almost sure uniqueness results for quite general cost functions.
Our distributional results can be viewed as a basis for uncertainty quantification and other statistical inference procedures concerning solutions to linear programs. For brevity, we mention such applications in passing and do not elaborate further on them, leaving a detailed study of statistical consequences such as testing or confidence statements as an important avenue for further research.
The outline of the paper is as follows. We recap basics for linear programming in Sect. 2 also introducing deterministic and stochastic assumptions for our general theory. Our main results are summarized in Sect. 3, followed by their proofs in Sect. 4. The assumptions are discussed in more detail in Sect. 5. Section 6 focuses on OT and gives limit laws for empirical OT solutions.

Preliminaries and assumptions
This section introduces notation and assumptions required to state the main results of the paper. Along the way, we recall basic facts of linear programming and refer to Bertsimas & Tsitsiklis, (1997) and Luenberger & Ye, (2008) for details.
Linear Programs and Duality. Let the columns of a matrix A ∈ R m×d be enumerated by the set [d] := {1, . . . , d}. Consider for a subset I ⊆ [d] the sub-matrix A I ∈ R m×|I | formed by the corresponding columns indexed by I . Similarly, x I ∈ R |I | denotes the coordinates of x ∈ R d corresponding to I . By full rank of A in (D b ), there always exists an index set I with cardinality m such that A I ∈ R m×m is one-to-one. An index set I with that property is termed basis and induces a primal and dual basic solution respectively. Herein, and in order to match dimensions (a solution for (P b ) has dimension d instead of m ≤ d) the linear operator Aug I : R m → R d augments zeroes in the coordinates that are not in I . If λ(I ) (resp. x(I , b)) is feasible for (D b ) (resp. (P b )) then it constitutes a dual (resp. primal) feasible basic solution with dual (resp. primal) feasible basis I . Moreover, λ(I ) (resp. x(I , b)) is termed dual (resp. primal) optimal basic solution if it is feasible and optimal for (D b ) (resp. (P b )). Indeed, as long as (D b ) admits a feasible (optimal) solution then there exists a dual feasible (optimal) basic solution and vice versa for (P b ). At the heart of linear programming is the strong duality statement. and λ(I ) are primal and dual optimal basic solutions, respectively.
We introduce the feasibility and optimality set for the primal (P b ) by respectively. Notably, in our theory to follow A and c are generally assumed to be fixed and only the dependence of these sets with respect to parameter b is emphasized. We introduce our first assumption: The set P (b) is non-empty and bounded.
In view of the strong duality statement in Fact 2.1, solving a linear program might be carried out focusing on the collection of all dual feasible bases. We partition this collection into two subsets depending on their feasibility for the primal program.
Remark 2.2 (Splitting of the Bases Collection) Let I 1 , . . . , I N enumerate all dual feasible bases, and let 1 ≤ K ≤ N be such that Notably, by Fact 2.1 the primal basic solution x(I k , b) is optimal for all k ≤ K . Recall that the convex hull C (x 1 , . . . , x K ) of a collection of points {x 1 , . . . , x K } ⊂ R d is the set of all possible convex combinations of them.

Fact 2.3
Consider the primal linear program (P b ) and assume (A1) holds. Then for any right hand sideb ∈ R m either one of the following statements is correct.
(i) The feasible set P b is empty.
(ii) The optimality set P b is non-empty, bounded and equal to the convex hull The restriction of the convex hull to basic solutions induced by primal and dual optimal bases in Fact 2.3 is well-known. A straightforward argument is based on the simplex method that if set up with appropriate pivoting rules always terminates. If (A1) holds and there exists a unique basis (K = 1) then the primal program attains a unique solution. Uniqueness of solutions to linear programs is related to degeneracy of corresponding dual solutions.  (i) If (P b ) (resp. (D b )) has a non-degenerate optimal basic solution, then (D b ) (resp. (P b )) has a unique solution. (ii) If (P b ) (resp. (D b )) has a unique non-degenerate optimal basic solution, then (D b ) (resp. (P b )) has a unique non-degenerate optimal solution. (iii) If (P b ) (resp. (D b )) has a unique degenerate optimal basic solution, then (D b ) (resp. (P b )) has multiple solutions.
For a proof of Fact 2.4, we refer to Gal & Greenberg (2012)[Lemma 6.2] in combination with strict complementary slackness from Goldman & Tucker (1956)[Corollary 2A] stating that for feasible primal and dual linear program there exists a pair (x, λ) of primal and dual optimal solution such that either x j > 0 or λ T A j < c j for all 1 ≤ j ≤ d. In addition to uniqueness statements, many results in linear programming simplify when degeneracy is excluded. Related to degeneracy but slightly weaker is the assumption Indeed, if P (b) is non-empty and bounded assumption (A2) characterizes non-degeneracy of all dual basic solution.

Lemma 2.5 Suppose assumption (A1) holds. Then assumption (A2) is equivalent to nondegeneracy of all dual optimal basic solutions.
To see that (A1) is necessary, let D m ∈ R m×m with m ≥ 2 be the identity matrix. Suppose Then there are K = 2 m−1 optimal bases defining K distinct (degenerate) dual solutions, so that assumption (A2) holds but dual degeneracy fails. Note that P (b) is unbounded and contains the optimal ray (b T , b T ).
Random Linear Programs. Introducing randomness in problems (P b ) and (D b ), we suppose to have incomplete knowledge of b ∈ R m , and replace it by a (consistent) estimator b n , e.g., based on a sample of size n independently drawn from a distribution with mean b. This defines empirical primal and dual counterparts (P b ) and (D b ), respectively. We allow the more general case that only the first m 0 ∈ {0, . . . , m} coordinates of b are unknown 4 and assume the existence of a sequence of random vectors n → 0 as n tends to infinity where D − → denotes convergence is distribution. In a typical central limit theorem type scenario, r n = √ n and G m 0 is a centred Gaussian random vector in R m 0 , assumed to have a non-singular covariance matrix. Assumption (B1) implies that b n → b in probability. In order to avoid pathological cases, we impose the last assumption that asymptotically an optimal Further discussions on the assumptions are deferred to Sect. 5.

Main results
According to Fact 2.3, in presence of (A1) any optimal solution where K is a non-empty subset of [N ] := {1, . . . , N } and α K n ∈ R N is a random vector in the (essentially |K|-dimensional) unit simplex K := α ∈ R N + | α 1 = 1, α k = 0 ∀k / ∈ K . The main result of the paper states the following asymptotic behaviour for the empirical optimal solution.
Theorem 3.1 Suppose assumptions (A1), (B1), and (B2) hold, and let x (b n ) ∈ P (b n ) be any (measurable) choice of an optimal solution. Further, assume that for all non-empty K ⊆ [K ], the random vectors α K n , G n converge jointly in distribution as n tends to infinity to (α K , G) on |K| × R m . Then there exist closed convex cones H m 0 1 , . . . , where the sum runs over non-empty subsets K of [K ] and H m 0 Remark 3.2 Underlying Theorem 3.1 is the well-known approach of partitioning R m into (closed convex) cones. Indeed, the union of the closed convex cones is the feasibility set A + := {Ax : x ≥ 0} ⊆ R m and on each cone the optimal solution is an affine function of b (e.g., Walkup & Wets, 1969a;Guddat et al., 1974, ). The cones H k depend only on A and c. In contrast, our cones H m 0 k also depend on b and define directions of perturbations of b that keep λ(I k ) optimal for the perturbed problem for a given k ≤ K . Assume for simplicity that m 0 = m and write H k instead of H m 0 k . If b = 0, then K = N and cones coincide H k = H k , but otherwise H k is a strict super-set of H k as the corresponding representation (7) of H k requires non-negativity on all coordinates. This is also in line with the observation that there are fewer (K ) cones H k than there are H k , namely N , and the union of the H k 's is a space that is at least as large as As an extreme example, suppose that (P b ) has a unique non-degenerate optimal solution x(I 1 , b). Then K = 1 and H 1 = R m but the H k 's are strict subsets of R m unless N = 1.
In Sect. 5, we discuss sufficient conditions for the joint distributional convergence of the random vector α K n , G n . In short, if we use any linear program solver, such joint distributional convergence appears to be reasonable. If the optimal basis is unique (K = 1) with is non-degenerate, and the proof shows that H m 0 k = R m 0 . The distributional limit theorem then takes the simple form In general, when K > 1, the number of summands in the limiting random variable in Theorem 3.1 might grow exponentially in K . In between these two cases is the situation that assumption (A2) holds, which implies all dual optimal basic solutions for (D b ) are nondegenerate (see Lemma 2.5). The limiting random variable then simplifies, as the subsets K must be singletons.

Theorem 3.3 Suppose assumptions (A1), (A2), and (B2) hold, and that r n
with the closed and convex cones H m 0 k as given in Theorem 3.1. Remark 3.4 With respect to Theorem 3.1, assumption (B1) is weakened in Theorem 3.3 as absolute continuity of G (or G m 0 ) is not required. Indeed, it can be arbitrary, and Theorem 3.3 thus accommodates, e.g., Poisson limit distributions. The proof shows that if G is absolutely continuous (i.e., m 0 = m) then the indicator functions of G ∈ H m k \ ∪ j<k H m j simplify to G ∈ H m k , because intersections H m k ∩ H m j have Lebesgue measure zero. The distributional limit theorem then reads as If the optimal solution of the limiting problem is unique, Theorem 3.1 can be formulated in a set-wise sense. The Hausdorff distance between two closed nonempty sets A, B ⊆ R d is The collection of closed subsets of R d equipped with d H is a metric space (with possibly infinite distance) and convergence in distribution is defined as usual by integrals of continuous real-valued bounded functions; see for example King (1989), where the delta method is developed in this context. Recall that C stands for convex hull.

Theorem 3.5 Suppose assumptions (A1), (B1), and (B2) hold, and that
where H m 0 k and H m 0 K are as defined in Theorem 3.1. We conclude this section by giving two further consequences of our proof techniques: a limit law for the objective value ν(b) for (P b ), and convergence in probability of optimality sets. Since the former is well-known and holds in more general, infinite-dimensional convex programs, we omit the proof details and instead refer to Shapiro (2000), Bonnans & Shapiro (2000) and results by Sommerfeld & Munk (2018) Another consequence of our bases driven approach underlying the proof of Theorem 3.1 is that the convergence of the Hausdorff distance A different and considerably shorter argument relies on Walkup & Wets (1969b) and proves the following result.

Proposition 3.7 Suppose assumptions (A1) and (B2) hold. If b n
We also refer to the work by Robinson (1977) for a similar result when the primal and dual optimality sets are both bounded.

Proofs for the main results
To simplify the notation, we assume that all random vectors in the paper are defined on a common generic probability space ( , F , P). This is no loss of generality by the Skorokhod representation theorem.
Preliminary steps. Recall from Remark 2.2 that bases I 1 , . . . , I K are feasible for (P b ) and (D b ) and hence optimal. The bases I K +1 , . . . , I N are only feasible for (D b ) but not for (P b ). For a set K ⊆ [N ] define the events, i.e., subsets of the underlying probability space By strong duality (Fact 2.1 (ii)), the set A K n is the event that the bases indexed by K are precisely those that are optimal for (P b ) and (D b ). We have A K n ⊆ B K n , and B K n ⊆ B {k} n for all k ∈ K. We start with two important observations, the first stating that only subsets of [K ] asymptotically matter.
(ii) If assumptions (A1) and (B2) hold, then with high probability P * (b n ) is bounded and non-empty.
The same inequality holds for b n if sufficiently close to b, which happens with high probability. For (ii), non-emptiness with high probability follows from assumption (B2), so we only prove boundedness. Indeed, assumption (A1) implies that the recession cone {x ≥ 0 | Ax = 0, c T x = 0} is trivial and equals {0}. This property does not depend on b n , which yields the result.
The event A ∅ n is equivalent to (P b ) being either infeasible or unbounded, and this has probability o(1) by (B2). Combining this with the previous lemma and the sets (A K n ) K forming a partition of the probability space , we deduce where 1 A (ω) denotes the usual indicator function of the set A. Defining the random vector We next investigate the indicator functions 1 A K n (ω) appearing in (6). Omitting the dependence of b n on ω, we rewrite At the last internal intersection in the above display we can, with high probability, restrict to those i in the primal degeneracy set DP k := {i ∈ I k | x i (I k , b) = 0}. Indeed, for i / ∈ I k , the inequality reads 0 ≥ 0, whereas for i ∈ I k \ DP k the right-hand side goes to −∞ and the lefthand side is bounded in probability. In other words P( For where the union over k > K can be neglected by Lemma 4.1. Thus we conclude that With these preliminary statements at our disposal, we are ready to prove the main results. Proof (Theorem 3.1) The goal is to replace G m 0 n by G m 0 in the indicator function in (8) at the limit as n tends to infinity. By the Portmanteau theorem ( Billingsley, 1999, Theorem 2.1) and elementary arguments 6 it suffices to show that the m 0 -dimensional boundary of each H m 0 k has Lebesgue measure zero. This is indeed the case, as they are convex sets. Define the function T K :

This function is continuous for all
In particular, the continuity set is of full measure with respect to (α K , G). As there are finitely many possible subsets K denoted by K 1 , . . . , K B , the function is continuous G-almost surely. The continuous mapping theorem together with the assumed joint distributional convergence of the random vector (α K n , G n ) yield that which completes the proof of Theorem 3.1.
Proof (Theorem 3.3) With high probability (A1) and (A2) hold for b n (by Lemma 4.1 for the former and trivially for the latter), which implies that P (b n ) is a singleton (Lemma 2.5 and Fact 2.4). Hence, regardless of the choice of α K n , it holds that 1 A K n x (b n ) = x(I min K , b n ). In particular, we may assume without loss of generality that α K n are deterministic and do not depend on n. Thus the joint convergence in Theorem 3.1 holds, and (6) simplifies to Since λ(I k ) and λ(I j ) are dual feasible, they must be optimal with respect to b + ηv. Thus it holds By (A2) the vector λ(I k ) − λ(I j ) is nonzero and hence v is contained in its orthogonal complement, which indeed has Lebesgue measure zero. C(x(I K ), v) is Lipschitz since without loss of generality K = ∅ and It follows that is a measurable random subset of R d . According to Fact 2.3 in presence of (A1) and the preceding computations by the continuous mapping theorem.
Proof (Proposition 3.7) Let K = R d + and define the linear map τ : R d → R m+1 by τ (x) = (Ax, c t x). For each b such that the linear program is feasible, let v b ∈ R be the optimal objective value. If τ is injective, then the optimality sets are singletons and the result holds trivially. We thus assume that τ is not injective, and observe that Since K is a polyhedron and τ is neither identically zero (A has full rank) nor injective, we can apply the main theorem of Walkup & Wets (1969b). We obtain because the optimal values satisfy v b − v b n = O P (r −1 n ) by Proposition 3.6.

(iii) if the dual set N is non-empty and bounded then
The following discussion on the assumptions is a consequence of Lemma 5.1. We first collect sufficient conditions for assumption (A1).

Corollary 5.2 (Sufficiency for (A1)) The following statements hold.
(i) If N is non-empty and P(b) is bounded for some b ∈ M then assumption (A1) holds for all b ∈ M. (ii) If N is non-empty, bounded and P (b) is bounded for some b ∈ R m then assumption (A1) holds for all b ∈ R m .
Certainly, if P (b) = ∅ then (A1) is equivalent to P (b) being bounded. The latter property is independent on b and equivalent to the set x ∈ R d | Ax = 0, x ≥ 0, c t x = 0 being empty. A sufficient condition for that is boundedness of P(b) that can be easily checked in certain settings.

Lemma 5.3 (Sufficiency for P(b) bounded) Suppose that A has non-negative entries and no column of A equals 0 ∈ R m . Then P(b) is bounded (possibly infeasible) for all b ∈ R m .
It is noteworthy that if the dual feasible set N is non-empty and bounded, then P (b) = ∅ for all b ∈ R m , but P(b) is necessarily unbounded (Clark, 1961). Thus, N is unbounded under the conditions of Lemma 5.3. We emphasize that assumption (A2) is neither easy to verify nor expected to hold for most structured linear programs. Indeed, under (A1) assumption (A2) is equivalent to all dual basic solutions being non-degenerate (Lemma 2.5). However, degeneracy in linear programs is often the case rather the exception (Bertsimas & Tsitsiklis, 1997). Notably, if (A2) and (A1) are satisfied the set P (b) is singleton.
The assumption (B1) has to be checked for each particular case and can usually be verified by an application of the central limit theorem (for a particular example see Sect. 6). Assumption (B2) is obviously necessary for the limiting distribution to exist. If the dual feasible set N is non-empty and bounded and (B1) holds then (B2) is always satisfied. A more refined statement is the following.

In particular, if the dual feasible set N is non-empty and (B1) holds then both conditions (i) and (ii) are sufficient for (B2).
Joint convergence. Our goal here is to state useful conditions such that the random vector α K n , G n jointly converges 9 in distribution to some limit random variable α K , G on the space |K| × R m . By assumption (B1), G n → G in distribution, and a necessary condition for the joint distributional convergence of (α K n , G n ) is that α K n has a distributional limit α K . There is no reason to expect α K n and G n to be independent, as discussed at the end of this section. We give a weaker condition than independence that is formulated in terms of the 8 The feasible set P(b 0 ) contains a positive element x ∈ (0, ∞) d . 9 Recall that the α K n represent random weights (summing up to one) for each optimal basis I k , k ∈ K for the case that A K n occurs, i.e., that several bases yield primal optimal solutions and hence any convex combination is also optimal. conditional distribution of α K n given G n (or, equivalently, given b n = b + G n /r n ). These conditions are natural in the sense that if b n = g, then the choice of solution x (g), as encapsulated by the α K n 's, is determined by the specific linear program solver in use. Treating conditional distributions rigorously requires some care and machinery. Let Z = Z K = |K| × R m and for ϕ : We say that ϕ is bounded Lipschitz if it belongs to BL(Z) = {ϕ : Z → R | ϕ BL ≤ 1}. The bounded Lipschitz metric is well-known to metrize convergence in distribution of (probability) measures on Z Dudley (1966) [Theorems 6 and 8]. According to the disintegration theorem (see Kallenberg, 1997[Theorem 5.4], Dudley, 2002[Section 10.2] or Chang & Pollard, 1997 for details), we may write the joint distribution of α K n , b n as an integral of conditional distributions μ K n,g that represent the distribution of α K n given that b n = g. More precisely, g → μ K n,g is measurable from R m to the metric space of probability measures on |K| with the bounded Lipschitz metric, so that for any ϕ ∈ BL(Z) it holds that where ψ n : R m → R is a measurable function. The joint distribution of (α K n , G n ) is determined by the collection of expectations Our sufficient condition for joint convergence is given by the following lemma. It is noteworthy that the spaces R m and |K| can be replaced with arbitrary Polish spaces, and even more general spaces, as long as the disintegration theorem is valid.
Lemma 5.5 Let {μ K g } g∈R m be a collection of probability measures on |K| such that the map g → μ K g is continuous at G-almost any g, and suppose that μ K n,g → μ K g uniformly with respect to the bounded Lipschitz metric BL. Then (α K n , G n ) converges in distribution to a random vector (α K , G) satisfying for any continuous bounded function ϕ ∈ BL(Z) (this determines the distribution of the random vector (α K , G) completely). Moreover, if L denotes the distribution of a random vector, then the rate of convergence can be quantified as where L := sup g 1 =g 2 BL(μ K g 1 , μ K g 2 )/ (g 1 − g 2 ) ∈ [0, ∞]. The supremum with respect to g can be replaced by an essential supremum.
The conditions of Lemma 5.5 (and hence the joint convergence in Theorem 3.1) will be satisfied in many practical situations. For example, given b n and an initial basis for the simplex method, its output is determined by the pivoting rule (for a general overview see Terlaky & Zhang, 1993 and references therein). Deterministic pivoting rules lead to degenerate conditional distributions of α K n given b n = g, whereas random pivoting rules may lead to non-degenerate conditional distributions. In both cases these conditional distributions do not depend on n at all, but only on the input vector g. In particular, the uniform convergence in Lemma 5.5 is trivially fulfilled (the supremum is equal to zero). It is reasonable to assume that these conditional distributions depend continuously on g except for some boundary values that are contained in a lower-dimensional space (which will have measure zero under the absolutely continuous random vector G).
On a finite space X = {x 1 , . . . , x N } equipped with some underlying cost c : X × X → R OT between two probability measures r , s ∈ N := {r ∈ R N | 1 T N r = 1, r i ≥ 0} is equal to the linear program where c i j = c(x i , x j ) and the set (r , s) denotes all non-negative matrices with row and column sum equal to r and s, respectively. OT comprises the challenge to find an optimal solution termed OT coupling π (r , s) between r and s such that the integrated cost is minimal among all possible couplings. We denote by (r , s) the set of all OT couplings. The dual problem is In our context reflecting many practical situations (Tameling et al., 2021), the measures r and s are unknown and need to be estimated from data. To this end, we assume to have access to independent and identically distributed (i.i.d.) X -valued random variables X 1 , . . . , X n ∼ r , where a reasonable proxy for the measure r is its empirical versionr n := 1 n n i=1 δ X i . As an illustration of our general theory, we focus on limit theorems that asymptotically (n → ∞) characterize the fluctuations of an estimated coupling π (r n , s) around π (r , s). For the sake of readability, we focus primarily on the one sample case, where only r is replaced byr n but include a short account on the case that both measures are estimated.
A few words regarding the assumptions from Sect. 2 in the OT context are in order. Assumption (A1) always holds, since (r , s) ⊆ [0, 1] N 2 is bounded and contains the independence coupling rs T . Assumption (A2) that according to Lemma 2.5 is equivalent to all dual feasible basic solutions for (DOT) being non-degenerate, however, does not always hold. Sufficient conditions for (A2) to hold in OT are given in Subsect. 6.1. Concerning the probabilistic assumptions, we notice that (B2) always holds as for any (possibly random) pair of measures (r n , s) the set (r n , s) is non-empty and bounded. Assumption (B1) is easily verified by an application of the multivariate central limit theorem. Indeed, the multinomial process of empirical frequencies √ n(r −r n ) converges weakly to a centered Gaussian random vector G(r ) ∼ N (0, (r )) with covariance Notably, (r ) is singular and G(r ) fails to be absolutely continuous with respect to Lebesgue measure. A slight modification allows to circumvent this issue. The constraint matrix in OT, has rank 2N − 1. Letting r † = r [N −1] ∈ R N −1 denote the first N − 1 coordinates of r ∈ R N and A † ∈ R (2N −1)×N 2 denote A with its N -th row removed, it holds that The limiting random variable for √ n(r † −r † n ), as n tends to infinity, is equal to G(r † ) following an absolutely continuous distribution if and only if r † > 0 and r † 1 < 1. Equivalently, r is in the relative interior of N (denoted ri( N )), i.e., 0 < r ∈ N . Under this condition (A1), (B1) and (B2) hold and from the main result in Theorem 3.1 we immediately deduce the limiting distribution of optimal OT couplings.

Remark 6.2
The two sample case presents an additional challenge. By the multivariate central limit theorem we have for min(m, n) → ∞ and m n+m → λ ∈ (0, 1) that nm n + m with G 1 (r † ) and G 2 (s) independent and the compound limit law following a centered Gaussian distribution with block diagonal covariance matrix, where the two blocks are given by (10), respectively. However, the limit law fails to be absolutely continuous. Nevertheless, the distributional limit theorem for OT couplings remains valid in this case and there exists a sequence π n,m (r , s) ∈ (r , s) such that nm n + m π (r n ,ŝ m ) − π n,m (r , s) We provide further details in Appendix 1.
We emphasize that once a limit law for the OT coupling is available, one can derive limit laws for sufficiently smooth functionals thereof. As examples let us mention the OT curve (Klatt et al., 2020) andOT geodesics (McCann, 1997). The details are omitted for brevity and instead, we provide an illustration of the distributional limit theorem (Theorem 6.1).

Example 6.3
We consider a ground space X = {x 1 < x 2 < x 3 } ⊂ R consisting of N = 3 points with cost c = (0, |x 1 − x 2 |, |x 1 − x 3 |, |x 2 − x 1 |, 0, |x 2 − x 3 |, |x 3 − x 1 |, |x 3 − x 2 |, 0) ∈ R 9 for which OT then reads as min π ∈R 9 c T π s.t. A † π = r † s , π ≥ 0 with constraint matrix A † ∈ R 5×9 . A basis I is a subset of cardinality five out of the column index set {1, . . . , 9} such that (A † ) I is of full rank. For OT it is convenient to think of a feasible solution in terms of a transport matrix π ∈ R 3×3 with π i j encoding mass transport from source i to destination j. For instance, the basis I = {1, 2, 3, 5, 9} corresponds to the transport scheme where each possible non-zero entry is marked by a star and specific values depend on the measures r and s. In particular, to basis I corresponds the (possibly infeasible) basic solution π(I , (r † , s)) = (A † ) −1 I (r † , s) that we illustrate in terms of its transport scheme by where r = (r † , 1 − r † 1 ) ∈ R 3 and the second equality employs that r and s sum up to one. Obviously, π(I , (r † , s)) is feasible if and only if s 2 ≥ r 2 and s 3 ≥ r 3 . Suppose that the Then all dual basic solutions are non-degenerate. In particular, (A2) holds and the optimal OT coupling is unique for any pair of measures r, s ∈ N .
We are unaware of an explicit reference for condition (15) that is reminiscent to the wellknown cyclic monotonicity property (Rüschendorf, 1996). Further, (15) can be thought of as dual to the condition of Klee & Witzgall (1968) that guarantees every primal basic solution to be non-degenerate. Notably, (15) is satisfied for OT on the real line with cost c(x, y) = |x − y| p and measures with at least N = 3 support points if and only if p > 0 and p = 1. If the underlying space involves too many symmetries, such as a regular grid with cost defined by the underlying grid structure, it typically fails to hold. An alternative condition that ensures (A2) is the strict Monge condition that the cost c satisfies possibly after relabelling the indices (Dubuc et al., 1999). This translates to easily interpretable statements on the real line. Then assumption (A2) holds.
The first statement follows by employing the Monge condition (see also McCann, 1999[Proposition A2] for an alternative approach). The second case is more delicate, and indeed, the description of the unique optimal solution is more complicated (see Appendix 1). In fact, in both cases the unique transport coupling can be computed by the Northwest corner algorithm (Hoffman, 1963). Typical costs covered by Lemma 6.5 are c(x, y) = |x − y| p for any p > 0 with p = 1. Indeed, for p = 0 or p = 1, uniqueness often fails (see Remark 6.7).
In a general linear program (P b ), the set of costs c for which (A2) fails to hold has Lebesgue measure zero (e.g., Bertsimas & Tsitsiklis, 1997, ). Here we provide a result in the same flavour for OT. Proposition 6.6 Let μ and ν be absolutely continuous on R D , with D ≥ 2, and let c(x, y) = x − y p q , where p ∈ R \ {0} and q ∈ (0, ∞] are such that if p = 1 then q / ∈ {1, ∞}. For probability vectors r , s ∈ N define the probability measures r (X) = N k=1 r k δ X k and s(Y) = N k=1 s k δ Y k with two independent collections of i.i.d. R D -valued random variables X 1 , . . . , X N ∼ μ and Y 1 , . . . , Y N ∼ ν. Then (15) holds almost surely for the optimal transport (OT). In particular, with probability one for any r, s ∈ N and pair of marginals r (X) and s(Y), the corresponding optimal transport coupling is unique.

A Omitted proofs
Proof (Lemma 2.5) Dual non-degeneracy obviously implies (A2), so we only show the converse (in presence of (A1)). Suppose that λ(I j ) is degenerate 1 ≤ j ≤ K . Then the index set L of active constraints in the dual, i.e., the set of indices such that [λ T (I j )A] l = c l , is such that I j ⊂ L. Let Pos j ⊆ I j be the set of positive entries of the optimal primal basic solution x(I j , b). Then Since the columns of A I j form a basis of R m , each other column a z writes a z = i∈Pos y z i a i + s∈I j \Pos y z s a s , z ∈ L \ I j .
Suppose there exists some index z ∈ L \ I j and s ∈ I j \ Pos such that y z s = 0. Then we can define a new basis I := I j \ {s} ∪ {z} such that λ( I ) = λ(I j ) and as Pos j ⊆ I we conclude that x( I , b) = x(I j , b). This contradicts (A2). Hence y z s = 0 for all s ∈ I j \ Pos j in (18). Now suppose that y z i > 0 for some i ∈ Pos j , so that i 0 ∈ arg min i|y z i >0 x i y z i is well defined and the minimum is strictly positive. Expressing a i 0 as a linear combination of A {z}∪Pos j \{i 0 } , we find that for some proper choice of x i . By definition of i 0 we find that x i are non-negative, so that I is a primal and dual optimal basis. Moreover, λ( I ) = λ(I j ) that again contradicts (A2).
We deduce for the representation in (18)  By definition w ≥ 0, Aw = 0. and c T w = 0, so that w = 0 is a primal optimal ray, in contradiction of (A1). In total we see that if any basis I j for 1 ≤ j ≤ K yields a degenerate dual basic solution we can modify basis I j to some I i with i = j and 1 ≤ i ≤ K such that λ(I j ) = λ(I j ).
It is in principle possible that λ(I l ) is dual optimal K + 1 ≤ l ≤ N but x(I l , b) is not primal optimal. Let us show that this cannot happen under assumption (A2). Consider any optimal primal basic solution x(I j , b) for 1 ≤ j ≤ K and denote by Pos j its positivity set. Optimality of λ(I l ) implies that its active set L contains I l ∪ Pos j . As I l is not a primal optimal basis, it holds that Pos j I l , so that |L| > m and λ(I l ) is degenerate. But then we can modify basis I l to some primal and dual optimal basis I i for 1 ≤ i ≤ K such that λ(I i ) = λ(I l ) is degenerate, in contradiction with (A2). Hence, any optimal dual basic solution is non-degenerate and induced by some primal and dual optimal basis I .
Proof (Lemma 5.5) We need to show that for any ϕ ∈ BL(Z) we have that vanishes as n → ∞. To bound the first term notice that for any fixed g it holds that Hence, we find E|ψ n (G n ) − ψ(G n )| ≤ sup g BL(μ K n,g , μ K g ) that tends to zero by assumption. Notice that the supremum can be an essential supremum, i.e., taken on set of full measure with respect to both (G n ) and (G) instead of the whole of R m . For the second term observe that ψ ∞ ≤ ϕ ∞ and that Hence, we conclude that Dividing ψ by its bounded Lipschitz norm, we find This completes the proof for the quantitative statement. Joint convergence still follows if g → μ K g is only continuous G-almost surely (but not Lipschitz). In fact, ψ is still continuous and bounded G-almost surely so that Eψ(G n ) → Eψ(G). Therefore, Eϕ(α K n , G n ) → Eϕ(α K , G) for all ϕ ∈ BL(Z), which implies that (α K n , G n ) → (α K , G) in distribution.

B Optimal transport
Proof (Theorem 6.1, two-sample) The only part where absolute continuity of G = (G 1 (r † ), G 2 (s)) was required is when showing that the boundaries of the cones defined in (7) have zero probability with respect to G. We shall show that this is still the case, despite the singularity of G 2 (s).
The cones under consideration take the form is viewed as an N 2 -dimensional vector), and their boundaries satisfy It suffices to show that for any pair of sets R ⊆ {1, . . . , N − 1} and S ⊂ {1, . . . , N } that are not both empty, the support could contain countably many intervals as long as there is "clear" starting point a 0 below; but M could be infinite.

Proof
There is nothing to prove if μ = ν = 0, so we assume μ = ν. It follows from the assumptions that there exists a finite sequence of M + 1 ≥ 3 real numbers −∞ ≤ a 0 < a 1 < a 2 < a 3 < · · · < a M ≤ ∞ We now claim that in any optimal coupling π between μ and ν, the μ-mass of [a 0 , a 1 ] must go to [a 1 , a ]. Indeed, suppose that a positive μ-mass from [a 0 , a 1 ] goes strictly beyond a . Then some mass from the support of μ but not in [a 0 , a 1 ] has to go to [a 1 , a ]. Such a coupling gives positive measure to the set for some > 0. Strict monotonicity of the cost function makes this sub-optimal, since this coupling entails sending mass from x 1 to y 1 and from x 2 to y 2 with x 1 < y 2 < min(x 2 , y 1 ), (see Gangbo & McCann, 1996[Theorem 2.3] for a rigorous proof). Hence the claim is proved. Let μ 1 be the restriction of μ to [a 0 , a 1 ] and ν 1 be the restriction of ν to [a 1 , a ] with mass m 0 , namely ν 1 (B) = ν(B) if B ⊆ [a 1 , a ), ν 1 ({a }) = m 0 − ν ([a 1 , a )) and ν(B) = 0 if B ∩ [a 1 , a ] = ∅. By definition of a , ν 1 is a measure (i.e., ν 1 ({a }) ≥ 0) and ν 1 and μ 1 have the same total mass m 0 . Each of these measures is supported on an interval and these intervals are (almost) disjoint. Strict concavity of the cost function entails that any optimal coupling between μ 1 and ν 1 must be non-increasing (in a set-valued sense). Since there is only one such coupling, the coupling is unique.
By the preceding paragraph and the above claim, we know that π must be non-increasing from [a 0 , a 1 ] to [a 1 , a ], which determines π uniquely on that part. After this transport is carried out, we are left with the measures ν − ν 1 and μ − μ 1 , where the latter is supported on one less interval, namely the interval [a 0 , a 1 ] disappears.
If instead μ 0 ([a 0 , a 1 ]) > ν([a 1 , a 2 ]), we can use the same construction with  1 , a 2 ]) then both the intervals [a 0 , a 1 ] and [a 1 , a 2 ] disappear when considering μ − μ 1 and ν − ν 1 . In all three cases we can continue inductively and construct π in a unique way. Since there are finitely many intervals, the procedure is guaranteed to terminate. Thus π is unique. on which f is defined can be partitioned into finitely many (less than 6 n D ) open connected components U 1 , . . . , U L according to the signs of x k − y k , e i and x k − y k−1 , e i . On each such component f |U i is analytic. It follows that P (X, Y) ∈ f −1 |U l ({0}) = 0 unless f |U l is identically zero Dang, (2015)[Lemma 1.2]. To exclude the latter possibility, consider for any point (x, y) ∈ U l and ∈ R the function f |U l ( ) = f |U l (x 1 + e i , x 2 , . . . , x n , y 1 , . . . , y n ) with derivative at = 0 given by where x i j denotes the jth entry of the ith vector. If this derivative is nonzero, then clearly f is not identically zero. If the derivative is zero then we shall show that there exists another point in U l for which this derivative is nonzero. Since U l is open, we can add δe j to y n for small δ and any 1 ≤ j ≤ D. If p = q then, taking j = i (which is possible because D ≥ 2) only modifies the term x 1 − y n in (20), and for small δ the derivative will not be zero. If p = q = 1 then the norms do not appear in (20) and taking j = i would yield a nonzero derivative. Hence, if p and q are not both equal to one, f is not identically zero on each piece U l , which is what we needed to prove. A similar idea works in case q = ∞ and p = 1.
The argument only depends on the positions of the random support points of the probability measures r = n k=1 r k δ X k and s = n k=1 s k δ Y k and hence is uniform in their probability weights. Recall further Proposition 2.4 that if the dual problem admits a non-degenerate optimal solution the primal optimal solution is unique. We conclude that almost surely the optimal transport coupling is unique.