Abstract
We consider a general linear program in standard form whose righthand side constraint vector is subject to random perturbations. For the corresponding random linear program, we characterize under general assumptions the random fluctuations of the empirical optimal solutions around their population quantities after standardization by a distributional limit theorem. Our approach is geometric in nature and further relies on duality and the collection of dual feasible basic solutions. The limiting random variables are driven by the amount of degeneracy inherent in linear programming. In particular, if the corresponding dual linear program is degenerate the asymptotic limit law might not be unique and is determined from the way the empirical optimal solution is chosen. Furthermore, we include consistency and convergence rates of the Hausdorff distance between the empirical and the true optimality sets as well as a limit law for the empirical optimal value involving the set of all dual optimal basic solutions. Our analysis is motivated from statistical optimal transport that is of particular interest here and distributional limit laws for empirical optimal transport plans follow by a simple application of our general theory. The corresponding limit distribution is usually nonGaussian which stands in strong contrast to recent finding for empirical entropy regularized optimal transport solutions.
1 Introduction
Linear programs arise naturally in many applications and have become ubiquitous in topics such as operations research, control theory, economics, physics, mathematics and statistics (see the textbooks by Dantzig, 1963; Bertsimas & Tsitsiklis, 1997; Luenberger & Ye, 2008; Galichon, 2018 and the references therein). Their solid mathematical foundation dates back to the midtwentieth century, to mention the seminal works of Kantorovich (1939), Hitchcock (1941), Dantzig (1948) and Koopmans (1949), and its algorithmic computation is an active topic of research until today^{Footnote 1}. A linear program in standard form writes
with \((A,b,c)\in \mathbb {R}^{m\times d}\times \mathbb {R}^m\times \mathbb {R}^d\) and matrix A of full rank \(m\le d\). For the purpose of the paper, the lower subscript b in (\(\text {P}_{b}\)) emphasizes the dependence on the vector b. Associated with the primal program (\(\text {P}_{b}\)) is its corresponding dual program
At the heart of linear programming and fundamental to our work is the observation that if the primal program (\(\text {P}_{b}\)) attains a finite value, the optimum is attained at one of a finite set of candidates termed basic solutions. Each basic solution (possibly infeasible) is identified by a basis \(I\subset \{1,\ldots ,d\}\) indexing m linearly independent columns of the constraint matrix A. The bases I also defines a basic solution for the dual (\(\text {D}_{b}\)). In fact, the simplex algorithm (Dantzig, 1948) is specifically designed to move from one primal feasible basic solution to another while checking if the corresponding basis induces a dual feasible basic solution.
Shortly after first algorithmic approaches and theoretical results became available, the need to incorporate uncertainty in the parameters has become apparent (see Dantzig, 1955; Beale, 1955; Ferguson and Dantzig, 1956 for early contributions). In fact, apart from its relevance in numerical stability issues, in many applications the parameters reflect practical needs (budget, prices or capacities) but are not available exactly. This has opened a wealth of approaches to account for randomness in linear programs. Common to all formulations is their general assumption that some parameters in (\(\text {P}_{b}\)) are random and follow a known probability distribution. Important contributions in this regard are chance constrained linear programs, two and multiplestage programming as well as the theory of stochastic linear programs (see Shapiro et al., 2021 for a general overview). Specifically relevant to this paper is the socalled distribution problem characterizing the distribution of the random variable \(\nu (X)\), where the righthand side b (and possibly A and c) in (\(\text {P}_{b}\)) is replaced by a random variable X following a specific law (Tintner, 1960; Prékopa, 1966; Wets, 1980).
In this paper, we take a related route and focus on statistical aspects of the standard linear program (\(\text {P}_{b}\)) if the righthand side b is replaced by a consistent estimator \(b_n\) indexed by \(n\in \mathbb {N}\), e.g., based on n observations. Different to aforementioned attempts, we only assume the random quantity \(r_n(b_nb)\) to converge weakly (denoted by \(\xrightarrow []{D}\)) to some limit law G as n tends to infinity^{Footnote 2}. Our main goal is to characterize the asymptotic distributional limit of the empirical optimal solution
around its population quantities after proper standardization. For the sake of exposition, suppose that \(x^\star (b)\) in (1) is unique^{Footnote 3}. The main results in Theorem 3.1 and Theorem 3.3 state that under suitable assumptions on (\(\text {P}_{b}\)) it holds, as n tends to infinity, that
where \(M:\mathbb {R}^m\rightarrow \mathbb {R}^d\) is given in Theorem 3.1. The function M in (2) is possibly random, and its explicit form is driven by the amount of degeneracy present in the primal and dual optimal solutions. The simplest case occurs if \(x^\star (b)\) is nondegenerate. The function M is then a linear transformation depending on the corresponding unique optimal basis, so that the limit law M(G) is Gaussian if G is Gaussian. If \(x^\star (b)\) is degenerate but all dual optimal (basic) solutions for (\(\text {D}_{b}\)) are nondegenerate, then M is a sum of deterministic linear transformations defined on closed and convex cones indexed by the collection of dual optimal bases. Specifically, the number of summands in M is equal to the number of dual optimal basic solutions for (\(\text {D}_{b}\)). A more complicated situation arises if both \(x^\star (b)\) and some dual optimal basic solutions are degenerate. In this case, the function M is still a sum of linear transformations defined on closed and convex cones, but these transformations are potentially random and indexed by certain subsets of the set of optimal bases. The latter setting reflects the complex geometric and combinatorial nature in linear programs under degeneracy.
Let us mention at once that limiting distributions for the empirical optimal solution in the form of (2) have been studied for a long time in a more general setting of (potentially) nonlinear optimisation problems; see for example Dupačová (1987), Dupačová & Wets (1988), Shapiro (1991), Shapiro (1993), Shapiro (2000), King & Rockafellar (1993). Regularity assumptions such as strong convexity of the objective function near the (unique) optimizer allow for either explicit asymptotic expansions of optimal values and optimal solutions or applications of implicit function theorems and generalizations thereof. These conditions usually do not hold for the linear programs considered in this paper.
To the best of our knowledge, our results are the first that cover limit laws for empirical optimal solutions to standard linear programs even beyond the nondegenerate case and without assuming uniqueness of optimizers. However, our proof technique relies on wellknown concepts from parametric optimization and sensitivity analysis for linear programs (Guddat et al., 1974; Greenberg, 1986; Ward & Wendell, 1990; Hadigheh & Terlaky, 2006). Indeed, our approach is based on a careful study of the collection of dual optimal bases. An early contribution in this regard is the basis decomposition theorem by Walkup & Wets, (1969a) analyzing the behavior of \(\nu (b)\) in (\(\text {P}_{b}\)) as a function of b (see also Remark 3.2). Each dual feasible basis defines a so called decision region over which the optimal value \(\nu (b)\) is linear. The integration over the collection of all these regions yields closed form expressions for the distribution problem (Bereanu, 1963; Ewbank et al., 1974). Further, related stability results are also found in the work by Walkup & Wets (1967), Böhm (1975), Bereanu (1976) and Robinson (1977). In algebraic geometry, decision regions are closely related to conetriangulations of the primal feasible optimization region (Sturmfels & Thomas, 1997; De Loera et al., 2010). We emphasize that rather than working with decision regions directly, our analysis is tailored to cones of feasible perturbations. In particular, we are interested in regions capturing feasible directions as our problem settings is based on the random perturbation \(\sqrt{n}\left( b_nb\right) \). These regions turn out to be closed, convex cones and appear as indicator functions in the (random) function M in (2).
Our proof technique allows to recover some related known results for random linear programs (see Sect. 3). These include convergence of the optimality sets in Hausdorff distance (Proposition 3.7), and a limit law for the optimal value
as n tends to infinity. Indeed, (3) is a simple consequence of general results in constrained optimization (Shapiro, 2000; Bonnans & Shapiro, 2000), and the optimality set convergence follows from Walkup & Wets (1969b).
Our statistical analysis for random linear programs in standard form is motivated by recent findings in statistical optimal transport (OT). More precisely, while there exists a thorough theory for limit laws on empirical OT costs on discrete spaces (Sommerfeld & Munk, 2018; Tameling et al., 2019), related statements for their empirical OT solutions remain open. An exception is Klatt et al. (2020), who provide limit laws for empirical (entropy) regularized OT solutions, thus modifying the underlying linear program to be strictly convex, nonlinear and most importantly nondegenerate in the sense that every regularized OT solution is strictly positive in each coordinate. Hence, an implicit function theorem approach in conjunction with a delta method allows to conclude for Gaussian limits in this case. This stands in stark contrast to the nonregularized OT considered in this paper, where the degenerate case is generic rather than the exception for most practical situations. Only if the OT solution is unique and nondegenerate, then we observe a Gaussian fluctuation on the support set, i.e., on all entries with positive values. If the OT solution is degenerate (or not unique), then the asymptotic limit law (2) is usually not Gaussian anymore. Degeneracy in OT easily occurs as soon as certain subsets of demand and supply sum up to the same quantity. In particular, we encounter the largest degree of degeneracy if individual demand is equal to individual supply. Additionally, we obtain necessary and sufficient conditions on the cost function in order for the dual OT to be nondegenerate. These may be of independent interest, and allow to prove almost sure uniqueness results for quite general cost functions.
Our distributional results can be viewed as a basis for uncertainty quantification and other statistical inference procedures concerning solutions to linear programs. For brevity, we mention such applications in passing and do not elaborate further on them, leaving a detailed study of statistical consequences such as testing or confidence statements as an important avenue for further research.
The outline of the paper is as follows. We recap basics for linear programming in Sect. 2 also introducing deterministic and stochastic assumptions for our general theory. Our main results are summarized in Sect. 3, followed by their proofs in Sect. 4. The assumptions are discussed in more detail in Sect. 5. Section 6 focuses on OT and gives limit laws for empirical OT solutions.
2 Preliminaries and assumptions
This section introduces notation and assumptions required to state the main results of the paper. Along the way, we recall basic facts of linear programming and refer to Bertsimas & Tsitsiklis, (1997) and Luenberger & Ye, (2008) for details.
Linear Programs and Duality. Let the columns of a matrix \(A\in \mathbb {R}^{m\times d}\) be enumerated by the set \([d]:= \{1,\ldots ,d\}\). Consider for a subset \(I\subseteq [d]\) the submatrix \(A_I\in \mathbb {R}^{m\times \vert I \vert }\) formed by the corresponding columns indexed by I. Similarly, \(x_I\in \mathbb {R}^{I}\) denotes the coordinates of \(x\in \mathbb {R}^d\) corresponding to I. By full rank of A in (\(\text {D}_{b}\)), there always exists an index set I with cardinality m such that \(A_I\in \mathbb {R}^{m\times m}\) is onetoone. An index set I with that property is termed basis and induces a primal and dual basic solution
respectively. Herein, and in order to match dimensions (a solution for (\(\text {P}_{b}\)) has dimension d instead of \(m\le d\)) the linear operator \(\mathrm {Aug}_I:\mathbb {R}^m\rightarrow \mathbb {R}^d\) augments zeroes in the coordinates that are not in I. If \(\lambda (I)\) (resp. x(I, b)) is feasible for (\(\text {D}_{b}\)) (resp. (\(\text {P}_{b}\))) then it constitutes a dual (resp. primal) feasible basic solution with dual (resp. primal) feasible basis I. Moreover, \(\lambda (I)\) (resp. x(I, b)) is termed dual (resp. primal) optimal basic solution if it is feasible and optimal for (\(\text {D}_{b}\)) (resp. (\(\text {P}_{b}\))). Indeed, as long as (\(\text {D}_{b}\)) admits a feasible (optimal) solution then there exists a dual feasible (optimal) basic solution and vice versa for (\(\text {P}_{b}\)). At the heart of linear programming is the strong duality statement.
Fact 2.1
Consider the primal linear program (\(\text {P}_{b}\)) and its dual (\(\text {D}_{b}\)).

(i)
If either of the linear programs (\(\text {P}_{b}\)) or (\(\text {D}_{b}\)) has a finite optimal solution, so does the other and the corresponding optimal values are the same.

(ii)
If for a basis \(I\subseteq [d]\) the vector \(\lambda (I)\) is dual feasible and x(I, b) is primal feasible, then both are primal and dual optimal basic solutions, respectively.

(iii)
If (\(\text {P}_{b}\)) and (\(\text {D}_{b}\)) are feasible then there always exists a basis \(I\subseteq [d]\) such that x(I, b) and \(\lambda (I)\) are primal and dual optimal basic solutions, respectively.
We introduce the feasibility and optimality set for the primal (\(\text {P}_{b}\)) by
respectively. Notably, in our theory to follow A and c are generally assumed to be fixed and only the dependence of these sets with respect to parameter b is emphasized. We introduce our first assumption:
In view of the strong duality statement in Fact 2.1, solving a linear program might be carried out focusing on the collection of all dual feasible bases. We partition this collection into two subsets depending on their feasibility for the primal program.
Remark 2.2
(Splitting of the Bases Collection) Let \(I_1,\dots ,I_{N}\) enumerate all dual feasible bases, and let \(1\le K\le N\) be such that
Notably, by Fact 2.1 the primal basic solution \(x(I_k,b)\) is optimal for all \(k\le K\). Recall that the convex hull \(\mathcal {C}\left( x_1,\ldots ,x_K\right) \) of a collection of points \(\{x_1,\ldots ,x_K\}\subset \mathbb {R}^d\) is the set of all possible convex combinations of them.
Fact 2.3
Consider the primal linear program (\(\text {P}_{b}\)) and assume (\(\mathbf{A2} \)) holds. Then for any right hand side \(\tilde{b}\in \mathbb {R}^m\) either one of the following statements is correct.

(i)
The feasible set \(\mathcal {P}\left( \tilde{b}\right) \) is empty.

(ii)
The optimality set \(\mathcal {P}^\star \left( \tilde{b}\right) \) is nonempty, bounded and equal to the convex hull
$$\begin{aligned} \mathcal {P}^\star \left( \tilde{b}\right) = \mathcal {C}\left( \left\{ x(I,\tilde{b})\mid \, I { primal and dual feasible basis for \left( {P}_{\widetilde{b}}\right) and \left( {D}_{\widetilde{b}}\right) } \right\} \right) . \end{aligned}$$
The restriction of the convex hull to basic solutions induced by primal and dual optimal bases in Fact 2.3 is wellknown. A straightforward argument is based on the simplex method that if set up with appropriate pivoting rules always terminates. If (\(\mathbf{A2} \)) holds and there exists a unique basis \((K=1)\) then the primal program attains a unique solution. Uniqueness of solutions to linear programs is related to degeneracy of corresponding dual solutions. A dual feasible basic solution \(\lambda (I)\) is degenerate if more than m of the d inequalities \(\lambda (I)^TA\le c^T\) hold as equalities. Similarly, a primal feasible basic solution x(I, b) is degenerate if less than m of its coordinates are nonzero.
Fact 2.4
Consider the linear program (\(\text {P}_{b}\)) and its dual (\(\text {D}_{b}\)).

(i)
If (\(\text {P}_{b}\)) (resp. (\(\text {D}_{b}\))) has a nondegenerate optimal basic solution, then (\(\text {D}_{b}\)) (resp. (\(\text {P}_{b}\))) has a unique solution.

(ii)
If (\(\text {P}_{b}\)) (resp. (\(\text {D}_{b}\))) has a unique nondegenerate optimal basic solution, then (\(\text {D}_{b}\)) (resp. (\(\text {P}_{b}\))) has a unique nondegenerate optimal solution.

(iii)
If (\(\text {P}_{b}\)) (resp. (\(\text {D}_{b}\))) has a unique degenerate optimal basic solution, then (\(\text {D}_{b}\)) (resp. (\(\text {P}_{b}\))) has multiple solutions.
For a proof of Fact 2.4, we refer to Gal & Greenberg (2012)[Lemma 6.2] in combination with strict complementary slackness from Goldman & Tucker (1956)[Corollary 2A] stating that for feasible primal and dual linear program there exists a pair \((x,\lambda )\) of primal and dual optimal solution such that either \(x_j>0\) or \(\lambda ^TA_j<c_j\) for all \(1\le j \le d\). In addition to uniqueness statements, many results in linear programming simplify when degeneracy is excluded. Related to degeneracy but slightly weaker is the assumption
Indeed, if \(\mathcal {P}^\star (b)\) is nonempty and bounded assumption (\(\mathbf{A2} \)) characterizes nondegeneracy of all dual basic solution.
Lemma 2.5
Suppose assumption (\(\mathbf{A2} \)) holds. Then assumption (\(\mathbf{A2} \)) is equivalent to nondegeneracy of all dual optimal basic solutions.
To see that (\(\mathbf{A2} \)) is necessary, let \(D_m\in \mathbb {R}^{m\times m}\) with \(m\ge 2\) be the identity matrix. Suppose that \(A=(D_m, D_m)\in \mathbb {R}^{m\times 2m}\), \(c\in \mathbb {R}^{2m}_+\) is strictly positive except that \(c_1=c_{m+1}=0\), and \(b=(1,0,0,\dots ,0)\in \mathbb {R}^m\). Then there are \(K=2^{m1}\) optimal bases defining K distinct (degenerate) dual solutions, so that assumption (\(\mathbf{A2} \)) holds but dual degeneracy fails. Note that \(\mathcal {P}^\star (b)\) is unbounded and contains the optimal ray \((b^T,b^T)\).
Random Linear Programs. Introducing randomness in problems (\(\text {P}_{b}\)) and (\(\text {D}_{b}\)), we suppose to have incomplete knowledge of \(b\in \mathbb {R}^m\), and replace it by a (consistent) estimator \(b_n\), e.g., based on a sample of size n independently drawn from a distribution with mean b. This defines empirical primal and dual counterparts (\(\text {P}_{b}\)) and (\(\text {D}_{b}\)), respectively. We allow the more general case that only the first \(m_0\in \{0,\dots ,m\}\) coordinates of b are unknown^{Footnote 4} and assume the existence of a sequence of random vectors \(b_n=(b_n^{m_0},[b]_{mm_0})^T\in \mathbb {R}^{m_0}\times \mathbb {R}^{mm_0}\) converging to b at rate \(r_n^{1} \rightarrow 0\) as n tends to infinity
where \(\xrightarrow []{D}\) denotes convergence is distribution. In a typical central limit theorem type scenario, \(r_n=\sqrt{n}\) and \(G^{m_0}\) is a centred Gaussian random vector in \(\mathbb {R}^{m_0}\), assumed to have a nonsingular covariance matrix. Assumption (\(\mathbf{B1} \)) implies that \(b_n\rightarrow b\) in probability. In order to avoid pathological cases, we impose the last assumption that asymptotically an optimal solution \(x^\star (b_n)\) for the primal \((\text {P}_{b_n})\) exists
Further discussions on the assumptions are deferred to Sect. 5.
3 Main results
According to Fact 2.3, in presence of (\(\mathbf{A2} \)) any optimal solution \(x^{\star }(b_n)\in \mathcal P^\star (b_n)\) takes the form
where \(\mathcal {K}\) is a nonempty subset of \([N]:= \{1,\dots ,N\}\) and \(\alpha _n^\mathcal {K}\in \mathbb {R}^N\) is a random vector in the (essentially \(\mathcal {K}\)dimensional) unit simplex \(\Delta _\mathcal {K}:= \left\{ \alpha \in \mathbb {R}_+^{N}\, \mid \,\Vert \alpha \Vert _1=1,\alpha _k=0\ \forall k\notin \mathcal {K}\right\} \). The main result of the paper states the following asymptotic behaviour for the empirical optimal solution.
Theorem 3.1
Suppose assumptions (\(\mathbf{A2} \)), (\(\mathbf{B1} \)), and (\(\mathbf{B2} \)) hold, and let \(x^{\star }(b_n)\in \mathcal {P}^\star (b_n)\) be any (measurable) choice of an optimal solution. Further, assume that for all nonempty \(\mathcal {K}\subseteq [K]\), the random vectors \(\left( \alpha _n^\mathcal {K}, G_n\right) \) converge jointly in distribution as n tends to infinity to \((\alpha ^\mathcal {K},G)\) on \(\Delta _{\vert \mathcal {K}\vert }\times \mathbb {R}^m\). Then there exist closed convex cones \(H_1^{m_0},\ldots ,H_K^{m_0}\subseteq \mathbb {R}^{m_0}\) and random vectors \(Y_n\in \mathcal {P}^\star (b)\) such that
where the sum runs over nonempty subsets \(\mathcal {K}\) of [K] and \(H_\mathcal {K}^{m_0}=\cap _{k\in \mathcal {K}}H_k^{m_0}\).
Remark 3.2
Underlying Theorem 3.1 is the wellknown approach of partitioning \(\mathbb R^m\) into (closed convex) cones. Indeed, the union of the closed convex cones
is the feasibility set \(A_+:=\{Ax:x\ge 0\}\subseteq \mathbb {R}^m\) and on each cone the optimal solution is an affine function of b (e.g., Walkup & Wets, 1969a; Guddat et al., 1974, ). The cones \(\widetilde{H}_k\) depend only on A and c. In contrast, our cones \(H_k^{m_0}\) also depend on b and define directions of perturbations of b that keep \(\lambda (I_k)\) optimal for the perturbed problem for a given \(k\le K\). Assume for simplicity that \(m_0=m\) and write \(H_k\) instead of \(H_k^{m_0}\). If \(b=0\), then \(K=N\) and cones coincide \(H_k=\widetilde{H}_k\), but otherwise \(H_k\) is a strict superset of \(\widetilde{H}_k\) as the corresponding representation (7) of \(\widetilde{H}_k\) requires nonnegativity on all coordinates. This is also in line with the observation that there are fewer (K) cones \(H_k\) than there are \(\widetilde{H}_k\), namely N, and the union of the \(H_k\)’s is a space that is at least as large as \(A_+\) (since \(b+A_+\subseteq A_+\) because \(b\in A_+\)), typically \(\mathbb R^m\). As an extreme example, suppose that (\(\text {P}_{b}\)) has a unique nondegenerate optimal solution \(x(I_1,b)\). Then \(K=1\) and \(H_1=\mathbb R^m\) but the \(\widetilde{H}_k\)’s are strict subsets of \(\mathbb R^m\) unless \(N=1\).
In Sect. 5, we discuss sufficient conditions for the joint distributional convergence of the random vector \(\left( \alpha _n^\mathcal {K},G_n\right) \). In short, if we use any linear program solver, such joint distributional convergence appears to be reasonable. If the optimal basis is unique (\(K=1\)) with \(x^\star (b)=x(I_1,b)\) nondegenerate, then \(\lambda (I_1)\) is nondegenerate, and the proof shows that \(H_k^{m_0}=\mathbb {R}^{m_0}\). The distributional limit theorem then takes the simple form
In general, when \(K>1\), the number of summands in the limiting random variable in Theorem 3.1 might grow exponentially in K. In between these two cases is the situation that assumption (\(\mathbf{A2} \)) holds, which implies all dual optimal basic solutions for (\(\text {D}_{b}\)) are nondegenerate (see Lemma 2.5). The limiting random variable then simplifies, as the subsets \(\mathcal {K}\) must be singletons.
Theorem 3.3
Suppose assumptions (\(\mathbf{A2} \)), (\(\mathbf{A2} \)), and (\(\mathbf{B2} \)) hold, and that \(r_n(b_n  b)\xrightarrow []D G\). Then any^{Footnote 5} (measurable) choice of \(x^{\star }(b_n)\in \mathcal {P}^\star (b_n)\) satisfies
with the closed and convex cones \(H_k^{m_0}\) as given in Theorem 3.1.
Remark 3.4
With respect to Theorem 3.1, assumption (\(\mathbf{B1} \)) is weakened in Theorem 3.3 as absolute continuity of G (or \(G^{m_0}\)) is not required. Indeed, it can be arbitrary, and Theorem 3.3 thus accommodates, e.g., Poisson limit distributions. The proof shows that if G is absolutely continuous (i.e., \(m_0=m\)) then the indicator functions of \({G\in H_k^m\setminus \cup _{j<k} H_j^m}\) simplify to \({G\in H_k^m}\), because intersections \(H_k^m\cap H_j^m\) have Lebesgue measure zero. The distributional limit theorem then reads as
If the optimal solution of the limiting problem is unique, Theorem 3.1 can be formulated in a setwise sense. The Hausdorff distance between two closed nonempty sets \(A,B\subseteq \mathbb {R}^d\) is
The collection of closed subsets of \({\mathbb {R}^d}\) equipped with \(d_H\) is a metric space (with possibly infinite distance) and convergence in distribution is defined as usual by integrals of continuous realvalued bounded functions; see for example King (1989), where the delta method is developed in this context. Recall that \(\mathcal {C}\) stands for convex hull.
Theorem 3.5
Suppose assumptions (\(\mathbf{A2} \)), (\(\mathbf{B1} \)), and (\(\mathbf{B2} \)) hold, and that \(\mathcal P^{\star }(b)=\{x^\star (b)\}\) is a singleton. On the collection of closed subsets of \(\mathbb {R}^d\) with the Hausdorff distance \(d_H\) it holds that
where \(H_k^{m_0}\) and \(H_\mathcal {K}^{m_0}\) are as defined in Theorem 3.1.
We conclude this section by giving two further consequences of our proof techniques: a limit law for the objective value \(\nu (b)\) for (\(\text {P}_{b}\)), and convergence in probability of optimality sets. Since the former is wellknown and holds in more general, infinitedimensional convex programs, we omit the proof details and instead refer to Shapiro (2000), Bonnans & Shapiro (2000) and results by Sommerfeld & Munk (2018), Tameling et al. (2019) tailored to OT.
Proposition 3.6
Under assumptions (\(\mathbf{A2} \)), (\(\mathbf{B1} \)), and (\(\mathbf{B2} \)) it holds that
Another consequence of our bases driven approach underlying the proof of Theorem 3.1 is that the convergence of the Hausdorff distance
between \(\mathcal {P}^\star (b_n)\) and \(\mathcal {P}^\star (b)\) is of order \(O_\mathbb {P}(r_n^{1})\). A different and considerably shorter argument relies on Walkup & Wets (1969b) and proves the following result.
Proposition 3.7
Suppose assumptions (\(\mathbf{A2} \)) and (\(\mathbf{B2} \)) hold. If \(\Vert b_n  b\Vert =O_\mathbb {P}(r_n^{1})\), then it follows that \(d_H\left( \mathcal {P}^\star (b_n),\mathcal {P}^\star (b)\right) =O_\mathbb {P}(r_n^{1})\).
We also refer to the work by Robinson (1977) for a similar result when the primal and dual optimality sets are both bounded.
4 Proofs for the main results
To simplify the notation, we assume that all random vectors in the paper are defined on a common generic probability space \((\Omega ,\mathcal F,\mathbb {P})\). This is no loss of generality by the Skorokhod representation theorem.
Preliminary steps. Recall from Remark 2.2 that bases \(I_1,\dots ,I_K\) are feasible for (\(\text {P}_{b}\)) and (\(\text {D}_{b}\)) and hence optimal. The bases \(I_{K+1},\dots ,I_N\) are only feasible for (\(\text {D}_{b}\)) but not for (\(\text {P}_{b}\)). For a set \(\mathcal {K}\subseteq [N]\) define the events, i.e., subsets of the underlying probability space
By strong duality (Fact 2.1 (ii)), the set \(A_n^{\mathcal {K}}\) is the event that the bases indexed by \(\mathcal {K}\) are precisely those that are optimal for (\(\text {P}_{b}\)) and (\(\text {D}_{b}\)). We have \(A_n^\mathcal {K}\subseteq B_n^\mathcal {K}\), and \(B_n^\mathcal {K}\subseteq B_n^{\{k\}}\) for all \(k\in \mathcal {K}\). We start with two important observations, the first stating that only subsets of [K] asymptotically matter.
Lemma 4.1
Suppose that \(b_n\xrightarrow []D b\).

(i)
It holds that \(\mathbb P\left( B_n^{\{k\}}\right) \rightarrow 0\) as \(n\rightarrow \infty \) for all \(k>K\).

(ii)
If assumptions (\(\mathbf{A2} \)) and (\(\mathbf{B2} \)) hold, then with high probability \(\mathcal P^*(b_n)\) is bounded and nonempty.
Proof
For (i), observe that for \(k>K\) there exists an index \(i\in [d]\) such that \(x_i(I_k,b)<0\). The same inequality holds for \(b_n\) if sufficiently close to b, which happens with high probability. For (ii), nonemptiness with high probability follows from assumption (\(\mathbf{B2} \)), so we only prove boundedness. Indeed, assumption (\(\mathbf{A2} \)) implies that the recession cone \( \{x\ge 0\mid Ax=0,c^T x=0\} \) is trivial and equals \(\{0\}\). This property does not depend on \(b_n\), which yields the result. \(\square \)
The event \(A_n^\emptyset \) is equivalent to (\(\text {P}_{b}\)) being either infeasible or unbounded, and this has probability o(1) by (\(\mathbf{B2} \)). Combining this with the previous lemma and the sets \((A_n^\mathcal {K})_\mathcal {K}\) forming a partition of the probability space \(\Omega \), we deduce
where \(\mathbbm {1}_{A}(\omega )\) denotes the usual indicator function of the set A. Defining the random vector
that lies in \(\mathcal {P}^\star (b)\) (because \(\mathcal {K}\subseteq [K]\)), we obtain
We next investigate the indicator functions \(\mathbbm {1}_{A_n^\mathcal {K}}(\omega )\) appearing in (6). Omitting the dependence of \(b_n\) on \(\omega \), we rewrite
At the last internal intersection in the above display we can, with high probability, restrict to those i in the primal degeneracy set \(\mathrm {DP}_k:=\{i\in I_k \mid x_i(I_k,b)=0\}\). Indeed, for \(i\notin I_k\), the inequality reads \(0\ge 0\), whereas for \(i\in I_k\setminus \mathrm {DP}_k\) the righthand side goes to \(\infty \) and the lefthand side is bounded in probability. In other words \(\mathbb {P}(B_n^\mathcal {K})=o(1)+\mathbb {P}(G_n^{m_0}\in \cap _{k\in \mathcal {K}} H_k^{m_0})\), where
For \(\emptyset \subset \mathcal {K}\subseteq [K]\) define \(H_{\mathcal {K}}^{m_0}=\cap _{k\in \mathcal {K}}H_k^{m_0}\), and write
where the union over \(k>K\) can be neglected by Lemma 4.1. Thus we conclude that
With these preliminary statements at our disposal, we are ready to prove the main results.
Proof
(Theorem 3.1) The goal is to replace \(G_n^{m_0}\) by \(G^{m_0}\) in the indicator function in (8) at the limit as n tends to infinity. By the Portmanteau theorem ( Billingsley, 1999, Theorem 2.1) and elementary arguments^{Footnote 6} it suffices to show that the \(m_0\)dimensional boundary of each \(H_k^{m_0}\) has Lebesgue measure zero. This is indeed the case, as they are convex sets. Define the function \(T^\mathcal {K}:\mathbb {R}^{\mathcal {K}}\times \mathbb {R}^m\rightarrow \mathbb {R}^d\)
This function is continuous for all \(\alpha \in \mathbb {R}^{\mathcal {K}}\) and all vectors \(v\in \mathbb {R}^m\) such that \(v_{[m_0]}\notin \partial [H_\mathcal {K}^{m_0}\setminus \bigcup _{k\in [K]\setminus \mathcal {K}}H_k^{m_0}]\). In particular, the continuity set is of full measure with respect to \((\alpha ^{\mathcal {K}},G)\). As there are finitely many possible subsets \(\mathcal {K}\) denoted by \(\mathcal {K}_1,\dots ,\mathcal {K}_B\), the function \(T=\left( T^{\mathcal {K}_1},\dots ,T^{\mathcal {K}_B}\right) :\mathbb {R}^{\sum _{i=1}^B\mathcal {K}_i}\times \mathbb {R}^m \rightarrow (\mathbb {R}^d)^B\) defined by
is continuous Galmost surely. The continuous mapping theorem together with the assumed joint distributional convergence of the random vector \((\alpha _n^\mathcal {K},G_n)\) yield that
which completes the proof of Theorem 3.1. \(\square \)
Proof
(Theorem 3.3) With high probability (\(\mathbf{A2} \)) and (\(\mathbf{A2} \)) hold for \(b_n\) (by Lemma 4.1 for the former and trivially for the latter), which implies that \(\mathcal {P}^\star (b_n)\) is a singleton (Lemma 2.5 and Fact 2.4). Hence, regardless of the choice of \(\alpha _n^\mathcal {K}\), it holds that \(\mathbbm {1}_{A_n^\mathcal {K}}x^\star (b_n)=x(I_{\min \mathcal {K}},b_n)\). In particular, we may assume without loss of generality that \(\alpha _n^\mathcal {K}\) are deterministic and do not depend on n. Thus the joint convergence in Theorem 3.1 holds, and (6) simplifies to
where \(K(\omega )\) is the minimal \(k\le K\) such that \(B_n^{\{k\}}\) holds. Since \(B_n^{\{k\}}\) is asymptotically \(\{G\in H_k^{m_0}\}\), Theorem 3.3 follows. Let us now show that M(G) simplifies to \(\sum _{k=1}^K \mathbbm {1}_{\left\{ G\in H_k^m\right\} }\,x\left( I_k,G\right) \) if G is absolutely continuous. It suffices to show that intersections \(H_j^m\cap H_k^m\) with \(j<k\le K\) have Lebesgue measure zero. If \(v\in H_j^m\cap H_k^m\) then there exists \(\eta >0\) such that \(x(I_j,b+\eta v)\ge 0\) and \(x(I_k,b+\eta v)\ge 0\). Since \(\lambda (I_k)\) and \(\lambda (I_j)\) are dual feasible, they must be optimal with respect to \(b+\eta v\). Thus it holds
By (\(\mathbf{A2} \)) the vector \(\lambda (I_k)\lambda (I_j)\) is nonzero and hence v is contained in its orthogonal complement, which indeed has Lebesgue measure zero. \(\square \)
Proof
(Theorem 3.5) We consider the optimality sets \(\mathcal P^\star (b_n)\) as elements of the power set in \(\mathbb {R}^d\) endowed with the Hausdorff distance \(d_H\).^{Footnote 7} Then, for all \(\mathcal {K}\subseteq [N]\) the mapping \(v\mapsto \mathcal {C}(x(I_\mathcal {K}),v)\) is Lipschitz since without loss of generality \(\mathcal {K}\ne \emptyset \) and
It follows that
is a measurable random subset of \(\mathbb {R}^d\). According to Fact 2.3 in presence of (\(\mathbf{A2} \)) and the preceding computations
If \(\mathcal P^\star (b)=\{x^\star (b)\}\) is a singleton, then \(\mathcal {C}(x(I_{[K]},b)=\{x^\star (b)\}\) and therefore
by the continuous mapping theorem. \(\square \)
Proof
(Proposition 3.7) Let \(K=\mathbb R_+^d\) and define the linear map \(\tau :\mathbb R^d\rightarrow \mathbb R^{m+1}\) by \(\tau (x)=(Ax,c^tx)\). For each b such that the linear program is feasible, let \(v_b\in \mathbb R\) be the optimal objective value. If \(\tau \) is injective, then the optimality sets are singletons and the result holds trivially. We thus assume that \(\tau \) is not injective, and observe that
Since K is a polyhedron and \(\tau \) is neither identically zero (A has full rank) nor injective, we can apply the main theorem of Walkup & Wets (1969b). We obtain
because the optimal values satisfy \(v_bv_{b_n}=O_\mathbb {P}(r_n^{1})\) by Proposition 3.6. \(\square \)
5 On the assumptions
We start collecting some wellknown facts from parametric optimization (see Walkup & Wets, (1969a); Guddat et al., (1974) for details). To this end, denote the dual feasible set by \(\mathcal {N}:=\left\{ \lambda \in \mathbb {R}^m\mid \, A^t\lambda \le c\right\} \). Further, define the set of feasible parameters by \(\mathcal {M}:= \left\{ b\in \mathbb {R}^m \mid \, \mathcal {P}(b)\ne \emptyset \right\} \) and \(\mathcal {M}^\star := \left\{ b\in \mathbb {R}^m \mid \, \mathcal {P}^\star (b)\ne \emptyset \right\} \) the solution set.
Lemma 5.1
If for some \(b_0\in \mathcal {M}\) the set \(\mathcal {P}(b_0)\) is bounded (resp. unbounded) then \(\mathcal {P}(b)\) is bounded (resp. unbounded) for all \(b\in \mathcal {M}\). Similarly, if for some \(b_0\in \mathcal {M}^\star \) the set \(\mathcal {P}^\star (b_0)\) is bounded (resp. unbounded) then \(\mathcal {P}^\star (b)\) is bounded (resp. unbounded) for all \(b\in \mathcal {M}^\star \). Moreover, it holds that

(i)
the set \(\mathcal {M}\) is nonempty and equal to an mdimensional convex cone.

(ii)
if the dual set \(\mathcal {N}\) is nonempty then it holds that \(\mathcal {M}=\mathcal {M}^\star \).

(iii)
if the dual set \(\mathcal {N}\) is nonempty and bounded then \(\mathcal {M}=\mathcal {M}^\star =\mathbb {R}^m\).
The following discussion on the assumptions is a consequence of Lemma 5.1. We first collect sufficient conditions for assumption (\(\mathbf{A2} \)).
Corollary 5.2
(Sufficiency for (\(\mathbf{A2} \))) The following statements hold.

(i)
If \(\mathcal {N}\) is nonempty and \(\mathcal {P}(b)\) is bounded for some \(b\in \mathcal {M}\) then assumption (\(\mathbf{A2} \)) holds for all \(b\in \mathcal {M}\).

(ii)
If \(\mathcal {N}\) is nonempty, bounded and \(\mathcal {P}^\star (b)\) is bounded for some \(b\in \mathbb {R}^m\) then assumption (\(\mathbf{A2} \)) holds for all \(b\in \mathbb {R}^m\).
Certainly, if \(\mathcal {P}^\star (b)\ne \emptyset \) then (\(\mathbf{A2} \)) is equivalent to \(\mathcal {P}^\star (b)\) being bounded. The latter property is independent on b and equivalent to the set \(\left\{ x\in \mathbb {R}^d\mid \, Ax=0,\,x\ge 0,\, c^tx=0\right\} \) being empty. A sufficient condition for that is boundedness of \(\mathcal {P}(b)\) that can be easily checked in certain settings.
Lemma 5.3
(Sufficiency for \(\mathbf {\mathcal {P}(b)}\) bounded) Suppose that A has nonnegative entries and no column of A equals \(0\in \mathbb {R}^m\). Then \(\mathcal {P}(b)\) is bounded (possibly infeasible) for all \(b\in \mathbb {R}^m\).
It is noteworthy that if the dual feasible set \(\mathcal {N}\) is nonempty and bounded, then \(\mathcal {P}^\star (b)\ne \emptyset \) for all \(b\in \mathbb {R}^m\), but \(\mathcal {P}(b)\) is necessarily unbounded (Clark, 1961). Thus, \(\mathcal N\) is unbounded under the conditions of Lemma 5.3. We emphasize that assumption (\(\mathbf{A2} \)) is neither easy to verify nor expected to hold for most structured linear programs. Indeed, under (\(\mathbf{A2} \)) assumption (\(\mathbf{A2} \)) is equivalent to all dual basic solutions being nondegenerate (Lemma 2.5). However, degeneracy in linear programs is often the case rather the exception (Bertsimas & Tsitsiklis, 1997). Notably, if (\(\mathbf{A2} \)) and (\(\mathbf{A2} \)) are satisfied the set \(\mathcal {P}^\star (b)\) is singleton.
The assumption (\(\mathbf{B1} \)) has to be checked for each particular case and can usually be verified by an application of the central limit theorem (for a particular example see Sect. 6). Assumption (\(\mathbf{B2} \)) is obviously necessary for the limiting distribution to exist. If the dual feasible set \(\mathcal {N}\) is nonempty and bounded and (\(\mathbf{B1} \)) holds then (\(\mathbf{B2} \)) is always satisfied. A more refined statement is the following.
Lemma 5.4
(Sufficiency for (\(\mathbf{B2} \))) Consider the set \(\mathcal {P}(b_0)\) assumed to be nonempty. Then \(\mathcal {P}(b)\) is nonempty for all b sufficiently close to \(b_0\) if

(i)
the set \(\mathcal {P}(b_0)\) contains a nondegenerate feasible basic solution.

(ii)
Slater’s constraint qualification^{Footnote 8} holds.
In particular, if the dual feasible set \(\mathcal {N}\) is nonempty and (\(\mathbf{B1} \)) holds then both conditions (i) and (ii) are sufficient for (\(\mathbf{B2} \)).
Joint convergence. Our goal here is to state useful conditions such that the random vector \(\left( \alpha _n^\mathcal {K}, G_n\right) \) jointly converges^{Footnote 9} in distribution to some limit random variable \(\left( \alpha ^\mathcal {K},G\right) \) on the space \( \Delta _{\vert \mathcal {K}\vert }\times \mathbb {R}^m\). By assumption (\(\mathbf{B1} \)), \(G_n\rightarrow G\) in distribution, and a necessary condition for the joint distributional convergence of \((\alpha _n^\mathcal {K},G_n)\) is that \(\alpha _n^\mathcal {K}\) has a distributional limit \(\alpha ^\mathcal {K}\). There is no reason to expect \(\alpha _n^\mathcal {K}\) and \(G_n\) to be independent, as discussed at the end of this section. We give a weaker condition than independence that is formulated in terms of the conditional distribution of \(\alpha _n^\mathcal {K}\) given \(G_n\) (or, equivalently, given \(b_n=b+G_n/r_n\)). These conditions are natural in the sense that if \(b_n=g\), then the choice of solution \(x^\star (g)\), as encapsulated by the \(\alpha _n^\mathcal {K}\)’s, is determined by the specific linear program solver in use.
Treating conditional distributions rigorously requires some care and machinery. Let \(\mathcal Z=\mathcal Z^\mathcal {K}=\Delta _{\mathcal {K}}\times \mathbb {R}^m\) and for \(\varphi :\mathcal Z\rightarrow \mathbb {R}\) denote
We say that \(\varphi \) is bounded Lipschitz if it belongs to \(\mathrm {BL}(\mathcal Z)=\{\varphi :\mathcal Z\rightarrow \mathbb {R}\,\vert \,\Vert \varphi \Vert _{\mathrm {BL}}\le 1\}\). The bounded Lipschitz metric
is wellknown to metrize convergence in distribution of (probability) measures on \(\mathcal Z\) Dudley (1966) [Theorems 6 and 8]. According to the disintegration theorem (see Kallenberg, 1997[Theorem 5.4], Dudley, 2002[Section 10.2] or Chang & Pollard, 1997 for details), we may write the joint distribution of \(\left( \alpha _n^\mathcal {K},b_n\right) \) as an integral of conditional distributions \(\mu _{n,g}^\mathcal {K}\) that represent the distribution of \(\alpha _n^\mathcal {K}\) given that \(b_n=g\). More precisely, \(g\mapsto \mu _{n,g}^\mathcal {K}\) is measurable from \(\mathbb {R}^m\) to the metric space of probability measures on \(\Delta _{\mathcal {K}}\) with the bounded Lipschitz metric, so that for any \(\varphi \in \mathrm {BL}(\mathcal Z)\) it holds that
where \(\psi _n:\mathbb {R}^m\rightarrow \mathbb {R}\) is a measurable function. The joint distribution of \((\alpha _n^\mathcal {K},G_n)\) is determined by the collection of expectations
Our sufficient condition for joint convergence is given by the following lemma. It is noteworthy that the spaces \(\mathbb {R}^m\) and \(\Delta _{\mathcal {K}}\) can be replaced with arbitrary Polish spaces, and even more general spaces, as long as the disintegration theorem is valid.
Lemma 5.5
Let \(\{\mu ^\mathcal {K}_{g}\}_{g\in \mathbb {R}^m}\) be a collection of probability measures on \(\Delta _{\mathcal {K}}\) such that the map \(g\mapsto \mu ^\mathcal {K}_{g}\) is continuous at Galmost any g, and suppose that \(\mu ^\mathcal {K}_{n,g}\rightarrow \mu ^\mathcal {K}_{g}\) uniformly with respect to the bounded Lipschitz metric \(\mathrm {BL}\). Then \((\alpha _n^\mathcal {K},G_n)\) converges in distribution to a random vector \((\alpha ^\mathcal {K},G)\) satisfying
for any continuous bounded function \(\varphi \in \mathrm {BL}(\mathcal Z)\) (this determines the distribution of the random vector \((\alpha ^\mathcal {K},G)\) completely). Moreover, if \(\mathcal L\) denotes the distribution of a random vector, then the rate of convergence can be quantified as
where \(L:=\sup _{g_1\ne g_2}\mathrm {BL}(\mu ^\mathcal {K}_{g_1},\mu ^\mathcal {K}_{g_2})/\Vert (g_1g_2)\Vert \in [0,\infty ]\). The supremum with respect to g can be replaced by an essential supremum.
The conditions of Lemma 5.5 (and hence the joint convergence in Theorem 3.1) will be satisfied in many practical situations. For example, given \(b_n\) and an initial basis for the simplex method, its output is determined by the pivoting rule (for a general overview see Terlaky & Zhang, 1993 and references therein). Deterministic pivoting rules lead to degenerate conditional distributions of \(\alpha _n^\mathcal {K}\) given \(b_n=g\), whereas random pivoting rules may lead to nondegenerate conditional distributions. In both cases these conditional distributions do not depend on n at all, but only on the input vector g. In particular, the uniform convergence in Lemma 5.5 is trivially fulfilled (the supremum is equal to zero). It is reasonable to assume that these conditional distributions depend continuously on g except for some boundary values that are contained in a lowerdimensional space (which will have measure zero under the absolutely continuous random vector G).
6 Optimal transport
Optimal transport (OT) dates back to the French mathematician and engineer Monge (1781). Roughly speaking, it seeks to transport objects from one collection of locations to another in the most economical manner. Apart from the work of Appell (1887), much of the progress of OT began in the midtwentieth century, firstly due to its practical relevance in economics. Indeed, much of the theory of linear programming including the simplex algorithm has been motivated by findings for OT with early contributions by Hitchcock (1941), Kantorovich (1942), Dantzig (1948) and Koopmans (1951). Since then a surprisingly rich theory has emerged with important contributions by Kantorovich & Rubinstein 1958, Zolotarev (1976), Sudakov (1979), Kellerer (1984), Rachev (1985), Brenier (1987), Smith & Knott 1987), McCann (1997), Jordan et al. (1998), Ambrosio et al. (2008) and Lott & Villani (2009), among many others. We also refer to the excellent monographs by Rachev & Rüschendorf (1998), Villani (2008) and Santambrogio (2015) for further details. In fact, OT has recently gained renewed interest especially as computational progress paves the way to explore novel fields of applications such as imaging (Rubner et al., 2000; Solomon et al., 2015), machine learning (Frogner et al., 2015; Arjovsky et al., 2017; Peyré & Cuturi, 2019), and statistical data analysis (Chernozhukov et al., 2017; Sommerfeld & Munk, 2018; del Barrio et al., 2019; Panaretos & Zemel, 2019).
On a finite space \(\mathcal {X}=\{x_1,\ldots ,x_N\}\) equipped with some underlying cost \(c:\mathcal {X}\times \mathcal {X}\rightarrow \mathbb {R}\) OT between two probability measures \(r,s\in \Delta _N:=\{ r\in \mathbb {R}^N\, \vert \, \mathbf {1}_N^Tr=1,\, r_i\ge 0 \}\) is equal to the linear program
where \(c_{ij}=c(x_i,x_j)\) and the set \(\Pi (r,s)\) denotes all nonnegative matrices with row and column sum equal to r and s, respectively. OT comprises the challenge to find an optimal solution termed OT coupling \(\pi ^\star (r,s)\) between r and s such that the integrated cost is minimal among all possible couplings. We denote by \(\Pi ^\star (r,s)\) the set of all OT couplings. The dual problem is
In our context reflecting many practical situations (Tameling et al., 2021), the measures r and s are unknown and need to be estimated from data. To this end, we assume to have access to independent and identically distributed (i.i.d.) \(\mathcal {X}\)valued random variables \(X_1,\ldots ,X_n\sim r\), where a reasonable proxy for the measure r is its empirical version \(\hat{r}_n:= \frac{1}{n}\sum _{i=1}^n \delta _{X_i}\). As an illustration of our general theory, we focus on limit theorems that asymptotically \((n\rightarrow \infty )\) characterize the fluctuations of an estimated coupling \(\pi ^\star (\hat{r}_n,s)\) around \(\pi ^\star (r,s)\). For the sake of readability, we focus primarily on the one sample case, where only r is replaced by \(\hat{r}_n\) but include a short account on the case that both measures are estimated.
A few words regarding the assumptions from Sect. 2 in the OT context are in order. Assumption (\(\mathbf{A2} \)) always holds, since \(\Pi (r,s)\subseteq [0,1]^{N^2}\) is bounded and contains the independence coupling \(rs^T\). Assumption (\(\mathbf{A2} \)) that according to Lemma 2.5 is equivalent to all dual feasible basic solutions for (\(\text {DOT}\)) being nondegenerate, however, does not always hold. Sufficient conditions for (\(\mathbf{A2} \)) to hold in OT are given in Subsect. 6.1. Concerning the probabilistic assumptions, we notice that (\(\mathbf{B2} \)) always holds as for any (possibly random) pair of measures \((\hat{r}_n,s)\) the set \(\Pi (\hat{r}_n,s)\) is nonempty and bounded. Assumption (\(\mathbf{B1} \)) is easily verified by an application of the multivariate central limit theorem. Indeed, the multinomial process of empirical frequencies \(\sqrt{n}(r\hat{r}_n)\) converges weakly to a centered Gaussian random vector \(G(r)\sim \mathcal {N}(0,\Sigma (r))\) with covariance
Notably, \(\Sigma (r)\) is singular and G(r) fails to be absolutely continuous with respect to Lebesgue measure. A slight modification allows to circumvent this issue. The constraint matrix in OT,
has rank \(2N1\). Letting \(r_\dagger =r_{[N1]}\in \mathbb {R}^{N1}\) denote the first \(N1\) coordinates of \(r\in \mathbb {R}^N\) and \(A_\dagger \in \mathbb {R}^{(2N1)\times N^2}\) denote A with its Nth row removed, it holds that
The limiting random variable for \(\sqrt{n}(r_\dagger \hat{r_\dagger }_n)\), as n tends to infinity, is equal to \(G(r_\dagger )\) following an absolutely continuous distribution if and only if \(r_\dagger >0\) and \(\Vert r_\dagger \Vert _1<1\). Equivalently, r is in the relative interior of \(\Delta _N\) (denoted \(\mathrm {ri}(\Delta _N)\)), i.e., \(0<r\in \Delta _N\). Under this condition (\(\mathbf{A2} \)), (\(\mathbf{B1} \)) and (\(\mathbf{B2} \)) hold and from the main result in Theorem 3.1 we immediately deduce the limiting distribution of optimal OT couplings.
Theorem 6.1
(Distributional limit law for OT couplings) Consider the optimal transport problem (\(\text {OT}\)) between two probability measures \(r,s\in \mathrm {ri}(\Delta _N)\) and let \(\hat{r}_n=\frac{1}{n}\sum _{i=1}^n \delta _{X_i}\) be the empirical measure derived by i.i.d. random variables \(X_1,\ldots ,X_n\sim r\). If sample size n tends to infinity, then there exists a sequence \(\pi _n^\star (r,s)\in \Pi ^\star (r,s)\) such that
with \(G(r_\dagger )=(G^1(r_\dagger ),0)\). If further assumption (\(\mathbf{A2} \)) holds, then \(\Pi ^\star (r,s)=\left\{ \pi ^\star (r,s)\right\} \) and
Remark 6.2
The two sample case presents an additional challenge. By the multivariate central limit theorem we have for \(\min (m,n)\rightarrow \infty \) and \(\frac{m}{n+m}\rightarrow \lambda \in (0,1)\) that
with \(G^1(r_\dagger )\) and \(G^2(s)\) independent and the compound limit law following a centered Gaussian distribution with block diagonal covariance matrix, where the two blocks are given by (10), respectively. However, the limit law fails to be absolutely continuous. Nevertheless, the distributional limit theorem for OT couplings remains valid in this case and there exists a sequence \(\pi _{n,m}^\star (r,s)\in \Pi ^\star (r,s)\) such that
We provide further details in Appendix 1.
We emphasize that once a limit law for the OT coupling is available, one can derive limit laws for sufficiently smooth functionals thereof. As examples let us mention the OT curve (Klatt et al., 2020) and OT geodesics (McCann, 1997). The details are omitted for brevity and instead, we provide an illustration of the distributional limit theorem (Theorem 6.1).
Example 6.3
We consider a ground space \(\mathcal {X}=\left\{ x_1<x_2<x_3\right\} \subset \mathbb {R}\) consisting of \(N=3\) points with cost \(c=\left( 0,\vert x_1x_2\vert ,\vert x_1x_3\vert , \vert x_2x_1\vert ,0,\vert x_2x_3\vert , \vert x_3x_1\vert ,\vert x_3x_2\vert ,0\right) \in \mathbb {R}^9\) for which OT then reads as
with constraint matrix \(A_\dagger \in \mathbb {R}^{5\times 9}\). A basis I is a subset of cardinality five out of the column index set \(\{1,\ldots ,9\}\) such that \((A_\dagger )_I\) is of full rank. For OT it is convenient to think of a feasible solution in terms of a transport matrix \(\pi \in \mathbb {R}^{3\times 3}\) with \(\pi _{ij}\) encoding mass transport from source i to destination j. For instance, the basis \(I=\{1,2,3,5,9\}\) corresponds to the transport scheme
where each possible nonzero entry is marked by a star and specific values depend on the measures r and s. In particular, to basis I corresponds the (possibly infeasible) basic solution \(\pi (I,(r_\dagger ,s))=(A_\dagger )_I^{1}(r_\dagger ,s)\) that we illustrate in terms of its transport scheme by
where \(r=(r_\dagger ,1\Vert r_\dagger \Vert _1)\in \mathbb {R}^3\) and the second equality employs that r and s sum up to one. Obviously, \(\pi (I,(r_\dagger ,s))\) is feasible if and only if \(s_2\ge r_2\) and \(s_3\ge r_3\). Suppose that the measures are equal \(r=s\). Then the transport problem attains a unique solution supported on the diagonal, i.e., all the mass remains at its current location. A straightforward computation yields \(K=8\) primal and dual optimal bases
For example, the transport scheme \(TS(I_1)\) corresponds to basis \(I_1=\{1,2,5,8,9\}\) and induces an invertible matrix \(A_{I_1}\). Omitting the superscript \(m_0=5\) for clarity, the respective closed convex cones \(H_k\) for \(1\le k\le K\) as defined in (7) are
Each of these cones is an intersection of two proper halfspaces, respectively. Some of these cones exhibit nontrivial intersections and in particular (\(\mathbf{A2} \)) fails to hold. Such cases arise for the pairs \(\{I_3,I_7\}\), \(\{I_6,I_7\}\), \(\{I_4,I_8\}\) and \(\{I_5,I_8\}\). The intersections of the corresponding cones are given by
The weak convergence in (14) and together with OT for \(p=1\) and \(r=s\) then leads to the distributional limit law for OT couplings
Although \(K=8\), there are only four distinct dual solutions: \(\lambda (I_1)\), \(\lambda (I_2)\), \(\lambda (I_7)\) and \(\lambda (I_8)\).
6.1 Degeneracy and uniqueness in optimal transport
This subsection provides sufficient conditions for assumption (\(\mathbf{A2} \)) to hold. In view of Lemma 2.5 and since for OT assumption (\(\mathbf{A2} \)) is always satisfied, assumption (\(\mathbf{A2} \)) is equivalent to nondegeneracy of all dual optimal basic solutions. Notably, this implies uniqueness of the OT coupling. Conversely, if for a given cost the OT coupling is unique for all \(r,s\in \Delta _N\), then (\(\mathbf{A2} \)) holds. We begin with a sufficient criterion for (\(\mathbf{A2} \)) only depending on the cost.
Lemma 6.4
Suppose that the following holds for the cost function c. For any \(n\ge 2\) and any family of indices \(\left\{ (i_k,j_k)\right\} _{1\le k \le n}\) with all \(i_k\) pairwise different and all \(j_k\) pairwise different it holds that
Then all dual basic solutions are nondegenerate. In particular, (\(\mathbf{A2} \)) holds and the optimal OT coupling is unique for any pair of measures \(r,s\in \Delta _N\).
We are unaware of an explicit reference for condition (15) that is reminiscent to the wellknown cyclic monotonicity property (Rüschendorf, 1996). Further, (15) can be thought of as dual to the condition of Klee & Witzgall (1968) that for any proper subsets \(A,B\subset [N]\) not both empty
that guarantees every primal basic solution to be nondegenerate. Notably, (15) is satisfied for OT on the real line with cost \(c(x,y)=\vert xy\vert ^p\) and measures with at least \(N=3\) support points if and only if \(p>0\) and \(p\ne 1\). If the underlying space involves too many symmetries, such as a regular grid with cost defined by the underlying grid structure, it typically fails to hold. An alternative condition that ensures (\(\mathbf{A2} \)) is the strict Monge condition that the cost c satisfies
possibly after relabelling the indices (Dubuc et al., 1999). This translates to easily interpretable statements on the real line.
Lemma 6.5
Let \(\mathcal {X}:=\{x_1<\ldots < x_N\}\) be a set of N distinct ordered points on the real line. Suppose that the cost takes the form \(c(x,y)=f(\vert xy\vert )\) with \(f:\mathbb {R}_+\rightarrow \mathbb {R}_+\) such that \(f(0)=0\) and either
Then assumption (\(\mathbf{A2} \)) holds.
The first statement follows by employing the Monge condition (see also McCann, 1999[Proposition A2] for an alternative approach). The second case is more delicate, and indeed, the description of the unique optimal solution is more complicated (see Appendix 1). In fact, in both cases the unique transport coupling can be computed by the Northwest corner algorithm (Hoffman, 1963). Typical costs covered by Lemma 6.5 are \(c(x,y)=\vert xy \vert ^p\) for any \(p> 0\) with \(p\ne 1\). Indeed, for \(p=0\) or \(p=1\), uniqueness often fails (see Remark 6.7).
In a general linear program (\(\text {P}_{b}\)), the set of costs c for which (\(\mathbf{A2} \)) fails to hold has Lebesgue measure zero (e.g., Bertsimas & Tsitsiklis, 1997, ). Here we provide a result in the same flavour for OT.
Proposition 6.6
Let \(\mu \) and \(\nu \) be absolutely continuous on \(\mathbb {R}^D\), with \(D\ge 2\), and let \(c(x,y)=\Vert xy\Vert ^p_q\), where \(p\in \mathbb {R}\setminus \{0\}\) and \(q\in (0,\infty ]\) are such that if \(p=1\) then \(q\notin \{1,\infty \}\). For probability vectors \(r,s\in \Delta _N\) define the probability measures \(r(\mathbf {X})=\sum _{k=1}^N r_k\delta _{X_k}\) and \(s(\mathbf {Y})=\sum _{k=1}^N s_k\delta _{Y_k}\) with two independent collections of i.i.d. \(\mathbb {R}^D\)valued random variables \(X_1,\ldots ,X_N\sim \mu \) and \(Y_1,\ldots ,Y_N\sim \nu \). Then (15) holds almost surely for the optimal transport (\(\text {OT}\)). In particular, with probability one for any \(r,s\in \Delta _N\) and pair of marginals \(r(\mathbf {X})\) and \(s(\mathbf {Y})\), the corresponding optimal transport coupling is unique.
See Wang et al., (2013) for a related result for \(p=q=2\) and fixed marginals r, s. Note that Proposition (6.6) includes the Coulomb case \((p=1)\) that has applications in physics (Cotar et al., 2013). As the proof details, the result is valid for piecewise analytic (nonconstant) functions.
Remark 6.7
(Nonuniqueness) Let \(\mu \) be uniform on \([0,1]^D\) and \(\nu \) be uniform on \([1,2]^D+(2,0,0,\dots ,0)\). Then, with probability one, all transport couplings bear the same cost if \(p=0\) or if \(p=1\) and \(q\in \{1,\infty \}\). Thus, for \(N\ge 2\) uniqueness fails.
Notes
A prototypical example is the standard central limit theorem, whereby under suitable assumptions \(r_n=\sqrt{n}\) and G is a Gaussian random vector on \(\mathbb {R}^m\).
One may assume at first reading that \(m_0=m\); the additional generality will turn useful for the onesample case naturally arising in optimal transport in Sect. 6.
There is no need to assume joint distributional convergence of \((\alpha _n^\mathcal {K},G_n)\) as in Theorem 3.1.
Letting \(A_k=H_k^{m_0}\) and \(B_k=\mathbb {R}^{m_0}\setminus A_k\), it holds \(\partial B_k=\partial A_k\) and thus \(\partial (\bigcap _{k\in \mathcal {K}} A_k\cap \bigcap _{k\in [K]\setminus \mathcal {K}}B_k)\subseteq \bigcup _{k=1}^K \partial A_k\).
To include the empty set define \(d_H(\emptyset ,\emptyset )=0\) and \(d_H(\emptyset ,A)=1\) for all nonempty A.
The feasible set \(\mathcal {P}(b_0)\) contains a positive element \(x\in (0,\infty )^d\).
Recall that the \(\alpha _n^\mathcal {K}\) represent random weights (summing up to one) for each optimal basis \(I_k\), \(k\in \mathcal {K}\) for the case that \(A_n^\mathcal {K}\) occurs, i.e., that several bases yield primal optimal solutions and hence any convex combination is also optimal.
References
Ambrosio L, Gigli N, & Savaré G (2008) Gradient Flows in Metric Spaces and in the Space of Probability Measures. Springer Science & Business Media
Appell, P. (1887). Mémoire sur les déblais et les remblais des systemes continus ou discontinus. Mémoires présentes par divers Savants à l’Académie des Sciences de l’Institut de France, 29, 1–208.
Arjovsky, M., Chintalah, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. Proceedings of Machine Learning Research, 70, 214–223.
Beale, E. M. (1955). On minimizing a convex function subject to linear inequalities. Journal of the Royal Statistical Society: Series B (Methodological), 17(2), 173–184.
Bereanu B (1963) Decision regions and minimum risk solutions in linear programming. In: Colloquium on applications of mathematics to economics, Budapest, pp 37–42
Bereanu, B. (1976). The continuity of the optimum in parametric programming and applications to stochastic programming. Journal of Optimization Theory and Applications, 18(3), 319–333.
Bertsimas, D., & Tsitsiklis, J. N. (1997). Introduction to Linear Optimization (Vol. 6). MA: Athena Scientific Belmont.
Billingsley, P. (1999). Convergence of Probability Measures (2nd ed.). New York: Wiley.
Böhm, V. (1975). On the continuity of the optimal policy set for linear programs. SIAM Journal on Applied Mathematics, 28(2), 303–306.
Bonnans JF, & Shapiro A (2000) Perturbation Analysis of Optimization Problems. Springer Science & Business Media
Brenier, Y. (1987). Décomposition polaire et réarrangement monotone des champs de vecteurs. CR Acad Sci Paris Sér I Math, 305, 805–808.
Brualdi, R. A. (2006). Combinatorial Matrix Classes, (Vol. 13). Cambridge University Press.
Chang, J. T., & Pollard, D. (1997). Conditioning as disintegration. Statistica Neerlandica, 51(3), 287–317.
Chernozhukov, V., Galichon, A., Hallin, M., & Henry, M. (2017). MongeKantorovich depth, quantiles, ranks and signs. The Annals of Statistics, 45(1), 223–256.
Clark, F. E. (1961). Remark on the constraint sets in linear programming. The American Mathematical Monthly, 68(4), 351–352.
Cotar, C., Friesecke, G., & Klüppelberg, C. (2013). Density functional theory and optimal transportation with Coulomb cost. Communications on Pure and Applied Mathematics, 66(4), 548–599.
Cottle R, Johnson E, & Wets RJB (2007) George B. Dantzig (1914–2005). Notices of the AMS 54(3), 344–362
Dang NV (2015) Complex powers of analytic functions and meromorphic renormalization in QFT. preprint arXiv:1503.00995
Dantzig, G. B. (1963). Linear Programming and Extensions. Princeton University Press.
Dantzig, G. B. (1948). Programming in a linear structure. Bulletin of the American Mathematical Society, 54(11), 1074–1074.
Dantzig, G. B. (1955). Linear programming under uncertainty. Management Science, 1, 197–206.
De Loera, J. A., Rambau, J., & Santos, F. (2010). Triangulations Structures for Algorithms and Applications. Springer.
del Barrio, E., CuestaAlbertos, J. A., Matrán, C., & MayoÍscar, A. (2019). Robust clustering tools based on optimal transportation. Statistics and Computing, 29, 139–160.
Dubuc, S., Kagabo, I., & Marcotte, P. (1999). A note on the uniqueness of solutions to the transportation problem. INFOR: Information Systems and Operational Research, 37(2), 141–148.
Dudley, R. M. (2002). Real Analysis and Probability, (Vol. 74). Cambridge University Press.
Dudley, R. (1966). Convergence of baire measures. Studia Mathematica, 3(27), 251–268.
Dupačová, J. (1987). Stochastic programming with incomplete information: a surrey of results on postoptimization and sensitivity analysis. Optimization, 18(4), 507–532.
Dupačová, J., & Wets, R. J. B. (1988). Asymptotic behavior of statistical estimators and of optimal solutions of stochastic optimization problems. The Annals of Statistics, 16(4), 1517–1549.
Ewbank, J. B., Foote, B. L., & Kumin, H. L. (1974). A method for the solution of the distribution problem of stochastic linear programming. SIAM Journal on Applied Mathematics, 26(2), 225–238.
Ferguson, A. R., & Dantzig, G. B. (1956). The allocation of aircraft to routes: An example of linear programming under uncertain demand. Management Science, 3(1), 45–73.
Frogner C, Zhang C, Mobahi H, Araya M, & Poggio TA (2015) Learning with a Wasserstein loss. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems 28, Curran Associates, Inc., pp 2053–2061
Gal, T., & Greenberg, H. J. (2012). Advances in Sensitivity Analysis and Parametric Programming (Vol. 6). Springer Science & Business Media.
Galichon, A. (2018). Optimal Transport Methods in Economics. Princeton University Press.
Gangbo, W., & McCann, R. J. (1996). The geometry of optimal transportation. Acta Mathematica, 177(2), 113–161.
Goldman, A. J., & Tucker, A. W. (1956). Theory of linear programming. Linear Inequalities and Related Systems, 38, 53–97.
Greenberg, H. J. (1986). An analysis of degeneracy. Naval Research Logistics Quarterly, 33(4), 635–655.
Guddat, J., Hollatz, H., & Bank, B. (1974). Theorie der linearen parametrischen Optimierung. Berlin: AkademicVerlag.
Hadigheh, A. G., & Terlaky, T. (2006). Sensitivity analysis in linear optimization: Invariant support set intervals. European Journal of Operational Research, 169(3), 1158–1175.
Hitchcock, F. L. (1941). The distribution of a product from several sources to numerous localities. Journal of Mathematics and Physics, 20(1–4), 224–230.
Hoffman, A. J. (1963). On simple linear programming problems. Proceedings of Symposia in Pure Mathematics, 7, 317–327.
Jordan, R., Kinderlehrer, D., & Otto, F. (1998). The variational formulation of the FokkerPlanck equation. SIAM Journal on Mathematical Analysis, 29(1), 1–17.
Kallenberg, O. (1997). Foundations of Modern Probability (2nd ed.). SpringerVerlag.
Kantorovich, L. V. (1939). Mathematical methods in the organization and planning of production. Publication House of the Leningrad State University, 6, 336–422.
Kantorovich, L. V. (1942). On the translocation of masses. Doklady Akademii Nauk USSR, 37, 199–201.
Kantorovich, L. V., & Rubinstein, G. S. (1958). On a space of completely additive functions. Vestnik Leningrad Univ, 13(7), 52–59.
Kellerer, H. G. (1984). Duality theorems for marginal problems. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 67(4), 399–432.
King, A. J. (1989). Generalized delta theorems for multivalued mappings and measurable selections. Mathematics of Operations Research, 14(4), 720–736.
King, A. J., & Rockafellar, R. T. (1993). Asymptotic theory for solutions in statistical estimation and stochastic programming. Mathematics of Operations Research, 18(1), 148–162.
Klatt, M., Tameling, C., & Munk, A. (2020). Empirical regularized optimal transport: Statistical theory and applications. SIAM Journal on Mathematics of Data Science, 2(2), 419–443.
Klee, V., & Witzgall, C. (1968). Facets and vertices of transportation polytopes. Mathematics of the Decision Sciences, 1, 257–282.
Koopmans, T. C. (1949). Optimum utilization of the transportation system. Econometrica: Journal of the Econometric Society, 17, 136–146.
Koopmans, T. C. (1951). Efficient allocation of resources. Econometrica: Journal of the Econometric Society, 19(4), 455–465.
Lott, J., & Villani, C. (2009). Ricci curvature for metricmeasure spaces via optimal transport. Annals of Mathematics, 169(3), 903–991.
Luenberger, D. G., & Ye, Y. (2008). Linear and Nonlinear Programming. New York: Springer.
McCann, R. J. (1997). A convexity principle for interacting gases. Advances in Mathematics, 128(1), 153–179.
McCann, R. J. (1999). Exact solutions to the transportation problem on the line. Proceedings of the Royal Society of London Series A: Mathematical, Physical and Engineering Sciences, 455(1984), 1341–1380.
Monge G (1781) Mémoire sur la théorie des déblais et des remblais. In: Histoire de l’Académie Royale des Sciences de Paris, pp 666–704
Panaretos, V. M., & Zemel, Y. (2019). Statistical Aspects of Wasserstein Distances. Annual Review of Statistics and its Applications, 6, 405–431.
Peyré, G., & Cuturi, M. (2019). Computational optimal transport. Foundations and Trends in Machine Learning, 11(5–6), 355–607.
Prékopa, A. (1966). On the probability distribution of the optimum of a random linear program. SIAM Journal on Control, 4(1), 211–222.
Rachev ST, & Rüschendorf L (1998) Mass Transportation Problems: Volume I: Theory, Volume II: Applications. Springer, New York
Rachev, S. T. (1985). The MongeKantorovich mass transference problem and its stochastic applications. Theory of Probability and its Applications, 29(4), 647–676.
Robinson, S. M. (1977). A characterization of stability in linear programming. Operations Research, 25(3), 435–447.
Rubner, Y., Tomasi, C., & Guibas, L. J. (2000). The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2), 99–121.
Rüschendorf, L. (1996). On \(c\)optimal random variables. Statistics and Probability Letters, 27(3), 267–270.
Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians. Basel: Birkhäuser.
Shapiro A (2000) Statistical inference of stochastic optimization problems. In: Probabilistic constrained optimization, Springer, pp 282–307
Shapiro, A. (1991). Asymptotic analysis of stochastic programs. Annals of Operations Research, 30(1), 169–186.
Shapiro, A. (1993). Asymptotic behavior of optimal solutions in stochastic programming. Mathematics of Operations Research, 18(4), 829–845.
Shapiro A, Dentcheva D, & Ruszczynski A (2021) Lectures on Stochastic Programming: Modeling and Theory. SIAM
Smith, C. S., & Knott, M. (1987). Note on the optimal transportation of distributions. Journal of Optimization Theory and Applications, 52(2), 323–329.
Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., et al. (2015). Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4), 1–11.
Sommerfeld, M., & Munk, A. (2018). Inference for empirical Wasserstein distances on finite spaces. Journal of the Royal Statistical Society: Series B (Methodological), 80(1), 219–238.
Sturmfels, B., & Thomas, R. R. (1997). Variation of cost functions in integer programming. Mathematical Programming, 77(2), 357–387.
Sudakov, V. N. (1979). Geometric problems in the theory of infinitedimensional probability distributions (Vol. 141). American Mathematical Soc.
Tameling, C., Sommerfeld, M., & Munk, A. (2019). Empirical optimal transport on countable metric spaces: Distributional limits and statistical applications. The Annals of Applied Probability, 29(5), 2744–2781.
Tameling, C., Stoldt, S., Stephan, T., Naas, J., Jakobs, S., & Munk, A. (2021). Colocalization for superresolution microscopy via optimal transport. Nature Computational Science, 1(3), 199–211.
Terlaky, T., & Zhang, S. (1993). Pivot rules for linear programming: A survey on recent theoretical developments. Annals of Operations Research, 46(1), 203–233.
Tintner, G. (1960). A note on stochastic linear programming. Econometrica: Journal of the Econometric Society pp. 490–495.
Vershik, A. (2002). L.V. Kantorovich and linear programming. Leonid Vital’evich Kantorovich: A man and a scientist, 1, 130–152.
Villani, C. (2008). Optimal Transport: Old and New. Berlin: Springer.
Walkup, D.W., Wets, R.J.B. (1969b). A Lipschitzian characterization of convex polyhedra. Proceedings of the American Mathematical Society pp. 167–173.
Walkup, D. W., & Wets, R. J. B. (1967). Continuity of some convexconevalued mappings. Proceedings of the American Mathematical Society, 18(2), 229–235.
Walkup, D. W., & Wets, R. J. B. (1969). Lifting projections of convex polyhedra. Pacific Journal of Mathematics, 28(2), 465–475.
Wang, W., Slepčev, D., Basu, S., Ozolek, J. A., & Rohde, G. K. (2013). A linear optimal transportation framework for quantifying and visualizing variations in sets of images. International Journal of Computer Vision, 101(2), 254–269.
Ward, J. E., & Wendell, R. E. (1990). Approaches to sensitivity analysis in linear programming. Annals of Operations Research, 27(1), 3–38.
Wets, R. J. B. (1980). The distribution problem and its relation to other problems in stochastic programming. Stochastic Programming (pp. 245–262). London: Academic Press.
Zolotarev, V. M. (1976). Metric distances in spaces of random variables and their distributions. Mathematics of the USSRSbornik, 30(3), 373.
Acknowledgements
M. Klatt and A. Munk gratefully acknowledge support from the DFG Research Training Group 2088 Discovering structure in complex data: Statistics meets Optimization and Inverse Problems and CRC 1456 Mathematics of Experiment. Y. Zemel was supported in part by Swiss National Science Foundation Grant 178220, and in part by a U.K. Engineering and Physical Sciences Research Council programme grant.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Omitted proofs
Proof
(Lemma 2.5) Dual nondegeneracy obviously implies (\(\mathbf{A2} \)), so we only show the converse (in presence of (\(\mathbf{A2} \))). Suppose that \(\lambda (I_j)\) is degenerate \(1\le j \le K\). Then the index set L of active constraints in the dual, i.e., the set of indices such that \([\lambda ^T(I_j)A]_l=c_l\), is such that \(I_j\subset L\). Let \(Pos_j \subseteq I_j\) be the set of positive entries of the optimal primal basic solution \(x(I_j,b)\). Then
Since the columns of \(A_{I_j}\) form a basis of \(\mathbb {R}^m\), each other column \(a_z\) writes
Suppose there exists some index \(z\in L\setminus I_j\) and \(s\in I_j\setminus Pos\) such that \(y_s^z\ne 0\). Then we can define a new basis \(\widetilde{I}:= I_j\setminus \{s\}\cup \{z\}\) such that \(\lambda (\widetilde{I})=\lambda (I_j)\) and as \(Pos_j\subseteq \widetilde{I}\) we conclude that \(x(\widetilde{I},b)=x(I_j,b)\). This contradicts (\(\mathbf{A2} \)). Hence \(y_s^z=0\) for all \(s\in I_j\setminus Pos_j\) in (18).
Now suppose that \(y_i^z>0\) for some \(i\in Pos_j\), so that \(i_0\in {{\,\mathrm{arg\,min}\,}}_{i\vert y_i^z > 0} \frac{x_i}{y_i^z}\) is well defined and the minimum is strictly positive. Expressing \(a_{i_0}\) as a linear combination of \(A_{\{z\}\cup Pos_j\setminus \{i_0\}}\), we find that
for some proper choice of \(\widetilde{x_i}\). By definition of \(i_0\) we find that \(\widetilde{x_i}\) are nonnegative, so that \(\widetilde{I}\) is a primal and dual optimal basis. Moreover, \(\lambda (\widetilde{I})=\lambda (I_j)\) that again contradicts (\(\mathbf{A2} \)). We deduce for the representation in (18) that \(y_i^z\le 0\). Consider the vector
By definition \(w\ge 0\), \(Aw=0\). and \(c^Tw=0\), so that \(w\ne 0\) is a primal optimal ray, in contradiction of (\(\mathbf{A2} \)). In total we see that if any basis \(I_j\) for \(1\le j \le K\) yields a degenerate dual basic solution we can modify basis \(I_j\) to some \(I_i\) with \(i\ne j\) and \(1\le i \le K\) such that \(\lambda (I_j)=\lambda (I_j)\).
It is in principle possible that \(\lambda (I_l)\) is dual optimal \(K+1\le l \le N\) but \(x(I_l,b)\) is not primal optimal. Let us show that this cannot happen under assumption (\(\mathbf{A2} \)). Consider any optimal primal basic solution \(x(I_j,b)\) for \(1\le j \le K\) and denote by \(Pos_j\) its positivity set. Optimality of \(\lambda (I_l)\) implies that its active set L contains \(I_l\cup Pos_j\). As \(I_l\) is not a primal optimal basis, it holds that \(Pos_j\nsubseteq I_l\), so that \(L>m\) and \(\lambda (I_l)\) is degenerate. But then we can modify basis \(I_l\) to some primal and dual optimal basis \(I_i\) for \(1\le i \le K\) such that \(\lambda (I_i)=\lambda (I_l)\) is degenerate, in contradiction with (\(\mathbf{A2} \)). Hence, any optimal dual basic solution is nondegenerate and induced by some primal and dual optimal basis I. \(\square \)
Proof
(Lemma 5.5) We need to show that for any \(\varphi \in \mathrm {BL}(\mathcal Z)\) we have that
vanishes as \(n\rightarrow \infty \). To bound the first term notice that for any fixed g it holds that \(\Vert \alpha \mapsto \varphi (\alpha ,g)\Vert _{\mathrm {BL}(\Delta _{\mathcal {K}})}\le \Vert \varphi \Vert _{\mathrm {BL}(\mathcal Z)}\le 1\), so
Hence, we find \(\mathbb {E}\psi _n(G_n)  \psi (G_n)\le \sup _{g} \mathrm {BL}(\mu ^\mathcal {K}_{n,g},\mu ^\mathcal {K}_{g})\) that tends to zero by assumption. Notice that the supremum can be an essential supremum, i.e., taken on set of full measure with respect to both \((G_n)\) and (G) instead of the whole of \(\mathbb {R}^{m}\). For the second term observe that \(\Vert \psi \Vert _\infty \le \Vert \varphi \Vert _\infty \) and that
Hence, we conclude that
Dividing \(\psi \) by its bounded Lipschitz norm, we find
This completes the proof for the quantitative statement. Joint convergence still follows if \(g\mapsto \mu _{g}^\mathcal {K}\) is only continuous Galmost surely (but not Lipschitz). In fact, \(\psi \) is still continuous and bounded Galmost surely so that \(\mathbb {E}\psi (G_n)\rightarrow \mathbb {E}\psi (G)\). Therefore, \(\mathbb {E}\varphi (\alpha _n^\mathcal {K},G_n)\rightarrow \mathbb {E}\varphi (\alpha ^\mathcal {K},G)\) for all \(\varphi \in \mathrm {BL}(\mathcal Z)\), which implies that \((\alpha _n^\mathcal {K},G_n)\rightarrow (\alpha ^\mathcal {K},G)\) in distribution. \(\square \)
B Optimal transport
Proof
(Theorem 6.1, twosample) The only part where absolute continuity of \(G=(G^1(r_\dagger ),G^2(s))\) was required is when showing that the boundaries of the cones defined in (7) have zero probability with respect to G. We shall show that this is still the case, despite the singularity of \(G^2(s)\).
The cones under consideration take the form
(where \(\pi (I_k,v)\) is viewed as an \(N^2\)dimensional vector), and their boundaries satisfy
Let \(w=(v_{[N1]},\sum _{j=1}^{N1}v_j,v_{\{N,\dots 2N1\}})\in \mathbb {R}^{2N}\) be the augmented vector corresponding to v. In view of Brualdi (2006)[Corollary 8.1.4], there exist \(\mathsf {R}_{i,k}\subset \{1,\ldots ,N\}\) and \(\mathsf {S}_{i,k}\subset \{N+1,\ldots ,2N\}\), not both empty, such that
It suffices to show that for any pair of sets \(\mathsf {R}\subseteq \{1,\ldots ,N1\}\) and \(\mathsf {S}\subset \{1,\ldots ,N\}\) that are not both empty,
Recall that \(G^1(r_\dagger )\) is independent of \(G^2(s)\) and admits a density on \(\mathbb {R}^{N1}\). Hence, when \(\mathsf {R}\) is nonempty, the above probability is indeed zero. If \(\mathsf {R}\) is empty, then \(\mathsf {S}\) is a nonempty proper subset of [N]. Since \(s\in \text {ri}(\Delta _N)\), the kernel of \(\Sigma (s)\) is the span of the vector of ones. Hence the distribution of \(\sum _{k\in \mathsf {S}_i} G^2(s)_k\) is absolutely continuous, so it vanishes with probability zero. This completes the proof. \(\square \)
Proof
(Lemma 6.5) By elementary arguments the cost \(c(x_i,x_j)=f(\vert x_ix_j\vert )\) satisfies the strict Monge condition (17) when f is strictly convex. We thus only consider the case where f is strictly concave. Since it was assumed nonnegative, finite, and with \(f(0)=0\), it must be that f is continuous and strictly increasing. Clearly, the optimal cost between \(\mu \) and \(\nu \) is finite. According to Gangbo & McCann (1996)[Proposition 2.9], all the common mass must stay in place. Hence, we may assume that \(\mu \) and \(\nu \) are mutually singular. We prove a more general result, from which Lemma 6.5 follows immediately. \(\square \)
Lemma B.1
Let \(\mu \) and \(\nu \) be mutually singular and both supported on a finite union of intervals. Let f be finite, strictly concave, and strictly increasing on the supports of \(\mu \) and \(\nu \). Then the optimal coupling between \(\mu \) and \(\nu \) with respect to the cost function \(c(x,y)=f(\vert xy\vert )\) is unique.
Remark B.2
If \(\mu \) and \(\nu \) have finite support, the assumption is satisfied. We believe that the statement is true for an arbitrary pair of measures \(\mu \) and \(\nu \), but the above formulation is sufficient as in the context of the present \(\mu \) and \(\nu \) are anyway finitely supported. For example, the support could contain countably many intervals as long as there is “clear" starting point \(a_0\) below; but M could be infinite.
Proof
There is nothing to prove if \(\mu =\nu =0\), so we assume \(\mu \ne \nu \). It follows from the assumptions that there exists a finite sequence of \(M+1\ge 3\) real numbers
such that (interchanging \(\mu \) and \(\nu \) if necessary)
Let \(m_0=\mu ([a_0,a_1])\) and suppose that \(m_0 \le \nu ([a_1,a_2])\). Define the quantile
We now claim that in any optimal coupling \(\pi \) between \(\mu \) and \(\nu \), the \(\mu \)mass of \([a_0,a_1]\) must go to \([a_1,a^\star ]\). Indeed, suppose that a positive \(\mu \)mass from \([a_0,a_1]\) goes strictly beyond \(a^\star \). Then some mass from the support of \(\mu \) but not in \([a_0,a_1]\) has to go to \([a_1,a^\star ]\). Such a coupling gives positive measure to the set
for some \(\epsilon >0\). Strict monotonicity of the cost function makes this suboptimal, since this coupling entails sending mass from \(x_1\) to \(y_1\) and from \(x_2\) to \(y_2\) with \(x_1<y_2<\min (x_2,y_1)\), (see Gangbo & McCann, 1996[Theorem 2.3] for a rigorous proof). Hence the claim is proved. Let \(\mu _1\) be the restriction of \(\mu \) to \([a_0,a_1]\) and \(\nu _1\) be the restriction of \(\nu \) to \([a_1,a^\star ]\) with mass \(m_0\), namely \(\nu _1(B)=\nu (B)\) if \(B\subseteq [a_1,a^\star )\), \(\nu _1(\{a^\star \}) = m_0  \nu ([a_1,a^\star ))\) and \(\nu (B)=0\) if \(B\cap [a_1,a^\star ]=\emptyset \). By definition of \(a^\star \), \(\nu _1\) is a measure (i.e., \(\nu _1(\{a^\star \})\ge 0\)) and \(\nu _1\) and \(\mu _1\) have the same total mass \(m_0\). Each of these measures is supported on an interval and these intervals are (almost) disjoint. Strict concavity of the cost function entails that any optimal coupling between \(\mu _1\) and \(\nu _1\) must be nonincreasing (in a setvalued sense). Since there is only one such coupling, the coupling is unique.
By the preceding paragraph and the above claim, we know that \(\pi \) must be nonincreasing from \([a_0,a_1]\) to \([a_1,a^\star ]\), which determines \(\pi \) uniquely on that part. After this transport is carried out, we are left with the measures \(\nu \nu _1\) and \(\mu \mu _1\), where the latter is supported on one less interval, namely the interval \([a_0,a_1]\) disappears.
If instead \(\mu _0([a_0,a_1])>\nu ([a_1,a_2])\), we can use the same construction with
and the interval \([a_1,a_2]\) will disappear. We then merge \([a^\star ,a_1]\) with \([a_2,a_3]\), that is
If \(\mu ([a_0,a_1])=\nu ([a_1,a_2])\) then both the intervals \([a_0,a_1]\) and \([a_1,a_2]\) disappear when considering \(\mu \mu _1\) and \(\nu \nu _1\). In all three cases we can continue inductively and construct \(\pi \) in a unique way. Since there are finitely many intervals, the procedure is guaranteed to terminate. Thus \(\pi \) is unique. \(\square \)
Proof
(Lemma 6.4) Let \(I_k\) be a dual feasible basis inducing a dual solution \((\alpha ,\beta )\). Every such basis induces a graph \(G(I_k)\) in the sense that if \((i,j)\in I_k\) then the ith support point of the measure r is connected to the jth support point of measure s, i.e., \((i,j)\in G(I_k)\). By definition of dual feasible basis it holds that \(\alpha _i+\beta _j=c_{ij}\). In fact, such a basis induces a tree structure between all support points of r and all support points of s Peyré & Cuturi (2019)[Section 3.4].
In order to exclude that \(\lambda (I_k)= \lambda (I_l)\) for \(k\ne l\) we proceed as follows. Since \(G(I_k)\ne G(I_l)\), there exists at least one edge (i, j) in \(G(I_l)\setminus G(I_k)\). By definition if \((\widetilde{\alpha },\widetilde{\beta })\) is the feasible dual solutions induced by \(I_l\), then \(\widetilde{\alpha }_i+\widetilde{\beta }_j=c_{ij}\). To conclude \(\lambda (I_k)= \lambda (I_l)\) it suffices to prove that \(\alpha _i+\beta _j\ne c_{ij}\). To see this, notice that adding edge (i, j) to \(G(I_k)\) creates a cycle. In particular, after proper relabelling there exists a path of the form
such that \((i_l,j_l)\in G(I_k)\) as well as \((i_{l+1},j_{l})\in G(I_k)\) for all \(1\le l \le n1\). Recall further that by definition if edge \((i_k,j_k)\in G(I_k)\) then \(\alpha _{i_k}+\beta _{j_k}=c_{i_k,j_k}\). By the summability assumtion (15) it follows
that gives \(\alpha _i+\beta _j\ne c_{ij}\). \(\square \)
Proof
(Proposition 6.6) According to Lemma 6.4 all dual feasible basic solutions for (\(\text {DOT}\)) are nondegenerate if there exists no family of indices \(\left\{ (i_k,j_k)\right\} \) for \(n\ge 2\) with all \(i_k\) pairwise different and all \(j_k\) pairwise different such that
It suffices to prove that (19) holds with probability zero for fixed n. For the sake of notational simplicity, we choose the first \(n\le N\) random locations \((\mathbf {X},\mathbf {Y}):= (X_{1},\ldots ,X_{n},Y_{1},\ldots ,Y_{n})\). We denote by \((\mathbf {x},\mathbf {y})=(x_1,\ldots ,x_n,y_1,\ldots ,y_n)\in \left( \mathbb {R}^D\right) ^{2n}\) and define the set
We need to show that \(\mathbb {P}((\mathbf {X},\mathbf {Y})\in A)=0\). Set \(e_i\in \mathbb {R}^D\) to be the ith unit vector and consider the closed set
Define the function \(f:(\mathbb {R}^D)^{2n}\setminus B \rightarrow \mathbb {R}\) with \(f(\mathbf {x},\mathbf {y})=\sum _{k=1}^n \Vert x_ky_{k1}\Vert _q^p\Vert x_ky_k\Vert _q^p\). We can rewrite
The second term on the righthand side is zero since by independence and absolute continuity the highdimensional vector \((\mathbf {X},\mathbf {Y})\) has a Lebesgue density and the set B lives in dimension less than 2Dn. It remains to discuss \(\mathbb {P}\left( (\mathbf {X},\mathbf {Y}) \in f^{1}(0)\right) \). The open set \((\mathbb {R}^D)^{2n}\setminus B\) on which f is defined can be partitioned into finitely many (less than \(6^{nD}\)) open connected components \(U_1,\ldots ,U_L\) according to the signs of \(\langle x_ky_k,e_i\rangle \) and \(\langle x_ky_{k1},e_i\rangle \). On each such component \(f_{\vert U_i}\) is analytic. It follows that \(\mathbb {P}\left( (\mathbf {X},\mathbf {Y})\in f^{1}_{\vert U_l}(\{0\})\right) =0\) unless \(f_{\vert U_l}\) is identically zero Dang, (2015)[Lemma 1.2]. To exclude the latter possibility, consider for any point \((\mathbf {x},\mathbf {y})\in U_l\) and \(\epsilon \in \mathbb {R}\) the function
with derivative at \(\epsilon =0\) given by
where \(x_{i_j}\) denotes the jth entry of the ith vector. If this derivative is nonzero, then clearly f is not identically zero. If the derivative is zero then we shall show that there exists another point in \(U_l\) for which this derivative is nonzero. Since \(U_l\) is open, we can add \(\delta e_j\) to \(y_n\) for small \(\delta \) and any \(1\le j\le D\). If \(p\ne q\) then, taking \(j\ne i\) (which is possible because \(D\ge 2\)) only modifies the term \(\Vert x_1y_n\Vert \) in (20), and for small \(\delta \) the derivative will not be zero. If \(p=q\ne 1\) then the norms do not appear in (20) and taking \(j=i\) would yield a nonzero derivative. Hence, if p and q are not both equal to one, f is not identically zero on each piece \(U_l\), which is what we needed to prove. A similar idea works in case \(q=\infty \) and \(p\ne 1\).
The argument only depends on the positions of the random support points of the probability measures \(r=\sum _{k=1}^n r_k\delta _{X_k}\) and \(s=\sum _{k=1}^n s_k\delta _{Y_k}\) and hence is uniform in their probability weights. Recall further Proposition 2.4 that if the dual problem admits a nondegenerate optimal solution the primal optimal solution is unique. We conclude that almost surely the optimal transport coupling is unique. \(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Klatt, M., Munk, A. & Zemel, Y. Limit laws for empirical optimal solutions in random linear programs. Ann Oper Res 315, 251–278 (2022). https://doi.org/10.1007/s10479022046980
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479022046980
Keywords
 Limit law
 Linear programming
 Optimal transport
 Sensitivity analysis
Mathematics Subject Classification
 90C05
 90C15
 90C31
 62E20
 49N15