1 Introduction

The linear programming (LP) problem in primal-dual form is to solve

$$\begin{aligned} \begin{aligned} \min \;&c^\top x \quad \\ Ax&=b \\ x&\ge 0, \\ \end{aligned} \quad \quad \quad \begin{aligned} \max \;&y^\top b \\ A^\top y + s&= c \\ s&\ge 0, \\ \end{aligned} \end{aligned}$$
(LP)

where \(A\in \mathbb {R}^{m\times n}\), \(\textrm{rank}(A) = m\), \(b\in \mathbb {R}^m\), \(c\in \mathbb {R}^n\) are given in the input, and \(x,s\in \mathbb {R}^n\), \(y\in \mathbb {R}^m\) are the variables. The program in x will be referred to as the primal problem and the program in (ys) as the dual problem.

Khachiyan [23] used the ellipsoid method to give the first polynomial time LP algorithm in the bit-complexity model, that is, polynomial in the bit description length of (Abc). An outstanding open question is the existence of a strongly polynomial algorithm for LP, listed by Smale as one of the most prominent mathematical challenges for the 21st century [46]. Such an algorithm amounts to solving LP using \(\textrm{poly}(n,m)\) basic arithmetic operations in the real model of computation.Footnote 1 Known strongly polynomially solvable LP problems classes include: feasibility for two variable per inequality systems [33], the minimum-cost circulation problem [50], the maximum generalized flow problem [41, 61], and discounted Markov decision problems [65, 67].

Towards this goal, the principal line of attack has been to develop LP algorithms whose running time is bounded in terms of natural condition measures. Such condition measures attempt to measure the “intrinsic complexity” of LPs. An important line of work in this area has been to parametrize LPs by the “niceness” of their solutions (e.g. the depth of the most interior point), where relevant examples include the Goffin measure [19] for conic systems and Renegar’s distance to ill-posedness for general LPs [43, 44], and bounded ratios between the nonzero entries in basic feasible solutions [6, 24].

Parametrizing by the constraint matrix A second line of research, and the main focus of this work, focuses on the complexity of the constraint matrix A. The first breakthrough in this area was given by Tardos [51], who showed that if A has integer entries and all square submatrices of A have determinant at most \(\Delta \) in absolute value, then (LP) can be solved in poly\((n,m,\log \Delta )\) arithmetic operations, independent of the encoding length of the vectors b and c. This is achieved by finding the exact solutions to O(nm) rounded LPs derived from the original LP, with the right hand side vector and cost function being integers of absolute value bounded in terms of n and \(\Delta \). From m such rounded problem instances, one can infer, via proximity results, that \(x_i=0\) must hold for every optimal solution for some index i. The process continues by induction until the optimal primal face is identified.

Path-following methods and the Vavasis–Ye algorithm In a seminal work, Vavasis and Ye [63] introduced a new type of interior-point method that optimally solves (LP) within \(O(n^{3.5} \log (\bar{\chi }_A+n))\) iterations, where the condition number \(\bar{\chi }_A\) controls the size of solutions to certain linear systems related to the kernel of A (see Sect. 2 for the formal definition).

Before detailing the Vavasis–Ye (henceforth VY) algorithm, we recall the basics of path following interior-point methods. If both the primal and dual problems in (LP) are strictly feasible, the central path for (LP) is the curve \(((x(\mu ),y(\mu ),s(\mu )): \mu > 0)\) defined by

$$\begin{aligned} \begin{aligned} x({\mu })_i s({\mu })_i&= \mu , \quad \forall i \in [n]\\ Ax(\mu )&= b, ~x(\mu )> 0, \\ A^\top y(\mu ) + s(\mu )&= c, ~s(\mu ) > 0, \end{aligned} \end{aligned}$$
(CP)

which converges to complementary optimal primal and dual solutions \((x^*,y^*,s^*)\) as \(\mu \rightarrow 0\), recalling that the duality gap at time \(\mu \) is exactly \(x(\mu )^\top s(\mu ) = n \mu \). We thus refer to \(\mu \) as the normalized duality gap. Methods that “follow the path” generate iterates that stay in a certain neighborhood around it while trying to achieve rapid multiplicative progress w.r.t. to \(\mu \), where given (xys) ‘close’ to the path, we define the normalized duality gap as \(\mu (x,y,s) = \sum _{i=1}^n x_i s_i/n\). Given a target parameter \(\mu '\) and starting point close to the path at parameter \(\mu \), standard path following methods [20] can compute a point at parameter below \(\mu '\) in at most \(O(\sqrt{n} \log (\mu /\mu '))\) iterations, and hence the quantity \(\log (\mu /\mu ')\) can be usefully interpreted as the length of the corresponding segment of the central path.

Crossover events and layered least squares steps At a very high level, Vavasis and Ye show that the central path can be decomposed into at most \(\left( {\begin{array}{c}n\\ 2\end{array}}\right) \) short but curved segments, possibly joined by long (apriori unbounded) but very straight segments. At the end of each curved segment, they show that a new ordering relation \(x_i(\mu ) > x_j(\mu )\)—called a ‘crossover event’—is implicitly learned. This inequality did not hold at the start of the segment, but is guaranteed to hold at every point from the end of the segment onwards. These \(\left( {\begin{array}{c}n\\ 2\end{array}}\right) \) relations give a combinatorial way to measure progress along the central path. In contrast to Tardos’s algorithm, where the main progress is setting variables to zero explicitly, the variables participating in crossover events cannot be identified; the analysis only shows their existence.

At a technical level, the VY-algorithm is a variant of the Mizuno–Todd–Ye [39] predictor–corrector method (MTY P-C). In predictor–corrector methods, corrector steps bring an iterate closer to the path, i.e., improve centrality, and predictor steps “shoot down” the path, i.e., reduce \(\mu \) without losing too much centrality. Vavasis and Ye’s main algorithmic innovation was the introduction of a new predictor step, called the ‘layered least squares’ (LLS) step, which crucially allowed them to cross each aforementioned “straight” segment of the central path in a single step, recalling that these straight segments may be arbitrarily long. To traverse the short and curved segments of the path, the standard predictor step, known as affine scaling (AS), in fact suffices.

To compute the LLS direction, the variables are decomposed into ‘layers’ \(J_1\cup J_2\cup \ldots \cup J_p=[n]\). The goal of such a decomposition is to eventually learn a refinement of the optimal partition of the variables \( B^* \cup N^*=[n]\), where \(B^*:= \{i \in [n]: x^*_i > 0\}\) and \(N^*:= \{i \in [n]: s^*_i > 0\}\) for the limit optimal solution \((x^*,y^*,s^*)\).

The primal affine scaling direction can be equivalently described by solving a weighted least squares problem in \({\text {Ker}}(A)\), with respect to a weighting defined according to the current iterate. The primal LLS direction is obtained by solving a series of weighted least squares problems, starting with focusing only on the final layer \(J_p\). This solution is gradually extended to the higher layers (i.e., layers with lower indices). The dual directions have analogous interpretations, with the solutions on the layers obtained in the opposite direction, starting with \(J_1\). If we use the two-level layering \(J_1=B^*\), \(J_2=N^*\), and are sufficiently close to the limit \((x^*,y^*,s^*)\) of the central path, then the LLS step reaches an exact optimal solution in a single step. We note that standard AS steps generically never find an exact optimal solution, and thus some form of “LLS rounding” in the final iteration is always necessary to achieve finite termination with an exact optimal solution.

Of course, guessing \(B^*\) and \(N^*\) correctly is just as hard as solving (LP). Still, if we work with a “good” layerings, these will reveal new information about the “optimal order” of the variables, where \(B^*\) is placed on higher layers than \(N^*\). The crossover events correspond to swapping two wrongly ordered variables into the correct ordering. Namely, a variable \(i\in B^*\) and \(j\in N^*\) are currently ordered on the same layer, or j is in a higher layer than i. After the crossover event, i will always be placed on a higher layer than j.

Computing good layerings and the \(\bar{\chi }_A\) condition measure Given the above discussion, the obvious question is how to come up with “good” layerings? The philosophy behind LLS can be stated as saying that if modifying a set of variables \(x_I\) barely affects the variables in \(x_{[n] \setminus I}\) (recalling that movement is constrained to \(\Delta x \in {\text {Ker}}(A)\)), then one should optimize over \(x_I\) without regard to the effect on \(x_{[n] \setminus I}\); hence \(x_I\) should be placed on lower layers.

VY’s strategy for computing such layerings was to directly use the size of the coordinates of the current iterate x (where (xys) is a point near the central path). In particular, assuming \(x_1\ge x_2\ge \ldots \ge x_n\), the layering \(J_1 \cup J_2\cup \ldots \cup J_p = [n]\) corresponds to consecutive intervals constructed in decreasing order of \(x_i\) values. The break between \(J_i\) and \(J_{i+1}\) occurs if the gap \(x_r/x_{r+1} > g\), where r is the rightmost element of \(J_i\) and \(g > 0\) is a threshold parameter. Thus, the expectation is that if \(x_i> g x_j\), then a small multiplicative change to \(x_j\), subject to moving in \({\text {Ker}}(A)\), should induce a small multiplicative change to \(x_i\). By proximity to the central path, the dual ordering is reversed as mentioned above.

The threshold g for which this was justified in the VY-algorithm is a function of the \(\bar{\chi }_A\) condition measure. We now provide a convenient definition that immediately yields this justification (see Proposition 2.4). Letting \(W = {\text {Ker}}(A)\) and \(\pi _I(W) = \{x_I: x \in W\}\), we define \(\bar{\chi }_A:= \bar{\chi }_W\) as the minimum number \(M \ge 1\) such that for any \(\emptyset \ne I \subseteq [n]\) and \(z \in \pi _I(W)\), there exists \(y \in W\) with \(y_I = z\) and \(\Vert y\Vert \le M \Vert z\Vert \). Thus, a change of norm \(\epsilon \) in the variables in I can be lifted to a change of norm at most \(\bar{\chi }_A\epsilon \) in the variables in \([n]\setminus I\). Crucially, \(\bar{\chi }\) is a “self-dual” quantity. That is, \(\bar{\chi }_W = \bar{\chi }_{W^\perp }\), where \(W^\perp = \textrm{range}(A^\top )\) is the movement subspace for the dual problem, justifying the reversed layering for the dual (see Sects. 2 for more details).

The question of scale invariance and \(\bar{\chi }^*_A\) While the VY layering procedure is powerful, its properties are somewhat mismatched with those of the central path. In particular, variable ordering information has no intrinsic meaning on the central path, as the path itself is scaling invariant. Namely, the central path point \((x(\mu ),y(\mu ),s(\mu ))\) w.r.t. the problem instance (Abc) is in bijective correspondence with the central path point \((D^{-1} x(\mu ), D y(\mu ), D s(\mu )))\) w.r.t. the problem instance (ADDcb) for any positive diagonal matrix D. The standard path following algorithms are also scaling invariant in this sense.

This lead Monteiro and Tsuchiya [36] to ask whether a scaling invariant LLS algorithm exists. They noted that any such algorithm would then depend on the potentially much smaller parameter

$$\begin{aligned} \bar{\chi }^*_A:= \inf _D \bar{\chi }_{AD}\,, \end{aligned}$$
(1)

where the infimum is taken over the set of \(n \times n\) positive diagonal matrices. Thus, Monteiro and Tsuchiya’s question can be rephrased as to whether there exists an exact LP algorithm with running time poly\((n,m,\log \bar{\chi }^*_A)\).

Substantial progress on this question was made in the followup works [28, 37]. The paper [37] showed that the number of iterations of the MTY predictor–corrector algorithm [39] can get from \(\mu _0>0\) to \(\eta >0\) on the central path in

$$\begin{aligned} O\left( n^{3.5}\log \bar{\chi }^*_A+\min \{n^2\log \log (\mu ^0/\eta ), \log (\mu ^0/\eta )\}\right) \end{aligned}$$

iterations. This is attained by showing that the standard AS steps are reasonably close to the LLS steps. This proximity can be used to show that the AS steps can traverse the “curved” parts of the central path in the same iteration complexity bound as the VY algorithm. Moreover, on the “straight” parts of the path, the rate of progress amplifies geometrically, thus attaining a \(\log \log \) convergence on these parts. Subsequently, [28] developed an affine invariant trust region step, which traverses the full path in \(O(n^{3.5} \log (\bar{\chi }_A^*+n))\) iterations. However, the running time of each iteration is weakly polynomial in b and c. The question of developing an LP algorithm with complexity bound poly\((n,m,\log \bar{\chi }_A^*)\) thus remained open.

A related open problem to the above is whether it is possible to compute a near-optimal rescaling D for program (1)? This would give an alternate pathway to the desired LP algorithm by simply preprocessing the matrix A. The related question of approximating \(\bar{\chi }_A\) was already studied by Tunçel [54], who showed NP-hardness for approximating \(\bar{\chi }_A\) to within a \(2^{\textrm{poly}(\textrm{rank}(A))}\) factor. Taken at face value, this may seem to suggest that approximating the rescaling D should be hard.

A further open question is whether Vavasis and Ye’s cross-over analysis can be improved. Ye showed in [66] that the iteration complexity can be reduced to \(O(n^{2.5} \log (\bar{\chi }_A+n))\) for feasibility problems and further to \(O(n^{1.5} \log (\bar{\chi }_A+n))\) for homogeneous systems, though the \(O(n^{3.5} \log (\bar{\chi }_A+n))\) bound for optimization has not been improved since [63].

1.1 Our contributions

In this work, we resolve all of the above questions in the affirmative. We detail our contributions below.

1. Finding an approximately optimal rescaling. As our first contribution, we give an \(O(m^2 n^2 + n^3)\) time algorithm that works on the linear matroid of A to compute a diagonal rescaling matrix D which achieves \(\bar{\chi }_{AD} \le n (\bar{\chi }_A^*)^3\), given any \(m \times n\) matrix A. Furthermore, this same algorithm allows us to approximate \(\bar{\chi }_A\) to within a factor \(n(\bar{\chi }_A^*)^2\). The algorithm bypasses Tunçel’s hardness result by allowing the approximation factor to depend on A itself, namely on \(\bar{\chi }_A^*\). This gives a simple first answer to Monteiro and Tsuchiya’s question: by applying the Vavasis–Ye algorithm directly on the preprocessed A matrix, we may solve any LP with constraint matrix A using \(O(n^{3.5}\log ( \bar{\chi }^*_A+n))\) iterations. Note that the approximation factor \(n(\bar{\chi }_A^*)^2\) increases the runtime only by a constant factor.

To achieve this result, we work with the circuits of A, where a circuit \(C\subseteq [n]\) corresponds to an inclusion-wise minimal set of linearly dependent columns. With each circuit, we can associate a vector \(g^C\in {\text {Ker}}(A)\) with \(\textrm{supp}(g^C)=C\) that is unique up to scaling. By the ‘circuit ratio’ \(\kappa _{ij}\) associated with the pair of nodes (ij), we mean the largest ratio \(|g^C_j/g^C_i|\) taken over every circuit C of A such that \(i,j\in C\). As our first observation, we show that the maximum of all circuit ratios, which we call the ‘circuit imbalance measure’, in fact characterizes \(\bar{\chi }_A\) up to a factor n. This measure was first studied by Vavasis [56], who showed that it lower bounds \(\bar{\chi }_A\), though, as far as we are aware, our upper bound is new. The circuit ratios of each pair (ij) induce a weighted directed graph we call the ‘circuit ratio digraph’ of A. From here, our main result is that \(\bar{\chi }^*_A\) is up to a factor n equal to the maximum geometric mean cycle in the circuit ratio digraph. Our algorithm populates the circuit ratio digraph with approximations of the \(\kappa _{ij}\) ratios for each \(i,j\in [n]\) using standard techniques from matroid theory, and then computes a rescaling by solving the dual of the maximum geometric mean ratio cycle on the ‘approximate circuit ratio digraph’.

2. Scaling invariant LLS algorithm. While the above yields an LP algorithm with poly\((n,m,\log \bar{\chi }^*_A)\) running time, it does not satisfactorily address Monteiro and Tsuchiya’s question on a scaling invariant algorithm. As our second contribution, we use the circuit ratio digraph directly to give a natural scaling invariant LLS layering algorithm together with a scaling invariant crossover analysis.

At a conceptual level, we show that the circuit ratios give a scale invariant way to measure whether ‘\(x_i >x_j\)’ and enable a natural layering algorithm. Assume for now that the circuit imbalance value \(\kappa _{ij}\) is known for every pair (ij). Given the circuit ratio graph induced by the \(\kappa _{ij}\)’s and given a primal point x near the path, our layering algorithm can be described as follows. We first rescale the variables so that x becomes the all ones vector, which rescales \(\kappa _{ij}\) to \(\kappa _{ij} x_i/x_j\). We then restrict the graph to its edges of length \(\kappa _{ij}x_i/x_j\ge 1/\textrm{poly}(n)\)—the long edges of the (rescaled) circuit ratio graph—and let the layering \(J_1 \cup J_2\cup \ldots \cup J_p\) be a topological ordering of its strongly connected components (SCC) with edges going from left to right. Intuitively, variables that “affect each other” should be in the same layer, which motivates the SCC definition.

We note that our layering algorithm does not have access to the true circuit ratios \(\kappa _{ij}\); these are in fact NP-hard to compute. Getting a good enough initial estimate for our purposes however is easy: we let \(\hat{\kappa }_{ij}\) be the ratio corresponding to an arbitrary circuit containing i and j. This already turns out to be within a factor \((\bar{\chi }^*_A)^2\) from the true value \(\kappa _{ij}\)—recall this is the maximum over all such circuits. Our layering algorithm learns better circuit ratio estimates if the ‘lifting costs of our SCC layering, i.e., how much it costs to lift changes from lower layer variables to higher layers (as in the definition of \(\bar{\chi }_A\)), are larger than we expected them to be based on the previous estimates.

We develop a scaling-invariant analogue of cross-over events as follows. Before the crossover event, \(\textrm{poly}(n)(\bar{\chi }^*_A)^{n}>\kappa _{ij} x_i/x_j\), and after the crossover event, \(\textrm{poly}(n)(\bar{\chi }^*_A)^{n}<\kappa _{ij} x_i/x_j\) for all further central path points. Our analysis relies on \(\bar{\chi }_A^*\) in only a minimalistic way, and does not require an estimate on the value of \(\bar{\chi }_A^*\). Namely, it is only used to show that if \(i,j\in J_q\), for a layer \(q \in [p]\), then the rescaled circuit ratio \(\kappa _{ij} x_i/x_j\) is in the range \((\textrm{poly}(n) \bar{\chi }_A^*)^{\pm O( |J_q|)}\). The argument to show this crucially utilizes the maximum geometric mean cycle characterization. Furthermore, unlike prior analyses [36, 63], our definition of a “good” layering (i.e., ‘balanced’ layerings, see Sect. 3.5), is completely independent of \(\bar{\chi }^*_A\).

3. Improved potential analysis. As our third contribution, we improve the Vavasis–Ye crossover analysis using a new and simple potential function based approach. When applied to our new LLS algorithm, we derive an \(O(n^{2.5} \log n \log (\bar{\chi }_A^*+n))\) iteration bound for path following, improving the polynomial term by an \(\Omega (n/\log n)\) factor compared to the VY analysis.

Our potential function can be seen as a fine-grained version of the crossover events as described above. In case of such a crossover event, it is guaranteed that in every subsequent iteration, i is in a layer before j. We analyze less radical changes instead: an “event” parametrized by \(\tau \) means that i and j are currently together on a layer of size \(\le \tau \), and after the event, i is on a layer before j, or if they are together on the same layer, then this layer must have size \(\ge 2\tau \). For every LLS step, we can find a parameter \(\tau \) such that an event of this type happens concurrently for at least \(\tau -1\) pairs within the next \(O(\sqrt{n} \tau \log (\bar{\chi }_A^*+n))\) iterations,

Our improved analysis is also applicable to the original VY-algorithm. Let us now comment on the relation between the VY-algorithm and our new algorithm. The VY-algorithm starts a new layer once \(x_{\pi (i)}> g x_{\pi (i+1)}\) between two consecutive variables where the permutation \(\pi \) is a non-increasing order of the \(x_i\) variables, and \(g=\textrm{poly}(n) \bar{\chi }_A\). Setting the initial ‘estimates’ \(\hat{\kappa }_{ij}=\bar{\chi }_A\) for a suitable polynomial, our algorithm runs the same way as the VY algorithm. Using these estimates, the layering procedure becomes much simpler: there is no need to verify ‘balancedness’ as in our algorithm.

However, using estimates \(\hat{\kappa }_{ij}=\bar{\chi }_A\) has drawbacks. Most importantly, it does not give a lower bound on the true circuit ratio \(\kappa _{ij}\)—to the contrary, g will be an upper bound. In effect, this causes VY’s layers to be “much larger” than ours, and for this reason, the connection to \(\bar{\chi }^*_A\) is lost. Nevertheless, our potential function analysis can still be adapted to the VY-algorithm to obtain the same \(\Omega (n/\log n)\) improvement on the iteration complexity bound; see Sect. 4.1 for more details.

1.2 Related work

Since the seminal works of Karmarkar [22] and Renegar [42], there has been a tremendous amount of work on speeding up and improving interior-point methods. In contrast to the present work, the focus of these works has mostly been to improve complexity of approximately solving LPs. Progress has taken many forms, such as the development of novel barrier methods, such as Vaidya’s volumetric barrier [55] and the recent entropic barrier of Bubeck and Eldan [5] and the weighted log-barrier of Lee and Sidford [29, 31], together with new path following techniques, such as the predictor–corrector framework [34, 39], as well as advances in fast linear system solving [30, 48]. For this last line, there has been substantial progress in improving IPM by amortizing the cost of the iterative updates, and working with approximate computations, see e.g. [42, 55] for classical results. Recently, Cohen, Lee and Song [7] developed a new inverse maintenance scheme to get a randomized \(\tilde{O}(n^{\omega }\log (1/\varepsilon ))\)-time algorithm for \(\varepsilon \)-approximate LP, which was derandomized by van den Brand [57]; here \(\omega \approx 2.37\) is the matrix multiplication exponent. A very recent result by van den Brand et al. [60] obtained a randomized \(\tilde{O}(nm+m^3)\) algorithm. For special classes of LP such as network flow and matching problems, even faster algorithms have been obtained using, among other techniques, fast Laplacian solvers, see e.g. [15, 32, 58, 59]. Given the progress above, we believe it to be an interesting problem to understand to what extent these new numerical techniques can be applied to speed up LLS computations, though we expect that such computations will require very high precision. We note that no attempt has been made in the present work to optimize the complexity of the linear algebra.

Subsequent to the conference version of this paper [8], some of the authors extended Tardos’s framework to the real model of computation [14], showing that poly\((n,m,\log \bar{\chi }_A)\) running time can be achieved using approximate solvers in a black box manner. Combined with [57], one obtains a deterministic \(O(mn^{\omega +1} \log ^{O(1)}(n) \log (\bar{\chi }_A))\) LP algorithm; using the initial rescaling subroutine from this paper, the dependence can be improved to \({\bar{\chi }}^*_A\) resulting in a running time of \(O(mn^{\omega +1} \log ^{O(1)}(n) \log (\bar{\chi }_A^* + n))\). A weaker extension of Tardos’s framework to the real model of computation was previously given by Ho and Tunçel [21].

With regard to LLS algorithms, the original VY-algorithm required explicit knowledge of \(\bar{\chi }_A\) to implement their layering algorithm. The paper [35] showed that this could be avoided by computing all LLS steps associated with n candidate partitions and picking the best one. In particular, they showed that all such LLS steps can be computed in \(O(m^2 n)\) time. In [36], an alternate approach was presented to compute an LLS partition directly from the coefficients of the AS step. We note that these methods crucially rely on the variable ordering, and hence are not scaling invariant. Kitahara and Tsuchiya [27], gave a 2-layer LLS step which achieves a running time depending only on \(\bar{\chi }_A^*\) and right-hand side b, but with no dependence on the objective, assuming the primal feasible region is bounded.

A series of papers have studied the central path from a differential geometry perspective. Monteiro and Tsuchiya [38] showed that a curvature integral of the central path, first introduced by Sonnevend, Stoer, and Zhao [47], is in fact upper bounded by \(O(n^{3.5} \log (\bar{\chi }^*_A+n))\). This has been extended to SDP and symmetric cone programming [26], and also studied in the context of information geometry [25].

Circuits have appeared in several papers on linear and integer optimization (see [13] and references within). The idea of using circuits within the context of LP algorithms also appears in [12]. They develop a circuit augmentation framework for LP (as well ILP) and show that simplex-like algorithms that take steps according to the “best circuit” direction achieves linear convergence, though these steps are hard to compute. Recently, [11] used circuit imbalance measures to obtain a circuit augmentation algorithm for LP with poly\((n,\log (\bar{\chi }_A))\) iterations. We refer to [16] for an overview on circuit imbalances and their applications.

Our algorithm makes progress towards strongly polynomial solvability of LP, by improving the dependence poly\((n,m,\log \bar{\chi }_A)\) to poly\((n,m,\log \bar{\chi }^*_A)\). However, in a remarkable recent paper, Allamigeon, Benchimol, Gaubert, and Joswig [1] have shown, using tools from tropical geometry, that path-following methods for the standard logarithmic barrier cannot be strongly polynomial. In particular, they give a parametrized family of instances, where, for sufficiently large parameter values, any sequence of iterations following the central path must be of exponential length—thus, \(\bar{\chi }^*_A\) will be doubly exponential. We note that very recently, Allamigeon, Gaubert, and Vandame [3] strengthened this result, showing that no interior point method using a self-concordant barrier function may be strongly polynomial.

As a further recent development, Allamigeon, Dadush, Loho, Natura, and Végh [2] complement these negative results by giving a weakly polynomial interior point method that always terminates in at most \(O(2^n n^{1.5}\log n)\) iterations—even when \(\log \bar{\chi }^*_A\) is unbounded. Moreover, their interior point method is ‘universal’: it matches the number of iterations of any interior point method that uses a self-concordant barrier function up to a factor \(O(n^{1.5} \log n)\). The ‘subspace LLS’ step used in the paper is a generalization of the LLS step, using restricted movements in general subspaces, not only coordinate subspaces.

1.3 Organization

The rest of the paper is organized as follows. We conclude this section by introducing some notation. Section 2 discusses our results on the circuit imbalance measure. It starts with Sect. 2.1 on the necessary background on the condition measures \(\bar{\chi }_A\) and \(\bar{\chi }^*_A\). Section 2.2 introduces the circuit imbalance measure, and formulates and explains all main results of Sect. 2. The proofs are given in the rest of the sections: basic properties in Sect. 2.3, the min-max characterization in Sect. 2.4, the circuit finding algorithm in Sect. 2.5, the algorithms for approximating \(\bar{\chi }^*_A\) and \(\bar{\chi }_A\) in Sect. 2.6.

In Sect. 3, we develop our scaling invariant interior-point method. Interior-point preliminaries are given in Sect. 3.1. Section 3.2 introduces the affine scaling and layered-least-squares directions, and proves some basic properties. Section 3.3 provides a detailed overview of the high level ideas and a roadmap to the analysis. Section 3.4 further develops the theory of LLS directions and introduces partition lifting scores. Section 3.5 gives our scaling invariant layering procedure, and our overall algorithm can be found in Sect. 3.6.

In Sect. 4, we give the potential function proof for the improved iteration bound, relying on technical lemmas. The full proof of these lemmas is deferred to Sect. 6; however, Sect. 4 provides the high-level ideas to each proof. Section 4.1 shows that our argument also leads to a factor \(\Omega (n/\log n)\) improvement in the iteration complexity bound of the VY-algorithm.

In Sect. 5, we prove the technical properties of our LLS step, including its proximity to AS and step length estimates. Finally, in Sect. 7, we discuss the initialization of the interior-point method.

Besides reading the paper linearly, we suggest two other possible ways of navigating the paper. Readers mainly interested in the circuit imbalance measure and its approximation may focus only on Sect. 2; this part can be understood without any familiarity with interior point methods. Other readers, who wish to mainly focus on our interior point algorithm may read Sect. 2 only up to Sect. 2.2; this includes all concepts and statements necessary for the algorithm.

1.4 Notation

Our notation will largely follow [36, 37]. We let \(\mathbb {R}_{++}\) denote the set of positive reals, and \(\mathbb {R}_+\) the set of nonnegative reals. For \(n\in \mathbb {N}\), we let \([n]=\{1,2,\ldots ,n\}\). Let \(e^i\in \mathbb {R}^n\) denote the ith unit vector, and \(e\in \mathbb {R}^n\) the all 1 s vector. For a vector \(x\in \mathbb {R}^n\), we let \({\text {Diag}}(x)\in \mathbb {R}^{n\times n}\) denote the diagonal matrix with x on the diagonal. We let \(\textbf{D}\) denote the set of all positive \(n\times n\) diagonal matrices and \(\textbf{I}_k\) denote the \(k \times k\) identity matrix. For \(x,y\in \mathbb {R}^n\), we use the notation \(xy\in \mathbb {R}^n\) to denote \(xy={\text {Diag}}(x)y=(x_iy_i)_{i\in [n]}\). The inner product of the two vectors is denoted as \(x^\top y\). For \(p\in \mathbb {Q}\), we also use the notation \(x^{p}\) to denote the vector \((x_i^{p})_{i\in [n]}\). Similarly, for \(x,y\in \mathbb {R}^n\), we let x/y denote the vector \((x_i/y_i)_{i\in [n]}\). We denote the support of a vector \(x \in \mathbb {R}^n\) by \(\textrm{supp}(x) = \{i\in [n]: x_i \ne 0\}\).

For an index subset \(I\subseteq [n]\), we use \(\pi _I: \mathbb {R}^n \rightarrow \mathbb {R}^I\) for the coordinate projection. That is, \(\pi _I(x)=x_I\), and for a subset \(S\subseteq \mathbb {R}^n\), \(\pi _I(S)=\{x_I:\, x\in S\}\). We let \(\mathbb {R}^n_I = \{x \in \mathbb {R}^n: x_{[n]{\setminus } I} = 0\}\).

For a matrix \(B\in \mathbb {R}^{n\times k}\), \(I\subset [n]\) and \(J\subset [k]\) we let \(B_{I,J}\) denote the submatrix of B restricted to the set of rows in I and columns in J. We also use \(B_{I,{\varvec{\cdot }}}=B_{I,[k]}\) and \(B_J=B_{{\varvec{\cdot }},J}=B_{[n],J}\). We let \(B^{\dagger }\in \mathbb {R}^{k\times n}\) denote the pseudo-inverse of B.

We let \({\text {Ker}}(A)\) denote the kernel of the matrix \(A \subseteq \mathbb {R}^{m\times n}\). Throughout, we assume that the matrix A in (LP) has full row rank, and that \(n\ge 3\).

We use the real model of computation, allowing basic arithmetic operations \(+\), −, \(\times \), /, comparisons, and square root computations. We keep (exact) square root computations for simplicity but we note that these could be avoided.

Subspace formulation Throughout the paper, we let \(W={\text {Ker}}(A)\subseteq \mathbb {R}^n\) denote the kernel of the matrix A. Using this notation, (LP) can be written in the form

$$\begin{aligned} \begin{aligned} \min \;&c^\top x \\ x&\in W + d \\ x&\ge 0, \end{aligned} \quad \quad \begin{aligned} \max \;&d^\top (c-s) \\ s&\in W^\perp +c \\ s&\ge 0, \end{aligned} \end{aligned}$$
(2)

where \(d\in \mathbb {R}^n\) satisfies \(Ad = b\). One can e.g., choose d as the minimum norm solution \(d = {{\,\mathrm{arg\,min}\,}}\{\Vert x\Vert : Ax=b\} = A^\top (AA^\top )^{-1} b\). Note that \(s \in W^\perp +c\) is equivalent to \(\exists y \in \mathbb {R}^m\) such that \(A^\top y + c = s\). Hence, the original variable y is implicit in this formulation.

Table 1 Recurring symbols that will be defined throughout the paper

2 Finding an approximately optimal rescaling

2.1 The condition number \(\bar{\chi }\)

The condition number \(\bar{\chi }_A\) is defined as

$$\begin{aligned} \bar{\chi }_A= & {} \sup \left\{ \Vert A^\top \left( A D A^\top \right) ^{-1}AD\Vert \,: D\in {\textbf{D}}\right\} \nonumber \\= & {} \sup \left\{ \frac{\left\Vert A^\top y\right\Vert }{\left\Vert p\right\Vert }: y \text { minimizes } \left\Vert D^{1/2}(A^\top y - p)\right\Vert \text { for some }0 \ne p \in \mathbb {R}^n \text { and }D \in \textbf{D}\right\} .\nonumber \\ \end{aligned}$$
(3)

This condition number was first studied by Dikin [9, 10], Stewart [49], and Todd [52], among others, and plays a key role in the analysis of the Vavasis–Ye interior point method [63]. There is an extensive literature on the properties and applications of \(\bar{\chi }_A\), as well as its relations to other condition numbers. We refer the reader to the papers [21, 36, 63] for further results and references.

It is important to note that \(\bar{\chi }_A\) only depends on the subspace \(W={\text {Ker}}(A)\). Hence, we can also write \(\bar{\chi }_W\) for a subspace \(W\subseteq \mathbb {R}^n\), defined to be equal to \(\bar{\chi }_A\) for some matrix \(A\in \mathbb {R}^{k\times n}\) with \(W={\text {Ker}}(A)\). We will use the notations \(\bar{\chi }_A\) and \(\bar{\chi }_W\) interchangeably.

The next lemma summarizes some important known properties of \(\bar{\chi }_A\).

Proposition 2.1

  Let \(A\in \mathbb {R}^{m\times n}\) with full row rank and \(W={\text {Ker}}(A)\).

  1. (i)

    If the entries of A are all integers, then \(\bar{\chi }_A\) is bounded by \(2^{O(L_A)}\), where \(L_A\) is the input bit length of A.

  2. (ii)

    \(\bar{\chi }_A = \max \{ \Vert B^{-1} A\Vert : B\) non-singular \(m \times m\)- submatrix of \( A\} \).

  3. (iii)

    Let the columns of \(B \in \mathbb {R}^{n \times (n-m)}\) form an orthonormal basis of W. Then

    $$\begin{aligned} \bar{\chi }_W = \max \left\{ \Vert B B_{I,{\varvec{\cdot }}}^\dagger \Vert : \emptyset \ne I \subset [n]\right\} \,. \end{aligned}$$
  4. (iv)

    \(\bar{\chi }_W=\bar{\chi }_{W^\perp }\).

Proof

  Part (i) was proved in [63, Lemma 24]. For part (ii), see [53, Theorem 1] and [63, Lemma 3]. In part (iii), the direction \(\ge \) was proved in [49], and the direction \(\le \) in [40]. The duality statement (iv) was shown in [18]. \(\square \)

In Proposition 3.8, we will also give another proof of (iv). We now define the lifting map, a key operation in this paper, and explain its connection to \(\bar{\chi }_A\).

Definition 2.2

Let us define the lifting map \(L_I^W: \pi _{I}(W) \rightarrow W\) by

$$\begin{aligned} L_I^W(p) = {{\,\mathrm{arg\,min}\,}}\left\{ \Vert z\Vert : z_I = p, z \in W\right\} . \end{aligned}$$

Note that \(L_I^W\) is the unique linear map from \(\pi _{I}(W)\) to W such that \(\left( L_I^W(p)\right) _I = p\) and \(L_I^W(p)\) is orthogonal to \(W \cap \mathbb {R}^n_{[n]\setminus I}\).

Lemma 2.3

Let \(W \subseteq \mathbb {R}^n\) be an \((n-m)\)-dimensional linear subspace. Let the columns of \(B \in \mathbb {R}^{n \times (n-m)}\) denote an orthonormal basis of W. Then, viewing \(L_I^W\) as a matrix in \(\mathbb {R}^{n\times |I|}\),

$$\begin{aligned} L_I^W = B B_{I,{\varvec{\cdot }}}^\dagger \,. \end{aligned}$$

Proof

If \(p \in \pi _I(W)\), then \(p = B_{I,{\varvec{\cdot }}} y\) for some \(y \in \mathbb {R}^{n-m}\). By the well-known property of the pseudo-inverse we get \(B_{I,{\varvec{\cdot }}}^\dagger p = {{\,\mathrm{arg\,min}\,}}_{p = B_{I,{\varvec{\cdot }}} y}\Vert y\Vert \). This solution satisfies \(\pi _I(BB_{I,{\varvec{\cdot }}}^\dagger p) = p\) and \(BB_{I,{\varvec{\cdot }}}^\dagger p \in W\). Since the columns of B form an orthonormal basis of W, we have \(\Vert BB_{I,{\varvec{\cdot }}}^\dagger p\Vert =\Vert B_{I,{\varvec{\cdot }}}^\dagger p\Vert \). Consequently, \(BB_{I,{\varvec{\cdot }}}^\dagger p\) is the minimum-norm point with the above properties. \(\square \)

The above lemma and Proposition 2.1(iii) yield the following characterization. This will be the most suitable characterization of \(\bar{\chi }_W\) for our purposes.

Proposition 2.4

For a linear subspace \(W \subseteq \mathbb {R}^n\),

$$\begin{aligned} \bar{\chi }_W =\max \left\{ \Vert L_I^W\Vert \,: {I\subseteq [n]}, I\ne \emptyset \right\} \,. \end{aligned}$$

The following notation will be convenient for our algorithm. For a subspace \(W\subseteq \mathbb {R}^n\) and an index set \(I\subseteq [n]\), if \(\pi _I(W) \ne \left\{ 0 \right\} \) then we define the lifting score

$$\begin{aligned} \ell ^W(I):=\sqrt{\Vert L^W_{I}\Vert ^2-1}\,. \end{aligned}$$
(4)

Otherwise, we define \(\ell ^W(I) = 0\). This means that for any \(z\in \pi _I(W)\) and \(x = L_I^W(z)\), \(\Vert x_{[n]{\setminus } I}\Vert \le \ell ^W(I)\Vert z\Vert \).

The condition number \(\bar{\chi }^*_A\) For every \(D\in {\textbf{D}}\), we can consider the condition number \(\bar{\chi }_{DW}=\bar{\chi }_{AD^{-1}}\). We let

$$\begin{aligned} \bar{\chi }^*_W=\bar{\chi }^*_A=\inf \{\bar{\chi }_{DW}\,: D\in {\textbf{D}}\}\, \end{aligned}$$

denote the best possible value of \(\bar{\chi }\) that can be attained by rescaling the coordinates of W. The main result of this section is the following theorem.

Theorem 2.5

(Proof in Sect. 2.6) There is an \(O(n^2m^2 + n^3)\) time algorithm that for any matrix \(A\in \mathbb {R}^{m\times n}\) computes an estimate \(\xi \) of \(\bar{\chi }_W\) such that

$$\begin{aligned} \xi \le \bar{\chi }_W \le n(\bar{\chi }_W^*)^2 \xi \end{aligned}$$

and a \(D\in {\textbf{D}}\) such that

$$\begin{aligned} \bar{\chi }^*_W\le \bar{\chi }_{DW}\le n(\bar{\chi }_W^*)^3\,. \end{aligned}$$

2.2 The circuit imbalance measure

The key tool in proving Theorem 2.5 is to study a more combinatorial condition number, the circuit imbalance measure which turns out to give a good proxy to \(\bar{\chi }_A\).

Definition 2.6

For a linear subspace \(W \subseteq \mathbb {R}^n\) and a matrix A such that \(W = {\text {Ker}}(A)\), a circuit is an inclusion-wise minimal dependent set of columns of A. Equivalently, a circuit is a set \(C \subseteq [n]\) such that \(W \cap \mathbb {R}^n_C\) is one-dimensional and that no strict subset of C has this property. The set of circuits of W is denoted by \(\mathcal {C}_W\).

Note that circuits defined above are the same as the circuits in the linear matroid associated with A. Every circuit \(C\in \mathcal {C}_W\) can be associated with a vector \(g^C \in W\) such that \(\textrm{supp}(g^C) = C\); this vector is unique up to scalar multiplication.

Definition 2.7

For a circuit \(C \in \mathcal {C}_W\) and \(i,j \in C\), we let

$$\begin{aligned} \kappa ^W_{ij}(C)=\frac{\left| g^C_j\right| }{\left| g^C_i\right| }. \end{aligned}$$
(5)

Note that since \(g^C\) is unique up to scalar multiplication, this is independent of the choice of \(g^C\). For any \(i,j\in [n]\), we define the circuit ratio as the maximum of \(\kappa ^W_{ij}(C)\) over all choices of the circuit C:

$$\begin{aligned} \kappa ^W_{ij}=\max \left\{ \kappa ^W_{ij}(C):\, C\in \mathcal {C}_W, i,j\in C\right\} . \end{aligned}$$
(6)

By convention we set \(\kappa ^W_{ij} = 0\) if there is no circuit supporting i and j. Further, we define the circuit imbalance measure as

$$\begin{aligned} \kappa _W=\max \left\{ \kappa ^W_{ij}:\, i, j\in [n]\right\} \,. \end{aligned}$$

Minimizing over all coordinate rescalings, we define

$$\begin{aligned} \kappa _W^* = \min \left\{ \kappa _{DW}:\, D \in \textbf{D}\right\} \,. \end{aligned}$$

We omit the index W whenever it is clear from context. Further, for a vector \(d\in \mathbb {R}^n_{++}\), we write \(\kappa _{ij}^d = \kappa _{ij}^{{\text {Diag}}(d)W}\) and \(\kappa ^d = \kappa ^d_W=\kappa _{{\text {Diag}}(d)W}\).

We want to remark that a priori it is not clear that \(\kappa _W^*\) is well-defined. Theorem 2.12 will show that the minimum of \(\{\kappa _{DW}:\, D\in \textbf{D}\}\) is indeed attained.

We next formulate the main statements on the circuit imbalance measure; proofs will be given in the subsequent subsections. Crucially, we show that the circuit imbalance \(\kappa _W\) is a good proxy to the condition number \(\bar{\chi }_W\). The lower bound was already proven in [56], and the upper bound is from [14]. A slightly weaker upper bound \(\sqrt{1 + (n\kappa _W)^2}\) was previously given in the conference version of this paper [8].

Theorem 2.8

(Proof in Sect. 2.3) For a linear subspace \(W\subseteq \mathbb {R}^n\),

$$\begin{aligned} \sqrt{1 + (\kappa _W)^2} \le \bar{\chi }_W\le n\kappa _W. \end{aligned}$$

We now overview some basic properties of \(\kappa _W\). Proposition 2.4 asserts that \(\bar{\chi }_W\) is the maximum \(\ell _2\rightarrow \ell _2\) operator norm of the mappings \(L_I^W\) over \(I\subseteq [n]\). In [14], it was shown that \(\kappa _W\) is in contrast the maximum \(\ell _1\rightarrow \ell _\infty \) operator norm of the same mappings; this easily implies the upper bound \(\bar{\chi }_W\le n\kappa _W\).

Proposition 2.9

[14] For a linear subspace \(W \subseteq \mathbb {R}^n\),

$$\begin{aligned} \kappa _W =\max \left\{ \frac{\Vert L_I^W(p)\Vert _\infty }{\Vert p\Vert _1}\,: {I\subseteq [n]}, I\ne \emptyset , p\in \pi _I(W)\setminus \{0\}\right\} \,. \end{aligned}$$

Similarly to \(\bar{\chi }_W\), \(\kappa _W\) is self-dual; this holds for all individual \(\kappa _{ij}^W\) values as well.

Lemma 2.10

(Proof in Sect. 2.3) For any subspace \(W \subseteq \mathbb {R}^n\) and \(i,j \in [n]\), \(\kappa _{ij}^W = \kappa _{ji}^{W^\perp }\).

The next lemma provides a subroutine that efficienctly yields upper bounds on \(\ell ^W(I)\) or lower bounds on some circuit imbalance values. Recall the definition of the lifting score \(\ell ^W(I)\) from (4).

Lemma 2.11

(Proof in Sect. 2.3) There exists a subroutine Verify-Lift(\(W,I,\theta \)) that, given a linear subspace \(W\subseteq \mathbb {R}^n\), an index set \(I\subseteq [n]\), and a threshold \(\theta \in \mathbb {R}_{++}\), either returns the answer ‘pass’, verifying \(\ell ^W(I)\le \theta \), or returns the answer ‘fail’, and a pair \(i \in I, j \in [n] \setminus I\) such that \(\theta /n\le \kappa ^W_{ij}\). The running time can be bounded as \(O(n(n-m)^2)\).

The proofs of the above statements are given in Sect. 2.3.

A min-max theorem We next provide a combinatorial min-max characterization of \(\kappa ^*_W\). Consider the circuit ratio digraph \(G=([n],E)\) on the node set [n] where \((i,j)\in E\) if \(\kappa _{ij}>0\), that is, there exists a circuit \(C\in \mathcal{C}\) with \(i,j\in C\). We will refer to \(\kappa _{ij}=\kappa _{ij}^W\) as the weight of the edge (ij). (Note that \((i,j)\in E\) if and only if \((j,i)\in E\), but the weight of these two edges can be different.)

Let H be a cycle in G, that is, a sequence of indices \(i_1,i_2,\dots ,i_k, i_{k+1} = i_1\). We use \(|H|=k\) to denote the length of the cycle. (In our terminology, ‘cycles’ always refer to objects in G, whereas ‘circuits’ refer to the minimum supports in \({\text {Ker}}(A)\).)

We use the notation \(\kappa (H)=\kappa _W(H)=\prod _{j=1}^k \kappa ^W_{i_j i_{j+1}}\). For a vector \(d\in \mathbb {R}^n_{++}\), we denote \(\kappa ^d_W(H)=\kappa _{{\text {Diag}}(d)W}(H)\). A simple but important observation is that such a rescaling does not change the value associated with the cycle, that is,

$$\begin{aligned} \kappa ^d_W(H)=\kappa _W(H)\quad \forall d\in \mathbb {R}^n_{++}\quad \text{ for } \text{ any } \text{ cycle } \text{ H } \text{ in } \text{ G }\,. \end{aligned}$$
(7)

Theorem 2.12

(Proof in Sect. 2.4) For a subspace \(W\subset \mathbb {R}^n\), we have

$$\begin{aligned} \kappa _W^* = \min _{d > 0} \kappa _W^d = \max \left\{ \kappa _W(H)^{1/|H|}:\ H \text { is a cycle in } G\right\} \,. \end{aligned}$$

The proof relies on the following formulation:

$$\begin{aligned} \begin{aligned} \kappa ^*_W=\quad \quad \quad \min \;&t \\ \kappa _{ij}d_j/d_i&\le t \quad \forall (i,j) \in E \\ d&> 0. \end{aligned} \end{aligned}$$

Taking logarithms, we can rewrite this problem as

$$\begin{aligned} \begin{aligned} \min \;&s \\ \log \kappa _{ij} + z_j - z_i&\le s \quad \forall (i,j) \in E \\ z&\in \mathbb {R}^n. \end{aligned} \end{aligned}$$

This is the dual of the minimum-mean cycle problem with weights \(\log \kappa _{ij}\), and can be solved in polynomial time (see e.g. [4, Theorem 5.8]).

Whereas this formulation verifies Theorem 2.12, it does not give a polynomial-time algorithm to compute \(\kappa ^*_W\). The caveat is that the values \(\kappa ^W_{ij}\) are typically not available; in fact, approximating them up to a factor \(2^{O(m)}\) is NP-hard, as follows from the work of Tunçel [54].

Nevertheless, the following corollary of Theorem 2.12 shows that any arbitrary circuit containing i and j yields a \((\kappa ^*)^2\) approximation to \(\kappa _{ij}\).

Corollary 2.13

(Proof in Sect. 2.4) Let us be given a linear subspace \(W\subseteq \mathbb {R}^n\) and \(i,j\in [n]\), \(i\ne j\), and a circuit \(C\in \mathcal {C}_W\) with \(i,j\in C\). Let \(g\in W\) be the corresponding vector with \(\textrm{supp}(g)=C\). Then,

$$\begin{aligned} \frac{\kappa ^W_{ij}}{\left( \kappa _W^*\right) ^2}\le \frac{|g_j|}{|g_i|}\le \kappa ^W_{ij}. \end{aligned}$$

The above statements are shown in Sect. 2.4. In Sect. 2.5, we use techniques from matroid theory and linear algebra to efficiently identify a circuit for any pair of variables that are contained in the same circuit. A matroid is non-separable if the circuit hypergraph is connected; precise definitions and background will be described in Sect. 2.5.

Theorem 2.14

(Proof in Sect. 2.5) Given \(A\in \mathbb {R}^{m\times n}\), there exists an \(O(n^2 m^2)\) time algorithm Find-Circuits(A) that obtains a decomposition of \(\mathcal{M}(A)\) to a direct sum of non-separable linear matroids, and returns a family \(\hat{\mathcal {C}}\) of circuits such that if i and j are in the same non-separable component, then there exists a circuit in \(\hat{\mathcal {C}}\) containing both i and j. Further, for each \(i\ne j\) in the same component, the algorithm returns a value \(\hat{\kappa }_{ij}\) as the the maximum of \(|g_j/g_i|\) such that \(g\in W\), \(\textrm{supp}(g)=C\) for some \(C\in \hat{\mathcal {C}}\) containing i and j. For these values, \(\hat{\kappa }_{ij} \le \kappa _{ij} \le (\kappa ^*)^2\hat{\kappa }_{ij}\).

Finally, in Sect. 2.6, we combine the above results to prove Theorem 2.5 on approximating \(\bar{\chi }^{*}_{W}\) and \(\kappa ^*_W\).

Section 2.5 contains an interesting additional statement, namely that the logarithms of the circuit ratios satisfy the triangle inequality. This will also be useful in the analysis of the LLS algorithm. The proof uses similar arguments as the proof of Theorem 2.14. A simpler proof of this statement was subsequently given in [16].

Lemma 2.15

(Proof in Sect. 2.5

  1. (i)

    For any distinct ijk in the same connected component of \(\mathcal {C}_W\), and any \(g^C\) with \(i,j \in C\), \(C \in \mathcal {C}_W\), there exist circuits \(C_1, C_2 \in \mathcal {C}_W\), \(i,k \in C_1\), \(j,k \in C_2\) such that \(|g^C_j/g^C_i| = |g^{C_2}_j/g^{C_2}_k| \cdot |g^{C_1}_k/g^{C_1}_i|\).

  2. (ii)

    For any distinct ijk in the same connected component of \(\mathcal {C}_W\), \(\kappa _{ij} \le \kappa _{ik}\cdot \kappa _{kj}\).

2.3 Basic properties of \(\kappa _W\)

Theorem 2.8

(Restatement). For a linear subspace \(W\subseteq \mathbb {R}^n\),

$$\begin{aligned} \sqrt{1 + (\kappa _W)^2} \le \bar{\chi }_W\le n\kappa _W. \end{aligned}$$

Proof

For the first inequality, let \(C \in \mathcal {C}_W\) be the circuit and \(i\ne j \in C\) such that \(|g_j/g_i| = \kappa _W\) for the corresponding solution \(g=g^C\). Let us use the characterization of \(\bar{\chi }_W\) in Proposition 2.4. Let \(I=([n]\setminus C)\cup \{i\}\), and \(p=g_i e^i\), that is, the vector with \(p_i=g_i\) and \(p_k=0\) for \(k\ne i\). Then, the unique vector \(z\in W\) such that \(z_I=p\) is \(z=g\). Therefore,

$$\begin{aligned} \bar{\chi }_W\ge \min _{z\in W, z_I=p}\frac{\Vert z\Vert }{\Vert p\Vert }=\frac{\Vert g\Vert }{|g_i|}\ge \frac{\sqrt{|g_i|^2 + |g_j|^2}}{|g_i|}=\sqrt{1+\kappa _W^2}. \end{aligned}$$

The second inequality is immediate from Propositions 2.4 and 2.9, and the inequalities between \(\ell _1\), \(\ell _2\), and \(\ell _\infty \) norms. The proof of the slightly weaker \(\bar{\chi }_W\le \sqrt{1+(n\kappa _W)^2}\) follows from Lemma 2.11. \(\square \)

The next lemma will be needed to prove Lemma 2.11 and also to analyze the LLS algorithm. Let us say that the vector \(y \in \mathbb {R}^n\) conforms to \(x\in \mathbb {R}^n\) if \(x_iy_i >0\) whenever \(y_i\ne 0\).

Lemma 2.16

For \(i \in I \subset [n]\) with \(e^i_I \in \pi _I(W)\), let \(z = L_I^W(e^i_I)\). Then for any \(j \in \textrm{supp}(z)\) we have \(\kappa _{ij}^W \ge |z_j|\).

Proof

We consider the cone \(F \subset W\) of vectors that conform to z. The faces of F are bounded by inequalities of the form \(z_k y_k \ge 0\) or \(y_k = 0\). The edges (rays) of F are of the form \(\{\alpha g:\, \alpha \ge 0\}\) with \(\textrm{supp}(g) \in \mathcal {C}_W\). It is easy to see from the Minkowski–Weyl theorem that z can be written as

$$\begin{aligned} z=\sum _{k=1}^h g^k, \end{aligned}$$

where \(h\le n\), \(C_1,C_2,\ldots ,C_h\in \mathcal {C}_W\) are circuits, and the vectors \(g^1,g^2,\ldots ,g^h\in W\) conform to z and \(\textrm{supp}(g^k)=C_k\) for all \(k\in [h]\). Note that \(i \in C_k\) for all \(k\in [h]\), as otherwise, \(z'=z-g^k\) would also satisfy \(z'_I=e^i_I\), but \(\Vert z'\Vert <\Vert z\Vert \) due to \(g^k\) being conformal to z, a contradiction to the definition of z.

At least one \(k \in [h]\) contributes at least as much to \(|z_j| = \frac{\sum _{k=1}^h |g^k_j|}{\sum _{k=1}^h g^k_i}\) as the average. Hence we find \(\kappa _{ij}^W \ge |g^k_j/g^k_i| \ge |z_j|\). \(\square \)

Lemma 2.11

(Restatement). There exists a subroutine Verify-Lift(\(W,I,\theta \)) that, given a linear subspace \(W\subseteq \mathbb {R}^n\), an index set \(I\subseteq [n]\), and a threshold \(\theta \in \mathbb {R}_{++}\), either returns the answer ‘pass’, verifying \(\ell ^W(I)\le \theta \), or returns the answer ‘fail’, and a pair \(i \in I, j \in [n] \setminus I\) such that \(\theta /n\le \kappa ^W_{ij}\). The running time can be bounded as \(O(n(n-m)^2)\).

Proof

Take any minimal \(I'\subset I\) such that \(\dim (\pi _{I'}(W)) = \dim (\pi _I(W))\). Then we know that \(\pi _{I'}(W) = \mathbb {R}^{I'}\) and for \(p \in \pi _I(W)\) we can compute \(L_I^W(p) = L_{I'}^W(p_{I'})\). Let \(B \in \mathbb {R}^{([n] {\setminus } I) \times I'}\) be the matrix sending any \(q \in \pi _{I'}(W)\) to the corresponding vector \((L_{I'}^W(q))_{[n]\setminus I}\). The column \(B_i\) can be computed as \((L_{I'}^W(e^i_{I'}))_{[n]\setminus I}\) for \(e^i_{I'} \in \mathbb {R}^{I'}\). We have \(\Vert L_I^W(p)\Vert ^2 = \Vert p\Vert ^2 + \Vert (L_{I'}^W(p_{I'}))_{[n]{\setminus } I}\Vert ^2 \le \Vert p\Vert ^2 + \Vert B\Vert ^2\Vert p_{I'}\Vert ^2\) for any \(p \in \pi _I(W)\), and so \(\ell ^W(I)=\sqrt{\Vert L_I^W\Vert ^2-1} \le \Vert B\Vert \). We upper bound the operator norm by the Frobenius norm as \(\Vert B\Vert \le \Vert B\Vert _F = \sqrt{\sum _{ji} B_{ji}^2} \le n\max _{ji} |B_{ji}|\). By Lemma 2.16 it follows that \(|B_{ji}| = |(L_{I'}^W(e^i))_j| \le \kappa _{ij}^W\). The algorithm returns the answer ‘pass’ if \(n\max _{ji} |B_{ji}|\le \theta \) and ‘fail’ otherwise.

To implement the algorithm, we first need to select a minimal \(I'\subset I\) such that \(\dim (\pi _{I'}(W)) = \dim (\pi _I(W))\). This can be found by computing a matrix \(M\in \mathbb {R}^{n \times (n-m)}\) such that \(\textrm{range} (M)=W\), and selecting a maximal number of linearly independent columns of \(M_{I,{\varvec{\cdot }}}\). Then, we compute the matrix \(B \in \mathbb {R}^{([n] \setminus I) \times I'}\) that implements the transformation \([L_{I'}^W]_{[n]{\setminus } I}:\ \pi _{I'}(W)\rightarrow \pi _{[n]{\setminus } I}(W)\). The algorithm returns the pair (ij) corresponding to the entry maximizing \(|B_{ji}|\). The running time analysis will be given in the proof of Lemma 3.15, together with an amortized analysis of a sequence of calls to the subroutine. \(\square \)

Remark 2.17

We note that the algorithm Verify-Lift does not need to compute the circuit as in Lemma 2.16. The following observation will be important in the analysis: the algorithm returns the answer ‘fail’ even if \(\ell ^W(I)\le \theta < n|B_{ji}|\).

We now prove the duality property of the circuit imbalances.

Lemma 2.10

(Restatement). For any subspace \(W \subseteq \mathbb {R}^n\) and \(i,j \in [n]\), \(\kappa _{ij}^W = \kappa _{ji}^{W^\perp }\).

Proof

Choose a circuit \(C \in \mathcal {C}_W\) and corresponding circuit solution \(g:= g^C \in W\cap \mathbb {R}^n_C\) such that \(\kappa _{ij} = \kappa _{ij}(C) = |g_j/g_i|\). We will construct a circuit solution in \(W^\perp \) that certifies \(\kappa _{ji}^{W^\perp } \ge \kappa _{ij}^W\).

Define \(h \in \mathbb {R}^C\) by \(h_i = g_j, h_j = -g_i\) and \(h_k = 0\) for all \(k\in C\setminus \{i,j\}\). Then, h is orthogonal to \(g_C\) by construction, and hence \(h \in (\pi _C(W \cap \mathbb {R}^n_C))^\perp = \pi _C(W^\perp )\). Furthermore, we have \(\textrm{supp}(h) \in \mathcal {C}_{\pi _C(W^\perp )}\) since \(h \in \mathbb {R}^C\) is a support minimal vector orthogonal to \(g^C\).

Take any vector \(\bar{h} \in W^\perp \) satisfying \(\bar{h}_C = h\) that is support minimal subject to these constraints. We claim that \(\textrm{supp}(\bar{h}) \in \mathcal {C}_{W^\perp }\). Assume not, then there exists a non-zero \(v \in W^\perp \) with \(\textrm{supp}(v) \subset \textrm{supp}(\bar{h})\). Since \(\textrm{supp}(\pi _C(v)) \subseteq \textrm{supp}(\pi _C(\bar{h})) = \textrm{supp}(h)\), we must have either \(v_C=0\) or \(v_C = s h\) for \(s\ne 0\). If \(v_C=0\), then \(\bar{h}-\alpha v\) is also in \( W^\perp \) satisfying \(\pi _C(\bar{h}_C - \alpha v) = h\) for all \(\alpha \in \mathbb {R}\), and since \(v\ne 0\) we can choose \(\alpha \) such that \(\bar{h}-\alpha v\) has smaller support than \(\bar{h}\), a contradiction. If \(s\ne 0\) then \(v/s \in W^\perp \) satisfies \(\pi _C(v/s) = h\) and has smaller support than \(\bar{h}\), again a contradiction.

By the above construction, we have

$$\begin{aligned}\kappa _{ji}^{W^\perp }\ge \left| \frac{\bar{h}_i}{\bar{h}_j}\right| =\left| \frac{h_i}{h_j}\right| =\left| \frac{g_j}{g_i}\right| =\kappa _{ij}^W\,. \end{aligned}$$

By swapping the role of W and \(W^\perp \) and i and j, we obtain \(\kappa _{ij}^W\ge \kappa _{ji}^{W^\perp }\). The statement follows. \(\square \)

2.4 A min–max theorem on \(\kappa ^*_W\)

The proof of the characterization of \(\kappa _W^*\) follows.

Theorem 2.12

(Restatement). For a subspace \(W\subset \mathbb {R}^n\), we have

$$\begin{aligned} \kappa _W^* = \min _{d > 0} \kappa _W^d = \max \left\{ \kappa _W(H)^{1/|H|}:\ H \text { is a cycle in }G\right\} \,. \end{aligned}$$

Proof

For the direction \(\kappa _W(H)^{1/|H|}\le \kappa _W^*\) we use (7). Let \(d > 0\) be a scaling and H a cycle. We have \(\kappa ^d_{ij}\le \kappa _W^d\) for every \(i,j\in [n]\), and hence \(\kappa _W(H)=\kappa _W^d(H)\le (\kappa _W^d)^{|H|}\). Since this inequality holds for every \(d > 0\), it follows that \(\kappa _W(H) \le (\kappa _W^*)^{|H|}\).

For the reverse direction, consider the following optimization problem.

$$\begin{aligned} \begin{aligned} \min \;&t \\ \kappa _{ij}d_j/d_i&\le t \quad \forall (i,j) \in E \\ d&> 0. \end{aligned} \end{aligned}$$
(8)

For any feasible solution (dt) and \(\lambda >0\), we get another feasible solution \((\lambda d, t)\) with the same objective value. As such, we can strengthen the condition \(d > 0\) to \(d \ge 1\) without changing the objective value. This makes it clear that the optimum value is achieved by a feasible solution.

Any rescaling \(d > 0\) provides a feasible solution with objective value \(\kappa ^d\), which means that the optimal value \(t^*\) of (8) is \(t^* = \kappa ^*\). Moreover, with the variable substitution \(z_i=\log d_i\), \(s=\log t\), (8) can be written as a linear program:

$$\begin{aligned} \begin{aligned} \min \;&s \\ \log \kappa _{ij} + z_j - z_i&\le s \quad \forall (i,j) \in E \\ z&\in \mathbb {R}^n. \end{aligned} \end{aligned}$$
(9)

This is the dual of a minimum-mean cycle problem with respect to the cost function \(\log (\kappa _{ij})\). Therefore, an optimal solution corresponds to the cycle maximizing \(\sum _{ij\in H}\log \kappa _{ij}/|H|\), or in other words, maximizing \(\kappa (H)^{1/|H|}\). \(\square \)

The following example shows that \(\kappa ^* \le \bar{\chi }^*\) can be arbitrarily big.

Example 2.18

Take \(W = \textrm{span}((0,1,1,M)^\top ,(1,0,M,1)^\top )\), where \(M > 0\). Then \(\{2,3,4\}\) and \(\{1,3,4\}\) are circuits with \(\kappa ^W_{34}(\{2,3,4\}) = M\) and \(\kappa ^W_{43}(\{1,3,4\}) = M\). Hence, by Theorem 2.12, we see that \(\kappa ^* \ge M\).

Corollary 2.13

(Restatement). Let us be given a linear subspace \(W\subseteq \mathbb {R}^n\) and \(i,j\in [n]\), \(i\ne j\), and a circuit \(C\in \mathcal {C}_W\) with \(i,j\in C\). Let \(g\in W\) be the corresponding vector with \(\textrm{supp}(g)=C\). Then,

$$\begin{aligned} \frac{\kappa ^W_{ij}}{\left( \kappa _W^*\right) ^2}\le \frac{|g_j|}{|g_i|}\le \kappa ^W_{ij}. \end{aligned}$$

Proof

The second inequality follows by definition. For the first inequality, note that the same circuit C yields \(|g_i/g_j|\le \kappa ^W_{ji}(C)\le \kappa ^W_{ji}\). Therefore, \(|g_j/g_i|\ge 1/\kappa ^W_{ji}\).

From Theorem 2.12 we see that \(\kappa ^W_{ij}\kappa ^W_{ji}\le (\kappa ^*_W)^2\), giving \(1/\kappa ^W_{ji}\ge \kappa ^W_{ij}/ (\kappa ^*_W)^2\), completing the proof. \(\square \)

2.5 Finding circuits: a detour in matroid theory

We next prove Theorem 2.14, showing how to efficiently obtain a family \(\hat{\mathcal {C}}\subseteq \mathcal {C}_W\) such that for any \(i,j\in [n]\), \(\hat{\mathcal {C}}\) includes a circuit containing both i and j, provided there exists such a circuit.

We need some simple concepts and results from matroid theory. We refer the reader to [45, Chapter 39] or [17, Chapter 5] for definitions and background. Let \(\mathcal{M}=([n],\mathcal{I})\) be a matroid on ground set [n] with independent sets \(\mathcal{I}\subseteq 2^{[n]}\). The rank \(\textrm{rk}(S)\) of a set \(S\subseteq [n]\) is the maximum size of an independent set contained in S. The maximal independent sets are called bases. All bases have the same cardinality \(\textrm{rk}([n])\).

For the matrix \(A\in \mathbb {R}^{m\times n}\), we will work with the linear matroid \(\mathcal{M}(A)=([n],\mathcal{I}(A))\), where a subset \(I\subseteq [n]\) is independent if the columns \(\{A_i\,: i\in I\}\) are linearly independent. Note that \(\textrm{rk}([n])= m\) under the assumption that A has full row rank.

The circuits of the matroid are the inclusion-wise minimal non-independent sets. Let \(I\in \mathcal{I}\) be an independent set, and \(i\in [n]{\setminus } I\) such that \(I\cup \{i\}\notin \mathcal{I}\). Then, there exists a unique circuit \(C(I,i)\subseteq I\cup \{i\}\) that is called the fundamental circuit of i with respect to I. Note that \(i\in C(I,i)\).

The matroid \(\mathcal M\) is separable, if the ground set [n] can be partitioned to two nonempty subsets \([n]=S\cup T\) such that \(I\in \mathcal{I}\) if and only if \(I\cap S,I\cap T\in \mathcal{I}\). In this case, the matroid is the direct sum of its restrictions to S and T. In particular, every circuit is fully contained in S or in T.

For the linear matroid \(\mathcal{M}(A)\), separability means that \({\text {Ker}}(A)={\text {Ker}}(A_S) \times {\text {Ker}}(A_T)\). In this case, solving (LP) can be decomposed into two subproblems, restricted to the columns in \(A_S\) and in \(A_T\), and \(\kappa _A=\max \{\kappa _{A_S},\kappa _{A_T}\}\).

Hence, we can focus on non-separable matroids. The following characterization is well-known, see e.g. [17, Theorems 5.2.5, 5.2.7 \(-\)5.2.9]. For a hypergraph \(H=([n],\mathcal{E})\), we define the underlying graph \(H_G=([n],E)\) such that \((i,j)\in E\) if there is a hyperedge \(S\in \mathcal{E}\) with \(i,j\in S\). That is, we add a clique corresponding to each hyperedge. The hypergraph is called connected if the underlying graph \(G=([n],E)\) is connected.

Proposition 2.19

For a matroid \(\mathcal{M}=([n],\mathcal{I})\), the following are equivalent:

  1. (i)

    \(\mathcal{M}\) is non-separable.

  2. (ii)

    The hypergraph of the circuits is connected.

  3. (iii)

    For any base B of \(\mathcal{M}\), the hypergraph formed by the fundamental circuits \(\mathcal {C}^B=\{ C(B,i)\,: i\in [n]{\setminus } B\}\) is connected.

  4. (iv)

    For any \(i,j\in [n]\), there exists a circuit containing i and j.

Proof

The implications (i) \(\Leftrightarrow \) (ii), (iii) \(\Rightarrow \) (ii), and (iv) \(\Rightarrow \) (ii) are immediate from the definitions.

For the implication (ii) \(\Rightarrow \) (iii), assume for a contradiction that the hypergraph of the fundamental circuits with respect to B is not connected. This means that we can partition \([n]=S\cup T\) such that for each \(i\in S\), \(C(B,i)\subseteq S\), and for each \(i\in T\), \(C(B,i)\subseteq T\). Consequently, \(\textrm{rk}(S)=|B\cap S|\), \(\textrm{rk}(T)=|B\cap T|\), and therefore \(\textrm{rk}([n])=\textrm{rk}(S)+\textrm{rk}(T)\). It is easy to see that this property is equivalent to separability to S and T; see e.g. [17, Theorem 5.2.7] for a proof.

Finally, for the implication (ii) \(\Rightarrow \) (iv), consider the undirected graph ([n], E) where \((i,j)\in E\) if there is a circuit containing both i and j. This graph is transitive according to [17, Theorem 5.2.5]: if \((i,j), (j,k)\in E\), then also \((i,k)\in E\). Consequently, whenever ([n], E) is connected, it must be a complete directed graph. \(\square \)

We give a different proof of (iii) \(\Rightarrow \) (iv) in Lemma 2.21 that will be convenient for our algorithmic purposes. First, we need a simple lemma that is commonly used in matroid optimization, see e.g. [17, Lemma 13.1.11] or [45, Theorem 39.13].

Lemma 2.20

Let I be an independent set of a matroid \(\mathcal{M}=([n],\mathcal{I})\), and \(U=\{u_1,u_2,\ldots , u_\ell \}\subseteq I\), \(V=\{v_1,v_2,\ldots , v_\ell \}\subseteq [n]\setminus I\) such that \(I\cup \{v_i\}\) is dependent for each \(i\in [\ell ]\). Further, assume that for each \(t\in [\ell ]\), \(u_t\in C(I,v_t)\) and \(u_t \notin C(I,v_h)\) for all \(h<t\). Then, \((I{\setminus } U)\cup V \in \mathcal{I}\).

We give a sketch of the proof. First, we note that for each \(t\in [\ell ]\), \(u_t\in C(I,v_t)\) means that exchanging \(v_t\) for \(u_t\) maintains independence. The statement follows by induction on \(\ell \): we consider the independent set \(I'=(I{\setminus } \{u_\ell \})\cup \{v_\ell \}\). We can apply induction for \(I'\), \(U'=\{u_1,u_2,\ldots , u_{\ell -1}\}\), and \(V'=\{v_1,v_2,\ldots , v_{\ell -1}\}\), noting that the assumption guarantees that \(C(I',v_t)=C(I,v_t)\) for all \(t\in [\ell -1]\). Based on this lemma, we show the following exchange property.

Lemma 2.21

Let B be a basis of the matroid \(\mathcal{M}=([n],\mathcal{I})\), and let \(U=\{u_1,u_2,\ldots , u_\ell \}\subseteq B\), and \(V=\{v_1,v_2,\ldots , v_\ell ,v_{\ell +1}\}\subseteq [n]{\setminus } B\). Assume \(C(B,v_1)\cap U=\{u_1\}\), \( C(B,v_{\ell +1})\cap U=\{u_\ell \}\), and for each \(2\le t\le \ell \), \( C(B,v_t)\cap U=\{u_{t-1}, u_t\}\). Then \((B{\setminus } U)\cup V\) contains a unique circuit C, and \(V\subseteq C\).

The situation described here corresponds to a minimal path in the hypergraph \(\mathcal {C}^B\) of the fundamental circuits with respect to a basis B. The hyperedges \(C(B,v_i)\) form a path from \(v_1\) to \(v_{\ell +1}\) such that no shortcut is possible (note that this is weaker than requiring a shortest path).

Proof of Lemma 2.21

Note that \(S = (B \setminus U)\cup V \notin \mathcal{I}\) since \(|S|>|B|\) and B is a basis. For any \(i\in [\ell +1]\), we can use Lemma 2.20 to show that \(S{\setminus } \{v_{i}\} = (B {\setminus } U) \cup (V {\setminus } \{v_i\}) \in \mathcal{I}\) (and thus, is a basis). To see this, we apply Lemma 2.20 for the ordered sets \(V'=\{v_1,\ldots ,v_{i-1},v_{\ell +1},v_\ell ,\ldots ,v_{i+1}\}\) and \(U'=\{u_1,\ldots ,u_{i-1},u_\ell ,u_{\ell -1},\ldots ,u_i\}\).

Consequently, every circuit in S must contain the entire set V. The uniqueness of the circuit in S follows by the well-known circuit axiom asserting that if \(C,C'\in \mathcal {C}\), \(C \ne C'\) and \(v\in C\cap C'\), then there exists a circuit \(C''\in \mathcal {C}\) such that \(C''\subseteq (C\cup C')\setminus \{v\}\), contradicting the claim that every circuit in S contains the entire set V. \(\square \)

We are ready to describe the algorithm that will be used to obtain lower bounds on all \(\kappa _{ij}\) values.

Theorem 2.14

(Restatement). Given \(A\in \mathbb {R}^{m\times n}\), there exists an \(O(n^2 m^2)\) time algorithm Find-Circuits(A) that obtains a decomposition of \(\mathcal{M}(A)\) to a direct sum of non-separable linear matroids, and returns a family \(\hat{\mathcal {C}}\) of circuits such that if i and j are in the same non-separable component, then there exists a circuit in \(\hat{\mathcal {C}}\) containing both i and j. Further, for each \(i\ne j\) in the same component, the algorithm returns a value \(\hat{\kappa }_{ij}\) as the the maximum of \(|g_j/g_i|\) such that \(g\in W\), \(\textrm{supp}(g)=C\) for some \(C\in \hat{\mathcal {C}}\) containing i and j. For these values, \(\hat{\kappa }_{ij} \le \kappa _{ij} \le (\kappa ^*)^2\hat{\kappa }_{ij}\).

Proof

Once we have found the set of circuits \(\hat{\mathcal {C}}\), and computed \(\hat{\kappa }_{ij}\) as in the statement, the inequalities \(\hat{\kappa }_{ij} \le \kappa _{ij} \le (\kappa ^*)^2\hat{\kappa }_{ij}\) follow easily. The first inequality is by the definition of \(\kappa _{ij}\), and the second inequality is from Corollary 2.13.

We now turn to the computation of \(\hat{\mathcal {C}}\). We first obtain a basis \(B\subseteq [n]\) of \({\text {Ker}}(A)\) via Gauss-Jordan elimination in time \(O(nm^2)\). Recall the assumption that A has full row-rank. Let us assume that \(B=[m]\) is the set of first m indices. The elimination transforms it to the form \(A=(\textbf{I}_m|H)\), where \(H\in \mathbb {R}^{m \times (n-m)}\) corresponds to the non-basis elements. In this form, the fundamental circuit C(Bi) is the support of the ith column of A together with i for every \(m+1\le i\le n\). We let \(\mathcal {C}^B\) denote the set of all these fundamental circuits.

We construct an undirected graph \(G=(B,E)\) as follows. For each \(i\in [n]\setminus B\), we add a clique between the nodes in \(C(B,i)\setminus \{i\}\). This graph can be constructed in \(O(nm^2)\) time.

The connected components of G correspond to the connected components of \(\mathcal {C}^B\) restricted to B. Thus, due to the equivalence shown in Proposition 2.19 we can obtain the decomposition by identifying the connected components of G. For the rest of the proof, we assume that the entire hypergraph is connected; connectivity can be checked in \(O(m^2)\) time.

We initialize \(\hat{\mathcal {C}}\) as \(\mathcal {C}^B\). We will then check all pairs \(i,j\in [n]\), \(i\ne j\). If no circuit \(C\in \hat{\mathcal {C}}\) exists with \(i,j\in C\), then we will add such a circuit to \(\hat{\mathcal {C}}\) as follows.

Assume first \(i,j\in [n]\setminus B\). We can find a shortest path in G between the sets \(C(B,i){\setminus } \{i\}\) and \(C(B,j){\setminus } \{j\}\) in time \(O(m^2)\). This can be represented by the sequences of points \(V=\{v_1,v_2,\ldots ,v_{\ell +1}\}\subseteq [n]\setminus B\), \(v_1=i\), \(v_{\ell +1}=j\), and \(U=\{u_1,u_2,\ldots ,u_\ell \}\subseteq B\) as in Lemma 2.21. According to the lemma, \(S=(B\setminus U)\cup V\) contains a unique circuit C that contains all \(v_t\)’s, including i and j.

We now show how this circuit can be identified in O(m) time, along with the vector \(g^C\). Let \(A_S\) be the submatrix corresponding to the columns in S. Since \(g=g^C\) is unique up to scaling, we can set \(g_{v_1}=1\). Note that for each \(t\in [\ell ]\), the row of \(A_S\) corresponding to \(u_t\) contains only two nonzero entries: \(A_{u_tv_t}\) and \(A_{u_tv_{t+1}}\). Thus, the value \(g_{v_1}=1\) can be propagated to assigning unique values to \(g_{v_2},g_{v_3},\ldots ,g_{v_{\ell +1}}\). Once these values are set, there is a unique extension of g to the indices \(t\in B\cap S\) in the basis. Thus, we have identified g as the unique element of \({\text {Ker}}(A_S)\) up to scaling. The circuit C is obtained as \(\textrm{supp}(g)\). Clearly, the above procedure can be implemented in O(m) time.

The argument easily extends to finding circuits for the case \(\{i,j\}\cap B\ne \emptyset \). If \(i\in B\), then for any choice of \(V=\{v_1,v_2,\ldots ,v_{\ell +1}\}\) and \(U=\{u_1,u_2,\ldots ,u_\ell \}\) as in Lemma 2.21 such that \(i\in C(B,v_1)\) and \(i\notin C(B,v_t)\) for \(t>1\), the unique circuit in \((B{\setminus } U)\cup V\) also contains i. This follows from Lemma 2.20 by taking \(V' = \left\{ v_{\ell +1},v_\ell ,\dots ,v_1 \right\} \) and \(U' = \left\{ u_\ell ,\dots ,u_1, i \right\} \), which proves that \(S {\setminus } \left\{ i \right\} = (B{\setminus } U') \cup V' \in \mathcal I\). Similarly, if \(j \in B\) with \(j \in C(B,v_{\ell + 1})\) and \(j\notin C(B,v_t)\) for \(t < \ell + 1\), taking \(V'' = V\) and \(U'' = \left\{ u_1,\dots ,u_\ell , j \right\} \) gives \(S {\setminus } \left\{ j \right\} \in \mathcal I\).

The bottleneck for the running time is finding the shortest paths for the \(n(n-1)\) pairs, in time \(O(m^2)\) each. \(\square \)

The triangle inequality An interesting additional fact about the circuit ratio graph is that the logarithm of the weights satisfy the triangle inequality. The proof uses similar arguments as the proof of Theorem 2.14 above.

Lemma 2.15

(Restatement).

  1. (i)

    For any distinct ijk in the same connected component of \(\mathcal {C}_W\), and any \(g^C\) with \(i,j \in C\), \(C \in \mathcal {C}_W\), there exist circuits \(C_1, C_2 \in \mathcal {C}_W\), \(i,k \in C_1\), \(j,k \in C_2\) such that \(|g^C_j/g^C_i| = |g^{C_2}_j/g^{C_2}_k| \cdot |g^{C_1}_k/g^{C_1}_i|\).

  2. (ii)

    For any distinct ijk in the same connected component of \(\mathcal {C}_W\), \(\kappa _{ij} \le \kappa _{ik}\cdot \kappa _{kj}\).

Proof

Note that part (ii) immediately follows from part (i) when taking \(C \in \mathcal {C}_W\) such that \(\kappa _{ij}(C) = \kappa _{ij}\). We now prove part (i).

Let \(A \in \mathbb {R}^{m \times n}\) be a full-rank matrix with \(W = {\text {Ker}}(A)\). If \(C = \left\{ i,j \right\} \), then the columns \(A_i, A_j\) are linearly dependent. Writing \(A_i = \lambda A_j\), we have \(\lambda = -g^C_j/g^C_i\). Let h be any circuit solution with \(i,k \in \textrm{supp}(h)\), and hence \(j \notin \textrm{supp}(h)\). By assumption, the vector \(h' = h - h_i e_i + \lambda h_i e_j\) will satisfy \(Ah' = 0\) and have \(i \notin \textrm{supp}(h'), j,k\in \textrm{supp}(h')\). We know that \(h'\) is a circuit solution, because any circuit \(C' \subset \textrm{supp}(h')\) could, by the above process in reverse, be used to produce a kernel solution with strictly smaller support than h, contradicting the assumption that h is a circuit solution. Now we have \(|h'_j/h'_k|\cdot |h_k/h_i| = |h'_j/h_i| = |\lambda |\) by construction. Thus, h and \(h'\) are the circuit solutions we are looking for.

Now assume \(C \ne \left\{ i,j \right\} \). If \(k \in C\), the statement is trivially true with \(C = C_1 = C_2\), so assume \(k \notin C\). Pick \(l \in C\), \(l \notin \{i,j\}\) and set \(B = C{\setminus }\left\{ l \right\} \). Assume without loss of generality that \(B \subseteq [m]\) and apply row operations to A such that \(A_{B,B} = \textbf{I}_{B\times B}\) is an identity submatrix and \(A_{[m]\setminus B,B} = 0\). Then the column \(A_{l}\) has support given by B, for otherwise \(g^C\) could not be in the kernel. The given circuit solution satisfies \(g^C_t = -A_{t,l}g^C_l\) for all \(t \in B\), and in particular \(g^C_j/g^C_i = A_{j,l}/A_{i,l}\).

Take any circuit solution \(h \in {\text {Ker}}(A)\) such that \(l, k \in \textrm{supp}(h)\) and such that \(C \cup \textrm{supp}(h)\) is inclusion-wise minimal. Such a vectors exists by Proposition 2.19(iv). Now let \(J = \textrm{supp}(h) \setminus C\). Because \(A_{[m]\setminus B, C} = 0\) and \(Ah = 0\), we must have \(0 \ne h_J \in {\text {Ker}}(A_{[m]\setminus B, J})\). We show that we can uniquely lift any vector \(x \in {\text {Ker}}(A_{B, C\cup \left\{ k \right\} })\) to a vector \(x' \in {\text {Ker}}(A_{C \cup J})\) with \( x_{C\cup k}'= x\). Since this lift will send circuit solutions to circuit solutions by uniqueness, it suffices to find our desired circuits as solutions to the smaller linear system.

We first prove that \(\dim ({\text {Ker}}(A_{[m]\setminus B, J})) = 1\). For suppose that \(\dim ({\text {Ker}}(A_{[m]\setminus B, J})) \ge 2\), then \(|J| \ge 2\) and there would exist some vector \(y \in {\text {Ker}}(A_{[m]{\setminus } B, J})\) linearly independent from \(h_J\) with \(k \in \textrm{supp}(y)\). This vector could be uniquely lifted to a vector \(\bar{y} \in {\text {Ker}}(A)\), and we could then find a linear combination \(h + \alpha \bar{y}\) such that \(\textrm{supp}(h + \alpha \bar{y}) \subsetneq C \cup J\) but \(l,k\in \textrm{supp}(h + \alpha \bar{y})\). The existence of such a vector contradicts the minimality of \(C \cup \textrm{supp}(h)\). As such, we know that \(\dim ({\text {Ker}}(A_{[m]\setminus B, J})) = 1\).

This clear linear relation between any two entries in J for any vector in \({\text {Ker}}(A_{[m]\setminus B, J})\) implies that we can apply row operations to A such that \(A_{B, J}\) has non-zero entries only in the column \(A_{B, \left\{ k \right\} }\). Note that these row operations leave \(A_C\) unchanged because \(A_{[m]\setminus B, C} = 0\). From this, we can see that any element in \({\text {Ker}}(A_{B, C \cup \left\{ k \right\} })\) can be uniquely lifted to an element in \({\text {Ker}}(A_{C \cup J})\). Hence we can focus on \({\text {Ker}}(A_{B, C\cup \left\{ k \right\} })\).

If \(A_{i,k} = A_{j,k} = 0\), then any \(x \in {\text {Ker}}(A_{B,C \cup \left\{ k \right\} })\) satisfies \(x_i + A_{i,l}x_l = x_j + A_{j,l}x_l = 0\) and, in particular, any circuit \(l,k \in \bar{C} \subset C \cup \{k\}\) contains \(\{i,j\} \subset \bar{C}\) and fulfills \(|g^C_j/g^C_i| = |A_{j,l}/A_{i,l}| = |g_j^{\bar{C}}/g_i^{\bar{C}}| = |g_j^{\bar{C}}/g_k^{\bar{C}}| |g_k^{\bar{C}}/g_i^{\bar{C}}|\). Choosing \(C_1 = C_2 = \bar{C}\) concludes the case.

Otherwise we know that \(A_{i,k} \ne 0\) or \(A_{j,k} \ne 0\), meaning that \({\text {Ker}}(A_{\left\{ i,j \right\} ,\left\{ i,j,l,k \right\} })\) contains at least one circuit solution with k in its support. Observe that any circuit in \({\text {Ker}}(A_{\left\{ i,j \right\} ,\left\{ i,j,l,k \right\} })\) can be lifted uniquely to an element in \({\text {Ker}}(A_{B,C \cup \left\{ k \right\} })\) since \(A_{B,B}\) is an identity matrix and we can set the entries of \(B\setminus \left\{ i,j \right\} \) individually to satisfy the equalities. Note that this lifted vector is a circuit as well, again by uniqueness of the lift. Hence we may restrict our attention to the matrix \(A_{\left\{ i,j \right\} ,\left\{ i,j,l,k \right\} }\). If the columns \(A_{\left\{ i,j \right\} ,k}, A_{\left\{ i,j \right\} ,l}\) are linearly dependent, then any circuit solution to \(A_{\left\{ i,j \right\} ,\left\{ i,j,l \right\} }x = 0, x_l \ne 0\), such as \(g^C_{\left\{ i,j,l \right\} }\), is easily transformed into a circuit solution to \(A_{\left\{ i,j \right\} ,\left\{ i,j,k \right\} }x = 0, x_k \ne 0\) and we are done.

If \(A_{\left\{ i,j \right\} ,k}, A_{\left\{ i,j \right\} ,l}\) are independent, we can write , where \(g^C_j/g^C_i = b/a\). For \(\alpha = ad-bc\), which is non-zero since by the independence assumption, we can check that \((\alpha , 0, -d, b)^\top \) and \((0,\alpha ,c,-a)^\top \) are the circuits we are looking for. \(\square \)

2.6 Approximating \(\bar{\chi }\) and \(\bar{\chi }^*\)

Equipped with Theorems 2.12 and 2.14, we are ready to prove Theorem 2.5. Recall that we defined \(\kappa _{ij}^d:= \kappa _{ij}^{{\text {Diag}}(d)W} = \kappa _{ij} d_j/d_i\) when \(d > 0\). We can similarly define \(\hat{\kappa }_{ij}^d:= \hat{\kappa }_{ij} d_j/d_i\), and \(\hat{\kappa }_{ij}^d\) approximates \(\kappa _{ij}^d\) just as in Theorem 2.14.

Theorem 2.5

(Restatement). There is an \(O(n^2m^2 + n^3)\) time algorithm that for any matrix \(A\in \mathbb {R}^{m\times n}\) computes an estimate \(\xi \) of \(\bar{\chi }_W\) such that

$$\begin{aligned} \xi \le \bar{\chi }_W \le n(\bar{\chi }_W^*)^2 \xi \end{aligned}$$

and a \(D\in {\textbf{D}}\) such that

$$\begin{aligned} \bar{\chi }^*_W\le \bar{\chi }_{DW}\le n(\bar{\chi }_W^*)^3\,. \end{aligned}$$

Proof

Let us run the algorithm Finding-Circuits(A) described in Theorem 2.14 to obtain the values \(\hat{\kappa }_{ij}\) such that \(\hat{\kappa }_{ij} \le \kappa _{ij} \le (\kappa ^*_W)^2\hat{\kappa }_{ij}\). We let \(G=([n],E)\) be the circuit ratio digraph, that is, \((i,j)\in E\) if \(\kappa _{ij}>0\).

To show the first statement on approximating \(\bar{\chi }\), we simply set \(\xi =\max _{(i,j)\in E}\hat{\kappa }_{ij}\). Then,

$$\begin{aligned} \xi \le \kappa _W \le \bar{\chi }_W\le n\kappa _W\le n (\kappa ^*_W)^2\xi \le n (\bar{\chi }^*_W)^2\xi \end{aligned}$$

follows by Theorem 2.8.

For the second statement on finding a nearly optimal rescaling for \(\bar{\chi }^*_W\), we consider the following optimization problem, which is an approximate version of (8) from Theorem 2.12.

$$\begin{aligned} \begin{aligned} \min \;&t \\ \hat{\kappa }_{ij}d_j/d_i&\le t \quad \forall (i,j) \in E \\ d&> 0. \end{aligned} \end{aligned}$$
(10)

Let \(\hat{d}\) be an optimal solution to (10) with value \(\hat{t}\). We will prove that \(\kappa ^{\hat{d}} \le (\kappa ^*_W)^3\).

First, observe that \(\kappa _{ij}^{\hat{d}} = \kappa _{ij}\hat{d}_j/\hat{d}_i \le (\kappa ^*_W)^2 \hat{\kappa }_{ij} \hat{d}_j/\hat{d}_i \le (\kappa ^*_W)^2 \hat{t}\) for any \((i,j) \in E\). Now, let \(d^* > 0\) be such that \(\kappa ^{d^*} = \kappa ^*_W\). The vector \(d^*\) is a feasible solution to (10), and so \(\hat{t} \le \max _{i\ne j} \hat{\kappa }_{ij}d^*_j/d^*_i \le \max _{i\ne j} \kappa _{ij}d^*_j/d^*_i = \kappa ^{d^*}\). Hence we find that \(\hat{d}\) gives a rescaling with

$$\begin{aligned} \bar{\chi }_{W\widehat{D}} \le n\kappa ^{\hat{d}} \le n(\kappa ^*_W)^3\le n(\bar{\chi }_W)^3\,, \end{aligned}$$

where we again used Theorem 2.8.

We can obtain the optimal value \(\hat{t}\) of (10) by solving the corresponding maximum-mean cycle problem (see Theorem 2.12). It is easy to develop a multiplicative version of the standard dynamic programming algorithm of the classical minimum-mean cycle problem (see e.g. [4, Theorem 5.8]) that allows finding the optimum to (10) directly, in the same \(O(n^3)\) time.

It is left to find the labels \(d_i>0\), \(i \in [n]\) such that \(\hat{\kappa }_{ij}d_j/d_i \le \hat{t}\) for all \((i,j) \in E\). We define the following weighted directed graph. We associate the weight \(w_{ij}=\log \hat{t} - \log \hat{\kappa }_{ij}\) with every \((i,j)\in E\), and add an extra source vertex r with edges (ri) of weight \(w_{ri}=0\) for all \(i\in [n]\).

By the choice of \(\hat{t}\), this graph does not contain any negative weight directed cycles. We can compute the shortest paths from r to all nodes in \(O(n^3)\) using the Bellman-Ford algorithm; let \(\sigma _i\) be the shortest path label for i. We then set \(d_i=\exp (\sigma _i)\). One can avoid computing logarithms by using a multiplicative variant of the Bellman-Ford algorithm instead.

The running time of the whole algorithm will be bounded by \(O(n^2\,m^2 + n^3)\). The running time is dominated by the \(O(n^2\,m^2)\) complexity of Finding-Circuits(A) and the \(O(n^3)\) complexity of solving the minimum-mean cycle problem and shortest path computation. \(\square \)

3 A scaling-invariant layered least squares interior-point algorithm

3.1 Preliminaries on interior-point methods

In this section, we introduce the standard definitions, concepts and results from the interior-point literature that will be required for our algorithm. We consider an LP problem in the form (LP), or equivalently, in the subspace form (2) for \(W={\text {Ker}}(A)\). We let

$$\begin{aligned}{} & {} \mathcal {P}^{++} = \{x \in \mathbb {R}^n: Ax = b, x> 0\}\,, \\{} & {} \quad \mathcal {D}^{++} = \{(y,s) \in \mathbb {R}^{m+n}: A^\top y + s = c, s > 0\}\,. \end{aligned}$$

Recall the central path defined in (CP), with \(w(\mu )=(x(\mu ),y(\mu ),s(\mu ))\) denoting the central path point corresponding to \(\mu >0\). We let \(w^*=(x^*,y^*,s^*)\) denote the primal and dual optimal solutions to (LP) that correspond to the limit of the central path for \(\mu \rightarrow 0\).

For a point \(w = (x, y, s) \in \mathcal {P}^{++} \times \mathcal {D}^{++}\), the normalized duality gap is \(\mu (w)=x^\top s/n\).

The \(\ell _2\)-neighborhood of the central path with opening \(\beta >0\) is the set

$$\begin{aligned} \mathcal {N}(\beta )&= \left\{ w \in \mathcal {P}^{++} \times \mathcal {D}^{++}: \left\Vert \frac{xs}{\mu (w)} - e\right\Vert \le \beta \right\} \, . \end{aligned}$$

Furthermore, we let \(\overline{\mathcal {N}}(\beta ):={\text {cl}}(\mathcal {N}(\beta ))\) denote the closure of \(\mathcal {N}(\beta )\). Throughout the paper, we will assume \(\beta \) is chosen from (0, 1/4]; in Algorithm 2 we use the value \(\beta =1/8\). The following proposition gives a bound on the distance between w and \(w(\mu )\) if \(w\in \mathcal{N}(\beta )\). See e.g., [20, Lemma 5.4], [36, Proposition 2.1].

Proposition 3.1

Let \(w = (x, y, s) \in \mathcal{N}(\beta )\) for \(\beta \in (0,1/4]\) and \(\mu =\mu (w)\), and consider the central path point \(w(\mu )=(x(\mu ),y(\mu ),s(\mu ))\). For each \(i\in [n]\),

$$\begin{aligned} \begin{aligned} \frac{x_i}{1+2\beta }\le \frac{1-2\beta }{1-\beta }\cdot x_i&\le x_i(\mu )\le \frac{x_i}{1-\beta }\,,\quad \text{ and }\\ \frac{s_i}{1+2\beta }\le \frac{1-2\beta }{1-\beta }\cdot s_i&\le s_i(\mu )\le \frac{s_i}{1-\beta }\,. \end{aligned} \end{aligned}$$

We will often use the following proposition which is immediate from definiton of \({\beta }\).

Proposition 3.2

Let \(w = (x, y, s) \in \mathcal{N}(\beta )\) for \(\beta \in (0,1/4]\), and \(\mu =\mu (w)\). Then for each \(i \in [n]\)

$$\begin{aligned} (1-\beta )\sqrt{\mu }\le \sqrt{s_i x_i}\le (1+\beta )\sqrt{\mu }\,. \end{aligned}$$

Proof

By definition of \(\mathcal N(\beta )\) we have for all \(i \in [n]\) that \(|\frac{x_is_i}{\mu } - 1| \le \Vert \frac{x s}{\mu } - e\Vert \le \beta \) and so \((1-\beta ) \mu \le x_is_i \le (1+\beta ) \mu \). Taking roots gives the results. \(\square \)

A key property of the central path is “near monotonicity”, formulated in the following lemma, see [63, Lemma 16].

Lemma 3.3

Let \(w = (x, y, s)\) be a central path point for \(\mu \) and \(w' = (x', y', s')\) be a central path point for \(\mu ' \le \mu \). Then \(\Vert x'/x + s'/s\Vert _\infty \le n\). Further, for the optimal solution \(w^*=(x^*,y^*,s^*)\) corresponding to the central path limit \(\mu \rightarrow 0\), we have \(\Vert x^*/x\Vert _1 + \Vert s^*/s\Vert _1 = n\).

Proof

We show that \(\Vert x'/x\Vert _1 + \Vert s'/s\Vert _1 \le 2n\) for any feasible primal \(x'\) and dual \((y',s')\) such that \((x')^\top s'\le x^\top s=n\mu \); this implies the first statement with the weaker bound 2n. For the stronger bound \(\Vert x'/x + s'/s\Vert _\infty \le n\), see the proof of [63, Lemma 16]. Since \(x-x'\in W\) and \(s-s'\in W^\perp \), we have \((x-x')^\top (s-s')=0\). This can be rewritten as \(x^\top s'+(x')^\top s=x^\top s+ (x')^\top s'\). By our assumption on \(x'\) and \(s'\), the right hand side is bounded by \(2n\mu \). Dividing by \(\mu \), and noting that \(x_is_i=\mu \) for all \(i\in [n]\), we obtain

$$\begin{aligned} \left\| \frac{x'}{x}\right\| _1+\left\| \frac{s'}{s}\right\| _1=\sum _{i=1}^n \frac{x'_i}{x_i}+\frac{s'_i}{s_i} \le 2n\,. \end{aligned}$$

The second statement follows by using this to central path points \((x',y',s')\) with parameter \(\mu '\), and taking the limit \(\mu '\rightarrow 0\). \(\square \)

3.2 The affine scaling and layered-least-squares steps

Given \(w = (x,y,s) \in \mathcal {P}^{++} \times \mathcal {D}^{++}\), the search directions commonly used in interior-point methods are obtained as the solution \((\Delta x,\Delta y,\Delta s)\) to the following linear system for some \(\sigma \in [0,1]\).

$$\begin{aligned} A \Delta x&= 0 \end{aligned}$$
(11)
$$\begin{aligned} A^\top \Delta y + \Delta s&= 0 \end{aligned}$$
(12)
$$\begin{aligned} s\Delta x + x \Delta s&=\sigma \mu e -xs \end{aligned}$$
(13)

Predictor–corrector methods, such as the Mizuno–Todd–Ye Predictor–Corrector (MTY P-C) algorithm [39], alternate between two types of steps. In predictor steps, we use \(\sigma =0\). This direction is also called the affine scaling direction, and will be denoted as \(\Delta w^\textrm{a}=(\Delta x^\textrm{a}, \Delta y^\textrm{a}, \Delta s^\textrm{a})\) throughout. In corrector steps, we use \(\sigma =1\). This gives the centrality direction, denoted as \(\Delta w^\textrm{c}=(\Delta x^\textrm{c}, \Delta y^\textrm{c}, \Delta s^\textrm{c})\).

In the predictor steps, we make progress along the central path. Given the search direction on the current iterate \(w = (x,y,s) \in \mathcal {N}(\beta )\), the step-length is chosen such that the line segment between the current and next steps remain in \(\overline{\mathcal {N}}(2\beta )\), i.e.,

$$\begin{aligned} \alpha ^\textrm{a}\le \sup \{\alpha \in [0,1]\, : \forall \alpha ' \in [0 ,\alpha ]: w+ \alpha ' \Delta w^\textrm{a}\in \mathcal {N}(2\beta )\}. \end{aligned}$$

Thus, we obtain a point \(w^+=w+\alpha ^\textrm{a}\Delta w^\textrm{a}\in \overline{\mathcal{N}}(2\beta )\). The corrector step finds a next iterate \(w^c=w^+ +\Delta w^\textrm{c}\), where \(\Delta w^\textrm{c}\) is the centrality direction computed at \(w^+\). The next proposition summarizes well-known properties, see e.g. [64, Section 4.5.1].

Proposition 3.4

Let \(w = (x,y,s) \in \mathcal {N}(\beta )\) for \(\beta \in (0,1/4]\).

  1. (i)

    For the affine scaling step, we have \(\mu (w^+)=(1-\alpha )\mu (w)\).

  2. (ii)

    The affine scaling step-length can be chosen as

    $$\begin{aligned}\alpha ^\textrm{a}\ge \max \left\{ \frac{\beta }{\sqrt{n}},1-\frac{\Vert \Delta x^\textrm{a}\Delta s^\textrm{a}\Vert }{\beta \mu (w)}\right\} \,. \end{aligned}$$
  3. (iii)

    For \(w^+ \in \overline{\mathcal{N}}(2\beta )\) with \(\mu (w^+) > 0\), let \(\Delta w^\textrm{c}\) be the centrality direction at \(w^+\). Then for \(w^\textrm{c}=w^+ +\Delta w^\textrm{c}\), we have \(\mu (w^\textrm{c})=\mu (w^+)\) and \(w^\textrm{c}\in \mathcal{N}(\beta )\).

  4. (iv)

    After a sequence of \(O(\sqrt{n} t)\) predictor and corrector steps, we obtain an iterate \(w'=(x',y',s')\in \mathcal{N}(\beta )\) such that \(\mu (w')\le \mu (w)/2^t\).

Minimum norm viewpoint and residuals For any point \(w = (x,y,s) \in \mathcal {P}^{++} \times \mathcal {D}^{++}\) we define

$$\begin{aligned} \delta = \delta (w) = s^{1/2}x^{-1/2} \in \mathbb {R}^n. \end{aligned}$$
(14)

With this notation, we can write (13) for \(\sigma = 0\) in the form

$$\begin{aligned} \delta \Delta x+\delta ^{-1} \Delta s = -s^{1/2}x^{1/2}\,. \end{aligned}$$
(15)

Note that for a point \(w(\mu )=(x(\mu ),y(\mu ),s(\mu ))\) on the central path, we have \(\delta _i(w(\mu ))=s_i(\mu )/\sqrt{\mu }=\sqrt{\mu }/x_i(\mu )\) for all \(i\in [n]\). From Proposition 3.1, we see that if \(w\in \mathcal{N}(\beta )\), and \(\mu =\mu (w)\), then for each \(i\in [n]\),

$$\begin{aligned} \sqrt{1-2\beta } \cdot \delta _i(w(\mu )) \le \delta _i(w)\le \frac{1}{\sqrt{1-2\beta }} \cdot \delta _i(w(\mu ))\,. \end{aligned}$$
(16)

The matrix \({\text {Diag}}(\delta (w))\) will be often used for rescaling in the algorithm. That is, for the current iterate \(w=(x,y,s)\) in the interior-point method, we will perform projections in the space \({\text {Diag}}(\delta (w))W\). To simplify notation, for \(\delta =\delta (w)\), we use \(L^\delta _I\) and \(\kappa ^\delta _{ij}\) as shorthands for \(L^{{\text {Diag}}(\delta )W}_I\) and \(\kappa ^{{\text {Diag}}(\delta )W}_{ij}\). The subspace \(W={\text {Ker}}(A)\) will be fixed throughout.

It is easy to see from the optimality conditions that the components of the affine scaling direction \(\Delta w^\textrm{a}=(\Delta x^\textrm{a},\Delta y^\textrm{a},\Delta s^\textrm{a})\) are the optimal solutions of the following minimum-norm problems.

$$\begin{aligned} \begin{aligned} \Delta x^\textrm{a}&= \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\Delta x \in \mathbb {R}^n}\{\Vert \delta (x+\Delta x)\Vert ^2: A\Delta x = 0\} \\ (\Delta y^\textrm{a}, \Delta s^\textrm{a})&= \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{(\Delta y, \Delta s) \in \mathbb {R}^m \times \mathbb {R}^n} \{\Vert \delta ^{-1}(s+\Delta s)\Vert ^2: A^\top \Delta y + \Delta s = 0 \} \end{aligned} \end{aligned}$$
(17)

Following [37], for a search direction \(\Delta w = (\Delta x, \Delta y, \Delta s)\), we define the residuals as

$$\begin{aligned} Rx&:=\frac{\delta (x+\Delta x)}{\sqrt{\mu }},&Rs&:=\frac{\delta ^{-1}(s + \Delta s)}{\sqrt{\mu }}\, . \end{aligned}$$
(18)

We let \( Rx ^\textrm{a}\) and \( Rs ^\textrm{a}\) denote the residuals for the affine scaling direction \(\Delta w^\textrm{a}\). Hence, the primal affine scaling direction \(\Delta x^\textrm{a}\) is the one that minimizes the \(\ell _2\)-norm of the primal residual \( Rx ^\textrm{a}\), and the dual affine scaling direction \((\Delta y^\textrm{a},\Delta s^\textrm{a})\) minimizes the \(\ell _2\)-norm of the dual residual \( Rs ^\textrm{a}\). The next lemma summarizes simple properties of the residuals, see [37].

Lemma 3.5

For \(\beta \in (0,1/4]\) such that \(w = (x,y,s) \in \mathcal {N}(\beta )\) and the affine scaling direction \(\Delta w = (\Delta x^\textrm{a}, \Delta y^\textrm{a}, \Delta s^\textrm{a})\), we have

  1. (i)
    $$\begin{aligned} Rx ^\textrm{a} Rs ^\textrm{a}=\frac{\Delta x^\textrm{a}\Delta s^\textrm{a}}{\mu },\quad Rx ^\textrm{a}+ Rs ^\textrm{a}=\frac{x^{1/2}s^{1/2}}{\sqrt{\mu }}\, , \end{aligned}$$
    (19)
  2. (ii)
    $$\begin{aligned} \Vert Rx ^\textrm{a}\Vert ^2+\Vert Rs ^\textrm{a}\Vert ^2= n \,, \end{aligned}$$
  3. (iii)

    We have \(\Vert Rx ^\textrm{a}\Vert ,\Vert Rs ^\textrm{a}\Vert \le \sqrt{n}\), and for each \(i\in [n]\), \(\max \{ Rx _i^\textrm{a}, Rs _i^\textrm{a}\} \ge \frac{1}{2}(1-\beta )\).

  4. (iv)
    $$\begin{aligned} Rx ^\textrm{a}= -\frac{1}{\sqrt{\mu }}\delta ^{-1}\Delta s^\textrm{a}, \quad Rs ^\textrm{a}= - \frac{1}{\sqrt{\mu }}\delta \Delta x^\textrm{a}\,. \end{aligned}$$

Proof

Parts (i) and (iv) are immediate from the definitions and from (11)-(13) and (15). In part (ii), we use part (i) and \(({ Rx ^\textrm{a}})^\top Rs ^\textrm{a}=0\). In part, (iii), the first statement follows by part (ii), and the second statement follows from (i) and Proposition 3.2. \(\square \)

For a subset \(I \subset [n]\), we define

$$\begin{aligned} \epsilon ^\textrm{a}_I(w) :=\max _{i \in I} \min \{| Rx ^\textrm{a}_i|, | Rs ^\textrm{a}_i|\}\,,\quad \text{ and }\quad \epsilon ^\textrm{a}(w) :=\epsilon _{[n]}^\textrm{a}(w)\,. \end{aligned}$$
(20)

The next claim shows that for the affine scaling direction, a small \(\epsilon (w)\) yields a long step; see [37, Lemma 2.5].

Lemma 3.6

Let \(w = (x,y,s) \in \mathcal {N}(\beta )\) for \(\beta \in (0,1/4]\). Then the affine scaling step can be chosen such that

$$\begin{aligned} \frac{\mu (w+\alpha ^\textrm{a}\Delta w^\textrm{a}) }{\mu (w)}\le \min \left\{ 1-\frac{\beta }{\sqrt{n}},\frac{2\sqrt{n}\epsilon ^\textrm{a}(w)}{\beta }\right\} \,. \end{aligned}$$

Proof

Let \(\epsilon :=\epsilon ^\textrm{a}(w)\). From Lemma 3.5(i), we get \(\Vert \Delta x^\textrm{a}\Delta s^\textrm{a}\Vert /\mu =\Vert Rx ^\textrm{a} Rs ^\textrm{a}\Vert \). We can bound \(\Vert Rx ^\textrm{a} Rs ^\textrm{a}\Vert \le \epsilon (\Vert Rx ^\textrm{a}\Vert +\Vert Rs ^\textrm{a}\Vert )\le 2\epsilon \sqrt{n}\), where the latter inequality follows by Lemma 3.5(iii). From Proposition 3.4(ii), we get \(\alpha ^\textrm{a}\ge \max \{\beta /\sqrt{n},1-2\sqrt{n}\epsilon /\beta \}\). The claim follows by part (i) of the same proposition. \(\square \)

3.2.1 The layered-least-squares direction

Let \(\mathcal{J}=(J_1,J_2,\ldots , J_p)\) be an ordered partition of [n].Footnote 2 For \(k\in [p]\), we use the notations \(J_{<k}:=J_1\cup \ldots \cup J_{k-1}\), \(J_{>k}:=J_{k+1}\cup \ldots \cup J_p\), and similarly \(J_{\le k}\) and \(J_{\ge k}\). We will also refer to the sets \(J_k\) as layers, and \(\mathcal{J}\) as a layering. Layers with lower indices will be referred to as ‘higher’ layers.

Given \(w = (x,y,s) \in \mathcal {P}^{++} \times \mathcal {D}^{++}\), and the layering \(\mathcal{J}\), the layered-least-squares (LLS) direction is defined as follows. For the primal direction, we proceed backwards, with \(k=p,p-1,\ldots ,1\). Assume the components on the lower layers \(\Delta x_{J_{>k}}^\textrm{ll}\) have already been determined. We define the components in \(J_k\) as the coordinate projection \(\Delta x_{J_k}^\textrm{ll}= \pi _{J_k}(X_k)\), where the affine subspace \(X_k\) is defined as the set of minimizers

$$\begin{aligned} X_k&:=\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\Delta x \in \mathbb {R}^n}\left\{ \left\| \delta _{J_k}(x_{J_k} + \Delta x_{J_k})\right\| ^2 :\, A\Delta x=0, \Delta x_{J_{>k}} = \Delta x_{J_{>k}}^\textrm{ll}\right\} \, . \end{aligned}$$
(21)

The dual direction \(\Delta s^\textrm{ll}\) is determined in the forward order of the layers \(k=1,2,\ldots , p\). Assume we already fixed the components \(\Delta s_{J_{<k}}^\textrm{ll}\) on the higher layers. Then, \(\Delta s_{J_k}^\textrm{ll}= \pi _{J_k}(S_k)\) for

$$\begin{aligned} S_k&= \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\Delta s \in \mathbb {R}^{n}}\left\{ \left\| \delta _{J_k}^{-1}(s_{J_k} + \Delta s_{J_k})\right\| ^2 :\,\exists y \in \mathbb {R}^m, A^\top \Delta y+\Delta s=0, \Delta s_{J_{<k}} = \Delta s_{J_{<k}}^\textrm{ll}\right\} \, . \end{aligned}$$
(22)

The component \(\Delta y^\textrm{ll}\) is obtained as the optimal \(\Delta y\) for the final layer \(k=p\). We use the notation \( Rx ^\textrm{ll}\) and \(\varepsilon ^\textrm{ll}(w)\) analogously to the affine scaling direction. This search direction was first introduced in [63].

The affine scaling direction is a special case for the single element partition. In this case, the definitions (21) and (22) coincide with those in (17).

3.3 Overview of ideas and techniques

A key technique in the analysis of layered least-squares algorithms [28, 36, 63] is to argue about variables that have ‘converged’. According to Proposition 3.1 and Lemma 3.3, for any iterate \(w=(x,y,s)\in \mathcal{N}(\beta )\) and the limit optimal solution \(w^*=(x^*,y^*,s^*)\), the bounds \(x^*_i\le O(n) x_i\) and \(s^*_i\le O(n) s_i\) hold. We informally say that \(x_i\) (or \(s_i\)) has converged, if \(x_i\le O(n)x_i^*\) (\(s_i\le O(n) s_i^*\)) hold for the current iterate. Thus, the value of \(x_i\) (or \(s_i\)) remains within a multiplicative factor \(O(n^2)\) for the rest of the algorithm. Note that if \(\mu >\mu '\) and \(x_i\) has converged at \(\mu \), then \(\frac{s_i(\mu ')/s_i(\mu )}{\mu '/\mu }\in \left[ \frac{1}{O(n^2)},O(n^2)\right] \); thus, \(s_i\) keeps “shooting down” with the central path parameter.

Converged variables in the affine scaling algorithm Let us start by showing that at any point of the algorithm, at least one primal or dual variable has converged.

Suppose for simplicity that our current iterate is exactly on the central path, i.e., that \(xs = \mu e\). This assumption will be maintained throughout this overview. In this case, the residuals can be simply written as \( Rx ^\textrm{a}=(x+\Delta x^\textrm{a})/x\), \( Rs ^\textrm{a}=(s+\Delta s^\textrm{a})/s\). Recall from (17) that the affine scaling direction corresponds to minimizing the residuals \( Rx ^\textrm{a}\) and \( Rs ^\textrm{a}\). From this choice, we see that

$$\begin{aligned} \left\| \frac{x^*}{x} \right\| \ge \left\| \frac{x + \Delta x^\textrm{a}}{x}\right\| \,,\quad \left\| \frac{s^*}{s} \right\| \ge \left\| \frac{s + \Delta s^\textrm{a}}{s}\right\| \,. \end{aligned}$$
(23)

We have \(\Vert Rx ^\textrm{a}\Vert ^2 + \Vert Rs ^\textrm{a}\Vert ^2 = n\) by Lemma 3.5(ii). Let us assume \(\Vert Rx ^\textrm{a}\Vert ^2\ge n/2\); thus, there exists a \(i \in [n]\) such that \(x^*_i \ge x_i/\sqrt{2}\). In other words, just by looking at the residuals, we get the guarantee that a primal or a dual variable has already converged. Based on the value of the residuals, we can guarantee this to be a primal or a dual variable, but cannot identify which particular \(x_i\) or \(s_i\) this might be.

For \(\Vert Rx ^\textrm{a}\Vert ^2\ge n/2\), a primal variable has already converged before performing the predictor and corrector steps. We now show that even if \(\Vert Rx ^\textrm{a}\Vert \) is small, a primal variable will have converged after a single iteration. From (23), we see that there is an index i with \(x^*_i/x_i \ge \Vert Rx ^\textrm{a}\Vert /\sqrt{n}\).

Furthermore, Proposition 3.4(ii) and Lemma 3.5 imply that \(1-\alpha \le {\Vert Rx ^\textrm{a}\Vert \cdot \Vert Rs ^\textrm{a}\Vert }/{\beta }\le {\sqrt{n} \Vert Rx ^\textrm{a}\Vert }/{\beta }\), since \(\Vert Rs ^\textrm{a}\Vert \le \sqrt{n}\). The predictor step moves to \(x^+ :=x + \alpha \Delta x^\textrm{a}= (1-\alpha ) x + \alpha (x + \Delta x^\textrm{a})\). Hence, \(x^+\le \left( \frac{\sqrt{n} \Vert Rx ^\textrm{a}\Vert }{\beta } + \Vert Rx ^\textrm{a}\Vert \right) x\). Putting the two inequalities together, we learn that \(x^+_i\le O(n)x^*_i\) for some \(i \in [n]\). Since \(w^+=(x^+,y^+,s^+)\in \overline{\mathcal{N}}(2\beta )\), Proposition 3.1 implies that \(x_i\) will have converged after this iteration. An analogous argument proves that some \(s_j\) will also have converged after the iteration. We again emphasize that the argument only shows the existence of converged variables, but we cannot identify them in general.

Measuring combinatorial progress Tying the above together, we find that after a single affine scaling step, at least one primal variable \(x_i\) and at least one dual variable \(s_j\) has converged. This means that for any \(\mu '<\mu \), \(\frac{x_i(\mu ')/x_j(\mu ')}{x_i(\mu )/x_j(\mu )}\in \left[ \frac{\mu }{O(n^4)\mu '},\frac{O(n^4)\mu }{\mu '}\right] \); thus, the ratio of these variables keeps asymptotically increasing. The \(x_i/x_j\) ratios serve as the main progress measure in the Vavasis–Ye algorithm. If \(x_i/x_j\) is between \(1/(\textrm{poly}(n)\bar{\chi })\) and \(\textrm{poly}(n)\bar{\chi }\) before the affine scaling step for the pair of converged variables \(x_i\) and \(s_j\), then after \(\textrm{poly}(n)\log \bar{\chi }\) iterations, the \(x_i/x_j\) ratio must leave this interval and never return. Thus, we obtain a ‘crossover-event’ that cannot again occur for the same pair of variables. In the affine scaling algorithm, there is no guarantee that \(x_i/x_j\) falls in such a bounded interval for the converging variables \(x_i\) and \(s_j\); in particular, we may obtain the same pairs of converged variables after each step.

The main purpose of layered-least-squares methods is to proactively force that in every certain number of iterations, some ‘bounded’ \(x_i/x_j\) ratios become ‘large’ and remain so for the rest of the algorithm.

In our approach, the first main insight is to focus on the scaling invariant quantities \(\kappa ^W_{ij} x_i/x_j\) instead. For simplicity’s sake, we first present the algorithm with the assumption that all values \(\kappa ^W_{ij}\) are known. We will then explain how this assumption can be removed by using gradually improving estimates on the values.

The combinatorial progress will be observed in the ‘long edge graph’. For a primal-dual feasible point \(w = (x,y,s)\) and \(\sigma =1/O(n^6)\), this is defined as \(G_{w,\sigma }=([n], E_{w,\sigma })\) with edges (ij) such that \( \kappa ^W_{ij} x_i/x_j \ge \sigma \). Observe that for any \(i,j\in [n]\), at least one of (ij) and (ji) are long edges: this follows since for any circuit C with \(i,j\in C\), we get lower bounds \(|g^C_j/g^C_i|\le \kappa ^W_{ij}\) and \(|g^C_i/g^C_j|\le \kappa ^W_{ji}\).

Intuitively, our algorithm will enforce the following two types of events. The analysis in Sect. 4 is based on a potential function analysis capturing roughly the same progress.

  • For an iterate w and a value \(\mu > 0\), we have \(i,j\in [n]\) in a strongly connected component in \(G_{w,\sigma }\) of size \(\le \tau \), and for any iterate \(w'\) with \(\mu (w') > \mu \), if ij are in a strongly connected component of \(G_{w',\sigma }\) then this component has size \(\ge 2\tau \).

  • For an iterate w and a value \(\mu > 0\), we have \((i,j) \notin E_{w,\sigma }\), and for any iterate \(w'\) with \(\mu (w') > \mu \) we have \((i,j) \in E_{w',\sigma }\).

At most \(O(n^2 \log n)\) such events can happen overall, so if we can prove that on average an event will happen every \(O(\sqrt{n} \log (\bar{\chi }^*_A + n))\) iterations or the algorithm terminates, then we have the desired convergence bound of \(O(n^{2.5}\log (n) \log (\bar{\chi }^*_A + n))\) iterations.

Fig. 1
figure 1

Top-down we have a chart of primal/dual variables and the estimated subgraph of the circuit ratio digraph (Definition 3.11) for three different iterations: 1) All variables except \(x_i\) are far away from their optimal values. 2) On \(J_1\) there is a primal variable (i) and dual variable (j) that have converged, i.e. \(x_i\) is close to \(x_i^*\) and \(s_i\) is close to \(s_i^*\). 3) j moves to layer \(J_2\) due to a change in the underlying subgraph of the circuit ratio digraph

Converged variables cause combinatorial progress We now show that combinatorial progress as above must happen in the affine scaling step in the case when the graph \(G_{w,\sigma }\) is strongly connected. As noted above, for the pair of converged variables \(x_i\) and \(s_j\) after the affine scaling step, \(x_i/x_j\), and thus \(\kappa ^W_{ij} x_i/x_j\), will asymptotically increase by a factor 2 in every \(O(\sqrt{n})\) iterations.

By the strong connectivity assumption, there is a directed path in the long edge graph from i to j of length at most \(n-1\). Each edge has length at least \(\sigma \), and by the cycle characterization (Theorem 2.12) we know that \((\kappa ^W_{ji} x_j/x_i) \cdot \sigma ^{n-1} \le (\kappa _W^*)^n\). As such, \(\kappa ^W_{ji} x_j / x_i \le (\kappa _W^*)^n/\sigma ^{n-1}\). Since \(\kappa ^W_{ij} \kappa ^W_{ji}\ge 1\), we obtain the lower bound \(\kappa ^W_{ij} x_i / x_j \ge \sigma ^{n-1}/(\kappa _W^*)^{n}\).

This means that after \(O(\sqrt{n} \log ((\kappa _W^*/\sigma )^n)) = O(n^{1.5}\log (\kappa _W^* + n))\) affine scaling steps, the weight of the edge (ij) will be more than \((\kappa _W^*/\sigma )^{4n}\). There can never again be a length n or shorter path from j to i in the long edge graph, for otherwise the resulting cycle would violate Theorem 2.12. Moreover, by the triangle inequality (Lemma 2.15), any other \(k \ne i,j\) will have either (ik) or (kj) of length at least \((\kappa _W^*/\sigma )^{2n}\), similarly causing a pair of variables to never again be in the same connected component. As such, we took \(O(n^{1.5}\log (\kappa _W^* + n))\) affine scaling steps and in that time at least \(n-1\) combinatorial progress events have occured.

The layered least squares step Similarly to the Vavasis–Ye algorithm [63] and subsequent literature, our algorithm is a predictor–corrector method using layered least squares (LLS) steps as in Sect. 3.2.1 for certain predictor iterations. Our algorithm (Algorithm 2) uses LLS steps only sometimes, and most steps are the simpler affine scaling steps; but for simplicity of this overview, we can assume every predictor iteration uses an LLS step.

We define the ordered partition \(\mathcal{J}=(J_1,J_2,\ldots , J_p)\) corresponding to the strongly connected components in topological ordering. Recalling that either (ij) or (ji) is a long edge for every pair \(i,j\in [n]\), this order is unique and such that there is a complete directed graph of long edges from every \(J_k\) to \(J_{k'}\) for \(1\le k<k'\le p\).

The first important property of the LLS step is that it is very close to the affine scaling step. In Sect. 3.4.1, we introduce the partition lifting cost \(\ell ^W(\mathcal{J})=\max _{2 \le k \le p}\ell ^W(J_{\ge k})\) as the cost of lifting from lower to higher layers; we let \(\ell ^{1/x}(\mathcal{J})\) be a shorthand for \(\ell ^{{\text {Diag}}(1/x)W}(\mathcal{J})\). Note that this same rescaling is used for the affine scaling step in (17), since \(\delta =\sqrt{\mu }/x\) if w is on the central path. In Lemma 3.10(ii), we show that for a small partition lifting cost, the LLS residuals will remain near the affine scaling residuals. Namely,

$$\begin{aligned} \Vert Rx ^\textrm{ll}- Rx ^\textrm{a}\Vert , \Vert Rs ^\textrm{ll}- Rs ^\textrm{a}\Vert \le 6n^{3/2}\ell ^{1/x}(\mathcal{J})\,. \end{aligned}$$

Recall that the LLS residuals can be written as \( Rx ^\textrm{ll}= ({x + \Delta x^\textrm{ll}})/{x}\), \( Rs ^\textrm{ll}= (s + \Delta s^\textrm{ll})/{s}\) for a point on the central path. For \(\mathcal{J}\) defined as above, Lemma 2.11 yields \(\ell ^{1/x}(\mathcal{J}) \le n \max _{i \in J_{> k}, j \in J_{\le k}, k \in [p]} \kappa ^W_{ij}{x_i}/{x_j}\). This will be sufficiently small as this maximum is taken over ‘short’ edges (not in \(E_{w,\sigma }\)).

A second, crucial property of the LLS step is that it “splits” our LP into p separate LPs that have “negligible” interaction. Namely, the direction \((\Delta x_{J_k}^\textrm{ll},\Delta s_{J_k}^\textrm{ll})\) will be very close to the affine scaling step obtained in the problem restricted to the subspace \(W_{\mathcal{J},k} = \{x_{J_k}: x \in W, x_{J_{>k}} = 0\}\) (Lemma 3.10(i))

Since each component \(J_k\) is strongly connected in the long edge graph \(G_{w,\sigma }\), if there is at least one primal \(x_i\) and dual \(s_j\) in \(J_k\) that have converged after the LLS step, we can use the above argument to show combinatorial progress regarding the \(\kappa ^W_{ij}x_i/x_j\) value (Lemma 4.3).

Exploiting the proximity between the LLS and affine scaling steps, Lemma 3.10(iv) gives a lower bound on the step size \(\alpha \ge 1-\frac{3\sqrt{n}}{\beta }\max _{i\in [n]}\min \{| Rx _i^\textrm{ll}|,| Rs _i^\textrm{ll}|\}\). Let \(J_k\) be the component where \(\min \{\Vert Rx _{J_k}^\textrm{ll}\Vert ,\Vert Rs _{J_k}^\textrm{ll}\Vert \}\) is the largest. Hence, the step size \(\alpha \) can be lower bounded in terms of \(\min \{\Vert Rx _{J_k}^\textrm{ll}\Vert ,\Vert Rs _{J_k}^\textrm{ll}\Vert \}\).

The analysis now distinguishes two cases. Let \(w^+=w+\alpha \Delta s^\textrm{ll}\) be the point obtained by the predictor LLS step. If the corresponding partition lifting cost \(\ell ^{1/x^+}(\mathcal{J})\) is still small, then a similar argument that has shown the convergence of primal and dual variables in the affine scaling step will imply that after the LLS step, at least one \(x_i\) and one \(s_j\) will have converged for \(i,j\in J_k\). Thus, in this case we obtain the combinatorial progress (Lemma 4.4).

The remaining case is when \(\ell ^{1/x^+}(\mathcal{J})\) becomes large. In Lemma 4.5, we show that in this case a new edge will enter the long edge graph, corresponding to the second combinatorial event listed previously. Intuitively, in this case one layer “crashes” into another.

Refined estimates on circuit imbalances In the above overview, we assumed the circuit imbalance values \(\kappa ^W_{ij}\) are given, and thus the graph \(G_{w,\sigma }\) is available. Whereas these quantities are difficult to compute, we can naturally work with lower estimates. For each \(i,j\in [n]\) that are contained in a circuit together, we start with the lower bound \(\hat{\kappa }^W_{ij}=|g^C_j/g^C_i|\) obtained for an arbitrary circuit C with \(i,j\in C\). We use the graph \(\hat{G}_{w,\sigma }=([n],\hat{E}_{w,\sigma })\) corresponding to these estimates. Clearly, \(\hat{E}_{w,\sigma }\subseteq E_{w,\sigma }\), but some long edges may be missing. We determine the partition \(\mathcal J\) of the strongly connected components of \(\hat{G}_{w,\sigma }\) and estimate the partition lifting cost \(\ell ^{1/x}(\mathcal{J})\). If this is below the desired bound, the argument works correctly. Otherwise, we can identify a pair ij responsible for this failure. Namely, we find a circuit C with \(i,j\in C\) such that \(\hat{\kappa }^W_{ij}<|g^C_j/g^C_i|\). In this case, we update our estimate, and recompute the partition; this is described in Algorithm 1. At each LLS step, the number of updates is bounded by n, since every update leads to a decrease in the number of partition classes. This finishes the overview of the algorithm.

3.4 A linear system viewpoint of layered least squares

We now continue with the detailed exposition of our algorithm. We present an equivalent definition of the LLS step introduced in Sect. 3.2.1, generalizing the linear system (12)–(13). We use the subspace notation. With this notation, (12)–(13) for the affine scaling direction can be written as

$$\begin{aligned} s\Delta x^\textrm{a}+x\Delta s^\textrm{a}=-xs\,, \quad \Delta x^\textrm{a}\in W\,, \quad \text{ and }\quad \Delta s^\textrm{a}\in W^\perp \,,\ \end{aligned}$$
(24)

which is further equivalent to \(\delta \Delta x^\textrm{a}+\delta ^{-1}\Delta s^\textrm{a}=-x^{1/2}s^{1/2}\).

Given the layering \(\mathcal{J}\) and \(w=(x,y,s)\), for each \(k\in [p]\) we define the subspaces

$$\begin{aligned} W_{\mathcal{J},k} :=\{x_{J_k}: x \in W, x_{J_{>k}} = 0\}\,\quad \text{ and }\quad W_{\mathcal{J},k}^\perp :=\{x_{J_k}: x \in W^\perp , x_{J_{< k}} = 0\}\,.\end{aligned}$$

We emphasize that \(W_{\mathcal{J},k}\) and \(W_{\mathcal{J},k}^\perp \) live on the variables in layer k. That is, \(W_{\mathcal{J},k}, W_{\mathcal{J},k}^\perp \subseteq \mathbb {R}^{J_k}\). It is easy to see that these two subspaces are orthogonal complements. Our next goal is to show that, analogously to (24), the primal LLS step \(\Delta x^{\textrm{ll}}\) is obtained as the unique solution to the linear system

$$\begin{aligned} \delta \Delta x^\textrm{ll}+ \delta ^{-1} \Delta s = -x^{1/2} s^{1/2}\,, \quad \Delta x^\textrm{ll}\in W\,,\quad \text{ and }\quad \Delta s \in W_{\mathcal{J},1}^\perp \times \cdots \times W_{\mathcal{J},p}^\perp \,, \nonumber \\ \end{aligned}$$
(25)

and the dual LLS step \(\Delta s^{\textrm{ll}}\) is the unique solution to

$$\begin{aligned} \delta \Delta x + \delta ^{-1} \Delta s^\textrm{ll}= -x^{1/2} s^{1/2}\,, \quad \Delta x\in W_{\mathcal{J},1} \times \cdots \times W_{\mathcal{J},p}\,,\quad \text{ and } \quad \Delta s^\textrm{ll}\in W^\perp \,.\nonumber \\ \end{aligned}$$
(26)

It is important to note that \(\Delta s\) in (25) may be different from \(\Delta s^\textrm{ll}\), and \(\Delta x\) in (26) may be different from \(\Delta x^\textrm{ll}\). In fact, \(\Delta s^\textrm{ll}=\Delta s\) and \(\Delta x^\textrm{ll}=\Delta x\) can only be the case for the affine scaling step.

The following lemma proves that the above linear systems are indeed uniquely solved by the LLS step.

Lemma 3.7

For \(t \in \mathbb {R}^n\), \(W \subseteq \mathbb {R}^n\), \(\delta \in \mathbb {R}^n_{++}\), and \(\mathcal J = (J_1,J_2,\dots ,J_p)\), let \(w = \textrm{LLS}^{W,\delta }_{\mathcal J}(t)\) be defined by

$$\begin{aligned} \delta w + \delta ^{-1} v = \delta t,\qquad w \in W, \qquad v \in W_{\mathcal{J},1}^\perp \times \cdots \times W_{\mathcal{J},p}^\perp . \end{aligned}$$

Then \(\textrm{LLS}^{W,\delta }_{\mathcal J}(t)\) is well-defined and

$$\begin{aligned} \left\Vert \delta _{J_k}(t_{J_k} - w_{J_k})\right\Vert = \min \left\{ \left\Vert \delta _{J_k}(t_{J_k} - z_{J_k})\right\Vert : z \in W, z_{J_{>k}} = w_{J_{>k}} \right\} \end{aligned}$$

for every \(k\in [p]\).

In the notation of the above lemma we have, for ordered partitions \(\mathcal J = (J_1,J_2,\dots ,J_p)\), \(\bar{\mathcal{J}} = (J_p,J_{p-1},\dots ,J_1)\), and \((x,y,s) \in \mathcal P^{++} \times \mathcal D^{++}\) with \(\delta = s^{1/2}x^{-1/2}\), that \(\Delta x^\textrm{ll}= \textrm{LLS}^{W,\delta }_{\mathcal J}(-x)\) and \(\Delta s^\textrm{ll}= \textrm{LLS}^{W^\perp ,\delta ^{-1}}_\mathcal{{\bar{J}}}(-s)\).

Proof of Lemma 3.7

We first prove the equality \(W \cap (W^\perp _{\mathcal J,1} \times \dots \times W^\perp _{\mathcal J,p}) = \left\{ 0 \right\} \), and by a similar argument we have \(W^\perp \cap (W_{\mathcal J,1} \times \dots \times W_{\mathcal J,p}) = \left\{ 0 \right\} \). By duality, this last equality tells us that

$$\begin{aligned}(W^\perp \cap (W_{\mathcal J,1} \times \dots \times W_{\mathcal J,p}))^\perp = W + (W^\perp _{\mathcal J,1} \times \dots \times W^\perp _{\mathcal J,p}) = \mathbb {R}^n.\end{aligned}$$

Thus, the linear decomposition defining \(\textrm{LLS}^{W,\delta }_{\mathcal J}(t)\) has a solution and its solution is unique.

Suppose \(y \in W \cap (W^\perp _{\mathcal J,1} \times \dots \times W^\perp _{\mathcal J,p})\). We prove \(y_{J_k} = 0\) by induction on k, starting at \(k=p\). The induction hypothesis is that \(y_{J_{>k}} = 0\), which is an empty requirement when \(k = p\). The hypothesis \(y_{J_{>k}} = 0\) together with the assumption \(y \in W\) is equivalent to \(y \in W \cap \mathbb {R}^n_{J_{\le k}}\), and implies \(y_{J_k} \in \pi _{J_k}(W \cap \mathbb {R}^n_{J_{\le k}}) :=W_{\mathcal{J},k}\). Since we also have \(y_{J_k} \in W_{\mathcal{J},k}^\perp \) by assumption, which is the orthogonal complement of \(W_{\mathcal{J},k}\), we must have \(y_{J_k} = 0\). Hence, by induction \(y = 0\). This finishes the proof that \(\textrm{LLS}^{W,\delta }_{\mathcal J}(t)\) is well-defined.

Next we prove that w is a minimizer of \(\min \left\{ \left\Vert \delta _{J_k}(t_{J_k} - z_{J_k})\right\Vert : z \in W, z_{J_{>k}} = w_{J_{>k}} \right\} \). The optimality condition is for \(\delta _{J_k}(t_{J_k} - z_{J_k})\) to be orthogonal to \(\delta _{J_k}u\) for any \(u \in W_{\mathcal{J},k}\). By the LLS equation, we have \(\delta _{J_k}(t_{J_k} - w_{J_k}) = \delta _{J_k}^{-1} v_{J_k}\), where \(v_{J_k} \in W^\perp _{\mathcal J, k}\). Noting then that \(\langle \delta _{J_k} u, \delta _{J_k}^{-1} v\rangle = \langle u_{J_{k}}, v_{J_k} \rangle = 0\) for \(u \in W_{\mathcal{J},k}\), the optimality condition follows immediately. \(\square \)

With these tools, we can prove that the lifting costs are self-dual. This explains the reverse order in the dual vs primal LLS step and justifies our attention on the lifting cost in a self-dual algorithm. The next proposition generalizes the result of [18].

Proposition 3.8

(Proof in Sect. 5) For a linear subspace \(W \subseteq \mathbb {R}^n\) and index set \(I \subseteq [n]\) with \(J = [n]{\setminus } I\),

$$\begin{aligned} \Vert L_I^W\Vert \le \max \{1, \Vert L_J^{W^\perp }\Vert \}. \end{aligned}$$

In particular, \(\ell ^W(I) = \ell ^{W^\perp }(J)\).

We defer the proof to Sect. 5. Note that this proposition also implies Proposition 2.1(iv).

3.4.1 Partition lifting scores

A key insight is that if the layering \(\mathcal J\) is “well-separated”, then we indeed have \(x \Delta s^\textrm{ll}+ s \Delta x^\textrm{ll}\approx -xs\), that is, the LLS direction is close to the affine scaling direction. This will be shown in Lemma 3.10. The notion of “well-separatedness” can be formalized as follows. Recall the definition of the lifting score (4). The lifting score of the layering \(\mathcal{J}=(J_1, J_2,\ldots , J_p)\) of [n] with respect to W is defined as

$$\begin{aligned} \ell ^W(\mathcal{J}):=\max _{2 \le k \le p}\ell ^W(J_{\ge k})\,. \end{aligned}$$

For \(\delta \in \mathbb {R}^n_{++}\), we use \(\ell ^{W,\delta }(I) :=\ell ^{{\text {Diag}}(\delta )W}(I)\) and \(\ell ^{W,\delta }(\mathcal{J}) :=\ell ^{{\text {Diag}}(\delta )W}(\mathcal{J})\). When the context is clear, we omit W and write \(\ell ^{\delta }(I) :=\ell ^{W,\delta }(I)\) and \(\ell ^\delta (\mathcal{J}) :=\ell ^{W,\delta }(\mathcal{J})\).

The following important duality claim asserts that the lifting score of a layering equals the lifting score of the reverse layering in the orthogonal complement subspace. It is an immediate consequence of Proposition 3.8.

Lemma 3.9

Let \(W \subseteq \mathbb {R}^n\) be a linear subspace, \(\delta \in \mathbb {R}^n_{++}\). For an ordered partition \(\mathcal{J}=(J_1,J_2,\ldots , J_p)\), let \(\mathcal { \bar{J}}=(J_p,J_{p-1},\ldots ,J_1)\) denote the reverse ordered partition. Then, we have

$$\begin{aligned} \ell ^{W,\delta }(\mathcal{J})=\ell ^{W^\perp ,\delta ^{-1}}(\mathcal { \bar{J}}). \end{aligned}$$

Proof

Let \(U = {\text {Diag}}(\delta )W\). Note that \(U^\perp = {\text {Diag}}(\delta ^{-1}) W^\perp \). Then by Proposition 3.8, for \(2 \le k \le p\), we have that

$$\begin{aligned} \ell ^{W,\delta }(J_{\ge k}) = \ell ^{U}(J_{\ge k}) = \ell ^{U^\perp }(J_{\le k-1}) = \ell ^{U^\perp }(\bar{J}_{\ge p-k+2}) = \ell ^{W^\perp ,\delta ^{-1}}(\bar{J}_{\ge p-k+2}). \end{aligned}$$

In particular, \(\ell ^{W,\delta }(\mathcal{J}) = \ell ^{W^\perp ,\delta ^{-1}}(\mathcal { \bar{J}})\), as needed. \(\square \)

The next lemma summarizes key properties of the LLS steps, assuming the partition has a small lifting score. We show that if \(\ell ^\delta (\mathcal{J})\) is sufficiently small, then on the one hand, the LLS step will be very close to the affine scaling step, and on the other hand, on each layer \(k\in [p]\), it will be very close to the affine scaling step restricted to this layer for the subspace \(W_{\mathcal{J},k}\). The proof is deferred to Sect. 5.

Lemma 3.10

(Proof on p. 46) Let \(w=(x,y,s)\in \mathcal{N}(\beta )\) for \(\beta \in (0,1/4]\), let \(\mu =\mu (w)\) and \(\delta =\delta (w)\). Let \(\mathcal{J}=(J_1,\ldots ,J_p)\) be a layering with \(\ell ^\delta (\mathcal{J})\le \beta /(32 n^2)\), and let \(\Delta w^\textrm{ll}= (\Delta x^\textrm{ll}, \Delta y^\textrm{ll}, \Delta s^\textrm{ll})\) denote the LLS direction for the layering \(\mathcal{J}\). Let furthermore \(\epsilon ^\textrm{ll}(w)=\max _{i\in [n]}\min \{| Rx _i^\textrm{ll}|,| Rs _i^\textrm{ll}|\}\), and define the maximal step length as

$$\begin{aligned} \alpha ^*&:=\sup \{\alpha ' \in [0,1] : \forall \bar{\alpha }\in [0,\alpha ']: w + \bar{\alpha }\Delta w^\textrm{ll}\in \mathcal {N}(2\beta )\}\, . \end{aligned}$$

Then the following properties hold.

  1. (i)

    We have

    $$\begin{aligned} \Vert \delta _{J_k} \Delta x^\textrm{ll}_{J_k} + \delta ^{-1}_{J_k} \Delta s^\textrm{ll}_{J_k} +x^{1/2}_{J_k} s^{1/2}_{J_k}\Vert&\le 6n\ell ^\delta (\mathcal{J})\sqrt{\mu }\, , \quad \forall k\in [p], \text{ and } \end{aligned}$$
    (27)
    $$\begin{aligned} \Vert \delta \Delta x^\textrm{ll}+ \delta ^{-1} \Delta s^\textrm{ll}+x^{1/2} s^{1/2}\Vert&\le 6n^{3/2}\ell ^\delta (\mathcal{J})\sqrt{\mu }\, . \end{aligned}$$
    (28)
  2. (ii)

    For the affine scaling direction \(\Delta w^\textrm{a}=(\Delta x^\textrm{a},\Delta y^\textrm{a},\Delta s^\textrm{a})\),

    $$\begin{aligned} \Vert Rx ^\textrm{ll}- Rx ^\textrm{a}\Vert , \Vert Rs ^\textrm{ll}- Rs ^\textrm{a}\Vert \le 6n^{3/2}\ell ^\delta (\mathcal{J})\,. \end{aligned}$$
  3. (iii)

    For the residuals of the LLS steps we have \(\Vert Rx ^\textrm{ll}\Vert ,\Vert Rs ^\textrm{ll}\Vert \le \sqrt{2n}\). For each \(i \in [n]\), \(\max \{| Rx ^\textrm{ll}_i|,| Rs ^\textrm{ll}_i|\}\ge \frac{1}{2}-\frac{3}{4} \beta \).

  4. (iv)

    We have

    $$\begin{aligned} \alpha ^*\ge 1-\frac{3\sqrt{n}\epsilon ^\textrm{ll}(w)}{\beta }\,, \end{aligned}$$
    (29)

    and for any \(\alpha \in [0,1]\)

    $$\begin{aligned} \mu (w + \alpha \Delta w^\textrm{ll}) = (1-\alpha )\mu , \end{aligned}$$
  5. (v)

    We have \(\epsilon ^\textrm{ll}(w)=0\) if and only if \(\alpha ^*=1\). These are further equivalent to \(w+ \Delta w^\textrm{ll}=(x+\Delta x^\textrm{ll}, y+\Delta y^\textrm{ll},s+ \Delta s^\textrm{ll})\) being an optimal solution to (LP).

3.5 The layering procedure

Our algorithm performs LLS steps on a layering with a low lifting score. A further requirement is that within each layer, the circuit imbalances \(\kappa ^\delta _{ij}\) defined in (6) are suitably bounded. The rescaling here is with respect to \(\delta =\delta (w)\) for the current iterate \(w=(x,y,s)\). To define the precise requirement on the layering, we first introduce an auxiliary graph. Throughout we use the parameter

$$\begin{aligned} \gamma :=\frac{\beta }{2^{10} n^{5}}\,. \end{aligned}$$
(30)

The auxiliary graph For a vector \(\delta \in \mathbb {R}^n_{++}\) and \(\sigma >0\), we define the directed graph \(G_{\delta ,\sigma }=([n],E_{\delta ,\sigma })\) such that \((i,j)\in E_{\delta ,\sigma }\) if \(\kappa ^\delta _{ij}\ge \sigma \). This is a subgraph of the circuit ratio digraph studied in Sect. 2, including only the edges where the circuit ratio is at least the threshold \(\sigma \). Note that we do not have direct access to this graph, as we cannot efficiently compute the values \(\kappa ^\delta _{ij}\).

At the beginning of the entire algorithm, we run the subroutine Find-Circuits(A) as in Theorem 2.14, where \(W={\text {Ker}}(A)\). We assume the matroid \(\mathcal{M}(A)\) is non-separable. For a separable matroid, we can solve the subproblems of our LP on the components separately. Thus, for each \(i\ne j\), \(i,j\in [n]\), we obtain an estimate \(\hat{\kappa }_{ij}\le \kappa _{ij}\). These estimates will be gradually improved throughout the algorithm.

Note that \(\kappa ^\delta _{ij}=\kappa _{ij}\delta _j/\delta _i\) and \(\hat{\kappa }^\delta _{ij}=\hat{\kappa }_{ij}\delta _j/\delta _i\). If \(\hat{\kappa }^\delta _{ij}\ge \sigma \), then we are guaranteed \((i,j)\in E_{\delta ,\sigma }\).

Definition 3.11

Define \(\hat{G}_{\delta ,\sigma }=([n],\hat{E}_{\delta ,\sigma })\) to be the directed graph with edges (ij) such that \(\hat{\kappa }^\delta _{ij}\ge \sigma \); clearly, \(\hat{G}_{\delta ,\sigma }\) is a subgraph of \(G_{\delta ,\sigma }\).

Lemma 3.12

Let \(\delta \in \mathbb {R}^n_{++}\). For every \(i\ne j\), \(i,j\in [n]\), \(\hat{\kappa }_{ij}^\delta \cdot \hat{\kappa }_{ji}^\delta \ge 1\). Consequently, for any \(0<\sigma \le 1\), at least one of \((i,j)\in \hat{E}_{\delta ,\sigma }\) or \((j,i)\in \hat{E}_{\delta ,\sigma }\).

Proof

We show that this property holds at the initialization. Since the estimates can only increase, it remains true throughout the algorithm. Recall the definition of \(\hat{\kappa }_{ij}\) from Theorem 2.14. This is defined as the maximum of \(|g_j/g_i|\) such that \(g\in W\), \(\textrm{supp}(g)=C\) for some \(C\in \hat{\mathcal {C}}\) containing i and j. For the same vector g, we get \(\hat{\kappa }_{ji}\ge |g_i/g_j|\). Consequently, \(\hat{\kappa }_{ij}\cdot \hat{\kappa }_{ji}\ge 1\), and also \(\hat{\kappa }^\delta _{ij}\cdot \hat{\kappa }_{ji}^\delta \ge 1\). The second claim follows by the assumption \(\sigma \le 1\). \(\square \)

Balanced layerings We are ready to define the requirements on the layering in the algorithm. In the algorithm, \(\delta =\delta (w)\) will correspond to the scaling of the current iterate \(w=(x,y,s)\).

Definition 3.13

Let \(\delta \in \mathbb {R}^n_{++}\). The layering \(\mathcal{J}=(J_1, J_2,\ldots , J_p)\) of [n] is \(\delta \)-balanced if

  1. (i)

    \(\ell ^\delta (\mathcal{J})\le \gamma \), and

  2. (ii)

    \(J_k\) is strongly connected in \(G_{\delta ,\gamma /n}\) for all \(k\in [p]\).

The following lemma shows that within each layer, the \(\kappa _{ij}^\delta \) values are within a bounded range. This will play an important role in our potential analysis.

Lemma 3.14

Let \(0<\sigma < 1\) and \(t>0\), and \(i,j\in [n]\), \(i\ne j\).

  1. (i)

    If the graph \(G_{\delta ,\sigma }\) contains a directed path of at most \(t-1\) edges from j to i, then

    $$\begin{aligned} \kappa _{ij}^\delta <\left( \frac{\kappa ^*}{\sigma }\right) ^{t}\,. \end{aligned}$$
  2. (ii)

    If \(G_{\delta ,\sigma }\) contains a directed path of at most \(t-1\) edges from i to j, then

    $$\begin{aligned} \kappa _{ij}^\delta > \left( \frac{\sigma }{\kappa ^*}\right) ^{t}\,. \end{aligned}$$

Proof

For part (i), let \(j=i_1,i_2,\ldots ,i_h=i\) be a path in \(G_{\delta ,\sigma }\) in J from j to i with \(h\le t\). That is, \(\kappa ^\delta _{i_\ell i_{\ell +1}}\ge \sigma \) for each \(\ell \in [h]\). Theorem 2.12 yields

$$\begin{aligned} (\bar{\kappa }^*)^{t}\ge \kappa _{ij}^\delta \cdot \sigma ^{h-1}> \kappa _{ij}^\delta \cdot \sigma ^{t}\,, \end{aligned}$$

since \(h\le t\) and \(\sigma < 1\). Part (ii) follows using part (i) for j and i, and that \(\kappa _{ij}^\delta \cdot \kappa _{ji}^\delta \ge 1\) according to Lemma 3.12. \(\square \)

Description of the layering subroutine Consider an iterate \(w=(x,y,s)\in \mathcal{N}(\beta )\) of the algorithm with \(\delta =\delta (w)\), The subroutine Layering\((\delta ,\hat{\kappa })\), described in Algorithm 1, constructs a \(\delta \)-balanced layering. We recall that the approximated auxilliary graph \(\hat{G}_{\delta ,\gamma /n}\) with respect to \(\hat{\kappa }\) is as in Definition 3.11

figure a

We now give an overview of the subroutine Layering\((\delta ,\hat{\kappa })\). We start by computing the strongly connected components (SCCs) of the directed graph \(\hat{G}_{\delta ,\gamma /n}\). The edges of this graph are obtained using the current estimates \(\hat{\kappa }_{ij}^\delta \). According to Lemma 3.12, we have \((i,j) \in \hat{E}_{\delta ,\gamma /n}\) or \((j,i)\in \hat{E}_{\delta ,\gamma /n}\) for every \(i,j\in [n]\), \(i\ne j\). Hence, there is a linear ordering of the components \(C_1,C_2,\ldots ,C_\ell \) such that \((u,v)\in \hat{E}_{\delta ,\gamma /n}\) whenever \(u\in C_i\), \(v\in C_j\), and \(i<j\). We call this the ordering imposed by \(\hat{G}_{\delta , \gamma /n}\).

Next, for each \(k= 2,\ldots ,\ell \), we use the subroutine Verify-Lift\(({\text {Diag}}(\delta )W, C_{\ge k},\gamma )\) described in Lemma 2.11. If the subroutine returns ‘pass’, then we conclude \(\ell ^\delta (C_{\ge k})\le \gamma \), and proceed to the next layer. If the answer is ‘fail’, then the subroutine returns as certificates \(i\in C_{\ge k}\), \(j\in C_{<k}\), and t such that \(\gamma /n \le t\le \kappa _{ij}^\delta \). In this case, we update \(\hat{\kappa }_{ij}^\delta \) to the higher value t. We add (ij) to an edge set \(\bar{E}\); this edge set was initialized to contain \(\hat{E}_{\delta ,\gamma /n}\). After adding (ij), all components \(C_\ell \) between those containing i and j will be merged into a single strongly connected component. To see this, recall that if \(i'\in C_{\ell }\) and \(j'\in C_{\ell '}\) for \(\ell <\ell '\), then \((i',j')\in \hat{E}_{\delta ,\gamma /n}\) according to Lemma 3.12.

Finally, we compute the strongly connected components of \(([n],\bar{E})\). We let \(J_1,J_2,\ldots ,J_p\) denote their unique acyclic order, and return these layers.

Lemma 3.15

The subroutine Layering\((\delta ,\hat{\kappa })\) returns a \(\delta \)-balanced layering in \(O(nm^2 + n^2)\) time.

The difficult part of the proof is showing the running time bound. We note that the weaker bound \(O(n^2 m^2)\) can be obtained by a simpler argument.

Proof

We first verify that the output layering is indeed \(\delta \)-balanced. For property (i) of Definition 3.13, note that each \(J_q\) component is the union of some of the \(C_k\)’s. In particular, for every \(q\in [p]\), the set \(J_{\ge q}=C_{\ge k}\) for some \(k\in [\ell ]\). Assume now \(\ell ^\delta (C_{\ge k})>\gamma \). At step k of the main cycle, the subroutine Verify-Lift returned the answer ‘fail’, and a new edge \((i,j)\in E\) was added with \(i\in C_{\ge k}\), \(j\in C_{<k}\). Note that we already had \((j,i)\in \hat{E}_{\delta ,\gamma /n}\), since \(j\in C_r\) for some \(r<k\), and \(i\in C_{r'}\) for \(r'\ge k\). This contradicts the choice of \(J_{\ge q}\) as a maximal strongly connected component in ([n], E).

Property (ii) follows since all new edges added to E have \(\kappa _{ij}\ge \gamma /n\). Therefore, ([n], E) is a subgraph of \(G_{\delta ,\gamma /n}\).

Let us now turn to the computational cost. The initial strongly-connected components can be obtained in time \(O(n^2)\), and the same bound holds for the computation of the final components. (The latter can be also done in linear time, exploiting the special structure that the components \(C_i\) have a complete linear ordering.)

The second computational bottleneck is the subroutine Verify-Lift. We assume a matrix \(M\in \mathbb {R}^{n \times (n-m)}\) is computed at the very beginning such that \(\textrm{range}(M)=W\). We first explain how to implement one call to Verify-Lift in \(O(n (n-m)^2)\) time. We then sketch how to amortize the work across the different calls to Verify-Lift, using the nested structure of the layering, to implement the whole procedure in \(O(n (n-m)^2)\) time. To turn this into \(O(n m^2)\), we recall that the layering procedure is the same for W and \(W^\perp \) due to duality (Proposition 3.8). Since \(\dim (W^\perp )=m\), applying this subroutine on \(W^\perp \) instead of W achieves the same result but in time \(O(nm^2)\).

We now explain the implementation of Verify-Lift, where we are given as input \(C \subseteq [n]\) and the basis matrix \(M \in \mathbb {R}^{n \times (n-m)}\) as above with \(\textrm{range}(M) = W\). Clearly, the running time is dominated by the computation of the set \(I \subseteq C\) and the matrix \(B \in \mathbb {R}^{([n] {\setminus } C) \times |I|}\) satisfying \(L_C^W(x)_{[n] {\setminus } C} = B x_{I}\), for \(x \in \pi _C(W)\). We explain how to compute I and B from M using column operations (note that these preserve the range). The valid choices for \(I \subseteq C\) are in correspondence with maximal sets of linear independent rows of \(M_{C,{\varvec{\cdot }}}\), noting then that \(|I| = r\) where \(r :=\textrm{rk}(M_{C,{\varvec{\cdot }}})\). Let \(D_1 = [n-m-r]\) and \(D_2 = [n-m] {\setminus } [n-m-r]\). By applying columns operations to M, we can compute \(I \subseteq C\) such that \(M_{I,D_2} = \textbf{I}_{r}\) (\(r \times r\) identity) and \(M_{C,D_1} = 0\). This requires \(O(n(n-m)|C|)\) time using Gaussian elimination. At this point, note that \(\pi _C(W) = \textrm{range}(M_{C,D_2})\), \(\pi _{I}(W) = \mathbb {R}^{I}\) and \(\textrm{range}(M_{{\varvec{\cdot }}, D_1}) = W \cap \mathbb {R}^n_{[n] {\setminus } C}\). To compute B, we must transform the columns of \(M_{{\varvec{\cdot }},D_2}\) into minimum norm lifts of \(e_i \in \pi _{I}(W)\) into W, for all \(i \in I\). For this purpose, it suffices to make the columns of \(M_{[n] \setminus C,D_2}\) orthogonal to the range of \(M_{[n] \setminus C,D_1}\). Applying Gram-Schmidt orthogonalization, this requires \(O((n-|C|)(n-m)(n-m-r))\) time. From here, the desired matrix \(B = M_{[n] \setminus C, D_2}\). Thus, the total running time of Verify-Lift is \(O(n(n-m)|C| + (n-|C|)(n-m)(n-m-r)) = O(n(n-m)^2)\).

We now sketch how to amortize the work of all the calls of Verify-Lift during the layering algorithm, to achieve a total \(O(n(n-m)^2)\) running time. Let \(C_1,\dots ,C_\ell \) denote the candidate SCC layering. Our task is to compute the matrices \(B_k\), \(2 \le k \le \ell \), needed in the calls to Verify-Lift on \(W, C_{\ge k}\), \(2 \le k\le \ell \), in total \(O(n(n-m)^2)\) time. We achieve this in three steps working with the basis matrix M as above. Firstly, by applying column operations to M, we compute sets \(I_k \subseteq C_k\) and \(D_k = [|I_{\le k}|] {\setminus } [|I_{< k}|]\), \(k \in [\ell ]\), such that \(M_{I_k,D_k} = \textbf{I}_{r_k}\), where \(r_k = |I_k|\), and \(M_{C_{\ge k},D_{<k}} = 0\), \(2 \le k \le \ell \). Note that this enforces \(\sum _{k=1}^\ell r_k = (n-m)\). This computation requires \(O(n(n-m)^2)\) time using Gaussian elimination. This computation achieves \(\textrm{range}(M_{C_k,D_k}) = \pi _{C_k}(W \cap \mathbb {R}^n_{C_{\le k}})\), \(\textrm{range}(M_{C_{\ge k},D_{\ge k}}) = \pi _{C_{\ge k}}(W)\) and \(\textrm{range}(M_{{\varvec{\cdot }},D_{\le k}}) = W \cap \mathbb {R}^n_{C_{\le k}}\), for all \(k \in [\ell ]\).

From here, we block orthogonalize M, such that the columns of \(M_{{\varvec{\cdot }}, D_k}\) are orthogonal to the range of \(M_{{\varvec{\cdot }},D_{<k}}\), \(2 \le k \le \ell \). Applying an appropriately adapted Gram-Schmidt orthogonalization, this requires \(O(n(n-m)^2)\) time. Note that this operation maintains \(M_{I_k,D_k} = \textbf{I}_{r_k}\), \(k \in [\ell ]\), since \(M_{C_{\ge k},D_{<k}} = 0\). At this point, for \(k \in [\ell ]\) the columns of \(M_{{\varvec{\cdot }},D_k}\) are in correspondence with minimum norm lifts of \(e_i \in \pi _{D_{\ge k}(W)}\) into W, for all \(i \in I_k\). Note that to compute the matrix \(B_k\) we need the lifts of \(e_i \in \pi _{D_{\ge k}(W)}\), for all \(i \in I_{\ge k}\) instead of just \(i \in I_k\).

We now compute the matrices \(B_\ell ,\dots ,B_2\) in this order via the following iterative procedure. Let k denote the iteration counter, which decrements from \(\ell \) to 2. For \(k=\ell \) (first iteration), we let \(B_\ell = M_{C_{<\ell },D_\ell }\) and decrement k. For \(k < \ell \), we eliminate the entries of \(M_{I_k,D_{>k}}\) by using the columns of \(M_{{\varvec{\cdot }},D_k}\). We then let \(B_k = M_{C_{<k},D_{\ge k}}\) and decrement k. To justify correctness, one only has to notice that at the end of iteration k, we maintain the orthogonality of \(M_{{\varvec{\cdot }},D_{\ge k}}\) to the range of \(M_{{\varvec{\cdot }},D_{< k}}\) and that \(M_{I_{\ge k},D_{\ge k}} = \textbf{I}_{|I_{\ge k}|}\) is the appropriate identity. The cost of this procedure is the same as a full run of Gaussian elimination and thus is bounded by \(O(n(n-m)^2)\). The calls to Verify-Lift during the layering procedure can thus be executed in \(O(n(n-m)^2))\) amortized time as claimed. \(\square \)

3.6 The overall algorithm

figure b

Algorithm 2 presents the overall algorithm LP-Solve\((A,b,c,w^0)\). We assume that an initial feasible solution \(w^0=(x^0,y^0,s^0)\in \mathcal{N}(\beta )\) is given. We address this in Sect. 7, by adapting the extended system used in [63]. We note that this subroutine requires an upper bound on \(\bar{\chi }^*\). Since computing \(\bar{\chi }^*\) is hard, we can implement it by a doubling search on \(\log \bar{\chi }^*\), as explained in Sect. 7. Other than for initialization, the algorithm does not require an estimate on \(\bar{\chi }^*\).

The algorithm starts with the subroutine Find-Circuits(A) as in Theorem 2.14. The iterations are similar to the MTY Predictor–Corrector algorithm [39]. The main difference is that certain affine scaling steps are replaced by LLS steps. In every predictor step, we compute the affine scaling direction, and consider the quantity \(\epsilon ^\textrm{a}(w)=\max _{i\in [n]}\min \{| Rx ^\textrm{a}_i|,| Rs ^\textrm{a}_i|\}\). If this is above the threshold \(10n^{3/2}\gamma \), then we perform the affine scaling step. However, in case \(\epsilon ^\textrm{a}(w)<10n^{3/2}\gamma \), we use the LLS direction instead. In each such iteration, we call the subroutine Layering(\(\delta ,\hat{\kappa }\)) (Algorithm 1) to compute the layers, and we compute the LLS step for this layering.

Another important difference is that the algorithm does not require a final rounding step. It terminates with the exact optimal solution \(w^*\) once a predictor step is able to perform a full step with \(\alpha =1\).

Theorem 3.16

For given \(A\in \mathbb {R}^{m\times n}\), \(b\in \mathbb {R}^m\), \(c\in \mathbb {R}^n\), and an initial feasible solution \(w^0=(x^0,y^0,s^0)\in \mathcal{N}(1/8)\), Algorithm 2 finds an optimal solution to (LP) in \(O(n^{2.5}\log n \log ( \bar{\chi }^*_A+n))\) iterations.

Remark 3.17

Whereas using LLS steps enables us to give a strong bound on the total number of iterations, finding LLS directions has a significant computational overhead as compared to finding affine scaling directions. The layering \(\mathcal J\) can be computed in time \(O(nm^2)\) (Lemma 3.15), and the LLS steps also require \(O(nm^2)\) time, see [35, 63]. This is in contrast to the computational cost \(O(n^\omega )\) of an affine scaling direction. Here \(\omega <2.373\) is the matrix multiplication constant [62].

We now sketch a possible approach to amortize the computational cost of the LLS steps over the sequence of affine scaling steps. It was shown in [37] that for the MTY P-C algorithm, the “bad” scenario between two crossover events amounts to a series of affine scaling steps where the progress in \(\mu \) increases exponentially from every iteration to the next. This corresponds to the term \(O(\min \{n^2 \log \log (\mu _0/\eta ), \log (\mu _0/\eta )\})\) in their running time analysis. Roughly speaking, such a sequence of affine scaling steps indicates that an LLS step is necessary.

Hence, we could observe these accelerating sequences of affine scaling steps, and perform an LLS step after we see a sequence of length \(O(\log n)\). The progress made by these affine scaling steps offsets the cost of computing the LLS direction.

4 The potential function and the overall analysis

Let \(\mu >0\) and \(\delta (\mu )=s(\mu )^{1/2}x(\mu )^{-1/2}=\sqrt{\mu }/x(\mu )=s(\mu )/\sqrt{\mu }\) correspond to the point on the central path and recall the definition of \(\gamma \) in (30). For \(i,j\in [n]\), \(i\ne j\), we define

$$\begin{aligned} \rho ^\mu (i,j):=\frac{\log \kappa _{ij}^{\delta (\mu )}}{ \log \left( 4n\kappa ^*_W/\gamma \right) }\,, \end{aligned}$$

and the main potentials in the algorithm as

$$\begin{aligned} \Psi ^\mu (i,j):=\max \left\{ 1,\min \left\{ 2n,\inf _{0<\mu '<\mu }\rho ^{\mu '}(i,j)\right\} \right\} \quad \text{ and } \\ \quad \Psi (\mu ):=\sum _{i,j\in [n], i\ne j}\log \Psi ^\mu (i,j)\,. \end{aligned}$$

The motivation for \(\rho ^\mu (i,j)\) and \(\Psi ^\mu (i,j)\) comes from Lemma 3.14, using \(\sigma =\gamma /(4n)\). Thus, \(\log \kappa _{ij}^{\delta (\mu )}/ \log \left( 4n\kappa ^*_W/\gamma \right) \) can be seen as a lower bound on the length of the shortest ji path. Recall that the layers are defined as strongly connected components of \(\hat{G}_{\delta ,\gamma /n}\), which is a subgraph of \(G_{\delta (\mu ),\gamma /(4n)}\) (using the bound (16)). Consequently, whenever \(\rho ^\mu (i,j)\ge n\), the nodes i and j cannot be in the same strongly connected component for the normalized duality gap \(\mu \). Thus, our potentials \(\Psi ^\mu (i,j)\) can be seen as fine-grained analogues of the crossover events analyzed in [36, 37, 63]: the definition of \(\Psi ^\mu (i,j)\) contains a minimization over \(0<\mu '<\mu \); therefore, \(\Psi ^\mu (i,j)> n\) implies that i and j may never appear on the same layer for any \(\mu '\le \mu \). On the other hand, these potentials are more fine-grained: even for \(t < n\), if \(\Psi ^\mu (i,j)\ge t\) then whenever a layer contains both i and j for \(\mu '\le \mu \), this layer must have size \(\ge t\).

By definition, for all pairs \((i,j) \in [n] \times [n]\) we have \(\Psi ^{\mu '}(i,j)\ge \Psi ^{\mu }(i,j)\) for \(0<\mu '\le \mu \); and we enforce \(\Psi ^{\mu }(i,j)\in [1,2n]\). The upper bound can be imposed since values \(\Psi ^{\mu '}(i,j)\ge n\) do not yield any new information on the layering. Hence, the overall potential \(\Psi (\mu )\) is between 0 and \(O(n^2\log n)\). The overall analysis in the proof of Theorem 3.16 divides the iterations into phases. In each phase, we can identify a set \(J\subseteq [n]\), \(|J|>1\) arising as a layer or as the union of two layers in the LLS step at the beginning of the phase. We show that \(\Psi ^{\mu }(i,j)\) doubles for at least \(|J|-1\) pairs \((i,j) \in J \times J\) during the subsequent \(O(\sqrt{n}|J|\log (\bar{\chi }^*+n))\) iterations; consequently, \(\Psi (\mu )\) increases by at least \(|J|-1\) during these iterations. This leads to the overall iteration bound \(O(n^{2.5}\log (n)\log (\bar{\chi }^*+n))\). In comparison, the crossover analysis would correspond to showing that within \(O(n^{1.5}\log (\bar{\chi }^*+n))\) iterations, one of the \(\Psi ^{\mu }(i,j)\) values previously \(<n\) becomes larger than n. The following statement formalizes the above mentioned properties of \(\Psi ^{\mu }(i,j)\).

Lemma 4.1

Let \(w=(x,y,s)\in \mathcal{N}(\beta )\) for \(\beta \in (0,1/4]\). Let \(i,j\in [n]\), \(i\ne j\), and let \(\mu =\mu (w)\).

  1. 1.

    If \(\hat{G}_{\delta ,\gamma /n}\) contains a path from j to i of at most \(t-1\) edges, then \(\rho ^\mu (i,j)<t\).

  2. 2.

    If \(\hat{G}_{\delta ,\gamma /n}\) contains a path from i to j of at most \(t-1\) edges, then \(\rho ^\mu (i,j) > -t\).

  3. 3.

    If \(\Psi ^\mu (i,j)\ge t\), then in any \(\delta (w')\)-balanced layering, where \(w'=(x',y',s')\in \mathcal{N}(\beta )\) with \(\mu (w') \le \mu \),

    • i and j cannot be together on a layer of size at most t, and

    • j cannot be on a layer preceding the layer containing i.

Proof

From (16), we see that for any ij,

$$\begin{aligned}\hat{\kappa }^\delta _{ij}\le \kappa ^\delta _{ij}\le (1-2\beta )^{-1}\kappa ^{\delta (\mu )}_{ij}\le 4\kappa ^{\delta (\mu )}_{ij}\,. \end{aligned}$$

Consequently, \(\hat{G}_{\delta ,\gamma /n}\) is a subgraph of \(G_{\delta (\mu ),\gamma /(4n)}\). The statement now follows from Lemma 3.14 with \(\sigma =\gamma /(4n)\). \(\square \)

In what follows, we formulate four important lemmas crucial for the proof of Theorem 3.16. For the lemmas, we only highlight some key ideas here, and defer the full proofs to Sect. 6.

For a triple \(w\in \mathcal{N}(\beta )\), \(\Delta w^\textrm{ll}\) refers to the LLS direction found in the algorithm, and \( Rx ^\textrm{ll}\) and \( Rs ^\textrm{ll}\) denote the residuals as in (18). For a subset \(I \subset [n]\) recall the definition

$$\begin{aligned} \epsilon _I^\textrm{ll}(w) := \max _{i \in I} \min \{| Rx _i^\textrm{ll}|, | Rs _i^\textrm{ll}|\}\, . \end{aligned}$$

We introduce another important quantity \(\xi \) for the analysis:

$$\begin{aligned} \xi ^\textrm{ll}_I(w):= \min \{\Vert Rx ^\textrm{ll}_I\Vert , \Vert Rs ^\textrm{ll}_I\Vert \}\, \end{aligned}$$
(31)

for a subset \(I \subset [n]\). For a layering \(\mathcal{J}=(J_1,J_2,\ldots ,J_p)\), we let

$$\begin{aligned} \xi ^\textrm{ll}_\mathcal{J}(w)=\max _{k \in [p]} \xi ^\textrm{ll}_{J_k}(w)\,. \end{aligned}$$

The key idea of the analysis is to extract information about the optimal solution \(w^*=(x^*,y^*,s^*)\) from the LLS direction. The first main lemma shows that if \(\Vert Rx ^\textrm{ll}_{J_q}\Vert \) is large on some layer \(J_q\), then for at least one index \(i\in J_q\), \(x^*_i/x_i\ge 1/\textrm{poly}(n)\), i.e., the variable \(x_i\) has “converged”. The analogous statement holds on the dual side for \(\Vert Rs ^\textrm{ll}_{J_q}\Vert \) and an index \(j \in J_q\).

Lemma 4.2

(Proof in Sect. 6) Let \(w = (x,y,s) \in \mathcal N(\beta )\) for \(\beta \in (0,1/8]\) and let \(w^* = (x^*, y^*, s^*)\) be the optimal solution corresponding to \(\mu ^* = 0\) on the central path. Let further \(\mathcal{J}=(J_1, \ldots , J_p)\) be a \(\delta (w)\)-balanced layering (Definition 3.13), and let \(\Delta w^\textrm{ll}=(\Delta x^\textrm{ll}, \Delta y^\textrm{ll}, \Delta s^\textrm{ll})\) be the corresponding LLS direction. Then the following statement holds for every \(q \in [p]\):

  1. (i)

    There exists \(i \in J_q\) such that

    $$\begin{aligned} x_i^* \ge \frac{2x_i}{3\sqrt{n}}\cdot (\Vert Rx _{J_q}^\textrm{ll}\Vert - 2\gamma n)\, . \end{aligned}$$
    (32)
  2. (ii)

    There exists \(j \in J_q\) such that

    $$\begin{aligned} {s_j^*}\ge \frac{2s_j}{3\sqrt{n}} \cdot (\Vert Rs _{J_q}^\textrm{ll}\Vert - 2\gamma n)\, . \end{aligned}$$
    (33)

We outline the main idea of the proof of part (i); part (ii) follows analogously using the duality of the lifting scores (Lemma 3.9). On layer q, the LLS step minimizes \(\Vert \delta _{J_q}(x_{J_q}+\Delta x_{J_q})\Vert \), subject to \(\Delta x_{J_{>q}}=\Delta x_{J_{>q}}^\textrm{ll}\) and subject to existence of \(\Delta x_{J_{<q}}\) such that \(\Delta x \in W\). By making use of \(\ell ^{\delta (w)}(J_{>q})\le \gamma \) due to \(\delta (w)\)-balancedness, we can show the existence of a point \(z\in W+x^*\) such that \(\Vert \delta _{J_q}(z_{J_q}-x^*_{J_q})\Vert \) is small, and \(z_{J_{>q}}=x_{J_{>q}}+\Delta x^\textrm{ll}_{J_{>q}}\). By the choice of \(\Delta x^\textrm{ll}_{J_q}\), we have \(\Vert \delta _{J_q} z_{J_q}\Vert \ge \Vert \delta _{J_q}(x_{J_q}+\Delta x_{J_q}^\textrm{ll})\Vert =\sqrt{\mu }\Vert Rx ^\textrm{ll}_{J_q}\Vert \). Therefore, \(\Vert \delta _{J_q}x^*_{J_q}/\sqrt{\mu }\Vert \) cannot be much smaller than \(\Vert Rx ^\textrm{ll}_{J_q}\Vert \). Noting that \(\delta _{J_q}x^*_{J_q}/\sqrt{\mu } \approx x^*_{J_q}/x_{J_q}\), we obtain a lower bound on \(x_i^*/x_i\) for some \(i\in J_q\).

We emphasize that the lemma only shows the existence of such indices i and j, but does not provide an efficient algorithm to identify them. It is also useful to note that for any \(i \in [n]\), \(\max \{| Rx ^\textrm{ll}_i|,| Rs ^\textrm{ll}_i|\}\ge \frac{1}{2}-\frac{3}{4}\beta \) according to Lemma 3.10(iii). Thus, for each \(q\in [p]\), we obtain a strong and positive lower bound in either case (i) on \(x_i/x_i^*\) or case (ii) on \(s_i/s_i^*\) for some \(i \in J_q\).

The next lemma allows us to argue that the potential function \(\Psi ^{\cdot }(\cdot ,\cdot )\) increases for multiple pairs of variables, if we have strong lower bounds on both \(x_i^*\) and \(s_j^*\) for some \(i,j\in [n]\), along with a lower and upper bound on \(\rho ^\mu (i,j)\).

Lemma 4.3

(Proof in Sect. 6) Let \(w=(x,y,s)\in \mathcal{N}(2\beta )\) for \(\beta \in (0,1/8]\), let \(\mu =\mu (w)\) and \(\delta =\delta (w)\). Let \(i,j\in [n]\) and \(2 \le \tau \le n\) such that for the optimal solution \(w^*=(x^*,y^*,s^*)\), we have \(x_i^*\ge \beta x_i/(2^{10}n^{5.5})\) and \(s_j^*\ge \beta s_j/(2^{10}n^{5.5})\), and assume \(\rho ^\mu (i,j)\ge -\tau \). After \(O(\beta ^{-1}\sqrt{n}\tau \log (\bar{\chi }^*+n))\) further iterations the duality gap \(\mu '\) fulfills \(\Psi ^{\mu '}(i,j)\ge 2\tau \), and for every \(\ell \in [n]\setminus \{i,j\}\), either \(\Psi ^{\mu '}(i,\ell )\ge 2\tau \), or \(\Psi ^{\mu '}(\ell ,j)\ge 2\tau \).

We note that i and j as in the lemma are necessarily different, since \(i=j\) would imply \(0=x_i^* s^*_i\ge \beta ^2 \mu /(2^{20} n^{11}) > 0\).

Let us illustrate the idea of the proof of \(\Psi ^{\mu '}(i,j)\ge 2\tau \). For i and j as in the lemma, and for a central path element \(w'=w(\mu ')\) for \(\mu '<\mu \), we have \(x'_i\ge x_i^*/n\ge \beta x_i/(2^{10}n^{6.5})\) and \(s'_j\ge s_j^*/n\ge \beta s_j/(2^{10}n^{6.5})\) by the near-monotonicity of the central path (Lemma 3.3). Note that

$$\begin{aligned} \kappa _{ij}^{\delta '}=\kappa _{ij}\cdot \frac{\delta '_j}{\delta '_i}=\kappa _{ij}\cdot \frac{x'_is'_j}{\mu '}\ge \kappa _{ij}\cdot \frac{\beta ^2 x_is_j}{2^{20}n^{13}\mu '}\ge \frac{\beta ^2(1-\beta )^2}{ 2^{20} n^{13}}\cdot \kappa _{ij}^\delta \cdot \frac{\mu }{\mu '}\,, \end{aligned}$$

where the last inequality uses Proposition 3.2. Consequently, as \(\mu '\) sufficiently decreases, \(\kappa _{ij}^{\delta '}\) will become much larger than \(\kappa _{ij}^\delta \). The claim on \(\ell \in [n]{\setminus }\{i,j\}\) can be shown by using the triangle inequality \(\kappa _{ik}\cdot \kappa _{kj}\ge \kappa _{ij}\) shown in Lemma 2.15.

Assume now \(\xi ^\textrm{ll}_{J_q}(w)\ge 4\gamma n\) for some \(q\in [p]\) in the LLS step. Then, Lemma 4.2 guarantees the existence of \(i,j\in J_q\) such that \(x_i^*/x_i, s_j^*/s_j\ge \frac{4}{3\sqrt{n}}\gamma n >\beta /(2^{10}n^{5.5})\). Further, Lemma 4.1 gives \(\rho ^\mu (i,j)\ge -|J_q|\). Hence, Lemma 4.3 is applicable for i and j with \(\tau =|J_q|\).

The overall potential argument in the proof of Theorem 3.16 uses Lemma 4.3 in three cases: \(\xi ^\textrm{ll}_{\mathcal{J}}(w)\ge 4\gamma n\) (Lemma 4.2 is applicable as above); \(\xi ^\textrm{ll}_{\mathcal{J}}(w)< 4\gamma n\) and \(\ell ^{\delta ^+}(\mathcal{J})\le 4\gamma n\) (Lemma 4.4); and \(\xi ^\textrm{ll}_{\mathcal{J}}(w)< 4\gamma n\) and \(\ell ^{\delta ^+}(\mathcal{J})> 4\gamma n\) (Lemma 4.5). Here, \(\delta ^+\) refers to the value of \(\delta \) after the LLS step. Note that \(\delta ^+ > 0\) is well-defined, unless the algorithm terminated with an optimal solution.

To prove these lemmas, we need to study how the layers “move” during the LLS step. We let \({\varvec{B}} = \{t \in [n]: | Rs _t^\textrm{ll}| < 4\gamma n\}\) and \({\varvec{N}}=\{t \in [n]: | Rx _t^\textrm{ll}| < 4\gamma n\}\). The assumption \(\xi _{\mathcal{J}}^\textrm{ll}(w) < 4\gamma n\) means that for each layer \(J_k\), either \(J_k\subseteq {\varvec{B}}\) or \(J_k\subseteq {\varvec{N}}\); we accordingly refer to \({\varvec{B}}\)-layers and \({\varvec{N}}\)-layers.

Lemma 4.4

(Proof in Sect. 6) Let \(w = (x,y,s) \in \mathcal N(\beta )\) for \(\beta \in (0,1/8]\), and let \(\mathcal{J}=(J_1, \ldots , J_p)\) be a \(\delta (w)\)-balanced partition. Assume that \(\xi _{\mathcal{J}}^\textrm{ll}(w) < 4\gamma n\), and let \(w^+ = (x^+, y^+, s^+)\in \overline{\mathcal{N}}(2\beta )\) be the next iterate obtained by the LLS step with \(\mu ^+=\mu (w^+)\) and assume \(\mu ^+ > 0\). Let \(q\in [p]\) such that \(\xi _{\mathcal{J}}^\textrm{ll}(w)=\xi _{J_q}^\textrm{ll}(w)\). If \(\ell ^{\delta ^+}(\mathcal J) \le 4\gamma n\), then there exist \(i,j\in J_q\) such that \(x_i^*\ge \beta x_i^+/(16n^{3/2})\) and \(s_j^*\ge \beta s_j^+/(16n^{3/2})\). Further, for any \(\ell ,\ell '\in J_q\), we have \(\rho ^{\mu ^+}(\ell ,\ell ')\ge -|J_q|\).

For the proof sketch, without loss of generality, let \(\xi _\mathcal{J}^\textrm{ll}=\xi _{J_q}^\textrm{ll}=\Vert Rx _{J_q}^\textrm{ll}\Vert \), that is, \(J_q\) is an \({\varvec{N}}\)-layer. The case \(\xi _{J_q}^\textrm{ll}=\Vert Rs _{J_q}^\textrm{ll}\Vert \) can be treated analogously. Since the residuals \(\Vert Rx _{J_q}^\textrm{ll}\Vert \) and \(\Vert Rs _{J_q}^\textrm{ll}\Vert \) cannot be both small, Lemma 4.2 readily provides a \(j\in J_q\) such that \(s_j^*/s_j\ge 1/(6\sqrt{n})\). Using Lemma 3.3 and Proposition 3.1, \(s_j^*/s_j^+ = s_j^*/s_j \cdot s_j/s_j^+> (1-\beta )/(6(1+4\beta )n^{3/2})>\beta /(16n^{3/2})\).

The key ideas of showing the existence of an \(i\in J_q\) such that \(x_i^*\ge x_i^+/(16n^{3/2})\) are the following. With \(\approx \), \(\lessapprox \) and \(\gtrapprox \), we write equalities and inequalities that hold up to small polynomial factors. First, we show that (i) \(\Vert \delta _{J_q}x^+_{J_q}\Vert \lessapprox \mu ^+/ \sqrt{\mu }\), and then, that (ii) \(\Vert \delta _{J_q} x^*_{J_q}\Vert \gtrapprox \mu ^+/\sqrt{\mu }\,.\)

If we can show (i) and (ii) as above, we obtain that \(\Vert \delta _{J_q}x^*_{J_q}\Vert \gtrapprox \Vert \delta _{J_q}x^+_{J_q}\Vert \), and thus, \(x_i^*\gtrapprox x_i^+\) for some \(i\in J_q\).

Let us now sketch the first step. By the assumption \(J_q \subset {\varvec{N}}\), one can show \(x_{J_q}^+/x_{J_q} \approx \mu ^+/\mu \), and therefore

$$\begin{aligned}\Vert \delta _{J_q}x^+_{J_q}\Vert \approx \frac{\mu ^+}{\mu } \Vert \delta _{J_q}x_{J_q}\Vert \approx \frac{\mu ^+}{\mu } \sqrt{\mu } = \frac{\mu ^+}{\sqrt{\mu }}\,. \end{aligned}$$

The second part of the proof, namely, lower bounding \(\Vert \delta _{J_q}x^*_{J_q}\Vert \), is more difficult. Here, we only sketch it for the special case when \(J_q=[n]\). That is, we have a single layer only; in particular, the LLS step is the same as the affine scaling step \(\Delta x^\textrm{ll}=\Delta x^\textrm{a}\). The general case of multiple layers follows by making use of Lemma 3.10, i.e. exploting that for a sufficiently small \(\ell ^\delta (\mathcal{J})\), the LLS step is close to the affine scaling step.

Hence, assume that \(\Delta x^\textrm{ll}=\Delta x^\textrm{a}\). Using the equivalent definition of the affine scaling step (17) as a minimum-norm point, we have \(\Vert \delta x^*\Vert \ge \Vert \delta (x+\Delta x^\textrm{ll})\Vert =\sqrt{\mu }\Vert Rx ^\textrm{ll}\Vert =\sqrt{\mu }\xi _\mathcal{J}^\textrm{ll}\). From Lemma 3.6, \(\mu ^+/\mu \le 2\sqrt{n}\epsilon ^\textrm{a}(w)/\beta \le 2\sqrt{n}\xi _\mathcal{J}^\textrm{ll}/\beta \). Thus, we see that \(\Vert \delta x^*\Vert \ge \beta \mu ^+/(2\sqrt{n\mu })\).

The final statement on lower bounding \(\rho ^{\mu ^+}(\ell ,\ell ')\ge -|J_q|\) for any \(\ell ,\ell '\in J_q\) follows by showing that \(\delta ^+_\ell /\delta ^+_{\ell '}\) remains close to \(\delta _\ell /\delta _{\ell '}\), and hence the values of \(\kappa ^{\mu ^+}(\ell ,\ell ')\) and \(\kappa ^\mu (\ell ,\ell ')\) are sufficiently close for indices on the same layer (Lemma 6.1).

Lemma 4.5

(Proof in Sect. 6) Let \(w = (x,y,s) \in \mathcal N(\beta )\) for \(\beta \in (0,1/8]\), and let \(\mathcal{J}=(J_1, \ldots , J_p)\) be a \(\delta (w)\)-balanced partition. Assume that \(\xi _{\mathcal{J}}^\textrm{ll}(w) < 4\gamma n\), and let \(w^+ = (x^+, y^+, s^+)\in \overline{\mathcal{N}}(2\beta )\) be the next iterate obtained by the LLS step with \(\mu ^+=\mu (w^+)\) and assume \(\mu ^+ > 0\). If \(\ell ^{\delta ^+}(\mathcal J) > 4\gamma n\), then there exist two layers \(J_q\) and \(J_r\) and \(i\in J_q\) and \(j\in J_r\) such that \(x_i^*\ge x^+_i/(8n^{3/2})\), and \(s_j^*\ge s^+_j/(8n^{3/2})\). Further, \(\rho ^{\mu ^+}(i,j)\ge -|J_q\cup J_r|\), and for all \(\ell ,\ell '\in J_q\cup J_r\), \(\ell \ne \ell '\) we have \(\Psi ^\mu (\ell ,\ell ')\le |J_q\cup J_r|\).

Consider now any \(\ell \in J_k\subseteq {\varvec{B}}\). Then, since \( Rx _\ell ^\textrm{ll}\) is multiplicatively close to 1, \(x_\ell ^+\approx x_\ell \); on the other hand \(s_\ell ^+\) will “shoot down” close to the small value \( Rs _\ell ^\textrm{ll}\cdot s_\ell \). Conversely, for \(\ell \in J_k\subseteq {\varvec{N}}\), \(s_\ell ^+\approx s_\ell \), and \(x_\ell ^+\) will “shoot down” to a small value.

The key step of the analysis is showing that the increase in \(\ell ^{\delta ^+}(\mathcal J)\) can be attributed to an \({\varvec{N}}\)-layer \(J_r\) “crashing into” a \({\varvec{B}}\)-layer \(J_q\). That is, we show the existence of an edge \((i',j')\in E_{\delta ^+,\gamma /(4n)}\) for \(i'\in J_q\) and \(j'\in J_r\), where \(r<q\) and \(J_q\subseteq {\varvec{B}}\), \(J_r\subseteq {\varvec{N}}\). This can be achieved by analyzing the matrix B used in the subroutine Verify-Lift.

For the layers \(J_q\) and \(J_r\), we can use Lemma 4.2 to show that there exists an \(i\in J_q\) where \(x_i^*/x_i\) is lower bounded, and there exists a \(j\in J_r\) where \(s_j^*/s_j\) is lower bounded. The lower bound on \(\rho ^{\mu ^+}(i,j)\) and the upper bounds on the \(\Psi ^\mu (\ell ,\ell ')\) values can be shown by tracking the changes between the \(\kappa ^\delta (\ell ,\ell ')\) and \(\kappa ^{\delta ^+}(\ell ,\ell ')\) values, and applying Lemma 4.1 both at w and at \(w^+\).

Proof of Theorem 3.16

We analyze the overall potential function \(\Psi (\mu )\). By the iteration at \(\mu \) we mean the iteration where the normalized duality gap of the current iterate is \(\mu \).

By Proposition 3.4(ii) and Lemma 3.10(ii), the predictor step gives \(w'\in \overline{\mathcal{N}}(1/4)\) in every iteration, and thus by Proposition 3.4(iii), if \(\mu (w') > 0\), the iterate \(w^{\textrm{c}}\) after a corrector step fulfills \(w^{\textrm{c}} \in \mathcal{N}(1/8)\). If \(\mu ^+ = 0\) at the end of an iteration, the algorithm terminates with an optimal solution. Recall from Lemma 3.10(v) that this happens if and only if \(\epsilon ^\textrm{ll}(w)=0\) at a certain iteration.

From now on, assume that \(\mu ^+ > 0\). We distinguish three cases at each iteration. These cases are well-defined even at iterations where affine scaling steps are used. At such iterations, \(\xi ^\textrm{ll}_{\mathcal{J}}(w)\) still refers to the LLS residuals, even if these have not been computed by the algorithm. (Case I) \(\xi ^\textrm{ll}_{\mathcal{J}}(w)\ge 4\gamma n\); (Case II) \(\xi ^\textrm{ll}_{\mathcal{J}}(w) < 4\gamma n\) and \(\ell ^{\delta ^+}(\mathcal{J})\le 4\gamma n\); and (Case III) \(\xi ^\textrm{ll}_{\mathcal{J}}(w) < 4\gamma n\) and \(\ell ^{\delta ^+}(\mathcal{J})> 4\gamma n\).

Recall that the algorithm uses an LLS direction instead of the affine scaling direction whenever \(\epsilon ^\textrm{a}(w)<10n^{3/2}\gamma \). Consider now the case when an affine scaling direction is used, that is, \(\epsilon ^\textrm{a}(w)\ge 10n^{3/2}\gamma \). According to Lemma 3.10(ii), \(\Vert Rx ^\textrm{ll}- Rx ^\textrm{a}\Vert , \Vert Rs ^\textrm{ll}- Rs ^\textrm{a}\Vert \le 6n^{3/2}\gamma \). This implies that \(\xi ^\textrm{ll}_{\mathcal{J}}(w)\ge 4n^{3/2}\gamma \ge 4n\gamma \). Therefore, in cases II and III, an LLS step will be performed.

Starting with any given iteration, in each case we will identify a set \(J\subseteq [n]\) of indices with \(|J|>1\), and start a phase of \(O(\sqrt{n}|J|\log (\bar{\chi }^*+n))\) iterations (that can be either affine scaling or LLS steps). In each phase, we will guarantee that \(\Psi \) increases by at least \(|J|-1\). By definition, \(0\le \Psi (\mu )\le n(n-1)(\log _2n+1)\), and if \(\mu '<\mu \) then \(\Psi (\mu ')\ge \Psi (\mu )\). As we can partition the union of all iterations into disjoint phases, this yields the bound \(O(n^{2.5}\log n\log (\bar{\chi }^*+n))\) on the total number of iterations.

We now consider each of the cases. We always let \(\mu \) denote the normalized duality gap at the current iteration, and we let \(q\in [p]\) be the layer such that \(\xi ^\textrm{ll}_{\mathcal{J}}(w)= \xi ^\textrm{ll}_{J_q}(w)\).

Case I: \(\xi ^\textrm{ll}_{\mathcal{J}}(w)\ge 4\gamma n\). Lemma 4.2 guarantees the existence of \(x_i,s_j\in J_q\) such that \(x_i^*/x_i, s_j^*/s_j\ge 4\gamma n/(3\sqrt{n})>1/(2^{10}n^{5.5})\). Further, according to Lemma 4.1, \(\rho ^{\mu }(i,j)\ge -|J_q|\). Thus, Lemma 4.3 is applicable for \(J=J_q\). The phase starting at \(\mu \) comprises \(O(\sqrt{n}|J_q|\log (\bar{\chi }^*+n))\) iterations, after which we get a normalized duality gap \(\mu '\) such that \(\Psi ^{\mu '}(i,j)\ge 2|J_q|\), and for each \(\ell \in [n]{\setminus } \{i,j\}\), either \(\Psi ^{\mu '}(i,\ell )\ge 2|J_q|\), or \(\Psi ^{\mu '}(\ell ,j)\ge 2|J_q|\).

We can take advantage of these bounds for indices \(\ell \in J_q\). Again by Lemma 4.1, for any \(\ell ,\ell '\in J_q\), we have \(\Psi ^\mu (\ell ,\ell ')\le \rho ^\mu (\ell ,\ell ')\le |J_q|\). Thus, there are at least \(|J_q|-1\) pairs of indices \((\ell ,\ell ')\) for which \(\Psi ^\mu (\ell ,\ell ')\) increases by at least a factor 2 between iterations at \(\mu \) and \(\mu '\). The increase in the contribution of these terms to \(\Psi (\mu )\) is at least \(|J_q|-1\) during these iterations.

We note that this analysis works regardless whether an LLS step or an affine scaling step was performed in the iteration at \(\mu \).

Case II: \(\xi ^\textrm{ll}_{\mathcal{J}}(w) < 4\gamma n\) and \(\ell ^{\delta ^+}(\mathcal{J})\le 4\gamma n\). As explained above, in this case we perform an LLS step in the iteration at \(\mu \), and we let \(w^+\) denote the iterate obtained by the LLS step. For \(J=J_q\), Lemma 4.4 guarantees the existence of \(i,j\in J_q\) such that \(x_i^*/x_i^+,s_j^*/s_j^+>\beta /(16n^{3/2})\), and further, \(\rho ^{\mu ^+}(i,j)>-|J_q|\). We can therefore apply Lemma 4.3. The phase starting at \(\mu \) includes the LLS step leading to \(\mu ^+\) (and the subsequent centering step), and the additional \(O(\sqrt{n}|J_q|\log (\bar{\chi }^*+n))\) iterations (\(\beta \) is a fixed constant in Algorithm 2) as in Lemma 4.3. As in Case I, we get the desired potential increase compared to the potentials at \(\mu \) in layer \(J_q\).

Case III: \(\xi ^\textrm{ll}_{\mathcal{J}}(w) < 4\gamma n\) and \(\ell ^{\delta ^+}(\mathcal{J})>4\gamma n\). Again, the iteration at \(\mu \) will use an LLS step. We apply Lemma 4.5, and set \(J=J_q\cup J_r\) as in the lemma. The argument is the same as in Case II, using that Lemma 4.5 explicitly states that \(\Psi ^\mu (\ell ,\ell ')\le |J|\) for any \(\ell ,\ell '\in J\), \(\ell \ne \ell '\). \(\square \)

4.1 The iteration complexity bound for the Vavasis–Ye algorithm

We now show that the potential analysis described above also gives an improved bound \(O(n^{2.5}\log n\) \(\log (\bar{\chi }_A+n))\) for the original VY algorithm [63].

We recall the VY layering step. Order the variables via \( \pi \) such that \(\delta _{\pi (1)}\le \delta _{\pi (2)}\le \ldots \le \delta _{\pi (n)}\). The layers will be consecutive sets in the ordering; a new layer starts with \(\pi (i+1)\) each time \(\delta _{\pi (i+1)}>g\delta _{\pi (i)}\), for a parameter \(g=\textrm{poly}(n)\bar{\chi }_A\).

As outlined in the Introduction, the VY algorithm can be seen as a special implementation of our algorithm by setting \(\hat{\kappa }_{ij}=g\gamma /n\). With these edge weights, we have that \(\hat{\kappa }^\delta _{ij}\ge \gamma /n\) precisely if \(g\delta _j\ge \delta _i\).Footnote 3

With these edge weights, it is easy to see that our Layering(\(\delta ,\hat{\kappa }\)) subroutine finds the exact same components as VY. Moreover, the layers will be the initial strongly connected components \(C_i\) of \(G_{\delta ,\gamma /n}\): due to the choice of g, this partition is automatically \(\delta \)-balanced. There is no need to call Verify-Lift.

The essential difference compared to our algorithm is that the values \(\hat{\kappa }_{ij}=g\gamma /n\) are not lower bounds on \(\kappa _{ij}\) as we require, but upper bounds instead. This is convenient to simplify the construction of the layering. On the negative side, the strongly connected components of \(\hat{G}_{\delta ,\gamma /n}\) may not anymore be strongly connected in \(G_{\delta ,\gamma /n}\). Hence, we cannot use Lemma 4.1, and consequently, Lemma 4.3 does not hold.

Still, the \(\hat{\kappa }_{ij}\) bounds are overestimating \(\kappa _{ij}\) by at most a factor poly\((n)\bar{\chi }_A\). Therefore, the strongly connected components of \(\hat{G}_{\delta ,n/\gamma }\) are strongly connected in \(G_{\delta ,\sigma }\) for some \(\sigma =1/(\textrm{poly}(n)\bar{\chi }_A)\).

Hence, the entire argument described in this section is applicable to the VY algorithm, with a different potential function defined with \(\bar{\chi }_A\) instead of \(\bar{\chi }^*_A\). This is the reason why the iteration bound in Lemma 4.3, and therefore in Theorem 3.16, also changes to \(\bar{\chi }_A\) dependency.

It is worth noting that due to the overestimation of the \(\kappa _{ij}\) values, the VY algorithm uses a coarser layering than our algorithm. Our algorithm splits up the VY layers into smaller parts so that \(\ell ^\delta (\mathcal{J})\) remains small, but within each part, the gaps between the variables are bounded as a function of \(\bar{\chi }^*_A\) instead of \(\bar{\chi }_A\).

5 Properties of the layered least square step

This section is dedicated to the proofs of Proposition 3.8 on the duality of lifting scores and Lemma 3.10 on properties of LLS steps.

Proposition 3.8

(Restatement). For a linear subspace \(W \subseteq \mathbb {R}^n\) and index set \(I \subseteq [n]\) with \(J = [n]{\setminus } I\),

$$\begin{aligned} \Vert L_I^W\Vert \le \max \{1, \Vert L_J^{W^\perp }\Vert \}. \end{aligned}$$

In particular, \(\ell ^W(I) = \ell ^{W^\perp }(J)\)

Proof

We first treat the case where \(\pi _I(W) = \{0\}\) or \(\pi _J(W^\perp ) = \{0\}\). If \(\pi _I(W) = \left\{ 0 \right\} \) then \(\Vert L_I^W\Vert = \ell ^W(I) = 0\). Furthermore, in this case \(\mathbb {R}^I = \pi _I(W)^\perp = \pi _I(W^\perp \cap \mathbb {R}^n_I)\), and thus \(\{(0, w_J): w \in W^\perp \} \subseteq W^\perp \). In particular, \(\Vert L_J^W\Vert \le 1\) and \(\ell ^{W^\perp }(J) = 0\). Symmetrically, if \(\pi _J(W^\perp ) = \{0\}\) then \(\Vert L_J^{W^\perp }\Vert = \ell ^{W^\perp }(J) = 0\), \(\Vert L_I^W\Vert \le 1\) and \(\ell ^{W}(I) = 0\).

We now restrict our attention to the case where both \(\pi _I(W),\pi _J(W^\perp ) \ne \{0\}\). Under this assumption, we show that \(\Vert L_I^W\Vert = \Vert L_J^{W^\perp }\Vert \) and thus that \(\ell ^W(I) = \ell ^{W^\perp }(J)\). Note that by non-emptyness, we clearly have that \(\Vert L_I^W\Vert ,\Vert L_J^{W^\perp }\Vert \ge 1\).

We formulate a more general claim. Let \(\{0\} \ne U, V \subset \mathbb {R}^n\) be linear subspaces such that \(U + V = \mathbb {R}^n\) and \(U \cap V = \{0\}\). Note that for the orthogonal complements in \(\mathbb {R}^n\), we also have \(\{0\} \ne U^\perp ,V^\perp \), \(U^\perp + V^\perp = \mathbb {R}^n\) and \(U^\perp \cap V^\perp = \{0\}\).

Claim 5.1

Let \(\{0\} \ne U, V \subset \mathbb {R}^n\) be linear subspaces such that \(U + V = \mathbb {R}^n\) and \(U \cap V = \{0\}\). Thus, for \(z \in \mathbb {R}^n\), there are unique decompositions \(z = u + v\) with \(u\in U\), \(v \in V\) and \(z=u'+v'\) with \(u' \in U^\perp \) and \(v' \in V^\perp \). Let \(T: \mathbb {R}^n \rightarrow V\) be the map sending \(Tz = v\). Let \(T': \mathbb {R}^n \rightarrow V^\perp \) be the map sending \(T'z = v'\). Then, \(\Vert T\Vert = \Vert T'\Vert \).

Proof

To prove the statement, we claim that it suffices to show that if \(\Vert T\Vert > 1\) then \(\Vert T'\Vert \ge \Vert T\Vert \). To prove sufficiency, note that by symmetry, we also get that if \(\Vert T'\Vert > 1\) then \(\Vert T\Vert \ge \Vert T'\Vert \). Note that \(V,V^\perp \ne \{0\}\) by assumption, and \(Tz=z\) for \(z\in V\), \(T'z=z\) for \(z\in V^\perp \). Thus, we always have \(\Vert T\Vert , \Vert T'\Vert \ge 1\), and therefore the equality \(\Vert T\Vert = \Vert T'\Vert \) must hold in all cases. We now assume \(\Vert T\Vert > 1\) and show \(\Vert T'\Vert \ge \Vert T\Vert \).

Representing T as an \(n \times n\) matrix, we write \(T = \sum _{i=1}^k \sigma _i v_i u_i^\top \) using a singular value decomposition with \(\sigma _1 \ge \dots \ge \sigma _k > 0\). As such, \(v_1,\dots ,v_k\) is an orthonormal basis of V, since the \(\textrm{range}(T) = V\), and \(u_1,\dots ,u_k\) is an orthonormal basis of \(U^\perp \), since \({\text {Ker}}(T) = U\), noting that we have restricted to the singular vectors associated with positive singular values. By assumption, we have that \(\Vert T\Vert = \Vert Tu_1\Vert = \sigma _1 > 1\).

The proof is complete by showing that

$$\begin{aligned} \left\| T'(v_1 - u_1/\sigma _1)\right\| \ge \sigma _1\Vert v_1 - u_1/\sigma _1\Vert , \end{aligned}$$
(34)

and that \(\Vert v_1-u_1/\sigma _1\Vert > 0\), since then the vector \(v_1 - u_1/\sigma _1\) will certify that \(\Vert T'\Vert \ge \sigma _1\).

The map T is a linear projection with \(T^2 = T\). Hence \(\langle u_i, v_i \rangle = \sigma _i^{-1}\) and \(\langle u_i, v_j \rangle = 0\) for all \(i \ne j\).

We show that \(v_1 - \sigma _1^{-1}u_1\) can be decomposed as \(v_1 - \sigma _1 u_1 + (\sigma _1-\sigma _1^{-1}) u_1\) such that \(v_1 - \sigma _1 u_1\in V^\perp \) and \((\sigma _1-\sigma _1^{-1}) u_1\in U^\perp \). Therefore, \(T'(v_1 - \sigma _1^{-1}u_1)=v_1 - \sigma _1 u_1\).

The containment \((\sigma _1-\sigma _1^{-1})u_1\in U^\perp \) is immediate. To show \(v_1 - \sigma _1 u_1\in V^\perp \), we need that \(\langle v_1 - \sigma _1 u_1, v_i \rangle =0\) for all \(i\in [k]\). For \(i\ge 2\), this is true since \(\langle u_i, v_j \rangle = 0\) and \(\langle v_i, v_j \rangle = 0\). For \(i=1\), we have \(\langle v_1-\sigma _1 u_1, v_1 \rangle =0\) since \(\Vert v_1\Vert =1\) and \(\langle u_1, v_1 \rangle =\sigma _1^{-1}\). Consequently, \(T'(v_1 - \sigma _1^{-1}u_1)=v_1 - \sigma _1 u_1\).

We compute \(\left\| v_1 - \sigma _1^{-1} u_1\right\| = \sqrt{1 - \sigma _1^{-2}} > 0\), since \(\sigma _1 > 1\), and \(\Vert v_1 - \sigma _1 u_1\Vert = \sqrt{\sigma _1^2 - 1}\). This verifies (34), and thus \(\Vert T'\Vert \ge \sigma _1 = \Vert T\Vert \). \(\square \)

To prove the lemma, we define \(\mathcal J = (J, I)\), \(U = W_{\mathcal J, 1}^\perp \times W_{\mathcal J, 2}^\perp \) and \(V = W\) and let \(T: \mathbb {R}^n \rightarrow V\) and \(T': \mathbb {R}^n \rightarrow V^\perp \) be as in Claim 5.1. By assumption, \(\{0\} \ne \pi _I(W) \Rightarrow \{0\} \ne V\) and \(\{0\} \ne \pi _J(W^\perp ) = W_{\mathcal J, 1}^\perp \Rightarrow \{0\} \ne U\). Applying Lemma 3.7, UV satisfy the conditions of Claim 5.1 and \(T = \textrm{LLS}^{W,1}_\mathcal{J}\). In particular, \(\Vert T'\Vert =\Vert T\Vert \). Using the fact that \(U^\perp = W_{\mathcal J,1} \times W_{\mathcal J,2}\) and \(V^\perp = W^\perp \), we similarly get that \(T' = \textrm{LLS}^{W^\perp ,1}_\mathcal{{ \bar{J}}}\), where \(\mathcal{{\bar{J}}} = (I,J)\). By (21) we have, for any \(t \in \pi _{\mathbb {R}^n_I}(W)\), that \(Tt = \textrm{LLS}^{W,1}_{\mathcal J}(t) = L_I^W(t_I)\). Thus, \(\Vert T\Vert \ge \Vert L_I^W\Vert \ge 1\).

To finish the proof of the lemma from the claim, we show that \(\Vert T\Vert \le \Vert L^W_I\Vert \). By a symmetric argument we get \(\Vert T'\Vert = \Vert L^{W^\perp }_J\Vert \).

If \(x \in \mathbb {R}^n_J\), then \(Tx \in W\cap \mathbb {R}^n_J\) because any \(s \in W_{\mathcal J, 2}^\perp , t \in \pi _I(W)\) with \(s + t = 0\) must have \(s = t= 0\) since \(W_{\mathcal J, 2}^\perp \) is orthogonal to \(\pi _I(W)\). But \(W \cap \mathbb {R}^n_J\) and \(W_{\mathcal J, 1}^\perp \) are orthogonal, so \(\Vert Tx\Vert \le \Vert x\Vert \) because \(x = Tx + (x - Tx)\) is an orthogonal decomposition.

If \(y \in \mathbb {R}^n_I\), then \(y_J = 0\) and hence \((Ty)_J = (Ty-y)_J\). Since \((Ty-y)_J \in W_{\mathcal J,1}^\perp = \pi _J(W \cap \mathbb {R}^n_J)^\perp \), we see that \(Ty \in (W \cap \mathbb {R}^n_J)^\perp \). As such, for any \(x \in \mathbb {R}^n_J, y \in \mathbb {R}^n_I\), we see that \(x \perp y\) and \(Tx \perp Ty\). For \(x,y \ne 0\), we thus have that

$$\begin{aligned} \frac{\Vert T(x+y)\Vert ^2}{\Vert x+y\Vert ^2} = \frac{\Vert T(x)\Vert ^2+\Vert T(y)\Vert ^2}{\Vert x\Vert ^2+\Vert y^2\Vert } \le \max \left\{ \frac{\Vert T(x)\Vert ^2}{\Vert x\Vert ^2},\frac{\Vert T(y)\Vert ^2}{\Vert y\Vert ^2}\right\} \\ \le \max \left\{ 1,\frac{\Vert T(y)\Vert ^2}{\Vert y\Vert ^2}\right\} . \end{aligned}$$

Since \(\Vert L_I^W\Vert \ge 1\), we must have that \(\Vert Tt\Vert /\Vert t\Vert \) is maximized by some \(t \in \mathbb {R}^n_I\). From \({\text {Ker}}(T) = U\) it is clear that \(\Vert Tt\Vert /\Vert t\Vert \) is maximized by some \(t \in U^\perp \). Now, \(U^\perp \cap \mathbb {R}^n_I = \pi _{\mathbb {R}^n_I}(W)\), so any t maximizing \(\Vert Tt\Vert /\Vert t\Vert \) satisfies \(Tt = L_I^W(t_I)\). Therefore, \(\Vert L_I^W\Vert \ge \Vert T\Vert \). \(\square \)

Our next goal is to show Lemma 3.10: for a layering with small enough \(\ell ^\delta (\mathcal{J})\), the LLS step approximately satisfies (13), that is, \(\delta \Delta x^\textrm{ll}+ \delta ^{-1} \Delta s^\textrm{ll}\approx -x^{1/2} s^{1/2}\). This also enables us to derive bounds on the norm of the residuals and on the step-length. We start by proving a few auxiliary technical claims. The next simple lemma allows us to take advantage of low lifting scores in the layering.

Lemma 5.2

Let \(u,v\in \mathbb {R}^n\) be two vectors such that \(u-v\in W\). Let \(I\subseteq [n]\), and \(\delta \in \mathbb {R}^n_{++}\). Then there exists a vector \(u' \in W + u\) satisfying \(u'_I=v_I\) and

$$\begin{aligned} \Vert \delta _{[n]\setminus I} (u'_{[n]\setminus I}-u_{[n]\setminus I})\Vert \le \ell ^\delta (I)\Vert \delta _I(u_I-v_I)\Vert \,. \end{aligned}$$

Proof

We let

$$\begin{aligned} u':=u+\delta ^{-1}L^\delta _I(\delta _I(v_I-u_I))\,. \end{aligned}$$

The claim follows by the definition of the lifting score \(\ell ^\delta (I)\). \(\square \)

The next lemma will be the key tool to prove Lemma 3.10. It is helpful to recall the characterization of the LLS step in Sect. 3.4.

Lemma 5.3

Let \(w=(x,y,s)\in \mathcal{N}(\beta )\) for \(\beta \in (0,1/4]\), let \(\mu =\mu (w)\) and \(\delta =\delta (w)\). Let \(\mathcal{J}=(J_1,\ldots ,J_p)\) be a \(\delta (w)\)-balanced layering, and let \(\Delta w^\textrm{ll}= (\Delta x^\textrm{ll}, \Delta y^\textrm{ll}, \Delta s^\textrm{ll})\) denote the corresponding LLS direction. Let and as in (25) and (26), that is

$$\begin{aligned} \delta \Delta x^\textrm{ll}+ \delta ^{-1} \Delta s +x^{1/2} s^{1/2}=0 \, ,\end{aligned}$$
(35)
$$\begin{aligned} \delta \Delta x + \delta ^{-1} \Delta s^\textrm{ll}+x^{1/2} s^{1/2}=0. \end{aligned}$$
(36)

Then, there exist vectors and such that

$$\begin{aligned} \Vert \delta _{J_k}(\Delta \bar{x}_{J_k} - \Delta x_{J_k}^\textrm{ll})\Vert&\le 2n\ell ^\delta (\mathcal{J})\sqrt{\mu }\quad \forall k\in [p]\, \quad \text{ and }\end{aligned}$$
(37)
$$\begin{aligned} \Vert \delta ^{-1}_{J_k}(\Delta \bar{s}_{J_k} - \Delta s_{J_k}^\textrm{ll})\Vert&\le 2n\ell ^\delta (\mathcal{J}) \sqrt{\mu }\quad \forall k\in [p]\, . \end{aligned}$$
(38)

Proof

Throughout, we use the shorthand notation \(\lambda =\ell ^\delta (\mathcal{J})\). We construct \(\Delta \bar{x}\); one can obtain \(\Delta \bar{s}\), using that the reverse layering has lifting score \(\lambda \) in \(W^\perp {\text {Diag}}(\delta ^{-1})\) according to Lemma 3.9.

We proceed by induction, constructing \(\Delta \bar{x}_{J_k}\in W_{\mathcal{J},k}\) for \(k=p,p-1,\ldots ,1\). This will be given as \(\Delta \bar{x}_{J_k}=\Delta x^{(k)}_{J_k}\) for a vector \(\Delta x^{(k)}\in W\) such that \(\Delta x^{(k)}_{J_{>k}}=0\). We prove the inductive hypothesis

$$\begin{aligned} \left\| \delta _{J_{\le k}}\left( \Delta x^{(k)}_{J_{\le k}}-\Delta x^\textrm{ll}_{J_{\le k}}\right) \right\| \le 2\lambda \sqrt{\mu } \sum _{q=k+1}^p \sqrt{|J_q|}\,. \end{aligned}$$
(39)

Note that (37) follows by restricting the norm on the LHS to \(J_k\) and since the sum on the RHS is \(\le n\).

For \(k=p\), the RHS is 0. We simply set \(\Delta x^{(p)}=\Delta x^\textrm{ll}\), that is, \(\Delta \bar{x}_{J_p}=\Delta x^\textrm{ll}_{J_p}\), trivially satisfying the hypothesis. Consider now \(k<p\), and assume that we have a \(\Delta \bar{x}_{J_{k+1}}=\Delta x^{(k+1)}_{J_{k+1}}\) satisfying (39) for \(k+1\). From (35) and the induction hypothesis, we get that

$$\begin{aligned} \begin{aligned}&\Vert \delta _{J_{k+1}} \Delta \bar{x}_{J_{k+1}} + \delta ^{-1}_{J_{k+1}} \Delta s_{J_{k+1}}\Vert \le \Vert x^{1/2}_{J_{k+1}} s^{1/2}_{J_{k+1}}\Vert +\Vert \delta _{J_{k+1}}(\Delta \bar{x}_{J_{k+1}}- \Delta x_{J_{k+1}}^\textrm{ll})\Vert \\&\le \Vert x^{1/2}_{J_{k+1}} s^{1/2}_{J_{k+1}}\Vert +2\lambda \sqrt{\mu }\sum _{q=k+2}^p \sqrt{|J_q|} \le \sqrt{1+\beta }\sqrt{\mu |J_{k+1}|}+2n\lambda \sqrt{\mu }<2 \sqrt{\mu |J_{k+1}|}\,, \end{aligned} \end{aligned}$$

using also that \(w\in \mathcal{N}(\beta )\), Proposition 3.2, and the assumptions \(\beta \le 1/4\), \(\lambda \le \beta /(32n^2)\). Note that \(\Delta \bar{x}_{J_{k+1}}\in W_{\mathcal{J},k}\) and \(\Delta s_{J_{k+1}}\in W^\perp _{\mathcal{J},k}\) are orthogonal vectors. The above inequality therefore implies

$$\begin{aligned} \Vert \delta _{J_{k+1}} \Delta \bar{x}_{J_{k+1}} \Vert \le 2\sqrt{\mu |J_{k+1}|}\,. \end{aligned}$$

Let us now use Lemma 5.2 to obtain \(\Delta x^{(k)}\) for \(u= \Delta x^{(k+1)}\), \(v=0\), and \(I=J_{>k}\). That is, we get \(\Delta x^{(k)}_{J_{>k}}=0\), \(\Delta x^{(k)}\in W\), and

$$\begin{aligned} \begin{aligned} \Vert \delta _{J_{\le k}}( \Delta x^{(k)}_{J_{\le k}}-\Delta x^{(k+1)}_{J_{\le k}})\Vert&\le \lambda \Vert \delta _{J_{> k}} \Delta x^{(k+1)}_{J_{> k}}\Vert \\&=\lambda \Vert \delta _{J_{k+1}} \Delta \bar{x}_{J_{k+1}}\Vert \le 2\lambda \sqrt{\mu |J_{k+1}|}\,. \end{aligned} \end{aligned}$$

By the triangle inequality and the induction hypothesis (39) for \(k+1\),

$$\begin{aligned} \Vert \delta _{J_{\le k}}(\Delta x^{(k)}_{J_{\le k}}- \Delta x^{\textrm{ll}}_{J_{\le k}})\Vert&\le \Vert \delta _{J_{\le k}}(\Delta x^{(k)}_{J_{\le k}} - \Delta x^{(k+1)}_{J_{\le k}})\Vert + \Vert \delta _{J_{\le k}}(\Delta x^{(k+1)}_{J_{\le k}} - \Delta x^{\textrm{ll}}_{J_{\le k}})\Vert \\&\le 2\lambda \sqrt{\mu |J_{k+1}|} + 2 \lambda \sum _{q=k+2}^p \sqrt{\mu |J_q|} , \end{aligned}$$

yielding the induction hypothesis for k. \(\square \)

Lemma 3.10

(Restatement). Let \(w=(x,y,s)\in \mathcal{N}(\beta )\) for \(\beta \in (0,1/4]\), let \(\mu =\mu (w)\) and \(\delta =\delta (w)\). Let \(\mathcal{J}=(J_1,\ldots ,J_p)\) be a layering with \(\ell ^\delta (\mathcal{J})\le \beta /(32 n^2)\), and let \(\Delta w^\textrm{ll}= (\Delta x^\textrm{ll}, \Delta y^\textrm{ll}, \Delta s^\textrm{ll})\) denote the LLS direction for the layering \(\mathcal{J}\). Let furthermore \(\epsilon ^\textrm{ll}(w)=\max _{i\in [n]}\min \{| Rx _i^\textrm{ll}|,| Rs _i^\textrm{ll}|\}\), and define the maximal step length as

$$\begin{aligned} \alpha ^*&:=\sup \{\alpha ' \in [0,1] : \forall \bar{\alpha }\in [0,\alpha ']: w + \bar{\alpha }\Delta w^\textrm{ll}\in \mathcal {N}(2\beta )\}\, . \end{aligned}$$

Then the following properties hold.

  1. (i)

    We have

    $$\begin{aligned} \Vert \delta _{J_k} \Delta x^\textrm{ll}_{J_k} + \delta ^{-1}_{J_k} \Delta s^\textrm{ll}_{J_k} +x^{1/2}_{J_k} s^{1/2}_{J_k}\Vert&\le 6n\ell ^\delta (\mathcal{J})\sqrt{\mu }\, , \quad \forall k\in [p], \text{ and } \end{aligned}$$
    (27)
    $$\begin{aligned} \Vert \delta \Delta x^\textrm{ll}+ \delta ^{-1} \Delta s^\textrm{ll}+x^{1/2} s^{1/2}\Vert&\le 6n^{3/2}\ell ^\delta (\mathcal{J})\sqrt{\mu }\, . \end{aligned}$$
    (28)
  2. (ii)

    For the affine scaling direction \(\Delta w^\textrm{a}=(\Delta x^\textrm{a},\Delta y^\textrm{a},\Delta s^\textrm{a})\),

    $$\begin{aligned} \Vert Rx ^\textrm{ll}- Rx ^\textrm{a}\Vert , \Vert Rs ^\textrm{ll}- Rs ^\textrm{a}\Vert \le 6n^{3/2}\ell ^\delta (\mathcal{J})\,. \end{aligned}$$
  3. (iii)

    For the residuals of the LLS steps we have \(\Vert Rx ^\textrm{ll}\Vert ,\Vert Rs ^\textrm{ll}\Vert \le \sqrt{2n}\). For each \(i \in [n]\), \(\max \{| Rx ^\textrm{ll}_i|,| Rs ^\textrm{ll}_i|\}\ge \frac{1}{2}-\frac{3}{4} \beta \).

  4. (iv)

    We have

    $$\begin{aligned} \alpha ^*\ge 1-\frac{3\sqrt{n}\epsilon ^\textrm{ll}(w)}{\beta }\,, \end{aligned}$$
    (29)

    and for any \(\alpha \in [0,1]\)

    $$\begin{aligned} \mu (w + \alpha \Delta w^\textrm{ll}) = (1-\alpha )\mu , \end{aligned}$$
  5. (v)

    We have \(\epsilon ^\textrm{ll}(w)=0\) if and only if \(\alpha ^*=1\). These are further equivalent to \(w+ \Delta w^\textrm{ll}=(x+\Delta x^\textrm{ll}, y+\Delta y^\textrm{ll},s+ \Delta s^\textrm{ll})\) being an optimal solution to (LP).

Proof

Again, we use \(\lambda =\ell ^\delta (\mathcal{J})\).

Part (i). Clearly, (27) implies (28). To show (27), we use Lemma 5.3 to obtain \(\Delta \bar{x}\) and \(\Delta \bar{s}\) as in (37) and (38). We will also use and as in (35) and (36).

Select any layer \(k\in [p]\). From (35), we get that

$$\begin{aligned} \Vert \delta _{J_k} \Delta \bar{x}_{J_k} + \delta ^{-1}_{J_k} \Delta s_{J_k} +x^{1/2}_{J_k} s^{1/2}_{J_k}\Vert =\Vert \delta _{J_{k}}(\Delta \bar{x}_{J_k}- \Delta x_{J_k}^\textrm{ll})\Vert \le 2n\lambda \sqrt{\mu }\,. \end{aligned}$$
(40)

Similarly, from (36), we see that

$$\begin{aligned} \Vert \delta ^{-1}_{J_k} \Delta \bar{s}_{J_k} + \delta _{J_k} \Delta x_{J_k} +x^{1/2}_{J_k} s^{1/2}_{J_k}\Vert =\Vert \delta ^{-1}_{J_{k}}(\Delta \bar{s}_{J_k}- \Delta s_{J_k}^\textrm{ll})\Vert \le 2n\lambda \sqrt{\mu }\,. \end{aligned}$$

From the above inequalities, we see that

$$\begin{aligned} \Vert \delta _{J_k} (\Delta \bar{x}_{J_k} -\Delta x_{J_k})+ \delta ^{-1}_{J_k} (\Delta s_{J_k}-\Delta \bar{s}_{J_k})\Vert \le 4 n\lambda \sqrt{\mu }\,. \end{aligned}$$

Since \(\delta _{J_k} (\Delta \bar{x}_{J_k} -\Delta x_{J_k})\) and \(\delta ^{-1}_{J_k} (\Delta s_{J_k}-\Delta \bar{s}_{J_k})\) are orthogonal vectors, we have

$$\begin{aligned} \Vert \delta _{J_k} (\Delta \bar{x}_{J_k} -\Delta x_{J_k})\Vert ,\, \Vert \delta ^{-1}_{J_k} (\Delta s_{J_k}-\Delta \bar{s}_{J_k})\Vert \le 4n\lambda \sqrt{\mu }\,. \end{aligned}$$

Together with (37), this yields \(\Vert \delta _{J_k} (\Delta x^\textrm{ll}_{J_k} -\Delta x_{J_k})\Vert \le 6n\lambda \sqrt{\mu }\). Combined with (26), we get

$$\begin{aligned} \Vert \delta _{J_k} \Delta x^\textrm{ll}_{J_k} + \delta ^{-1}_{J_k} \Delta s^\textrm{ll}_{J_k} +x^{1/2}_{J_k} s^{1/2}_{J_k}\Vert = \Vert \delta _{J_k} (\Delta x^\textrm{ll}_{J_k} -\Delta x_{J_k})\Vert \le 6n \lambda \sqrt{\mu }\,, \end{aligned}$$

thus, (27) follows.

Part (ii). Recall from Lemma 3.5(i) that \(\sqrt{\mu } Rx ^\textrm{a}+\sqrt{\mu } Rs ^\textrm{a}={x^{1/2}s^{1/2}}\). From part (i), we can similarly see that

$$\begin{aligned} \Vert \sqrt{\mu } Rx ^\textrm{ll}+\sqrt{\mu } Rs ^\textrm{ll}-{x^{1/2}s^{1/2}}\Vert \le 6n^{3/2}\lambda \sqrt{\mu }\,. \end{aligned}$$

From these, we get

$$\begin{aligned} \Vert ( Rx ^\textrm{ll}- Rx ^\textrm{a})+ ( Rs ^\textrm{ll}- Rs ^\textrm{a})\Vert \le 6n^{3/2}\lambda \,. \end{aligned}$$

The claim follows since \( Rx ^\textrm{ll}- Rx ^\textrm{a}\in {\text {Diag}}(\delta ) W\) and \( Rs ^\textrm{ll}- Rs ^\textrm{a}\in {\text {Diag}}(\delta ^{-1}) W^\perp \) are orthogonal vectors.

Part (iii). Both bounds follow from the previous part and Lemma 3.5(iii), using the assumption \(\ell ^\delta (\mathcal{J})\le \beta /(32n^2)\).

Part (iv). Let \(w^+=w+\alpha \Delta w^\textrm{ll}\). We need to find the largest value \(\alpha >0\) such that \(w^+\in \mathcal{N}(2\beta )\). To begin, we first show that the normalized duality gap \(\mu (w^+)\) fulfills \(\mu (w^+) = (1-\alpha )\mu \) for any \(\alpha \in \mathbb {R}\). For this purpose, we use the decomposition:

$$\begin{aligned} (x + \alpha \Delta x^{\textrm{ll}})(s + \alpha \Delta s^{\textrm{ll}}) = (1-\alpha ) xs + \alpha (x + \Delta x^{\textrm{ll}})(s+ \Delta s^{\textrm{ll}}) - \alpha (1-\alpha ) \Delta x^{\textrm{ll}} \Delta s^{\textrm{ll}}. \nonumber \\ \end{aligned}$$
(41)

Recall from Part (i) that there exists and as in (35) and (36) such that \(\delta \Delta x^{\textrm{ll}} + \delta ^{-1} \Delta s = - \delta x\) and \(\delta \Delta x + \delta ^{-1} \Delta s^{\textrm{ll}} = -\delta ^{-1} s\). In particular, \(x + \Delta x^{\textrm{ll}} = -\delta ^{-2} \Delta s\) and \(s + \Delta s^{\textrm{ll}} = -\delta ^2 \Delta x\). Noting that \(\Delta x^{\textrm{ll}} \perp \Delta s^\textrm{ll}\) and \(\Delta x \perp \Delta s\), taking the average of the coordinates on both sides of (41), we get that

$$\begin{aligned} \mu (w + \alpha \Delta w^\textrm{ll})&= (1-\alpha ) \mu (w) + \alpha \langle x + \Delta x^{\textrm{ll}}, s + \Delta s^\textrm{ll}\rangle /n - \alpha (1-\alpha ) \langle \Delta x^{\textrm{ll}}, \Delta s^\textrm{ll}\rangle /n \nonumber \\&= (1-\alpha ) \mu (w) + \alpha \langle \delta ^{-2} \Delta s, \delta ^2 \Delta x \rangle /n \nonumber \\&= (1-\alpha ) \mu (w), \end{aligned}$$
(42)

as needed.

Let \(\epsilon := \varepsilon ^{\textrm{ll}}(w)\). To obtain the desired lower bound on the step-length, given (42) it suffices to show that for all \(0 \le \alpha < 1-\frac{3 \sqrt{n} \epsilon }{\beta }\) that

$$\begin{aligned} \left\| \frac{(x+\alpha \Delta x^\textrm{ll})(s+\alpha \Delta s^\textrm{ll})}{(1-\alpha )\mu }-e\right\| \le 2\beta \,. \end{aligned}$$
(43)

We will need a bound on the product of the LLS residuals:

$$\begin{aligned} \begin{aligned} \left\| Rx ^\textrm{ll} Rs ^\textrm{ll}-\frac{1}{\mu }\Delta x^\textrm{ll}\Delta s^\textrm{ll}\right\|&=\left\| \frac{x^{1/2}s^{1/2}}{\sqrt{\mu }}\cdot \frac{\delta \Delta x^\textrm{ll}+\delta ^{-1}\Delta s^\textrm{ll}+ x^{1/2}s^{1/2}}{\sqrt{\mu }}\right\| \\&\le 6(1+2\beta )n^{3/2}\lambda \le \frac{\beta }{4}\,, \end{aligned} \end{aligned}$$
(44)

using Proposition 3.1, part (i), and the assumptions \(\lambda \le \beta /(32n^2)\), \(\beta \le 1/4\). Another useful bound will be

$$\begin{aligned} \begin{aligned} \Vert Rx ^\textrm{ll} Rs ^\textrm{ll}\Vert ^2&= \sum _{i \in [n]} \left| Rx ^\textrm{ll}_i\right| ^2\left| Rs ^\textrm{ll}_i\right| ^2 \le \epsilon ^2 \sum _{i \in [n]} \max \Big \{\left| Rx ^\textrm{ll}_i\right| ^2,\left| Rs ^\textrm{ll}_i\right| ^2\Big \} \\&\le \epsilon ^2(\Vert Rx ^\textrm{ll}\Vert ^2 + \Vert Rs ^\textrm{ll}\Vert ^2) \le 2n \epsilon ^2\,. \end{aligned} \end{aligned}$$
(45)

The last inequality uses part (iii). With (41) we are ready to get the bound in (43), as

$$\begin{aligned} \Big \Vert \frac{(x + \alpha \Delta x^\textrm{ll})(s + \alpha \Delta s^\textrm{ll})}{(1-\alpha )\mu } - e\Big \Vert&\le \beta + \Big \Vert \frac{\alpha }{(1-\alpha )\mu }(x+\Delta x^\textrm{ll})(s + \Delta s^\textrm{ll}) - \frac{\alpha }{\mu } \Delta x^\textrm{ll}\Delta s^\textrm{ll}\Big \Vert \\&= \beta + \Big \Vert \Big (\frac{\alpha }{1 - \alpha } - \alpha \Big ) Rx ^\textrm{ll} Rs ^\textrm{ll}+ \alpha \Big ( Rx ^\textrm{ll} Rs ^\textrm{ll}- \frac{1}{\mu }\Delta x^\textrm{ll}\Delta s^\textrm{ll}\Big )\Big \Vert \, \\&\le \beta + \frac{\alpha ^2}{1 - \alpha }\Vert Rx ^\textrm{ll} Rs ^\textrm{ll}\Vert + \alpha \Big \Vert Rx ^\textrm{ll} Rs ^\textrm{ll}- \frac{1}{\mu }\Delta x^\textrm{ll}\Delta s^\textrm{ll}\Big \Vert \\&\le \beta + \frac{\sqrt{2n}\epsilon }{1 - \alpha } + \frac{\beta }{4} \le \frac{5}{4}\beta + \frac{\sqrt{2n}\epsilon }{1 - \alpha }\, . \end{aligned}$$

This value is \(\le 2\beta \) whenever \({2\sqrt{n}\epsilon }/({1 - \alpha })\le (3/4) \beta \Leftarrow \alpha < 1 - \frac{3 \sqrt{n} \epsilon }{\beta }\), as needed.

Part(v). From part (iv), it is immediate that \(\epsilon ^\textrm{ll}(w)=0\) implies \(\alpha =1\). If \(\alpha =1\), we have that \(w+\Delta w^\textrm{ll}\) is the limit of (strictly) feasible solutions to (LP) and thus is also a feasible solution. Optimality of \(w+\Delta w^\textrm{ll}\) now follows from Part (iv), since \(\alpha =1\) implies \(\mu (w+\Delta w^\textrm{ll})=0\). The remaining implication is that if \(w+\Delta w^\textrm{ll}\) is optimal, then \(\epsilon ^\textrm{ll}(w)=0\). Recall that \( Rx _i^\textrm{ll}=\delta _i(x_i+\Delta x_i^\textrm{ll})/\sqrt{\mu }\) and \( Rs _i^\textrm{ll}=\delta ^{-1}_i(s_i+\Delta s_i^\textrm{ll})/\sqrt{\mu }\). The optimality of \(w+\Delta w^\textrm{ll}\) means that for each \(i\in [n]\), either \(x_i+\Delta x_i^\textrm{ll}=0\) or \(s_i+\Delta s_i^\textrm{ll}=0\). Therefore, \(\epsilon ^\textrm{ll}(w)=0\). \(\square \)

6 Proofs of the main lemmas for the potential analysis

Lemma 4.2

Let \(w = (x,y,s) \in \mathcal N(\beta )\) for \(\beta \in (0,1/8]\) and let \(w^* = (x^*, y^*, s^*)\) be the optimal solution corresponding to \(\mu ^* = 0\) on the central path. Let further \(\mathcal{J}=(J_1, \ldots , J_p)\) be a \(\delta (w)\)-balanced layering (Definition 3.13), and let \(\Delta w^\textrm{ll}=(\Delta x^\textrm{ll}, \Delta y^\textrm{ll}, \Delta s^\textrm{ll})\) be the corresponding LLS direction. Then the following statement holds for every \(q \in [p]\):

  1. (i)

    There exists \(i \in J_q\) such that

    $$\begin{aligned} x_i^* \ge \frac{2x_i}{3\sqrt{n}}\cdot (\Vert Rx _{J_q}^\textrm{ll}\Vert - 2\gamma n)\, . \end{aligned}$$
    (32)
  2. (ii)

    There exists \(j \in J_q\) such that

    $$\begin{aligned} {s_j^*}\ge \frac{2s_j}{3\sqrt{n}} \cdot (\Vert Rs _{J_q}^\textrm{ll}\Vert - 2\gamma n)\, . \end{aligned}$$
    (33)

Proof of Lemma 4.2

We prove part (i); part (ii) follows analogously using Lemma 3.9. Let z be a vector fulfilling the statement of Lemma 5.2 for \(u=x^*\), \(v=x+\Delta x^\textrm{ll}\), and \(I=J_{>q}\). Then \(z \in W + x\), \(z_{J_{>q}}=x_{J_{>q}}+\Delta x_{J_{>q}}^\textrm{ll}\) and by \(\ell ^\delta (\mathcal J) \le \gamma \)

$$\begin{aligned} \left\| \delta _{J_{\le q}} (x^*_{J_{\le q}}-z_{J_{\le q}})\right\| \le \gamma \left\| {\delta _{J_{>q}} \big (x^*_{J_{>q}}-(x_{J_{>q}}+\Delta x^\textrm{ll}_{J_{>q}})\big )}\right\| . \end{aligned}$$

Restricting to the components in \(J_q\), and dividing by \(\sqrt{\mu }\), we get

$$\begin{aligned}{} & {} \left\| \frac{\delta _{J_q}(x^*_{J_q}-z_{J_q})}{\sqrt{\mu }}\right\| \le \gamma \left\| \frac{\delta _{J_{>q}}\big (x^*_{J_{>q}}-(x_{J_{>q}}+\Delta x^\textrm{ll}_{J_{>q}})\big )}{\sqrt{\mu }}\right\| \nonumber \\{} & {} \le \gamma \left\| \frac{\delta _{J_{>q}}x^*_{J_{>q}}}{\sqrt{\mu }}\right\| +\gamma \Vert Rx ^\textrm{ll}_{J_{>q}}\Vert \,. \end{aligned}$$
(46)

Since \(w\in \mathcal{N}(\beta )\), from Proposition 3.1 and (16) we see that for \(i \in [n]\)

$$\begin{aligned} \frac{\delta _i}{\sqrt{\mu }}\le \frac{1}{\sqrt{1-2\beta }}\cdot \frac{\delta _i(w(\mu ))}{\sqrt{\mu }}=\frac{1}{\sqrt{1-2\beta }}\cdot \frac{1}{x_i(\mu )}\,, \end{aligned}$$

and therefore

$$\begin{aligned} \left\| \frac{\delta _{J_{>q}}x^*_{J_{>q}}}{\sqrt{\mu }}\right\| \le \frac{1}{\sqrt{1-2\beta }} \left\| {x(\mu )^{-1}_{J_{>q}}x^*_{J_{>q}}}\right\| \, \le \frac{1}{\sqrt{1-2\beta }}\cdot \left\| {x(\mu )^{-1}_{J_{>q}}x^*_{J_{>q}}}\right\| _1\le \frac{n}{\sqrt{1-2\beta }}, \end{aligned}$$

where the last inequality follows by Lemma 3.3.

Using the above bounds with (46), along with \(\Vert Rx ^\textrm{ll}_{J_{\ge q}}\Vert \le \Vert Rx ^\textrm{ll}\Vert \le \sqrt{2n}\) from Lemma 3.10(iii), we get

$$\begin{aligned} \left\| \frac{\delta _{J_q} z_{J_q}}{\sqrt{\mu }}\right\| \le \left\| \frac{\delta _{J_q} x^*_{J_q}}{\sqrt{\mu }}\right\| + \frac{\gamma n}{\sqrt{1-2\beta }}+\gamma \sqrt{2n}\le \left\| \frac{\delta _{J_q} x^*_{J_q}}{\sqrt{\mu }}\right\| +2\gamma n \,, \end{aligned}$$

using that \(\beta \le 1/8\) and \(n\ge 3\). Note that z is a feasible solution to the least-squares problem which is optimally solved by \(x_{J_q}^\textrm{ll}\) for layer \(J_q\) and so

$$\begin{aligned} \Vert R x_{J_q}^\textrm{ll}\Vert \le \left\| \frac{\delta _{J_q} z_{J_q}}{\sqrt{\mu }}\right\| \,. \end{aligned}$$

It follows that

$$\begin{aligned} \left\| \frac{\delta _{J_q} x^*_{J_q}}{\sqrt{\mu }}\right\| \ge \Vert R x_{J_q}^\textrm{ll}\Vert -2\gamma n\,. \end{aligned}$$

Let us pick \(i={{\,\mathrm{arg\,max}\,}}_{t\in J_q}|\delta _t x^*_t|\). Using Proposition 3.2,

$$\begin{aligned} \frac{x^*_i}{x_i}\ge \frac{1}{1+\beta } \cdot \frac{\delta _i x^*_i}{\sqrt{\mu }}\ge \frac{\Vert R x_{J_q}^\textrm{ll}\Vert -2\gamma n}{(1+\beta )\sqrt{n}}\ge \frac{2}{3\sqrt{n}}\cdot (\Vert Rx _{J_q}^\textrm{ll}\Vert - 2\gamma n)\,, \end{aligned}$$

completing the proof. \(\square \)

Lemma 4.3

(Restatement). Let \(w=(x,y,s)\in \mathcal{N}(2\beta )\) for \(\beta \in (0,1/8]\), let \(\mu =\mu (w)\) and \(\delta =\delta (w)\). Let \(i,j\in [n]\) and \(2 \le \tau \le n\) such that for the optimal solution \(w^*=(x^*,y^*,s^*)\), we have \(x_i^*\ge \beta x_i/(2^{10}n^{5.5})\) and \(s_j^*\ge \beta s_j/(2^{10}n^{5.5})\), and assume \(\rho ^\mu (i,j)\ge -\tau \). After \(O(\beta ^{-1}\sqrt{n}\tau \log (\bar{\chi }^*+n))\) further iterations the duality gap \(\mu '\) fulfills \(\Psi ^{\mu '}(i,j)\ge 2\tau \), and for every \(\ell \in [n]\setminus \{i,j\}\), either \(\Psi ^{\mu '}(i,\ell )\ge 2\tau \), or \(\Psi ^{\mu '}(\ell ,j)\ge 2\tau \).

Proof of Lemma 4.3

Let us select a value \(\mu '\) such that

$$\begin{aligned} \log \mu - \log \mu '\ge 5\tau \log \left( \frac{4n\kappa ^*}{\gamma }\right) +31\log n+44-4\log \beta \,. \end{aligned}$$

The normalized duality gap decreases to such value within \(O(\beta ^{-1}\sqrt{n}\tau \cdot \log (\bar{\chi }^* + n))\) iterations, recalling that \(\log (\bar{\chi }^* + n) = \Theta (\log (\kappa ^* + n))\). The step-lengths for the affine scaling and LLS steps are stated in Proposition 3.4 and Lemma 3.10(iv). Whenever the algorithm chooses an LLS step, \(\epsilon ^\textrm{a}(w) < 10n^{3/2}\gamma \). Thus, the progress in \(\mu \) will be at least as much (in fact, much better) than the \(1-\beta /\sqrt{n}\) guarantee for the affine scaling step in Proposition 3.4.

Let \(w'=(x',y',s')\) be the central path element corresponding to \(\mu '\), and let \(\delta '=\delta (w')\). From now on we use the shorthand notation

$$\begin{aligned} \Gamma := \log \left( \frac{4n\kappa ^*}{\gamma }\right) \,. \end{aligned}$$

We first show that

$$\begin{aligned} \Gamma \rho ^{\mu '}(i,j)\ge 4\Gamma \tau +18\log n+ 22 \log 2 - 2 \log \beta \end{aligned}$$
(47)

for \(\mu '\), and therefore, \(\Gamma \Psi ^{\mu '}(i,j)\ge \min (2\Gamma n, 4\Gamma \tau +18\log n+ 22 \log 2 - 2 \log \beta ) \ge 2\Gamma \tau \) as \(\tau \le n\). Recalling the definition \(\kappa _{ij}^\delta =\kappa _{ij}\delta _j/\delta _i\), we see that according to Proposition 3.2,

$$\begin{aligned} \kappa _{ij}^\delta \le \frac{\kappa _{ij}}{(1-\beta )^2}\cdot \frac{x_is_j}{\mu }, \quad \text{ and }\quad \kappa _{ij}^{\delta '}={\kappa _{ij}}\cdot \frac{x'_is'_j}{\mu '}\,. \end{aligned}$$

Thus,

$$\begin{aligned} \Gamma \rho ^{\mu '}(i,j)&\ge \Gamma \rho ^\mu (i,j)+ {\log \mu -\log \mu ' +2\log (1-\beta ) + \log x_i'-\log x_i+\log s'_j-\log s_j}\\&\ge \Gamma \rho ^\mu (i,j)+ 5\Gamma \tau + 31\log n + 44 -4\log \beta +2\log (1-\beta ) + \log x_i'-\log x_i\\&\quad +\log s'_j-\log s_j. \end{aligned}$$

Using the near-monotonicity of the central path (Lemma 3.3), we have \(x_i'\ge x^*_i/n\) and \(s_j'\ge s^*_j/n\). Together with our assumptions \(x_i^*\ge \beta x_i/(2^{10}n^{5.5})\) and \(s_i^*\ge \beta s_i/(2^{10}n^{5.5})\), we see that

$$\begin{aligned} \log x_i'-\log x_i+\log s'_j-\log s_j\ge -13\log n-20\log 2+2\log \beta \,. \end{aligned}$$

Using the assumption \(\rho ^\mu (i,j)>-\tau \) of the lemma, we can establish (47) as \(\beta < 1/8\).

Next, consider any \(\ell \in [n]\setminus \{i,j\}\). From the triangle inequality Lemma 2.15(ii) it follows that \( \kappa _{ij}^{\delta '} \le \kappa _{i\ell }^{\delta '} \cdot \kappa _{\ell j}^{\delta '}\,, \) which gives \(\rho ^{\mu '}(i,\ell ) + \rho ^{\mu '}(\ell ,j) \ge \rho ^{\mu '}(i,j).\) We therefore get

$$\begin{aligned}\max \{\Gamma \rho ^{\mu '}(i,\ell ),\Gamma \rho ^{\mu '}(\ell ,j) \} \ge \frac{1}{2} \Gamma \rho ^{\mu '}(i,j) {\mathop {\ge }\limits ^{(47)}} 2\Gamma \tau +9\log n+11\log 2-\log \beta .\end{aligned}$$

We next show that if \(\Gamma \rho ^{\mu '}(i,\ell )\ge 2\Gamma \tau +9\log n+11\log 2-\log \beta \), then \(\Psi ^{\mu '}(i,\ell )\ge 2\tau \). The case \(\Gamma \rho ^{\mu '}(\ell ,j)\ge 2\Gamma \tau +9\log n+11\log 2-\log \beta \) follows analogously.

Consider any \(0<\bar{\mu }<\mu '\) with the corresponding central path point \(\bar{w}=(\bar{x},\bar{y},\bar{s})\). The proof is complete by showing \(\Gamma \rho ^{\bar{\mu }}(i,\ell )\ge \Gamma \rho ^{\mu '}(i,\ell )-9\log n-11\log 2+\log \beta \). Recall that for central path elements, we have \(\kappa ^{\delta '}_{ij}=\kappa _{ij}x'_i/x'_j\), and \(\kappa ^{\bar{\delta }}_{ij}=\kappa _{ij}\bar{x}_i/\bar{x}_j\). Therefore

$$\begin{aligned} \Gamma \rho ^{\bar{\mu }}(i,j)=\Gamma \rho ^{\mu '}(i,j)+ {\log \bar{x}_i-\log x_i'-\log \bar{x}_j+\log x_j'}\,. \end{aligned}$$

Using Proposition 3.1, Lemma 3.3 and the assumption \(x^*_i\ge \beta x_i/(2^{10}n^{5.5})\), we have \(\bar{x}_j\le nx_j'\) and

$$\begin{aligned}\bar{x}_i\ge \frac{x_i^*}{n}\ge \frac{\beta x_i}{2^{10}n^{6.5}}\ge \frac{\beta (1-\beta ) x'_i}{2^{10}n^{7.5}} \ge \frac{\beta x'_i}{2^{11}n^{7.5}}\,. \end{aligned}$$

Using these bounds, we get

$$\begin{aligned} \Gamma \rho ^{\bar{\mu }}(i,j)&\ge \Gamma \rho ^{\mu '}(i,j) - {9\log n - 11\log 2+\log \beta }, \end{aligned}$$

completing the proof. \(\square \)

It remains to prove Lemma 4.4 and Lemma 4.5, addressing the more difficult case \(\xi _\mathcal{J}^\textrm{ll}< 4\gamma n\). It is useful to decompose the variables into two sets. We let

$$\begin{aligned} {\varvec{B}}:= \{t \in [n]: | Rs _t^\textrm{ll}|< 4\gamma n\},\quad \text{ and }\quad {\varvec{N}}:=\{t \in [n]: | Rx _t^\textrm{ll}| < 4\gamma n\}\,. \end{aligned}$$
(48)

The assumption \(\xi _\mathcal{J}^\textrm{ll}< 4\gamma n\) implies that for every layer \(J_k\), either \(J_k\subseteq {\varvec{B}}\) or \(J_k\subseteq {\varvec{N}}\). The next two lemmas describe the relations between \(\delta \) and \(\delta ^+\).

Lemma 6.1

Let \(w\in \mathcal{N}(\beta )\) for \(\beta \in (0,1/8]\), and assume \(\ell ^\delta (\mathcal{J})\le \gamma \) and \(\epsilon ^\textrm{ll}(w) < 4\gamma n\). For the next iterate \(w^+ = (x^+, y^+, s^+) \in \overline{\mathcal {N}}(2\beta )\), we have

  1. (i)

    For \(i \in {\varvec{B}}\),

    $$\begin{aligned} \frac{1}{2} \cdot \sqrt{\frac{\mu ^+}{\mu }} \le \frac{\delta ^+_i}{\delta _i}\le 2 \cdot \sqrt{\frac{\mu ^+}{\mu }}\,\quad \text{ and }\quad \delta _i^{-1}s_i^+\le \frac{3\mu ^+}{\sqrt{\mu }}\,. \end{aligned}$$
  2. (ii)

    For \(i \in {\varvec{N}}\),

    $$\begin{aligned} \frac{1}{2}\cdot \sqrt{\frac{\mu }{\mu ^+}} \le \frac{\delta ^+_i}{\delta _i}\le 2 \cdot \sqrt{\frac{\mu }{\mu ^+}}\, \quad \text{ and }\quad \delta _ix_i^+\le \frac{3\mu ^+}{\sqrt{\mu }}\,. \end{aligned}$$
  3. (iii)

    If \(i,j \in {\varvec{B}}\) or \(i,j\in {\varvec{N}}\), then

    $$\begin{aligned} \frac{1}{4} \le \frac{\kappa _{ij}^{\delta }}{\kappa _{ij}^{\delta ^+}}=\frac{\delta ^+_i \delta _j}{\delta _i \delta ^+_j} \le 4\, . \end{aligned}$$
  4. (iv)

    If \(i\in {\varvec{N}}\) and \(j\in {\varvec{B}}\), then

    $$\begin{aligned} \frac{\kappa _{ij}^{\delta }}{\kappa _{ij}^{\delta ^+}} \ge 4n^{3.5}\,. \end{aligned}$$

Proof

Part (i). By Lemma 3.10(i), we see that

$$\begin{aligned} \Vert \delta _B \Delta x^{\textrm{ll}}_B\Vert _\infty&\le \Vert \delta _B \Delta x^{\textrm{ll}}_B + \delta _B^{-1} \Delta s^{\textrm{ll}}_B + x^{1/2}_B s^{1/2}_B\Vert _\infty + \Vert \delta ^{-1}_B(\Delta s^\textrm{ll}_B + s_B)\Vert _\infty \\&=\Vert \delta _B \Delta x^{\textrm{ll}}_B + \delta _B^{-1} \Delta s^{\textrm{ll}}_B + x^{1/2}_B s^{1/2}_B\Vert _\infty + \sqrt{\mu }\Vert Rs ^\textrm{ll}_B\Vert _\infty \\&\le \sqrt{\mu }\left( 6 n \ell ^\delta (\mathcal{J}) +4n\gamma \right) \le 10n\gamma \sqrt{\mu }\le \sqrt{\mu }/64\, , \end{aligned}$$

by the assumption on \(\ell ^\delta (\mathcal{J}) \) and the definition of \({\varvec{B}}\).

By construction of the LLS step, \(|x_i^+-x_i|=\alpha ^+|\Delta x_i^\textrm{ll}|\le |\Delta x_i^\textrm{ll}|\), recalling that \(0 \le \alpha ^+ \le 1\). Using the bound derived above, for \(i\in {\varvec{B}}\) we get

$$\begin{aligned} \left| \frac{x_i^+}{x_i}-1\right| \le \left| \frac{\Delta x_i^\textrm{ll}}{x_i}\right| = \frac{|\delta _i \Delta x_i^\textrm{ll}|}{\delta _i x_i}\le \frac{\sqrt{\mu }}{64\delta _ix_i}\le \frac{1}{32}\,, \end{aligned}$$

where the last inequality follows from Proposition 3.2. As

$$\begin{aligned} \frac{\delta ^+_i}{\delta _i}=\sqrt{\frac{x^+_is^+_i}{x_is_i}}\cdot \frac{x_i}{x^+_i} \quad \text{ and } \quad \frac{1 - 2\beta }{1 + \beta } \frac{\sqrt{\mu ^+}}{\sqrt{\mu }} \le \sqrt{\frac{x^+_is^+_i}{x_is_i}} \le \frac{1 + 2\beta }{1 - \beta } \frac{\sqrt{\mu ^+}}{\sqrt{\mu }} \end{aligned}$$

by Proposition 3.2 the claimed bounds follow with \(\beta \le 1/8\).

To get the upper bound on \(\delta ^{-1}_is_i^+\), again with Proposition 3.2

$$\begin{aligned} \delta ^{-1}_is_i^+=\frac{\delta ^+_i}{\delta _i\delta ^+_i} s_i^+=\frac{\delta ^+_i}{\delta _i} \cdot \sqrt{x_i^+s_i^+} \le 2 \sqrt{\frac{\mu ^+}{\mu }} \cdot (1+2\beta ) \sqrt{\mu ^+} \le \frac{3\mu ^+}{\sqrt{\mu }} \,. \end{aligned}$$

Part (ii). Analogously to (i).

Part (iii). Immediate from parts (i) and (ii).

Part (iv). Follows by parts (i) and (ii), and by the lower bound on \(\sqrt{\mu /\mu ^+}\) obtained from Lemma 3.10(iv) as follows

$$\begin{aligned}\frac{\kappa _{ij}^{\delta }}{\kappa _{ij}^{\delta ^+}} = \frac{\delta ^+_i \delta _j}{\delta _i \delta ^+_j} \ge \frac{\mu }{4\mu ^+} = \frac{1}{4(1-\alpha ^+)} \ge \frac{\beta }{12\sqrt{n}\epsilon ^\textrm{ll}(w)}\ge 4n^{3.5}.\end{aligned}$$

\(\square \)

Lemma 4.4

(Restatement). Let \(w = (x,y,s) \in \mathcal N(\beta )\) for \(\beta \in (0,1/8]\), and let \(\mathcal{J}=(J_1, \ldots , J_p)\) be a \(\delta (w)\)-balanced partition. Assume that \(\xi _{\mathcal{J}}^\textrm{ll}(w) < 4\gamma n\), and let \(w^+ = (x^+, y^+, s^+)\in \overline{\mathcal{N}}(2\beta )\) be the next iterate obtained by the LLS step with \(\mu ^+=\mu (w^+)\) and assume \(\mu ^+ > 0\). Let \(q\in [p]\) such that \(\xi _{\mathcal{J}}^\textrm{ll}(w)=\xi _{J_q}^\textrm{ll}(w)\). If \(\ell ^{\delta ^+}(\mathcal J) \le 4\gamma n\), then there exist \(i,j\in J_q\) such that \(x_i^*\ge \beta x_i^+/(16n^{3/2})\) and \(s_j^*\ge \beta s_j^+/(16n^{3/2})\). Further, for any \(\ell ,\ell '\in J_q\), we have \(\rho ^{\mu ^+}(\ell ,\ell ')\ge -|J_q|\).

Proof of Lemma 4.4

Without loss of generality, let \(\xi _\mathcal{J}^\textrm{ll}=\xi _{J_q}^\textrm{ll}=\Vert Rx _{J_q}^\textrm{ll}\Vert \) for a layer q with \(J_q\subseteq {\varvec{N}}\). The case \(\xi _{J_q}^\textrm{ll}=\Vert Rs _{J_q}^\textrm{ll}\Vert \) and \(J_q\subseteq {\varvec{B}}\) can be treated analogously.

By Lemma 3.10(iii), \(\Vert Rs _{J_q}^\textrm{ll}\Vert \ge \frac{1}{2}-\frac{3}{4}\beta >\frac{1}{4}+2n\gamma \), and therefore Lemma 4.2 provides a \(j\in J_q\) such that \(s_j^*/s_j\ge 1/(6\sqrt{n})\). Using Lemmas 3.3 and 3.1 we find that \(s_j^+/s_j \le 2n\) and so \(s_j^*/s_j^+ = s_j^*/s_j \cdot s_j/s_j^+ \ge 1/(12 n^{3/2}) > 1/(16 n^{3/2})\).

The final statement \(\rho ^{\mu ^+}(\ell ,\ell ')\ge -|J_q|\) for any \(\ell ,\ell '\in J_q\) is also straightforward. From Lemma 6.1(iii) and the strong connectivity of \(J_q\) in \(G_{\delta ,\gamma /n}\), we obtain that \(J_q\) is strongly connected in \(G_{\delta ^+,\gamma /(4n)}\). Hence, \(\rho ^{\mu ^+}(\ell ,\ell ')\ge -|J_q|\) follows by Lemma 4.1.

The rest of the proof is dedicated to showing the existence of an \(i\in J_q\) such that \(x_i^* \ge \beta x_i^+/(16 n^{3/2})\). For this purpose, we will prove following claim.

Claim 1

\(\Vert \delta _{J_q} x^*_{J_q}\Vert \ge \frac{\beta \mu ^+}{8\sqrt{n\mu }}\).

In order to prove Claim 1, we define

$$\begin{aligned} z:= (\delta ^+)^{-1} L^{\delta ^+}_{J_{>q}}\left( \delta ^+_{J_{>q}}(x^*_{J_{>q}}-x^+_{J_{>q}})\right) \quad \text { and } w:= x^*-x^+-z\,, \end{aligned}$$

as in Lemma 5.2. By construction, \(w \in W\) and \(w_{J_{>q}} = 0\). Thus, \(w_{J_q} \in W_{\mathcal{J},q}\) as defined in Sect. 3.4.

Using the triangle inequality, we get

$$\begin{aligned} \Vert \delta _{J_q}x^*_{J_q}\Vert \ge \Vert \delta _{J_q}(x^+_{J_q}+w_{J_q})\Vert - \Vert \delta _{J_q}z_{J_q}\Vert \,. \end{aligned}$$
(49)

We bound the two terms separately, starting with an upper bound on \(\Vert \delta _{J_q}z_{J_q}\Vert \). Since \(\ell ^{\delta ^+}(\mathcal{J}) \le 4 \gamma n\), we have with Lemma 5.2 that

$$\begin{aligned} \begin{aligned} \left\| \delta ^+_{J_q} z_{J_q}\right\|&\le \ell ^{\delta ^+}(\mathcal{J}) \left\| \delta ^+_{J_{>q}}\left( x^*_{J_{>q}}-x^+_{J_{>q}}\right) \right\| \\&\le 4n \gamma \left\| \delta ^+_{J_{>q}}\left( x^*_{J_{>q}}-x^+_{J_{>q}}\right) \right\| \\&= 4n \gamma \left\| \delta _{J_{>q}}^+ x_{J_{>q}}^+\left( \frac{x_{J_{>q}}^*}{x_{J_{>q}}^+} - e\right) \right\| \\&\le 4n\gamma \left( \Vert \delta ^+ x^+\Vert _\infty \cdot \left\| \frac{x^*}{x^+}\right\| _1 + \sqrt{n\mu ^+}\right) \\&\le 4n\gamma \left( \frac{3}{2} \sqrt{\mu ^+} \cdot \frac{4}{3} n + \sqrt{n\mu ^+}\right) \\&\le 16n^2\sqrt{\mu ^+} \gamma , \end{aligned} \end{aligned}$$
(50)

where the penultimate inequality follows by Proposition 3.2 and Lemma 3.3. We can use this and Lemma 6.1(ii) to obtain

$$\begin{aligned} \Vert \delta _{J_q} z_{J_q}\Vert \le \Vert \delta _{J_q}/\delta ^+_{J_q}\Vert _\infty \cdot \Vert \delta ^+_{J_q} z_{J_q}\Vert \le \frac{32n^2\gamma \mu ^+}{\sqrt{\mu }}\le \frac{\beta \mu ^+}{32n^3\sqrt{\mu }}\,, \end{aligned}$$
(51)

using the definition of \(\gamma \).

The first RHS term in (49) will be bounded as follows.

Claim 2

\(\Vert \delta _{J_q}(x^+_{J_q}+w_{J_q})\Vert \ge \frac{1}{2} \sqrt{\mu }\xi ^\textrm{ll}_{\mathcal{J}}\).

Proof of Claim 2

We recall the characterization (25) of the LLS step \(\Delta x^\textrm{ll}\in W\). Namely, there exists \(\Delta s \in W_{\mathcal{J},1}^\perp \times \cdots \times W_{\mathcal{J},q}^\perp \) that is the unique solution to \(\delta ^{-1} \Delta s + \delta \Delta x^\textrm{ll}= -\delta x\). From the above, note that

$$\begin{aligned} \Vert \delta ^{-1}_{J_q} \Delta s_{J_q}\Vert = \Vert \delta _{J_q} (x_{J_q}+\Delta x_{J_q}^\textrm{ll})\Vert =\sqrt{\mu } \Vert Rx _{J_q}^\textrm{ll}\Vert = \sqrt{\mu }\xi ^\textrm{ll}_{\mathcal{J}}\,. \end{aligned}$$

From the Cauchy-Schwarz inequality,

$$\begin{aligned} \begin{aligned} \Vert \delta ^{-1}_{J_q} \Delta s_{J_q}\Vert \cdot \Vert \delta _{J_q}(x^+_{J_q}+w_{J_q})\Vert&\ge \left| \left\langle \delta _{J_q}^{-1} \Delta s_{J_q}, \delta _{J_q}(x^+_{J_q}+w_{J_q})\right\rangle \right| \\&=\left| \left\langle \delta _{J_q}^{-1} \Delta s_{J_q}, \delta _{J_q}x^+_{J_q}\right\rangle \right| \,. \end{aligned} \end{aligned}$$
(52)

Here, we used that \(\Delta s_{J_q}\in W^\perp _{\mathcal{J},q}\) and \(w_{J_q}\in W_{\mathcal{J},q}\). Note that

$$\begin{aligned} x^+=x+\alpha \Delta x^\textrm{ll}=x+\Delta x^\textrm{ll}- (1-\alpha )\Delta x^\textrm{ll}=-\delta ^{-2}\Delta s- (1-\alpha )\Delta x^\textrm{ll}\,. \end{aligned}$$

Therefore,

$$\begin{aligned} \begin{aligned} \left| \left\langle \delta _{J_q}^{-1} \Delta s_{J_q}, \delta _{J_q}x^+_{J_q}\right\rangle \right|&=\left| \left\langle \delta _{J_q}^{-1} \Delta s_{J_q}, -\delta _{J_q}^{-1} \Delta s_{J_q} - (1-\alpha ) \delta _{J_q} \Delta x^\textrm{ll}_{J_q}\right\rangle \right| \\&\ge \Vert \delta _{J_q}^{-1} \Delta s_{J_q}\Vert ^2 - (1-\alpha ) \left| \left\langle \delta _{J_q}^{-1} \Delta s_{J_q},\delta _{J_q} \Delta x^\textrm{ll}_{J_q}\right\rangle \right| \,. \end{aligned} \end{aligned}$$

By Lemma 5.3, there exists \(\Delta \bar{x} \in W_{\mathcal{J},1} \times \cdots \times W_{\mathcal{J},p}\) such that \(\Vert \delta _{J_q}(\Delta x_{J_q}^\textrm{ll}-\Delta \bar{x}_{J_q})\Vert \le 2n\ell ^\delta (\mathcal{J})\sqrt{\mu } \). Therefore, using the orthogonality of \(\Delta s_{J_q}\) and \(\Delta \bar{x}_{J_q}\), we get that

$$\begin{aligned} \left| \left\langle \delta _{J_q}^{-1} \Delta s_{J_q},\delta _{J_q} \Delta x^\textrm{ll}_{J_q}\right\rangle \right| = \left| \left\langle \delta _{J_q}^{-1} \Delta s_{J_q},\delta _{J_q} (\Delta x^\textrm{ll}_{J_q}-\Delta \bar{x}^\textrm{ll}_{J_q})\right\rangle \right| \le 2n\ell ^\delta (\mathcal{J})\sqrt{\mu } \cdot \Vert \delta _{J_q}^{-1} \Delta s_{J_q}\Vert \,. \end{aligned}$$

From the above inequalities, we see that

$$\begin{aligned} \Vert \delta _{J_q}(x^+_{J_q}+w_{J_q})\Vert \ge \Vert \delta ^{-1}_{J_q} \Delta s_{J_q}\Vert -2(1-\alpha ) n\ell ^\delta (\mathcal{J})\sqrt{\mu } =\sqrt{\mu }\xi ^\textrm{ll}_{\mathcal{J}} - 2(1-\alpha ) n\ell ^\delta (\mathcal{J})\sqrt{\mu }\,. \end{aligned}$$

It remains to show \((1-\alpha ) n\ell ^\delta (\mathcal{J})\le \xi ^\textrm{ll}_{\mathcal{J}}/4\). From Lemma 3.10(iv), we obtain

$$\begin{aligned} (1-\alpha ) n\ell ^\delta (\mathcal{J})\le 3n^{3/2}\ell ^\delta ( \mathcal{J}) \xi ^\textrm{ll}_{\mathcal{J}}\beta ^{-1},\, \end{aligned}$$

using \(\xi ^{\textrm{ll}}_\mathcal{J} \ge \varepsilon ^{\textrm{ll}}\). The claim now follows by the assumption \(\ell ^\delta ( \mathcal{J})\le \gamma \), and the choice of \(\gamma \). \(\square \)

Proof of Claim 1

Using Lemma 3.10(iv),

$$\begin{aligned} \mu ^+\le \frac{3\sqrt{n}\xi ^\textrm{ll}_{\mathcal{J}}\mu }{\beta }, \end{aligned}$$

implying \(\Vert \delta _{J_q}(x^+_{J_q}+w_{J_q})\Vert \ge \beta \mu ^+/(6\sqrt{n\mu })\) by Claim 2. Now the claim follows using (49) and (51). \(\square \)

By Lemma 6.1(ii), we see that

$$\begin{aligned} \Vert \delta _{J_q} x^+_{J_q}\Vert \le \sqrt{n} \Vert \delta _{J_q} x^+_{J_q}\Vert _\infty \le \frac{ 3 \sqrt{n} \mu ^+}{\sqrt{\mu }}\,. \end{aligned}$$

Thus, the lemma follows immediately from Claim 1: for at least one \(i\in J_q\), we must have

$$\begin{aligned} \frac{x_i^*}{x_i} \ge \frac{\Vert \delta _{J_q} x^*_{J_q}\Vert }{\Vert \delta _{J_q} x^+_{J_q}\Vert } \ge \frac{\beta }{24n} \ge \frac{\beta }{16n^{3/2}}. \end{aligned}$$

\(\square \)

Lemma 4.5

(Restatement). Let \(w = (x,y,s) \in \mathcal N(\beta )\) for \(\beta \in (0,1/8]\), and let \(\mathcal{J}=(J_1, \ldots , J_p)\) be a \(\delta (w)\)-balanced partition. Assume that \(\xi _{\mathcal{J}}^\textrm{ll}(w) < 4\gamma n\), and let \(w^+ = (x^+, y^+, s^+)\in \overline{\mathcal{N}}(2\beta )\) be the next iterate obtained by the LLS step with \(\mu ^+=\mu (w^+)\) and assume \(\mu ^+ > 0\). If \(\ell ^{\delta ^+}(\mathcal J) > 4\gamma n\), then there exist two layers \(J_q\) and \(J_r\) and \(i\in J_q\) and \(j\in J_r\) such that \(x_i^*\ge x^+_i/(8n^{3/2})\), and \(s_j^*\ge s^+_j/(8n^{3/2})\). Further, \(\rho ^{\mu ^+}(i,j)\ge -|J_q\cup J_r|\), and for all \(\ell ,\ell '\in J_q\cup J_r\), \(\ell \ne \ell '\) we have \(\Psi ^\mu (\ell ,\ell ')\le |J_q\cup J_r|\).

Proof of Lemma 4.5

Recall the sets \({\varvec{B}}\) and \({\varvec{N}}\) defined in (48). The key is to show the existence of an edge

$$\begin{aligned} (i',j')\in E_{\delta ^+,\gamma /(4n)}\quad \text{ such } \text{ that } \quad i'\in J_q\subseteq {\varvec{B}}, \quad j'\in J_r\subseteq {\varvec{N}},\quad r<q\,. \end{aligned}$$
(53)

Before proving the existence of such \(i'\) and \(j'\), we show how the rest of the statements follow. Note that \(x^+ \le (1-\beta )^{-1}(1+2 \cdot 2\beta ) nx \le \frac{7}{4} nx\) by Lemma 3.3 and Proposition 3.1. Further, we have \(\Vert Rx ^\textrm{ll}_{J_q}\Vert - 2\gamma n \ge \frac{1}{2} - \frac{3}{4} \beta - 2\gamma n \ge \frac{2}{5}\) by Lemma 3.10 (iii). The existence of \(i \in J_q\) such that \(x_i^*\ge x^+_i/(8n^{3/2})\) now follows immediately from Lemma 4.2, as there is an \(i \in J_q\) such that

$$\begin{aligned} x_i^* \ge \frac{2x_i}{3\sqrt{n}}\cdot (\Vert Rx _{J_q}^\textrm{ll}\Vert - 2\gamma n) \ge \frac{2}{3\sqrt{n}} \frac{4x_i^+}{7n}\frac{2}{5} \ge \frac{x_i^+}{8n^{3/2}}. \end{aligned}$$
(54)

With analogous argumentation it can be shown that there exists \(j \in J_r\) such that \(s_j^*\ge s^+_j/(8n^{3/2})\). The other statements are that \(\rho ^{\mu ^+}(i,j)\ge -|J_q\cup J_r|\), and for each \(\ell ,\ell '\in J_q\cup J_r\), \(\ell \ne \ell '\), \(\Psi ^\mu (\ell ,\ell ')\le |J_q\cup J_r|\). According to Lemma 4.1, the latter is true (even with the stronger bound \(\max \{|J_q|,|J_r|\}\)) whenever \(\ell ,\ell '\in J_q\), or \(\ell ,\ell '\in J_r\), or if \(\ell \in J_q\) and \(\ell '\in J_r\). It is left to show the lower bound on \(\rho ^{\mu ^+}(i,j)\) and \(\Psi ^\mu (\ell ,\ell ')\le |J_q\cup J_r|\) for \(\ell '\in J_q\) and \(\ell \in J_r\).

From Lemma 6.1(iii), we have that if \(\ell ,\ell '\in J_q\subseteq {\varvec{B}}\) or \(\ell ,\ell '\in J_r\subseteq {\varvec{N}}\), then \( \kappa _{\ell \ell '}^{\delta }/4\le {\kappa _{\ell \ell '}^{\delta ^+}}\). Hence, the strong connectivity of \(J_r\) and \(J_q\) in \(G_{\delta ,\gamma }\) implies the strong connectivity of these sets in \(G_{\delta ^+, \gamma /(4n)}\). Together with the edge \((i',j')\), we see that every \(\ell '\in J_q\) can reach every \(\ell \in J_r\) on a directed path of length \(\le |J_q\cup J_r|-1\) in \(G_{\delta ^+,\gamma /(4n)}\). Applying Lemma 4.1 for this setting, we obtain \(\Psi ^\mu (\ell ,\ell ')\le \rho ^{\mu ^+}(\ell ,\ell ')\le |J_q\cup J_r|\) for all such pairs, and also \(\rho ^{\mu ^+}(i,j)\ge -|J_q\cup J_r|\).

The rest of the proof is dedicated to showing the existence of \(i'\) and \(j'\) as in (53). We let \(k \in [p]\) such that \(\ell ^{\delta ^+}(J_{\ge k}) = \ell ^{\delta ^+}(\mathcal J) > 4n\gamma \). To simplify the notation, we let \(I=J_{\ge k}\).

When constructing \(\mathcal J\) in Layering(\(\delta ,\hat{\kappa }\)), the subroutine Verify-Lift(\({\text {Diag}}(\delta )W,I,\gamma \)) was called for the set \(I=J_{\ge k}\), with the answer ‘pass’. Besides \(\ell ^\delta (I)\le \gamma \), this guaranteed the stronger property that \(\max _{ji}|B_{ji}|\le \gamma \) for the matrix B implementing the lift (see Remark 2.17).

Let us recall how this matrix B was obtained. The subroutine starts by finding a minimal \(I'\subset I\) such that \(\dim (\pi _{I'}(W)) = \dim (\pi _I(W))\). Recall that \(\pi _{I'}(W) = \mathbb {R}^{I'}\) and \(L_I^\delta (p) = L_{I'}^\delta (p_{I'})\) for every \(p \in \pi _I({\text {Diag}}(\delta )W)\).

Consider the optimal lifting \(L_I^\delta :\pi _I({\text {Diag}}(\delta )W)\rightarrow {\text {Diag}}(\delta )W\). We defined \(B \in \mathbb {R}^{([n] \setminus I) \times I'}\) as the matrix sending any \(q \in \pi _{I'}({\text {Diag}}(\delta )W)\) to the corresponding vector \([L_{I'}^\delta (q)]_{[n]\setminus I}\). The column \(B_i\) can be computed as \([L_{I'}^\delta (e^i)]_{[n]{\setminus } I}\) for \(e^i \in \mathbb {R}^{I'}\).

We consider the transformation

$$\begin{aligned} {\bar{B}}:={\text {Diag}}(\delta ^+\delta ^{-1})B{\text {Diag}}\big ((\delta ^+_{I'})^{-1}\delta _{I'}\big ). \end{aligned}$$

This maps \(\pi _{I'}({\text {Diag}}(\delta ^+)W)\rightarrow \pi _{[n]{\setminus } I} ({\text {Diag}}(\delta ^+)W)\).

Let \(z \in \pi _I({\text {Diag}}(\delta ^+) W)\) be the singular vector corresponding to the maximum singular value of \(L_I^{\delta ^+}\), namely, \(\Vert [L_{I}^{\delta ^+}(z)]_{[n]{\setminus } I}\Vert > 4n\gamma \Vert z\Vert \). Let us normalize z such that \(\Vert z_{I'}\Vert =1\). Thus,

$$\begin{aligned} \left\Vert [L_{I'}^{\delta ^+}(z_{I'})]_{[n]\setminus I}\right\Vert > 4n\gamma \,. \end{aligned}$$

Let us now apply \({\bar{B}}\) to \(z_{I'}\in \pi _{I'}({\text {Diag}}(\delta ^+)W)\). Since \(L_I^{\delta ^+}\) is the minimum-norm lift operator, we see that

$$\begin{aligned} \left\Vert {\bar{B}} z_{I'}\right\Vert \ge \left\Vert [L_{I'}^{\delta ^+}(z_{I'})]_{n\setminus I}\right\Vert > 4n\gamma \,. \end{aligned}$$

We can upper bound the operator norm by the Frobenius norm \(\Vert {\bar{B}}\Vert \le \Vert {\bar{B}}\Vert _F = \sqrt{\sum _{ji} {{\bar{B}}_{ji}}^2} \le n\max _{ji} |{\bar{B}}_{ji}|\), and therefore

$$\begin{aligned} \max _{ji} |{\bar{B}}_{ji}|> 4\gamma \,. \end{aligned}$$

Let us fix \(i'\in I'\) and \(j'\in [n]{\setminus } I\) as the indices giving the maximum value of \({\bar{B}}\). Note that \({\bar{B}}_{j'i'}=B_{j'i'}\delta ^+_{j'}\delta _{i'}/(\delta ^+_{i'}\delta _{j'})\).

Let us now use Lemma 2.16 for the pair \(i',j'\), the matrix B and the subspace \({\text {Diag}}(\delta )W\). Noting that \(B_{j'i'}=[L_{I'}^\delta (e^{i'})]_{j'}\), we obtain \(\kappa _{i'j'}^\delta \ge |B_{j'i'}|\). Now,

$$\begin{aligned} \kappa _{i'j'}^{\delta ^+} = \kappa _{i'j'}^\delta \cdot \frac{\delta ^+_{j'}\delta _{i'}}{\delta ^+_{i'}\delta _{j'}}\ge |B_{j'i'}|\cdot \frac{\delta ^+_{j'}\delta _{i'}}{\delta ^+_{i'}\delta _{j'}}= |{\bar{B}}_{j'i'}| > 4\gamma \, . \end{aligned}$$
(55)

The next claim finishes the proof. \(\square \)

Claim 6.2

For \(i'\) and \(j'\) selected as above, (53) holds.

Proof

\((i',j') \in E_{\delta ^+, \gamma /(4n)}\) holds by (55). From the above, we have

$$\begin{aligned} |B_{j'i'}| > 4\gamma \cdot \frac{\delta ^+_{i'} \delta _{j'}}{\delta _{i'} \delta ^+_{j'}}\,. \end{aligned}$$

According to Remark 2.17, \(|B_{j'i'}|\le \gamma \) follows since Verify-Lift(\({\text {Diag}}(\delta )W,I,\gamma \)) returned with ‘pass’. We thus have

$$\begin{aligned} \frac{\delta ^+_{i'} \delta _{j'}}{\delta _{i'} \delta ^+_{j'}} < \frac{1}{4}. \end{aligned}$$

Lemma 6.1 excludes the scenarios \(i',j'\in {\varvec{N}}\), \(i',j'\in {\varvec{B}}\), and \(i'\in {\varvec{N}}\), \(j'\in {\varvec{B}}\), leaving \(i'\in {\varvec{B}}\) and \(j'\in {\varvec{N}}\) as the only possibility. Therefore, \(i'\in J_q\subseteq {\varvec{B}}\) and \(j'\in J_r\subseteq {\varvec{N}}\). We have \(r<q\) since \(i\in I=J_{\ge k}\) and \(j\in [n]{\setminus } I=J_{<k}\). \(\square \)

7 Initialization

Our main algorithm (Algorithm 2 in Sect. 3.6), requires an initial solution \(w^0=(x^0,y^0,s^0)\in \mathcal{N}(\beta )\). In this section, we remove this assumption by adapting the initialization method of [63] to our setting.

We use the “big-M method”, a standard initialization approach for path-following interior point methods that introduces an auxiliary system whose optimal solutions map back to the optimal solutions of the original system. The primal-dual system we consider is

$$\begin{aligned} \begin{aligned} \min \; c^\top x +&Me^\top \underline{x}&\max \; y^\top b + 2&Me^\top z \\ Ax - A\underline{x}&= b&A^\top y + z + s&= c\\ x + \bar{x}&= 2Me&z + \bar{s}&= 0 \\ x,\bar{x},\underline{x}&\ge 0&-A^\top y + \underline{s}&= Me \\{} & {} s,\bar{s}, \underline{s}&\ge 0. \end{aligned} \end{aligned}$$
(Init-LP)

The constraint matrix used in this system is

$$\begin{aligned} \hat{A} = \left( \begin{array}{ccc} A &{} -A &{} 0 \\ I &{} 0 &{} I \\ \end{array}\right) \end{aligned}$$

The next lemma asserts that the \(\bar{\chi }\) condition number of \(\hat{A}\) is not much bigger than that of A of the original system (LP).

Lemma 7.1

[63, Lemma 23] \(\bar{\chi }_{\hat{A}} \le 3\sqrt{2}(\bar{\chi }_A+ 1).\)

We extend this bound for \(\bar{\chi }^*\).

Lemma 7.2

\({\bar{\chi }}^*_{\hat{A}} \le 3\sqrt{2}(\bar{\chi }_A^*+1)\).

Proof

Let \(D \in \textbf{D}_n\) and let \(\hat{D} \in \textbf{D}_{3n}\) the matrix consisting of three copies of D, i.e.

$$\begin{aligned} \hat{D}&= \left( \begin{array}{ccc} D &{} 0 &{} 0 \\ 0 &{} D &{} 0 \\ 0 &{} 0 &{} D \end{array}\, \right) . \end{aligned}$$

Then

$$\begin{aligned} \hat{A} \hat{D}&= \left( \begin{array}{ccc} AD &{}\quad - AD &{}\quad 0 \\ D &{}\quad 0 &{}\quad \end{array}\right) \end{aligned}$$

Row-scaling does not change \(\bar{\chi }\) as the kernel of the matrix remains unchanged. Thus, we can rescale the last n rows of \(\hat{A} \hat{D}\), to the identity matrix, i.e. multiplying by \((I, D^{-1})\) from the left hand side. We observe that

$$\begin{aligned} \bar{\chi }_{\hat{A}\hat{D}} = \bar{\chi } \left( \left( \begin{array}{ccc} AD &{} -AD &{} 0 \\ I &{} 0 &{} I \end{array} \right) \right) \le 3\sqrt{2} (\bar{\chi }_{AD} + 1) \end{aligned}$$

where the inequality follows from Lemma 7.1. The lemma now readily follows as

$$\begin{aligned} {\bar{\chi }}^*_{\hat{A}}&= \inf \{\bar{\chi }_{\hat{A} \hat{D}} : D \in \textbf{D}_{3n} \} \le \inf \{3\sqrt{2}(\bar{\chi }_{AD} + 1): D \in \textbf{D}_n\} = 3\sqrt{2}(\bar{\chi }_A^*+1). \end{aligned}$$

\(\square \)

We show next that the optimal solutions of the original system are preserved for sufficiently large M. We let d be the min-norm solution to \(Ax=b\), i.e., \(d = A^\top (AA^\top )^{-1}b\).

Proposition 7.3

Assume both primal and dual of (LP) are feasible, and \(M > \max \{(\bar{\chi }_{A}+1)\Vert c\Vert , \bar{\chi }_{A}\Vert d\Vert \}\). Every optimal solution (xys) to (LP), can be extended to an optimal solution \((x,\underline{x},\bar{x}, y,z,s,\underline{s}, \bar{s})\) to (Init-LP); and conversely, from every optimal solution \((x,\underline{x},\bar{x}, y,z,s,\underline{s}, \bar{s})\) to (Init-LP), we obtain an optimal solution (xys) by deleting the auxiliary variables.

Proof

If system (LP) is feasible, it admits a basic optimal solution \((x^*,y^*,s^*)\) with basis B such that \(A_Bx_B^* = b, x^* \ge 0, A_B^\top y^* = c\) and \(A^\top y^* \le c.\) Using Proposition 2.1(ii) we see that

$$\begin{aligned} \Vert x_B^*\Vert&= \Vert A_B^{-1}b\Vert = \Vert A_B^{-1}Ad\Vert \le \bar{\chi }_{A}\Vert d\Vert < M \, , \end{aligned}$$
(56)

and using that \(\Vert A\Vert = \Vert A^\top \Vert \) we observe

$$\begin{aligned} \Vert A^\top y^*\Vert&= \Vert A^\top A_B^{-\top }c\Vert \le \Vert A^\top A_B^{-\top }\Vert \Vert c\Vert = \Vert A_B^{-1}A\Vert \Vert c\Vert \le \bar{\chi }_A\Vert c\Vert < M. \end{aligned}$$
(57)

We can extend this solution to a solution of system (Init-LP) via setting \(\bar{x}^* = 2Me - x^*, \underline{x}^* =0, z^* = \bar{s}^* = 0\) and \(\underline{s}^* = Me + A^\top y^*\). Observe that \(\bar{x}^* > 0\) and \(\underline{s}^* > 0\) by (56) and (57). Furthermore observe that by complementary slackness this extended solution for (Init-LP) is an optimal solution. The property that \(\underline{s}^* > 0\) immediately tells us that \(\underline{x}\) vanishes for all optimal solutions of (Init-LP) and thus all optimal solutions of (LP) coincide with the optimal solutions of (Init-LP), with the auxiliary variables removed. \(\square \)

The next lemma is from [36, Lemma 4.4]. Recall that \(w=(x,y,s)\in \mathcal{N}(\beta )\) if \(\Vert xs/\mu (w)-e\Vert \le \beta \).

Lemma 7.4

Let \(w=(x,y,s)\in \mathcal{P}^{++}\times \mathcal{D}^{++}\), and let \(\nu >0\). Assume that \(\Vert xs/\nu -e\Vert \le \tau \). Then \((1-\tau /\sqrt{n})\nu \le \mu (w)\le (1+\tau /\sqrt{n})\nu \) and \(w\in \mathcal{N}(\tau /(1-\tau ))\).

The new system has the advantage that we can easily initialize the system with a feasible solution in close proximity to central path:

Proposition 7.5

We can initialize system (Init-LP) close to the central path with initial solution \(w^0 = (x^0, y^0, s^0) \in \mathcal {N}(1/8)\) and parameter \(\mu (w^0) \approx M^2\) if \(M > 15\max \{(\bar{\chi }_A + 1)\Vert c\Vert , \bar{\chi }_A \Vert d\Vert \}\).

Proof

The initialization follows along the lines of [63, Section 10]. We let d as above, and set

$$\begin{aligned} \bar{x}^0&= Me, x^0 = Me, \underline{x}^0 = Me -d \\ y^0&= 0, z^0 = -Me \\ \bar{s}^0&= Me, s^0 = Me + c, \underline{s}^0 = Me. \end{aligned}$$

This is a feasible primal-dual solution to system (Init-LP) with parameter

$$\begin{aligned} \mu ^0&= (3n)^{-1}(\langle x^0, s^0 \rangle + \langle \underline{x}^0, \underline{s}^0 \rangle + \langle \bar{x}^0, \bar{s}^0 \rangle ) = (3n)^{-1} (3nM^2 + Mc^\top e - Md^\top e)\approx M^2\, . \end{aligned}$$

We see that

$$\begin{aligned} \left\| \frac{1}{M^2}\begin{pmatrix}\bar{x}^0 \bar{s}^0 \\ x^0s^0 \\ \underline{x}^0 \underline{s}^0 \end{pmatrix} - e\right\| ^2&= M^{-2}\Vert c\Vert ^2 + M^{-2}\Vert d\Vert ^2 \le \frac{1}{9^2\bar{\chi }_A^2} \le \frac{1}{9^2}. \end{aligned}$$

With Lemma 7.4 we conclude that \(w^0 = (x^0, y^0, s^0) \in \mathcal {N}\left( \frac{1/9}{1-1/9}\right) = \mathcal {N}(1/8)\). \(\square \)

Detecting infeasibility To use the extended system (Init-LP), we still need to assume that both the primal and dual programs in (LP) are feasible. For arbitrary instances, we first need to check if this is the case, or conclude that the primal or the dual (or both) are infeasible.

This can be done by employing a two-phase method. The first phase decides feasibility by running (Init-LP) with data (Ab, 0) and \(M > \bar{\chi }_A \Vert d\Vert _1\). The objective value of the optimal primal-dual pair is 0 if and only if (LP) has a feasible solution. If the optimal primal/dual solution \((x^*, \underline{x}^*, \bar{x}^*, y^*, z^*, s^*, \underline{s}^*, \bar{s}^*)\) has positive objective value, we can extract an infeasibility certificate in the following way.

We can w.l.o.g. assume that \(x^*\) is supported on some basis B of A. Note that the objective function of the primal is equivalent to \(\Vert \underline{x}\Vert _1\). Therefore, clearly \(\Vert \underline{x}^*\Vert _1 \le -\sum _{i: d_i < 0} d_i \le \Vert d\Vert _1\) and so \(\Vert \underline{x}^*\Vert \le \Vert d\Vert _1\). Due to the constraint \(Ax^* - A\underline{x}^* = b = Ad\) we get that

$$\begin{aligned} \Vert x^*\Vert = \Vert B^{-1}A(d + \underline{x}^*)\Vert \le \Vert B^{-1}A\Vert (\Vert d\Vert + \Vert \underline{x}^*\Vert ) \le 2\bar{\chi }_A \Vert d\Vert _1. \end{aligned}$$
(58)

Therefore, if \(M > \bar{\chi }_A \Vert d\Vert _1\), then \(\bar{x}^* = 2Me - \Vert x^*\Vert > 0\) so by strong duality, \(\bar{s}^* = 0\). From the dual, we conclude that \(z^* = 0\), and therefore \(A^\top y^* \le A^\top y^* + s^* + z^* = c = 0\). On the other hand, by assumption the objective value of the dual is positive, and so \({(y^*)}^\top b \ge {(y^*)}^\top b + 2\,M e^\top z^* > 0\). Hence, \(y^*\) is the desired certificate.

Feasibility of the dual of (LP) can be decided by running (Init-LP) on data (A, 0, c) and \(M > (\bar{\chi }_A + 1)\Vert c\Vert \) with the same argumentation: Either the objective value of the dual is 0 and therefore the dual optimal solution \((y^*,z^*, \underline{s}^*, s^*, \bar{s}^*)\) corresponds to a feasible dual solution of (LP) or the objective value is negative and we extract a dual infeasibility certificate in the following way: For the optimal corresponding primal solution \((x^*, \underline{x}^*, \bar{x}^*)\) we have by assumption \(c^\top x^* \le c^\top x^* + Me^\top \underline{x}^* < 0\). Furthermore, w.l.o.g. the support of \(s^*\) is contained in a basis which allows us to conclude that \(\underline{s}^* > 0\) and therefore \(\underline{x}^* = 0\). So we have \(Ax^* = 0 + A\underline{x}^* = 0\), which together with \(c^\top x^* < 0\) yields the certificate of dual infeasibility.

Finding the right value of M While Algorithm 2 does not require any estimate on \(\bar{\chi }^*\) or \(\bar{\chi }\), the initialization needs to set \(M \ge \max \{(\bar{\chi }_{A}+1)\Vert c\Vert , \bar{\chi }_{A}\Vert d\Vert \}\) as in Proposition 7.3.

A straightforward guessing approach (attributed to Renegar in [63]) starts with a constant guess, say \(\bar{\chi }_A=100\), constructs the extended system, and runs the algorithm. In case the optimal solution to the extended system does not map to an optimal solution of (LP), we restart with \(\bar{\chi }_A=100^2\) and try again; we continue squaring the guess until an optimal solution is found.

This would still require a series of \(\log \log \bar{\chi }_A\) guesses, and thus, result in a dependence on \(\bar{\chi }_A\) in the running time. However, if we initially rescale our system using the near-optimal rescaling Theorem 2.5, then we can turn the dependence from \(\bar{\chi }_A\) to \(\bar{\chi }^*_A\). The overall iteration complexity remains \(O(n^{2.5}\log n\log (\bar{\chi }^*_A+n))\), since the running time for the final guess on \(\bar{\chi }^*_A\) dominates the total running time of all previous computations due to the repeated squaring.

An alternative approach, that does not rescale the system, is to use Theorem 2.5 to approximate \(\bar{\chi }_A\). In this case we repeatedly square a guess of \(\bar{\chi }_A^*\) instead of \(\bar{\chi }_A\) which takes \(\mathcal {O}(\log \log \bar{\chi }_A^*)\) iterations until our guess corresponds to a valid upper bound for \(\bar{\chi }_A\).

Note that either guessing technique can handle bad guesses gracefully. For the first phase, if neither a feasible solution to (LP) is returned nor a Farkas’ certificate can be extracted, we have proof that the guess was too low by the above paragraph. Similarly, in phase two, when feasibility was decided in the affirmative for primal and dual, an optimal solution to (Init-LP) that corresponds to an infeasible solution to (LP) serves as a certificate that another squaring of the guess is necessary.