Gaining traction: on the convergence of an inner approximation scheme for probability maximization

We analyze an inner approximation scheme for probability maximization. The approach was proposed in Fábián et al. (Acta Polytech Hung 15:105–125, 2018), as an analogue of a classic dual approach in the handling of probabilistic constraints. Even a basic implementation of the maximization scheme proved usable and endured noise in gradient computations without any special effort. Moreover, the speed of convergence was not affected by approximate computation of test points. This robustness was then explained in an idealized setting. Here we work out convergence proofs and efficiency arguments for a nondegenerate normal distribution. The main message of the present paper is that the procedure gains traction as an optimal solution is approached.


Introduction
Let F(z) denote an n-dimensional nondegenerate standard normal distribution function. Due to logconcavity of the normal distribution, the probabilistic function φ(z) = − log F(z) is convex. We consider a probability maximization problem in the form where vectors are x ∈ IR m , b ∈ IR r , and the matrices T and A are of sizes n × m and r × m, respectively. We are going to examine a simple and straightforward solution method, the one proposed in Fábián et al. (2018). It builds an inner approximation of the epigraph of φ(z), based on function evaluations in certain test points. A master problem is formulated using this approximation. Further test points are selected in the course of the procedure, with a view of gradually improving the optimum of the master problem.
In the computational study of Fábián et al. (2018), the speed of the convergence was not affected by approximate computation of test points. This robustness was then explained in an idealized setting, considering a globally well-conditioned objective function. In this paper we work out convergence proofs for the probabilistic function φ(z) specified above.

The broader context: inner approximation in probabilistic problems
Our scheme is analogous to the well-known dual approach in the handling of probabilistic constraints of the form where p 0 is a prescribed probability. The classic dual approach applies an inner approximation of the level set L p . The approach has been initiated by Prékopa (1990), and the inner approximation was first applied to a probabilistic constraint by Prékopa et al. (1998). In Dentcheva et al. (2000), a cone generation scheme was developed for the continual improvement of the approximation. In this scheme, new test points are found by minimizing a tractable function over the level set L p . Since this minimization entails a substantial computational effort, the master part of the decomposition framework should succeed with as few test points as possible. Efficient solution methods were developed by Dentcheva et al. (2004) and Dentcheva and Martinez (2013), approximating the original distribution by a discrete one and applying regularization to the master problem. More recently, van Ackooij et al. (2017) employed a special bundle-type method for the solution of the master problem, based on the on-demand accuracy approach of de Oliveira and Sagastizábal (2014). Each of these methods are quite complex, and, looking at them in a chronological order, an increasing level of complexity is noticeable. An effective solver based on this approach demands a sophisticated implementation.
In the classic scheme for probabilistic constraints, finding a new test point amounts to minimization over the level set L p . In the scheme we are going to examine, a new approximation point is found by unconstrained minimization, with considerably less computational effort. The unconstrained problems can be approximately solved by a simple gradient descent method. Of course the probability maximization problem (1) is easier than the handling of a probabilistic constraint (2). A Newton-type scheme for the handling of the latter was proposed in Fábián et al. (2019). It requires the approximate solution of a short sequence of problems of the former type. (Initial problems in this sequence are solved with a large stopping tolerance, and the accuracy is gradually increased.) Typically, gradient computations of probabilistic functions require a far greater effort than function evaluations. Computing a single non-zero component of a gradient vector will involve an effort comparable to that of computing a function value. (An alternative means of alleviating the difficulty of gradient computation in case of multivariate normal distribution has recently been proposed by Hantoute et al. 2018.) In comparison with the outer approximation approach widely used in probabilistic programming, we mention that it requires a sophisticated implementation to deal with noise in gradient computation. Even a fairly accurate gradient may result in a cut cutting into the level set. The analogous difficulty does not occur in case of inner approximation. We can easily keep the inner approximating model from bulging out of the epigraph. A procedure that employs an inner approximation of the epigraph will endure noise in gradient computation without any special effort, provided function values are evaluated with appropriate accuracy. Inherent stability of the model enables the application of randomized methods of simple structure. Such methods have been worked out in Fábián et al. (2019).

Overview of the paper
Section 2 contains problem and model formulation, under mild assumptions. We formulate the probability maximization problem and its Lagrangian dual. Moreover, we construct respective polyhedral models of this primal-dual pair of convex problems.
Section 3 is a brief overview of the solution method proposed in Fábián et al. (2018). This is a simple procedure, formulated here as a column generation scheme with a linear programming master problem. The column generating subproblems are unconstrained convex problems, conveniently solved by the steepest descent method. In the computational study reported in Fábián et al. (2018), we applied approximate solutions of the subproblems, performing just a few line searches in each subproblem. This heuristic procedure never resulted in any substantial increase in the number of the iterations needed to solve a test problem. We actually found that a single line search performs well in practice.
This paper presents a convergence analysis of the simple procedure of Sect. 3, and an explanation of the experimental efficiency of a single line search, in case the objective function is derived from a nondegenerate normal distribution. The objects defined in the remaining sections are not employed in the algorithm itself. Their sole role is to enable convergence proofs and to support efficiency arguments. (The only exception is the box Z in Sect. 8.2, which has some minor role in a modified algorithm.) Spadework for the analysis is done in Sect. 4, where we show that, with proper initialization of the test points, the solutions of the successive model problems will remain within well-defined bounds. These bounds, however, may be huge and have no practical relevance.
The bounds are applied in Sect. 5, where we present a theoretical estimate concerning the efficiency of a single line search in the column generating subproblem of Sect. 3. Though this estimate is not suitable for practical considerations.
In Sect. 6, we look at the whole column generation scheme from a dual viewpoint, and interpret it as an inexact cutting-plane scheme. The theoretical efficiency estimate of Sect. 5 is interpreted here as a rule that controls cut tolerance. The advantage of this viewpoint is that we can apply classic results concerning the uniform convergence of convex functions and the convergence of exact cutting-plane schemes. We prove that the resulting inexact cutting-plane scheme is convergent. Though this proof does not directly yield practical efficiency results.
In Sect. 7, we return to the primal viewpoint of the column generation scheme. We assume that the probabilistic function φ(z) is well-conditioned in the optimal solution of the original probability maximization problem. The convergence results of Sect. 6 imply that, starting with a certain iteration in the column generation scheme, we can focus on a certain neighborhood of the optimal solution. We show that, starting with that certain iteration, a practicable version of the theoretical efficiency estimate of Sect. 5 will hold.
In Sect. 8, we consider the master problem in a linear programming form, to be solved by the simplex method. In the simplex context, we look on the estimate of Sect. 7 as a guarantee of the workability of the column-selection rule of our simplex variant. Our arguments will then be based on the practical efficiency of the simplex method.
In Sect. 9, we illustrate the locally well-conditioned character of the probabilistic objective function, and a closing discussion is included in Sect. 10.

Problem and model formulation
We assume that the feasible domain of the probability maximization problem (1) is not empty, and is contained in the box Exploiting the monotonicity of the objective function, problem (1) can be written as (3) Assumption 1 A significantly high probability can be achieved in the probability maximization problem. Specifically, a feasible pointž is known such that F(ž) ≥ 0.5.
The purpose of this assumption is not just to provide a starting solution. It will also help keeping the solution process in a safe area where the slope of the log(.) function is moderate. A further speciality of the normal distribution function is the existence of a bounded box Z outside which the probability weight can be ignored. For the sake simplicity, we assume that this box has the form Z = { z ∈ IR n | − 1 ≤ z j ≤ 1 ( j = 1, . . . , n) }. Including the constraint z ∈ Z in (3) results in the approximating problem As observed in Fábián et al. (2019), the difference between the respective optima of problems (3) and (4) is insignificant. (In the proof, we need an upper bound on the gradients of the objective function φ(z) = − log F(z) at the optimal solutions. Such a bound can be derived from Assumption 1.) The bound z ∈ Z in (4) allows regularization of the objective function, in the form of with some ρ > 0. Substituting this regularized objective in (4) makes no significant variation in the objective value of z ∈ Z, provided ρ is small enough. On the other hand, the regularizing term improves the condition of the objective and ensures that the conjugate function φ (.) is finite valued. In this paper, we are going to work with this regularized objective, with an appropriately small ρ. The aim is just to improve numerical characteristics of φ(z). We do not wish to omit Z in the problem formulation, because it will have a role in the algorithm. Splitting the variables z, we transform (4) into the equivalent form Problem (6) has an optimal solution because the feasible domain is nonempty and bounded. We relax the constraints z − z = 0; z − T x ≤ 0 and Ax − b ≤ 0 by introducing respective multiplier vectors u ∈ IR n ; −v ∈ IR n , −v ≥ 0 and − y ∈ IR r , − y ≥ 0. The Lagrangian is (In a strict sense, we actually applied the multiplier −u to the first constraint.) The relaxed problem falls apart into three separate minimization problems: The first minimum is by definition the negative of the conjugate function value φ (u). Due to the special form of the box Z, the second minimum is − u − v 1 . The third minimum can be computed in a similar manner, and the optimum of the relaxed problem is Introducing the function the Lagrangian dual of (6) can be written as According to the theory of convex duality, this problem has an optimal solution. Even strong duality relationship holds between (6) and (11), because we assumed linear constraints in the primal problem. We could even allow nonlinear convex constraints, assumed that Slater's condition is satisfied. For a recent treatise on Lagrangian duality, see, e.g., Chapter 4 in the book Ruszczyński (2006). An optimal solution of the primal problem (6) has the form of (z , z , x ). With some laxity, I will call a vector z ∈ IR n an optimal solution of (6), if it can be extended into a strict-sense optimal solution (z , z , x ). This z is unique due to the strict convexity of the regularized objective function φ(z). The objective function is also smooth, hence its conjugate is strictly convex (see Theorem 26.1 in Rockafellar (1970)). Hence the dual problem (11) has a unique optimal solution u .
Observation 2 Let z and u denote the respective optimal solutions of the primal problem (6) and the dual problem (11). We have Proof Let us extend z to a strict-sense optimal solution (z , z , x ) of the primal problem (6). Given u , let v , y denote an optimal solution of the supremum problem in (10). Then (z , z , x ), (−u , −v , − y ) is a saddle point of the Lagrangian (7). Hence the Karush-Kuhn-Tucker conditions for the optimality of (z , z , x ) in (6) hold with Lagrange multipliers (−u , −v , − y ), and (12) is part of the Karush-Kuhn-Tucker conditions. Proofs of the cited statements can be found, e.g. in Ruszczyński (2006), Theorems 4.9, 4.7 and 3.34.

Polyhedral models
Suppose we have evaluated the function φ(z) at points z i (i = 0, 1, . . . , k); we introduce the notation φ i = φ(z i ) for respective objective values. An inner approximation of φ(.) is If z / ∈ Conv(z 0 , . . . , z k ), then let φ k (z) := +∞. A polyhedral model of problem (6) is We assume that (14) is feasible, i.e., its optimum is finite. This can be ensured by proper selection of the initial z 0 , . . . , z k points. With some laxity, I will call a vector z ∈ IR n an optimal solution of (14), if it can be extended into a strict-sense optimal solution (z, z , x).
As φ k (.) is a cutting-plane model of φ (.), the following problem is a polyhedral model of problem (11): At the same time, (16) is the linear programming dual of (14). Let z denote an optimal solution of the primal model problem (14), and let u denote an optimal solution of the dual model problem (16)-existing due to our assumption concerning the feasibility of (14). The following observation was proved in Fábián et al. (2018).

A simple solution method
The convex problem (6) can be solved through approximation. The approximating problem (14) can be formulated as a linear programming problem. The idea is to progressively improve the approximation by adding further approximation points, a traditional and widely used approach.
In Fábián et al. (2018), we applied a straightforward and unsophisticated method for the construction of approximation points. That is an iterative method. The initialization will be presented in Sect. 3.1. An iteration consist in building and solving the model problem as described in Sect. 3.2, and finding an improving approximation point as described in Sect. 3.3.

Initialization of the model
We initialized the model function (13) with n + 1 points (n being the dimension of the distribution). The 2-dimensional case is illustrated in Fig. 1. The shaded area represents the bounded box Z outside which the probability weight can be ignored.  The 'most positive' vertex of Z was selected as z 0 , and z 1 , . . . , z n were respective points from the edges adjoining z 0 . In this paper we follow this way of initialization, further adding the vector of Assumption 1, as z n+1 =ž. This of course always results in a feasible primal model problem (14).

A linear programming formulation of the model problem
Taking into account the model function definition (13), the primal model problem (14) can obviously be formulated as follows.
The variable z of (14) is represented here as the convex combination λ i z i . Hence the splitting constraint z − z = 0 of (14) takes the form λ i z i − z = 0. The columns corresponding to the approximation points z 0 , . . . , z k are involved only in the convexity constraint λ i = 1 and in the splitting constraint. The dual variable (shadow price) belonging to the convexity constraint will be denoted by ϑ, and the dual vector belonging to the splitting constraint will be denoted by u. (These are shown above, attached to the respective constraints with ⊥ signs.) The linear programming problem (17) is easy to solve (and an optimal solution always exists, due to the initialization described above.) Let ( λ 0 , . . . , λ k ) denote a part of an optimal solution of (17). Then an optimal solution z of (14) can be constructed in the form

Column generation
We aim to find a new approximation point z k+1 whose addition to the model function (13) will result in an improvement in the optimal objective value of (17). Let ϑ denote the optimal dual variable belonging to the convexity constraint in (17). Similarly, let u denote the optimal dual vector belonging to the splitting constraint.
Given a vector z ∈ IR n , we can add the corresponding column (1, z, 0) to the linear programming problem (17), with objective component φ(z). This is an improving column if its reduced cost is positive; formally, if (z) > 0 holds for A vector yielding a good reduced cost can be found by approximately solving the column generating problem In the context of the simplex method, the optimal solution of this problem corresponds to the Markowitz column selection rule. This is a classic column selection rule that is still considered fairly efficient and is widely used.

Our experience with the iterative method
As we have seen, an improving column for (17) can be found by maximizing the reduced cost function (z). In the computational study reported in Fábián et al. (2018), we applied a steepest descent method to − (z), a natural starting point being z. In the first round of test runs, we solved the column generating problems to near-optimality. However, we found the computational effort excessive. (Most of the computational effort was spent in the computation of probabilistic function gradients.) Hence we decided to perform just a few line searches in each column generating problem (20). This heuristic procedure never resulted in any substantial increase in the number of the iterations needed to solve a test problem. We actually found that a single line search performs well in practice.

Convergence arguments for an idealized setting
We thought that the experimental efficiency of the above mentioned simplified scheme needs explanation. In Fábián et al. (2018) we explained it in an idealized setting, considering a globally well-conditioned objective function. Let f : IR n → IR such that the following assumption holds.
Assumption 4 (Bounded Hessians) The function f (z) is twice continuously differentiable, and real numbers α, ω (0 < α ≤ ω) exist such that Here ∇ 2 f (z) is the Hessian matrix, I is the identity matrix, and the relation U V between matrices means that V − U is positive semidefinite.
Let me note that lower-bounded Hessians imply strong convexity -this is an easy consequence of Taylor's theorem. The following well-known theorem can be found e.g., in Chapter 8.6 of Luenberger and Ye (2008). (Ruszczyński 2006 in Chapter 5.3.5, Theorem 5.7 presents a slightly different form.) Theorem 5 Let Assumption 4 hold. We minimize f (z) over IR n using a steepest descent method, starting from a point z 0 . Let z 1 , . . . , z j , . . . denote the iterates obtained by applying exact line search at each step. Then we have where F = min z f (z).
If our objective function φ(z) should have bounded Hessians, then the function − (z) of (19) would inherit this property. (The column generating subproblem can be formulated as minimization of − (z).) In Fábián et al. (2018), we applied Theorem 5 to is a straight consequence of the theorem. In view of the Markowitz rule mentioned above, we find a fairly good improving vector in the column generation scheme in a very few iterations, provided the condition number α/ω is not extremely bad. The efficiency of the whole iterative procedure can then be explained by the practical efficiency of the simplex method.

Bounds on optimal solutions
Our first aim is to construct a compact set that contains every optimal solution z of any model problem (14). An obvious choice would be the box Z. But, for reasons that will become apparent shortly, we need a set well inside Z.
The constraints z = z ≤ T x and x ∈ X appear in the model problem (14). Hence the set { T x | x ∈ X } + N , where N denotes the negative (closed) orthant in IR n , contains every optimal solution z of (14).
As we includež of Assumption 1 among the initial test points, every optimal solution satisfies also φ(z) ≤ φ(ž). We have set the regularization term in the objective (5) in such a way that the variation of ρ 2 z 2 on Z is small. This results a reasonably large lower bound for the probability belonging to the optimal solution, like F(z) ≥ 0.49. It follows that the set contains every optimal solution z of any model problem (14). In a similar manner, it can be shown that z ∈ O z with the optimal solution z of the primal problem (6). The set O z is compact. Indeed, let us consider the two intersecting sets on the righthand side of (23): the former has the negative orthant as its recession cone, while the recession cone of the latter is the positive orthant. Now we are going to construct an upper bound for the subgradients of the model function on O z . The idea is to have a wide border around O z , on which border the model function is finite valued.
Initialization of the approximation points was described in Sect. 3.1, and illustrated in Fig. 1. (The 'most positive' vertex of Z is selected as z 0 , and z 1 , . . . , z n are respective points from the edges adjoining z 0 .) Let S = Conv(z 0 , . . . , z n ) denote the convex hull of these n + 1 test points. As the box Z can be extended arbitrarily, we may assume not only that O z ⊂ S holds, but even that where S η denotes the simplex obtained by shrinking S towards its barycenter, the ratio of decrease being 1 to η. In what follows, we work with a fixed value η satisfying (24). The proof is straightforward but somewhat technical, hence I include it in the "Appendix, section A.1".

On the efficiency of a single line search
Let z denote an optimal solution of the model problem, constructed in the form (18). Let moreover u be the optimal dual vector belonging to the splitting constraint in the linear programming form (17) of the model problem. According to the column generating procedure of Sect. 3.3, the next test point z k+1 = z is obtained by approximate maximization of u T z − φ(z).
The column generating subproblem is conveniently reformulated as approximate minimization of the function f (z) = φ(z) − u T z, and a steepest descent method can be applied from the convenient starting point z. In accordance with our experience described in Sect. 3.4, we are actually going to work out proofs for a single line search performed in the opposite direction of the gradient g = ∇ f (z) = ∇φ(z) − u.
We have z ∈ O z and g ≤ 2 u as shown in Sect. 4. Hence any line search is performed on a ray of the form Due to the regularizing term in the objective (5), all eigenvalues of the Hessians are increased by ρ, hence ∇ 2 φ(z) ρ I (z ∈ IR n ) holds. Though there exists no finite upper bound on the eigenvalues, it turns out that a local version of Assumption 4 is sufficient.
Given z ∈ IR n , let ω(z) denote the largest eigenvalue of ∇ 2 f (z). Continuity of this function is a straight consequence of a theorem of Ostrowski that I cite as Theorem 17 in "Appendix" B. Since the set O z is compact, it follows that This again is a compact set, hence is again finite. As O ⊇ O z , it follows that ≥ .
Proposition 7 With z found by performing a single line search, we have where F = min z f (z).
The proof will be an adaptation of that of Theorem 5, as related in Chapter 8.6 of Luenberger and Ye (2008). We restrict our investigation to O.

Proof of Proposition 7
Due to the above construction, we have z − t g ∈ O for t such that 0 ≤ t ≤ 1/ . Hence ∇ 2 φ(z) I holds for z = z − t g with these t values. It follows that holds (a consequence of Taylor's theorem). We consider the respective minima in t ∈ IR separately of the two sides. The right-hand side is a quadratic expression, yielding minimum at t = 1/ . (Note that 1/ ≤ 1/ .) It follows that From ∇ 2 φ(z) ρ I (z ∈ IR n ), it follows in a similar manner that The right-hand side expression is a quadratic function of z, which yields its minimum at z = z − 1 ρ g. Hence The proof is concluded by simple transformations of (28) and (30). The improving column z is found by solving the minimization problem in the left-hand side of (28). Subtracting F from both sides of (28), and applying (30) to underestimate g 2 , we get (27).

Dual viewpoint: an approximate cutting-plane method
The method to be described in this section is not new. It is just the simple solution method of Sect. 3, described from a new viewpoint that will enable us to apply some classic tools in the convergence proof.
The column generation procedure, as described in Sects. 3.2 and 3.3, yields an inner approximation of the epigraph of the objective function φ(z). Looking at the same procedure from a dual viewpoint we can see a cutting-plane method. This relationship between the primal and dual approaches is well known, see, e.g., Frangioni (2002Frangioni ( , 2018, but it is immediately apparent in the present case: (16) is clearly a cuttingplane model of (11). From the dual point of view, the procedure yields an outer approximation of the epigraph of the conjugate function φ (u).
As the linear programming problem (17) is just a re-formulation of the primal model problem (14), it follows that the dual of (17) is equivalent to the dual model problem (16). Specifically, let u denote an optimal dual vector belonging to the splitting constraint in (17). Then, at the same time, u is an optimal solution of the dual model problem (16).
If we wish to apply a cutting-plane method for the minimization of a convex function ϕ : IR n → IR, we must be able to construct an approximate support function at any given point u ∈ IR n . This is a linear function (u) such that (u) ≤ ϕ(u) (u ∈ IR n ) holds, and the difference ϕ( u) − ( u) is under control. If the difference is 0 then (u) is an exact support function to ϕ(u) at u.

Problem specification and dual interpretation of former results
Introducing the function d(u) := φ (u) + ν(u), the dual problem (11) can be solved in the form The kth model function is (15). The dual model problem (16) can be solved in the form Let u denote an optimal solution of the model problem (32). Moreover let with the bound u of Observation 6. This is a compact ball that contains the optimal solution u of the model problem (32), as well as the optimal solution u of the convex problem (31). An approximate support function to d(u) at u is constructed in the form where the right-hand-side functions are separate support functions to φ (u) and ν(u), respectively. As for ν(u), we can construct an exact support function (u) by solving the problem (10 : u = u ). This can be formulated as a linear programming problem whose optimal dual variables will give a subgradient at u.
The approximate support function to φ (u) is constructed in the form with an appropriate vector z . (Note that (.) ≤ φ (.) holds by the definition of the conjugate function.) An exact support function at u could be obtained by setting z to be the exact maximizer of u T z−φ(z). This is just the column generating problem (20). An approximate solution of the column generating problem yields an approximate support function. Approximate solution of the column generation problem was described in Sect. 3. Let z denote a primal optimal solution, constructed in the form (18). Let moreover z be found by a single line search starting from z. We analyzed this procedure in Sect. 5. Translating Proposition 7 to the present setting, we get

Corollary 8 The support function (35) satisfies
with the constant θ = ρ/ . Here ρ is the regularization constant of (5), and defined in (26). Of course we have 0 < θ < 1, and it is independent of the iteration count k.

Proof of Corollary 8 We start with formulating (27) in terms of φ(z) and φ (u).
By the definition f (z) = φ(z) − u T z, we have F = −φ (u). Substituting these, we get A simple transformation results in By Observation 3, we have u T z = φ k (u) + φ k (z). Substituting this in the bracketed expression, and taking into account The support function (34) obviously inherits the property of Corollary 8. That is, holds with the same constant θ = ρ/ .

Convergence
A more detailed notation will be needed, including indexing of the iterates and support functions in the cutting-plane scheme. Let u 1 , . . . , u k denote the known iterates. Let i (u) (i = 1, . . . , k) denote approximate support functions at the respective iterates. These take the form i (u) = i (u) + i (u), where i (u) is an approximate support function to φ (u), and i (u) is an exact support function to ν(u), at u i . The model function d k (u) is the upper cover of the support functions i (u) (i = 1, . . . , k), and the next iterate u k+1 is a minimizer of d k (u). At u k+1 , again, we construct a support function to d(u). By (39), holds with the constant θ = ρ/ .

Theorem 9
We have seen that the compact set O u of (33) contains all the iterates u k . Moreover, all the approximate supporting functions satisfy (40) with the constant θ independent of k. The approximate cutting plane method generates a sequence of models and points satisfying where D k and D are the minima of d k (u) and d(u), respectively.
A nice elementary proof of the convergence of the exact cutting plane method can be found in Ruszczyński (2006), Theorem 7.7. The following proof uses some of the ideas presented there. To handle inexactness, I'll need a well-known though strong theorem from convex analysis. Using that, the proof becomes surprisingly simple.

Proof of Theorem 9
exists and is finite. d ∞ (u) is a convex function and the sequence of the model functions converges uniformly on the compact set O u -see, e.g., Theorem 10.8 in Rockafellar's book Rockafellar (1970). We have D k = min u∈IR n d k (u) = d k (u k+1 ). Moreover, let D ∞ = min u∈IR n d ∞ (u). These are finite and D k ≤ D ∞ ≤ D. With any index k, we have Let us now take into account the uniform convergence of the sequence of the model functions. Given > 0, there exist a finite number K such that |d k (u) − d ∞ (u)| ≤ holds for k ≥ K , u ∈ O u . Specifically, this holds with u = u k+1 , showing that the difference between the left-hand side and the right-hand side of (42) is small for large enough k. It follows that Now let (u k i +1 ) be a convergent subsequence of (u k+1 ), and let u := lim i→∞ u k i +1 . We have due to the continuity of d(u).
With any index k, we have d ∞ (u k+1 ) ≥ k+1 (u k+1 ). Taking into account (40), we get Here we have d ∞ (u k+1 ) → D ∞ and D k → D ∞ as k → ∞, according to (43). As for the remaining expression d(u k+1 ), we have (44). Taking limits in (45), we get The proof can be completed in an indirect fashion. Assume that (41) does not hold, i.e., there exists some > 0 and a subsequence (u k j ) such that |d(u k j ) − D| > for every j. But due to the compactness of the domain O u , there exists a convergent subsequence of (u k j ), which is a contradiction with the argument above.
Using the notation of Sect. 2, the z-part of the optimal solution of the true problem (6) will be denoted by z , and the optimal solution of the true dual problem (11) will be denoted by u .

Corollary 10
We have u k → u and z k → z .
Proof Strict convexity of φ (u) results in strict convexity of d(u). The latter, combined with the second statement of Theorem 9, yields u k → u .
Using the notation of Theorem 9, −D k is the maximum of the dual model problem (16) which in turn equals the minimum of the primal model problem (14), the latter minimum being φ k (z k+1 ). Similarly, −D is the maximum of the true dual problem (11) which in turn equals the minimum of the true primal problem (6), the latter minimum being φ(z ). As we mentioned in Sect. 2, a strong duality relationship holds between the true primal and the true dual problem.

Line search revisited
In this section we review the efficiency of applying a single line search in the solution of the column generating problem. We assume that the probabilistic objective function is locally well-conditioned. (I am going to make a case for this assumption in Sect. 9.) We are going to consider the convergence results of Sect. 6 in the primal setting. They will enable us to focus on a locality where the objective function is well-conditioned. Our aim is to develop a practicable version of Proposition 7.
Assumption 11 (Locally well-conditioned objective) The function φ(z) is wellconditioned in the optimal solution z of the true problem (6). Formally, α /ω 0 holds, where ω and α denotes the smallest and the largest eigenvalue, respectively, of the Hessian matrix ∇ 2 φ(z ).
Let α(z) and ω(z) denote the smallest and the largest eigenvalue, respectively, of the Hessian matrix ∇ 2 φ(z) as a function of z ∈ IR n . Continuity of both functions follows from a theorem of Ostrowski that I cite in "Appendix B".
Let the real numbers α − , ω + be such that 0 < α − < α and ω < ω + . (The idea is to have α − near α , and ω + near ω .) From the continuity of the functions α(z) and ω(z), it follows that there exists a ball B r around z with radius r > 0, such that α(z) ≥ α − and ω(z) ≤ ω + holds for every z ∈ B r , and hence Let us consider the convergence results of Sect. 6 in the primal setting. I start by reviewing the notation. z k denotes an optimal solution of the (k − 1)th primal model problem, and u k denotes an optimal solution of the (k − 1)th dual model problem, for k = 1, 2, . . .. In the column generating subproblem, an improving column is found by approximate minimization of the function f k (z) = φ(z) − u T k z. Specifically, z k denotes the point found by performing a single line search from z k , in the opposite direction of g k = ∇ f k (z k ).

Observation 12
We have g k → 0.

Proof We have
the latter equality being a consequence of Observation 2. In the right-hand side, we have u k → u by Corollary 10, and ∇φ(z k ) → ∇φ(z ) by Corollary 10 and φ(z) being continuously differentiable.
For technical reasons, we'll need some further notation. Let z • k be the exact minimizer of the function f k (z). (This is never computed in the course of the column generation scheme.)

Observation 13
We have z • k → z . The proof is simple but rather technical, hence I include it in the "Appendix, section A.2".

Focusing on a locality
Given the radius r > 0 of the ball B r of (47), there exists a natural number K r such that The above statement holds because we have z k → z , g k → 0, z • k → z by Corollary 10 and Observations 12 and 13.
In the remainder of this section, we consider a fixed k ≥ K r . For the sake of simplicity, we are going to omit the lower index k. Let me sum up this simplified notation. An improving column is found by approximate minimization of the function f (z) = φ(z) − u T z. Specifically, z denotes the point found by performing a single line search from z, in the opposite direction of g = ∇ f (z). The exact minimizer of f (z) is denoted by z • .
where F = min z f (z).
The proof will be an adaptation of the proof of Proposition 7. It turns out that we only need ∇ 2 φ(z)

Proof of Theorem 14
Due to the first and the second assumption in (48), we have z − t g ∈ B r for t ∈ [0, 1/ω + ]. Let T = 1/ω + . According to (47) holds (a consequence of Taylor's theorem). We consider the respective minima in t ∈ IR separately of the two sides. The right-hand side is a quadratic expression, yielding minimum at t = 1/ω + ∈ [0, T ]. It follows that Coming to lower bounds, we have [z, z • ] ⊂ B r according to the first and the third assumption in (48). By (47), we have ∇ 2 φ(z) α − I (z ∈ [z, z • ]). From this, it follows by Taylor's theorem that The left-hand side is F by definition. A lower estimate of the right-hand side is obtained by taking its minimum in z • . The minimum is easily computed as the right-hand side expression is a quadratic function of z • . We get The proof is concluded by simple transformations of (50) and (52). The improving column z is found by solving the minimization problem in the left-hand side of (50). Subtracting F from both sides of (50), and applying (52) to underestimate g 2 , we get (49).

Arguments based on the efficiency of the simplex method
We are going to interpret the column generation scheme of Sect. 3 as the solution process, by the simplex method, of a single linear programming problem. Specifically, the procedure for finding improving columns, described in Sect. 3.3, will be interpreted as a column selection procedure in the context of the simplex method. We assume a locally well-conditioned φ(z), and are going to the apply the results of Sect. 7. As the column generation scheme progresses, we are bound to arrive at a neighborhood on which the objective function is well-conditioned. For a more precise exposition, let me include a summary of the first half of Sect. 7. Let the real numbers α − , ω + be such that 0 < α − < α and ω < ω + . There exists a ball B r around z with radius r > 0, such that (47) holds. For the radius r , in turn, there exists a natural number K r such that (48) holds.
From that point on (i.e., in any iteration k > K r ), our column selection procedure will conform to a practical simplex rule, as I am going to show in the remainder of this section. The practical efficiency of the simplex method will then explain the behavior of our method.
In accordance with former notation, let z denote the vector constructed in the form of (18), using optimal weights from the linear programming problem (17). Let moreover ϑ and u denote the optimal dual variable and the optimal dual vector belonging to the convexity constraint and the splitting constraint, respectively, in (17).
In our column generation scheme, an improving column is found by approximate minimization of the function f (z) = φ(z) − u T z. This is of course the same problem as the approximate minimization of − (z), the negative of the reduced cost function whose definition I copy here for convenience: Let z denote the point found by performing a single line search from z, in the opposite direction of g = ∇φ(z) − u. Theorem 14 is applicable to f (z) = − (z), and (49) transforms to where R = max z (z).
In Fábián et al. (2018), we have shown that (z) ≥ 0 holds (as an obvious consequence of the convexity of φ(z).) Hence the right-hand side of (53) is majorated by In the simplex context, (54) means that the vector z has a substantial reduced cost in the linear programming master problem, hence we may expect a substantial improvement from including it as a new column. Equation (54) is a realistic counterpart of (22) that was obtained in an idealized setting. More importantly, (54) is an approximate version of the Markowitz column selection rule, well-known in the context of the simplex method.

Arguments based on experimental results
A long-standing rule of thumb states that, in solving a linear programming problem having M rows and N columns (with N M), the number of pivot steps needed is proportional with M. As confirmation, let me cite from the computational study of Cutler and Wolfe (1963): 'It would appear that the rule of "2M iterations" from folklore is fairly good ...' After a more detailed examination, they observe: '... an estimate of between M and 3M iterations will almost always be correct.' In our case, the linear programming model problem (17) was built using a finite set of test points z 0 , z 1 , . . . , z k . Let us include every vector z ∈ IR n among the initial test points; the resulting infinite problem will then be equivalent to the convex problem (6). Though the number of the columns is infinite, the construction (18) of the optimal solution z is still usable, because the number of the rows and hence the cardinality of the bases remains finite.
Our column generation scheme is equivalent to a simplex method applied to this infinite linear programming problem. As we have seen in the previous section, the column selection rule is such that the incoming column x satisfies (54) from iteration K r . From this point on, we can expect the simplex method to be fairly efficient.

Arguments based on theoretical results
Most of the theoretical results concerning the behavior of the simplex method are incapable of explaining its practical efficiency. Specifically, results based on worstcase analysis are generally very far from experimental results. The results of Karl Heinz Borgwardt are fundamental to bridging this gap. He studies average behavior. A further important contribution is Spielman and Teng (2004). The results of Dempster and Merkovsky (1995) are related, from a dual viewpoint. They presented a geometrically convergent version of the cutting-plane method, for a continuously differentiable, strictly convex case.
The review paper Borgwardt (2009) deals with rudimentary problem types of maximizing a linear function subject to Ax ≤ b or Ax ≤ b, x ≥ 0 or Ax = b, x ≥ 0. The matrix has M rows and N columns. The matrix and the objective vectors are populated with random variables of special joint distributions. Borgwardt focuses on the rotation symmetry model: b = 1 and the rows of A and the objective vectors are distributed on IR n \ {0} independently, identically and symmetrically under rotations. Taking into account several rotation-symmetric distributions and several special simplex-type algorithms, he examines the expected number of the pivot steps required. In Borgwardt (2009), he summarizes several related bounds for problems with M ≥ N , and also asymptotic results for N fixed and M → ∞. These are typically of the form with a moderate ν and a constant C. Let us play, for a moment, with the idea that bounds of the type (55) grasp the average behavior of the simplex method in a general sense.
Let us make a minor extension to the column generation scheme described in Sect. 3. There exists a bounded box Z and a natural number K such that Z contains every possible improving vector z that we may obtain, in iterations after K , by approximately solving the column generating problem. In the following arguments, we need to know Z explicitly, but need not know K . The construction of Z is easy but somewhat technical, hence I include it in the "Appendix, section A.3".
We could include every z ∈ Z among the initial test points of the linear programming model problem (17), like we did in the previous section. But here we wish to avoid an infinite problem. Though we can settle for a closely approximating problem that has a finite (huge) number of columns. This is easily constructed. Given a prescribed precision ε > 0, let us divide the box Z into small boxes whose edges are not longer then ε. Let us include the center of each cell among the initial test points. The number of the initial test points, considered as a function of the prescribed accuracy, is on the order of (1/ε) n .
We can now make a slight modification in the column generation process. In a certain iteration, let z denote the vector obtained by performing a line search. If this falls into the box Z , then we add no new test point, because there is a cell center close enough among the initial test points. In case z / ∈ Z , we add z as a new test point, like in the original scheme. We know that there exists a natural number K such that no new columns are added after iteration K . I.e., we have an unchanging master problem from that point.
For iterations k > K , this modified column generation scheme can be modelled as the solution, by the simplex method, of the unchanging master problem. Let z denote the vector obtained by a line search in the column generation scheme. In the simplex method, we can then select an existing column (i.e., cell center) closest to z . For iterations k > max{K , K r }, this is a practicable selection rule in the simplex context. Let us consider the number of the iterations needed to finish from here, as a function of the prescribed accuracy ε.
Let us assume that a bound of the form (55) applies in the present setting. In the dual of our huge linear programming problem, the number M of the rows is on the order of (1/ε) n . As for the number N of columns, we have N > n (and N is of a moderate magnitude.) Hence we may expect the iteration count to be on the order of 1/ε, that is by no means discouraging.

Illustration of the locally well-conditioned character of the probabilistic objective function
In Sect. 7, we assumed that the objective function is well-conditioned in the optimal solution, i.e., α /ω 0 holds with the smallest and largest eigenvalue, respectively, of the Hessian matrix ∇ 2 φ(z ) (Assumption 11). Let me make a case for this assumption by a simple illustration. Figure 2 shows contour lines of a two-dimensional standard normal distribution function F(z), where the covariance between the marginals is 0.5. Fig. 2 Contour lines of a two-dimensional normal distribution function F(z), and optimal solutions z , z . From top right, the contour lines belong to the probabilities 0.99, 0.95, and 0.90, respectively. The contour lines tend to be straight as we move away from the diagonal z 1 = z 2 Let us consider a probability maximization problem of the form (3), with the unregularized objective function φ(z) = − log F(z). The feasible region for the z vector has the negative orthant for regression cone.
Let z in Fig. 2 denote an optimal solution, with F(z ) = 0.9. Let z denote another point on the same contour line, i.e., we have F(z ) = 0.9 as well. As the section of the contour line between z and z is practically straight and horizontal, we have z ≤ z , implying that z is a feasible solution. Hence z is also an optimal solution. This shows that an optimal solution can be located 'near' the curved part of a contour line, i.e., 'near' the diagonal z 1 = z 2 .
Contour lines of the eigenvalue functions α(z) and ω(z) for z ∈ [−6, +6] 2 are shown in Fig. 3, taken from Fábián et al. (2018). α(z) and ω(z) are the smaller and the larger eigenvalue, respectively, of the Hessian matrix ∇ 2 φ(z) = ∇ 2 [− log F(z)]. Fig. 3 Contour lines of the smaller and the larger eigenvalue, respectively, of ∇ 2 [− log F(z)] as a function of z, for a two-dimensional normal distribution function F(z). In the left-hand figure (smaller eigenvalue), contour lines from top right are 1e − 5, 1e − 4, 1e − 3, 1e − 2. In the area not shaded gray, the smaller eigenvalue is above 1e − 5. In the right-hand figure (larger eigenvalue), contour lines from top right are 1, 1.2, 1.4, 1.6. In the area not shaded gray, the larger eigenvalue is below 1.6 directly yield a practicable efficiency estimate. However, they enable us to focus on a neighborhood of the optimal solution (where we may depend on the locally wellconditioned character of the objective function.) In Sect. 7, we argued that, starting with iteration K r , the process remains in a well-defined neighborhood. The threshold K r may, in theory, be large. But in our experiments, a rough approximate solution was always found relatively quickly, and the main computational effort was devoted to refine the approximation. The diagrams included in Fábián et al. (2019) show similar behavior. In that paper, we developed a randomized version of the method, and performed an experimental study. We always found a rough approximate solution with a relatively small effort.
In (5), we regularized the probabilistic objective function by adding the term ρ 2 z 2 with an appropriate ρ > 0. In the present paper, we worked with a fixed ρ. However, from a practical point of view, it makes sense to start with a large ρ, and decrease it gradually as the optimal solution is approached. The resulting procedure, when observed from a dual viewpoint, shows an interesting relationship with bundle methods. In the context of Sect. 6, the regularization (5) can be interpreted as a smoothing operation applied to the conjugate function φ (u) in (31). This smoothing operation improves the conditioning of φ (u). (On the inverese relationship between Hessian matrices of φ(z) and φ (u), see, e.g., Gorni 1991.) material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A.1 Bounding the subgradients of the model functions
Before proving Observation 6, we need some preliminary observations.
It is easily seen that the shrunk simplex S η of (24) can be obtained in the form Due to η < 1, the common lower bound of the weights is positive. Let R := { ∈ IR n | = 1 } denote the unit sphere.

Lemma 15
There exists a positive constant such that max 0≤i≤n T (z i − z) ≥ holds for any z ∈ S η and ∈ R.
Proof Let z = n i=0 λ i z i be the representation according to (56). We have because λ i (z i − z) = 0 holds by the definition of the weights λ i . As the simplex S is non-degenerate, we cannot have T (z i − z) = 0 for all i. Moreover, as the weights λ i are all positive, a nonzero T (z i − z) results in a nonzero addend in (57). Hence there exists i such that T (z i − z) > 0. It means that the function is positive-valued on the compact set R × S η . Since the function is also continuous, it obtains its infimum by Weierstrass' extreme value theorem. Let > 0 denote the minimum.

Proof of Observation 6
Given z ∈ O z , we have z ∈ S η according to (24). Let φ denote the current model function value at z. Moreover, let u be a subgradient of the current model function at z. We put it in the form u = t with some t ≥ 0 and ∈ R, and will find a bound on t = u . By the definition of the subgradient, we get For a practicable construction, let us consider k ≥ K r of (48). Then we can focus on the locality B r of (47), where ∇ 2 φ(z) α − I (z ∈ B r ) holds. In accordance with Remark 16, we can substitute α − for ρ in (64).

B On the convergence of eigenvalues
The following theorem is due to Ostrowski. (I learned it from the textbook Szidarovszky 1974 that cited it in Chapter 7.4.) Theorem 17 (from Appendix K in Ostrowski 1960) Given matrices A, B ∈ IR n×n having components a i j and b i j (1 ≤ i, j ≤ n), respectively, let For any eigenvalue λ of A, there exists an eigenvalue μ of B such that |λ − μ| ≤ (n + 2)Mδ 1/n .