1 Introduction

We deal with approximate methods for the solution of smooth convex programming problems. First, we consider minimization over a polyhedron:

$$\begin{aligned} \min \; \phi ( T {\varvec{x}} ) \quad \hbox {subject to}\quad A {\varvec{x}} \le {\varvec{b}}, \end{aligned}$$
(1)

where \( \phi : \mathrm{I}\mathrm{R}^n \rightarrow \mathrm{I}\mathrm{R}\) is a convex function whose gradient computation is demanding. The vectors are \( {\varvec{x}} \in \mathrm{I}\mathrm{R}^m,\; {\varvec{b}} \in \mathrm{I}\mathrm{R}^r \), and the matrices T and A are of sizes \( n \times m \) and \( r \times m \), respectively. For the sake of simplicity we assume that the feasible domain is not empty and is bounded. We then consider the minimization of a linear cost function subject to a difficult convex constraint:

$$\begin{aligned} \min \; {\varvec{c}}^T {\varvec{x}} \quad \hbox {subject to}\quad \breve{A} {\varvec{x}} \le \breve{{\varvec{b}}},\quad \phi ( T {\varvec{x}} ) \le \pi , \end{aligned}$$
(2)

where the vectors \( {\varvec{c}}, \breve{{\varvec{b}}} \) and the matrix \(\breve{A} \) have compatible sizes, and \( \pi \) is a given number. The approach we discuss is easily extended to convex functions from the linear ones.

A motivation for the above forms are the classic probability maximization and probabilistic constrained problems, where \(\phi ({\varvec{z}} ) = -\log F( {\varvec{z}} ) \) with a logconcave distribution function \( F( {\varvec{z}} ) \). We briefly overview a couple of closely related probabilistic programming approaches. For a broader survey, see Fábián et al. (2018). Given a distribution and a number \( p\; ( 0< p < 1 ) \), a probabilistic constraint confines search to the level set \( {{{\mathcal {L}}}}( F, p ) = \{\, {\varvec{z}} \,|\, F( {\varvec{z}} ) \ge p \,\} \) of the distribution function \( F( {\varvec{z}} ) \). Prékopa (1990) initiated a novel solution approach by introducing the concept of p-efficient points. The point \( {\varvec{z}} \) is p-efficient if \( F( {\varvec{z}} ) \ge p \) and there exists no \( {\varvec{z}}^{\prime } \) such that \( {\varvec{z}}^{\prime } \le {\varvec{z}},\, {\varvec{z}}^{\prime } \not = {\varvec{z}},\, F( {\varvec{z}}^{\prime } ) \ge p \). Prékopa et al. (1998) considered problems with random parameters having a discrete finite distribution. They began with enumerating p-efficient points and based on them, built a convex relaxation of the problem.

Dentcheva et al. (2000) formulated the probabilistic constraint in a split form: \( T{\varvec{x}} = {\varvec{z}} \) with \( {\varvec{z}} \in {{{\mathcal {L}}}}( F, p ) \); and constructed a Lagrangian dual by relaxing the constraint \( T{\varvec{x}} = {\varvec{z}} \). The resulting dual functional is the sum of the respective optimal objective values of two simpler problems. The first auxiliary problem is a linear programming problem, and the second one is the minimization of a linear function over the level set \({{{\mathcal {L}}}}( F, p ) \). Based on this decomposition, the authors developed a method, called cone generation, that finds new p-efficient points in the course of the optimization process.

As minimization over the level set \( {{\mathcal {L}}}( F, p ) \) entails a substantial computational effort, the master part of the decomposition framework should succeed with as few p-efficient points as possible. Efficient solution methods were developed by Dentcheva et al. (2004) and Dentcheva and Martinez (2013); the latter applies regularization to the master problem. Approximate minimization over the level set \( {{\mathcal {L}}}( F, p ) \) is another enhancement. Dentcheva et al. (2004) constructed approximate p-efficient points through approximating the original distribution by a discrete one. More recently, van Ackooij et al. (2017) employed a special bundle-type method for the solution of the master problem, based on the on-demand accuracy approach of de Oliveira and Sagastizábal (2014). This means working with inexact data and regulating accuracy in the course of the optimization. Approximate p-efficient points with on-demand accuracy were generated employing the integer programming approach of Luedtke et al. (2010).

Our former paper Fábián et al. (2018) focussed on probability maximization, and proposed a polyhedral approximation of the epigraph of the probabilistic function. This approach is analogous to the use of p-efficient points (has actually been motivated by that concept). The dual function is constructed and decomposed in the manner of Dentcheva et al. (2000), but the nonlinear subproblem is easier. In Dentcheva et al. (2000), finding a new p-efficient point amounts to minimization over the level set \({{\mathcal {L}}}(F, p)\). In contrast, a new approximation point in Fábián et al. (2018) is found by unconstrained minimization, with considerably less computational effort. Moreover, a practical approximation scheme was developed in the latter paper: instead of exactly solving an unconstrained subproblem occurring during the process, just a single line search is sufficient. The approach is easy to implement and endures noise in gradient computation.

In the present paper, we extend the inner approximation approach of Fábián et al. (2018) to a randomized method handling gradient estimates. The motivation is our experience reported in that former paper: when solving probability maximization problems, most computational efforts were spent on computing gradients. (Computing a single component of the gradient vector required an effort comparable to that of computing a distribution function value). We conclude that easily computable estimates for the gradients are well worth using, even if the iteration count increases due to estimation errors.

The paper is organized as follows. In Sect. 2 we work in an idealized setting, under the following assumptions:

Assumption 1

The function \( \phi ( {\varvec{z}} ) \) is twice continuously differentiable, and real numbers \( \alpha , \omega \; ( 0 < \alpha \le \omega ) \) are known such that

$$\begin{aligned} \alpha I \;\preceq \; \nabla ^2 \phi ( {\varvec{z}} ) \;\preceq \; \omega I \quad ( {\varvec{z}} \in \mathrm{I}\mathrm{R}^n ). \end{aligned}$$

Here \( \nabla ^2 \phi ( {\varvec{z}} ) \) is the Hessian matrix, I is the identity matrix, and the relation \( U \preceq V \) between matrices means that \( V - U \) is positive semidefinite.

Assumption 2

Given \( {\varvec{z}} \in \mathrm{I}\mathrm{R}^n \), the function value \( \phi ( {\varvec{z}} ) \) and the gradient vector \( \nabla \phi ( {\varvec{z}} ) \) can be computed exactly.

We present a brief overview of the models and of the column generation approach proposed in Fábián et al. (2018) to the unconstrained problem (1). The epigraph of the convex function \( \phi ( {\varvec{z}} ) \) is approximated by a convex combination of finitely many points (obtained by evaluating the function value in the known iterates.) New points (columns in a model problem) are generated by unconstrained minimization of a probabilistic function. The column generation problem is solved with a gradient descent method. Due to Assumption 1, an approximate solution is sufficient, taking a limited number of descent steps.

In Sect. 3 we extend the method to gradient estimates, replacing Assumption 2 with

Assumption 3

Given \( {\varvec{z}}, {\varvec{u}} \in \mathrm{I}\mathrm{R}^n \), the function value \( \phi ( {\varvec{z}} ) \) can be computed exactly, and the norm \( \Vert \nabla \phi ( {\varvec{z}} ) - {\varvec{u}} \Vert \) can be estimated with a pre-defined relative accuracy. Moreover, realizations of an unbiased stochastic estimate \( {\varvec{G}} \) of the gradient vector \( \nabla \phi ( {\varvec{z}} ) \) can be constructed such that \( \hbox {E}( \Vert {\varvec{G}} - \nabla \phi ( {\varvec{z}} ) \Vert ^2 ) \) remains below a pre-defined tolerance. (Higher accuracy in case of norm estimation, and tighter tolerance on variance entail larger computational effort.)

We develop a randomized version of the column generation method, and present reliability considerations based on Assumption 1.

In Sect. 4 we deal with the convex constrained problem (2), still in the idealized setting of Assumption 1. We consider a parametric version of an unconstrained problem of the form (1). We present an approximation scheme for the constrained problem that requires the approximate solution of a short sequence of unconstrained problems. Initial problems in this sequence are solved with a large stopping tolerance, and the accuracy is gradually increased. This approximation scheme is first developed in a deterministic form (based on the deterministic method of Sect. 2), and then extended to admit the randomized method of Sect. 3. Reliability considerations are presented for the randomized scheme.

The approach is adapted to probabilistic programming problems in Sect. 5. Here we consider \( \phi ( {\varvec{z}} ) = -\log F( {\varvec{z}} ) \) with a nondegenerate n-dimensional standard normal distribution function \( F( {\varvec{z}} ) \). Assumption 1 obviously does not hold for every \( {\varvec{z}} \in \mathrm{I}\mathrm{R}^n \) with a probabilistic \( \phi ( {\varvec{z}} ) \). However, as illustrated in Fábián et al. (2018), Assumption 1 holds for the points of a bounded ball around the origin. (The ratio \( \frac{\alpha }{\omega } \) decreases as the radius of the ball increases.) Owing to the specialities of the probabilistic function, the column generation process can be guaranteed to remain in a ball of sufficiently large radius. Such a procedure was sketched in Fábián et al. (2018). That construction provides a theoretical justification for limiting our investigations to a bounded ball, but it does not yield usable estimates for the values \( \alpha \) and \( \omega \). While efficiency considerations of the previous sections are inherited to probabilistic problems, reliability considerations cannot be based on Assumption 1. The quality of a model is measured by different means, based on special features of the probabilistic function.

Section 6 contains an overview of algorithms for the estimation of multivariate normal probability distribution function values and gradients. We discuss the numerical integration algorithm of Genz (1992), and the variance reduction Monte Carlo simulation algorithms of Deák (1980, 1986), Szántai (1976, 1985, 1988) and Ambartzumian et al. (1998), mentioning related works. These variance reduction Monte Carlo simulation algorithms have originally been developed to be used in primal-type methods for probabilistic constrained problems. An abundant stream of research in this direction has been initiated by the models, methods and applications pioneered by Prékopa and his school. Based on these algorithms, a gradient estimate satisfying Assumption 3 can be constructed by a two-stage sampling procedure, as mentioned in Sect. 6.5.

Section 7 describes a computational experiment. The aim is to demonstrate the workability of the randomized column generation scheme of Sect. 3, in case of probabilistic problems.

2 Column generation in an idealized setting

In this section we work in the idealized setting of Assumptions 1 and 2. We formulate the dual problem and construct polyhedral models of the primal and dual problems. We follow the construction in Fábián et al. (2018), though monotonicity of the probabilistic objective was exploited there, and variable splitting was based on \( {\varvec{z}} \le T{\varvec{x}} \). In the present idealized setting, we apply the traditional form of variable splitting: problem (1) is written as

$$\begin{aligned} \min \; \phi ( {\varvec{z}} ) \quad \hbox {subject to}\quad A {\varvec{x}} - {\varvec{b}} \le {\varvec{0}},\quad {\varvec{z}} - T{\varvec{x}} = {\varvec{0}}. \end{aligned}$$
(3)

This problem has an optimal solution because the feasible domain of (1) is nonempty and bounded, by assumption. Introducing the multiplier vector \( -{\varvec{y}} \in \mathrm{I}\mathrm{R}^r, -{\varvec{y}} \ge {\varvec{0}} \) to the constraint \( A {\varvec{x}} - {\varvec{b}} \le {\varvec{0}} \), and \( -{\varvec{u}} \in \mathrm{I}\mathrm{R}^n \) to the constraint \( {\varvec{z}} - T{\varvec{x}} = {\varvec{0}} \), the Lagrangian dual of (3) can be written as

$$\begin{aligned} \max \; \left\{ {\varvec{y}}^T{\varvec{b}} - \phi ^{\star }( {\varvec{u}} ) \right\} \quad \hbox {subject to}\quad ( {\varvec{y}}, {\varvec{u}} ) \in {{\mathcal {D}}}, \end{aligned}$$
(4)

where

$$\begin{aligned} {{\mathcal {D}}} := \left\{ ( {\varvec{y}}, {\varvec{u}} ) \in \mathrm{I}\mathrm{R}^{r+n} \; \left| \right. {\varvec{y}} \le {\varvec{0}},\;\; T^T {\varvec{u}} = A^T {\varvec{y}}\right\} . \end{aligned}$$
(5)

According to the theory of convex duality, this problem has an optimal solution. For a recent treatise on Lagrangian duality, see, e.g., Chapter 4 of Ruszczyński (2006).

2.1 Polyhedral models

Suppose we have evaluated the function \( \phi ( {\varvec{z}} ) \) at points \( {\varvec{z}}_i\; ( i = 0, 1, \ldots , k )\); we introduce the notation \(\phi _i = \phi ( {\varvec{z}}_i ) \) for respective objective values. An inner approximation of \( \phi (\cdot )\) is

$$\begin{aligned} \phi _k( {\varvec{z}} ) = \min \; \sum \limits _{i=0}^k \lambda _i \phi _i \quad \hbox {such that}\quad \lambda _i \ge 0\; ( i = 0, \ldots , k ),\quad \sum \limits _{i=0}^k \lambda _i =1, \quad \sum \limits _{i=0}^k \lambda _i {\varvec{z}}_i = {\varvec{z}}.\nonumber \\ \end{aligned}$$
(6)

If \( {\varvec{z}} \not \in \hbox {Conv}( {\varvec{z}}_0, \ldots , {\varvec{z}}_k ) \), then let \( \phi _k( {\varvec{z}} ) := +\infty \). A polyhedral model of problem (3) is

$$\begin{aligned} \min \; \phi _k( {\varvec{z}} ) \quad \hbox {subject to}\quad A {\varvec{x}} - {\varvec{b}} \le {\varvec{0}},\quad {\varvec{z}} - T{\varvec{x}} = {\varvec{0}}. \end{aligned}$$
(7)

We assume that (7) is feasible, i.e., its optimum is finite. This can be ensured by proper selection of the initial \({\varvec{z}}_0, \ldots , {\varvec{z}}_k \) points. The convex conjugate of \(\phi _k( {\varvec{z}} ) \) is

$$\begin{aligned} \phi _k^{\star }( {\varvec{u}} ) = \max \limits _{0 \le i \le k}\; \left\{ {\varvec{u}}^T{\varvec{z}}_i - \phi _i \right\} . \end{aligned}$$
(8)

As \( \phi _k^{\star }(\cdot )\) is a cutting-plane model of \(\phi ^{\star }(\cdot )\), the following problem is a polyhedral model of problem (4):

$$\begin{aligned} \max \; \left\{ {\varvec{y}}^T{\varvec{b}} - \phi _k^{\star }( {\varvec{u}} ) \right\} \quad \hbox {subject to}\quad ( {\varvec{y}}, {\varvec{u}} ) \in {{\mathcal {D}}}. \end{aligned}$$
(9)

2.2 Linear programming formulations

The primal model problem (6)–(7) will be formulated as

$$\begin{aligned} \begin{array}{ll} \min &{}\quad \sum \limits _{i=0}^k\, \phi _i \lambda _i \\ \text{ such } \text{ that }&{}\quad \lambda _i \ge 0\quad ( i = 0, \ldots , k ), \\ &{}\quad \sum \limits _{i=0}^k\,\lambda _i= 1, \\ &{}\quad \sum \limits _{i=0}^k\,\lambda _i {\varvec{z}}_i -T {\varvec{x}} = {\varvec{0}}, \\ &{}\quad A {\varvec{x}} \le {\varvec{b}}. \end{array} \end{aligned}$$
(10)

The dual model problem (8)–(9), formulated as a linear programming problem, is just the LP dual of (10):

$$\begin{aligned} \begin{array}{llrrlr} \max \vartheta + {\varvec{b}}^T {\varvec{y}} \\ \\ \text{ such } \text{ that }~ {\varvec{y}} \le {\varvec{0}}, \\ \\ \vartheta +\; {\varvec{z}}_i^T {\varvec{u}} \le \; \phi _i \quad ( i = 0, \ldots , k ), \\ \\ \; - T^T {\varvec{u}} + A^T {\varvec{y}} = {\varvec{0}}. \end{array} \end{aligned}$$
(11)

Let \( (\, {\overline{\lambda }}_0, \ldots , {\overline{\lambda }}_k,\, \overline{{\varvec{x}}} \,) \) and \( (\, {\overline{\vartheta }},\, \overline{{\varvec{u}}},\, \overline{{\varvec{y}}} \,) \) denote respective optimal solutions of the problems (10) and (11)—both existing due to our assumption concerning the feasibility of (7) and hence (10). Let moreover

$$\begin{aligned} \overline{{\varvec{z}}} = \sum \limits _{i=0}^k\, {\overline{\lambda }}_i {\varvec{z}}_i. \end{aligned}$$
(12)

Observation 4

We have

  1. (a)

    \(\phi _k( \overline{{\varvec{z}}} ) \; =\; \sum _{i=0}^k\,\phi _i {\overline{\lambda }}_i \; =\; {\overline{\vartheta }} + \overline{{\varvec{u}}}^T \overline{{\varvec{z}}}\),

  2. (b)

    \({\overline{\vartheta }} = - \phi _k^{\star } ( \overline{{\varvec{u}}} )\),

  3. (c)

    \(\phi _k( \overline{{\varvec{z}}} ) + \phi _k^{\star } ( \overline{{\varvec{u}}} ) = \overline{{\varvec{u}}}^T \overline{{\varvec{z}}} \)    and hence   \( \overline{{\varvec{u}}} \in \partial \phi _k( \overline{{\varvec{z}}} ) \).

Sketch of proof

  1. (a)

    The first equality follows from the equivalence of (10) on the one hand, and (6)–(7) on the other hand. The second equality is a straight consequence of complementarity.

  2. (b)

    follows from the equivalence between (11) on the one hand and (8)–(9) on the other hand.

  3. (c)

    The equality is a consequence of (a) and (b). This is Fenchel’s equality between \( \overline{{\varvec{u}}} \) and \( \overline{{\varvec{z}}} \), with respect to the model function \( \phi _k(\cdot )\). On \( \overline{{\varvec{u}}} \) being a subgradient, see, e.g., Section 23 in Rockafellar (1970).

2.3 A column generation procedure

We give a brief overview of the approximation scheme of Fábián et al. (2018). An optimal dual solution (i.e., shadow price vector) of the current model problem is \( (\, {\overline{\vartheta }},\, \overline{{\varvec{u}}},\, \overline{{\varvec{y}}} \,) \). Given a vector \( {\varvec{z}} \in \mathrm{I}\mathrm{R}^n \), we can add a new column in (10), corresponding to \( {\varvec{z}}_{k+1} = {\varvec{z}} \). This is an improving column if its reduced cost

$$\begin{aligned} {\overline{\rho }}( {\varvec{z}} ) \; :=\; {\overline{\vartheta }} \; +\; \overline{{\varvec{u}}}^T {\varvec{z}} - \phi ( {\varvec{z}} ) \end{aligned}$$
(13)

is positive. It is easily seen that the reduced cost of \(\overline{{\varvec{z}}} \) is non-negative. Indeed,

$$\begin{aligned} {\overline{\rho }}( \overline{{\varvec{z}}} ) \; \ge \; {\overline{\vartheta }} \; +\; \overline{{\varvec{u}}}^T \overline{{\varvec{z}}} - \phi _k( \overline{{\varvec{z}}} ) \; = 0 \end{aligned}$$
(14)

follows from \(\; \phi _k(\cdot ) \ge \phi (\cdot )\) and Observation 4(a).

In the context of the simplex method, the Markowitz column-selection rule is widely used. The Markowitz rule selects the vector with the largest reduced cost. Coming back to the present problem (10), let

$$\begin{aligned} \overline{{\mathcal {R}}} := \max _{{\varvec{z}}} {\overline{\rho }}( {\varvec{z}}). \end{aligned}$$
(15)

The column with the largest reduced cost can, in theory, be found by a steepest descent method applied to the function \( - {\overline{\rho }}( {\varvec{z}} ) \). In a practical approach, only a limited number of line search steps are performed, starting from \( \overline{{\varvec{z}}} \). The efficiency of this practical approach can be estimated on the basis of the following well-known theorem:

Theorem 5

Let Assumption 1 hold for the function \( f : \mathrm{I}\mathrm{R}^n \rightarrow \mathrm{I}\mathrm{R}\). We minimize \( f( {\varvec{z}} ) \) over \( \mathrm{I}\mathrm{R}^n \) using a steepest descent method, starting from a point \( {\varvec{z}}^0 \). Let \( {\varvec{z}}^1, \ldots , {\varvec{z}}^j, \ldots \) denote the iterates obtained by applying exact line search at each step. Then we have

$$\begin{aligned} f\left( {\varvec{z}}^j \right) - {{\mathcal {F}}} \; \le \; \left( 1 - \frac{\alpha }{\omega } \right) ^j \left[ \; f\left( {\varvec{z}}^0 \right) - {{\mathcal {F}}} \;\right] , \end{aligned}$$
(16)

where \( {{\mathcal {F}}} = \min _{{\varvec{z}}} f( {\varvec{z}} ) \).

This theorem can be found e.g., in Chapter 8.6 of Luenberger and Ye (2008). Ruszczyński (2006) in Chapter 5.3.5, Theorem 5.7 presents a slightly different form. The following corollary was obtained in Fábián et al. (2018):

Corollary 6

Let \( \beta \; ( 0 < \beta \ll 1 ) \) be given. In \(\; O\left( - \log \beta \right) \) steps with the steepest descent method, we find a vector \( {\widehat{{\varvec{z}}}} \) such that

$$\begin{aligned} {\overline{\rho }}\left( {\widehat{{\varvec{z}}}} \right) \; \ge \; ( 1 - \beta )\; \overline{{\mathcal {R}}}. \end{aligned}$$
(17)

This can be shown by substituting \( f( {\varvec{z}} ) = - {\overline{\rho }}( {\varvec{z}} ),\; {\varvec{z}}^0 = \overline{{\varvec{z}}} \) in (16), and applying (14). The objective function \( - {\overline{\rho }}( {\varvec{z}} ) \) inherits Assumption 1 from \( \phi ( {\varvec{z}} ) \). Performing j steps with j such that \( ( 1 - \frac{\alpha }{\omega } )^j \le \beta \) yields an appropriate \( {\widehat{{\varvec{z}}}} = {\varvec{z}}^j \).

In view of the Markowitz rule mentioned above, the vector \( {\widehat{{\varvec{z}}}} \) in Corollary 6 is a fairly good improving vector in the column generation scheme.

To check near-optimality of the current solution, we use the usual LP stopping rule: the reduced cost of any candidate vector should be below a fixed optimality tolerance. Of course we do not know \( \overline{{\mathcal {R}}} \), but let

$$\begin{aligned} \overline{{\mathcal {B}}} := \frac{1}{1 - \beta } {\overline{\rho }}\left( {\widehat{{\varvec{z}}}} \right) , \end{aligned}$$
(18)

with the \( \beta \) and \( {\widehat{{\varvec{z}}}} \) of Corollary 6. We stop the column generation procedure when \(\overline{{\mathcal {B}}} \) falls below the optimality tolerance. When applying this bound, we work with a fixed \( \beta \) throughout the process, e.g., let \( \beta = 0.5 \). For the present special linear programming problem (10), this is not just a heuristic rule:

Observation 7

\( \overline{{\mathcal {R}}} \) (and hence \( \overline{{\mathcal {B}}} \)) is an upper bound on the gap between the respective optima of the model problem (10) and the original convex problem (3).

Proof

We have

$$\begin{aligned} \overline{{\mathcal {R}}} \; =\; \max _{{\varvec{z}}}\, {\overline{\rho }}( {\varvec{z}} ) \; =\; \phi ^{\star }( \overline{{\varvec{u}}} ) - \phi _k^{\star }( \overline{{\varvec{u}}} ). \end{aligned}$$
(19)

(The second equality follows from the definition of the conjugate function).

Since \( ( \overline{{\varvec{u}}}, \overline{{\varvec{y}}} ) \) is a feasible solution of the dual problem (4), it follows that (19) is an upper bound on the gap between the respective optima of the dual model problem (9) and the dual problem (4). The observation follows from convex duality. \(\square \)

Remark 8

Prescribing a loose optimality tolerance on \(\overline{{\mathcal {B}}} \) results in an early termination of the column generation process. Common experience with LP problems is that computational effort is substantially reduced by loosening the stopping tolerance.

Remark 9

Looking at the column-generation approach from a dual viewpoint we can see a cutting-plane method. This relationship between the primal and dual approaches is well known, see, e.g., Frangioni (2002, 2018). Details for the present case were worked out in the research report Fábián and Szántai (2017), a former version of the present paper.

The dual viewpoint admits a visual justification of the convergence of the sequence of the optimal dual vectors \( \overline{{\varvec{u}}} \). (Moreover, the cutting-plane method can be regularized, but we do not consider regularization in this paper.)

3 Working with gradient estimates

First we extend Theorem 5. Let \( f : \mathrm{I}\mathrm{R}^n \rightarrow \mathrm{I}\mathrm{R}\) be such that Assumptions 1 and 3 hold. We wish to minimize \(f( {\varvec{z}} ) \) over \( \mathrm{I}\mathrm{R}^n \) using a stochastic descent method. Let \( {\varvec{z}}^{\circ } \in \mathrm{I}\mathrm{R}^n \) denote an iterate, and \({\varvec{g}}^{\circ } = \nabla f( {\varvec{z}}^{\circ } ) \) the corresponding gradient.

Let \( {\sigma ^2} > 0 \) be given. According to Assumption 3, realizations of a random vector \( {\varvec{G}}^{\circ } \) can be constructed, satisfying

$$\begin{aligned} \hbox {E}\left( {\varvec{G}}^{\circ } \right) = {\varvec{g}}^{\circ } \quad \hbox {and} \quad \hbox {E}\left( \left\| {\varvec{G}}^{\circ } - {\varvec{g}}^{\circ } \right\| ^2 \right) \;\le \; {\sigma ^2} \left\| {\varvec{g}}^{\circ } \right\| ^2. \end{aligned}$$
(20)

From (20) follows

$$\begin{aligned} \hbox {E}\left( \left\| {\varvec{G}}^{\circ } \right\| ^2 \right) \; =\; \hbox {E}\left( \left\| {\varvec{G}}^{\circ } - {\varvec{g}}^{\circ } \right\| ^2 \right) + \left\| {\varvec{g}}^{\circ } \right\| ^2 \; \le \; ( {\sigma ^2} + 1 )\, \left\| {\varvec{g}}^{\circ } \right\| ^2. \end{aligned}$$
(21)

We consider the following randomized form of Theorem 5:

Theorem 10

Under the above assumptions, we perform a steepest descent method using gradient estimates: at the current iterate \( {\varvec{z}}^{\circ } \), a gradient estimate \( {\varvec{G}}^{\circ } \) is generated and a line search is performed in that direction. We assume that gradient estimates at the respective iterates are generated independently, and (20)–(21) hold for each of them.

Having started from the point \( {\varvec{z}}^0 \), and having performed j line searches, let \( {\varvec{z}}^1, \ldots , {\varvec{z}}^{j} \) denote the respective iterates. Then we have

$$\begin{aligned} \hbox {E}\left[ f\left( {\varvec{z}}^{j} \right) \right] - {{\mathcal {F}}}\; \le \; \left( 1 - \frac{\alpha }{\omega ( {\sigma ^2} + 1 )} \right) ^j \left( f\left( {\varvec{z}}^{0} \right) \; - {{\mathcal {F}}} \right) , \end{aligned}$$
(22)

where \( {{\mathcal {F}}} = \min _{{\varvec{z}}} f( {\varvec{z}} ) \).

Proof

Let \( {\varvec{G}}^0, \ldots , {\varvec{G}}^{j-1} \) denote the respective gradient estimates for the iterates \( {\varvec{z}}^0, \ldots , {\varvec{z}}^{j-1} \).

To begin with, we focus on the first line search whose starting point is \( {\varvec{z}}^{\circ } = {\varvec{z}}^0 \). Here \( {\varvec{z}}^{\circ } \) is a given (not random) vector. We adapt the proof of Theorem 5, presented in Chapter 8.6 of Luenberger and Ye (2008), to employ the gradient estimate \({\varvec{G}}^{\circ } \) instead of the gradient \( {\varvec{g}}^{\circ } \). From \(\; \nabla ^2 f ( {\varvec{z}} ) \preceq \omega I, \;\) it follows that

$$\begin{aligned} f\left( {\varvec{z}}^{\circ } - t {\varvec{G}}^{\circ } \right) \le f\left( {\varvec{z}}^{\circ } \right) - t\, {\varvec{g}}^{\circ \; T} {\varvec{G}}^{\circ } + \frac{\omega }{2} t^2\, {\varvec{G}}^{\circ \; T} {\varvec{G}}^{\circ } \end{aligned}$$

holds for any \( t \in \mathrm{I}\mathrm{R}\) (a consequence of Taylor’s theorem). Considering expectations on both sides, we get

$$\begin{aligned} \begin{array}{ll} \hbox {E}\left[ f\left( {\varvec{z}}^{\circ } - t {\varvec{G}}^{\circ } \right) \right] &{} \le \; f\left( {\varvec{z}}^{\circ } \right) - t\, \left\| {\varvec{g}}^{\circ } \right\| ^2 +\frac{\omega }{2} t^2\, \hbox {E}\left( \left\| {\varvec{G}}^{\circ } \right\| ^2 \right) \\ \\ &{} \le f\left( {\varvec{z}}^{\circ } \right) - t\, \left\| {\varvec{g}}^{\circ } \right\| ^2+\frac{\omega }{2} t^2\, ( {\sigma ^2} + 1 )\, \left\| {\varvec{g}}^{\circ } \right\| ^2 \\ \end{array} \end{aligned}$$

according to (21). We consider the respective minima in t separately of the two sides. The right-hand side is a quadratic expression, yielding minimum at \( t = \frac{1}{\omega ( {\sigma ^2} + 1 )} \). Inequality is inherited to minima, hence

$$\begin{aligned} \min _{t}\, \hbox {E}\left[ f\left( {\varvec{z}}^{\circ } - t {\varvec{G}}^{\circ } \right) \right] \; \le \; f\left( {\varvec{z}}^{\circ } \right) \; -\; \frac{1}{2 \omega ( {\sigma ^2} + 1 )}\, \left\| {\varvec{g}}^{\circ } \right\| ^2. \end{aligned}$$
(23)

For the left-hand side, we obviously have

$$\begin{aligned} \hbox {E}\left[ \; \min _{t}\, f\left( {\varvec{z}}^{\circ } - t {\varvec{G}}^{\circ } \right) \right] \; \le \; \min _{t}\, \hbox {E}\left[ f\left( {\varvec{z}}^{\circ } - t {\varvec{G}}^{\circ } \right) \right] . \end{aligned}$$
(24)

(This is analogous to the basic inequality comparing the wait-and-see and the here-and-now approaches for classic two-stage stochastic programing problems, see, e.g., Chapter 4.3 of Birge and Louveaux 1997).

Let \( {\varvec{z}}^{\prime } \) denote the minimizer of the line search on the left-hand side of (24), i.e., \( f\left( {\varvec{z}}^{\prime } \right) = \min _{t}\, f\left( {\varvec{z}}^{\circ } - t {\varvec{G}}^{\circ } \right) \). (Of course \( {\varvec{z}}^{\prime } \) is a random vector since it depends on \( {\varvec{G}}^{\circ } \).) Substituting this in (24) and comparing with (23), we get

$$\begin{aligned} \hbox {E}\left[ f\left( {\varvec{z}}^{\prime } \right) \right] \; \le \; f\left( {\varvec{z}}^{\circ } \right) \; -\; \frac{1}{2 \omega ( {\sigma ^2} + 1 )}\, \left\| {\varvec{g}}^{\circ } \right\| ^2. \end{aligned}$$

Subtracting \( {{\mathcal {F}}} \) from both sides results in

$$\begin{aligned} \hbox {E}\left[ f\left( {\varvec{z}}^{\prime } \right) \right] - {{\mathcal {F}}}\; \le \; f\left( {\varvec{z}}^{\circ } \right) - {{\mathcal {F}}}\; -\; \frac{1}{2 \omega ( {\sigma ^2} + 1 )}\, \left\| {\varvec{g}}^{\circ } \right\| ^2. \end{aligned}$$
(25)

Coming to the lower bound, a well-known consequence of \(\; \alpha I \preceq \nabla ^2 f ( {\varvec{z}} ) \;\) is

$$\begin{aligned} \left\| {\varvec{g}}^{\circ } \right\| ^2\; \ge \; 2 \alpha \, \left( f\left( {\varvec{z}}^{\circ } \right) \; - {{\mathcal {F}}} \right) \end{aligned}$$

(see Chapter 8.6 of Luenberger and Ye 2008). Combining this with (25), we get

$$\begin{aligned} \hbox {E}\left[ f\left( {\varvec{z}}^{\prime } \right) \right] - {{\mathcal {F}}}\le & {} f\left( {\varvec{z}}^{\circ } \right) - {{\mathcal {F}}} - \frac{\alpha }{\omega ( {\sigma ^2} + 1 )}\, \left( f\left( {\varvec{z}}^{\circ } \right) \; - {{\mathcal {F}}} \right) \nonumber \\= & {} \left( 1 - \frac{\alpha }{\omega ( {\sigma ^2} + 1 )} \right) \left( f\left( {\varvec{z}}^{\circ } \right) \; - {{\mathcal {F}}} \right) . \end{aligned}$$
(26)

As we have assumed that \( {\varvec{z}}^{\circ } \) is a given (not random) vector, the right-hand side of (26) is deterministic, and the expectation on the left-hand side is considered according to the distribution of \( {\varvec{G}}^{\circ } \).

Now, let us examine the \( ( l + 1 )\)th line search (for \( 1 \le l \le j-1 \)) where the starting point is \( {\varvec{z}}^{\circ } = {\varvec{z}}^l \) and the minimizer is \( {\varvec{z}}^{\prime } = {\varvec{z}}^{l+1} \). Of course (26) holds with these objects also, but now both sides are random variables, depending on the vectors \( {\varvec{G}}^{0}, \ldots , {\varvec{G}}^{l-1} \). (The expectation on the left-hand side is a conditional expectation.) We consider the respective expectations of the two sides, according to the joint distribution of \( {\varvec{G}}^{0}, \ldots , {\varvec{G}}^{l-1} \). As the random gradient vectors were generated independently, we get

$$\begin{aligned} \hbox {E}\left[ f\left( {\varvec{z}}^{l+1} \right) \right] - {{\mathcal {F}}}\; \le \; \left( 1 - \frac{\alpha }{\omega ( {\sigma ^2} + 1 )} \right) \left( \hbox {E}\left[ f\left( {\varvec{z}}^{l} \right) \right] \; - {{\mathcal {F}}} \right) , \end{aligned}$$
(27)

where the left-hand expectation is now taken according to the joint distribution of \( {\varvec{G}}^{0}, \ldots , {\varvec{G}}^{l} \). This technique of proof is well known in the context of stochastic gradient schemes, see, e.g., Nesterov and Vial (2008).

Finally, (22) follows from the iterative application of (27). \(\square \)

Coming back to problem 1, let Assumptions 1 and 3 hold for the objective function \( \phi ( {\varvec{z}} ) \). We show that the column generation scheme of Sect. 2.3 can be implemented as a randomized method using gradient estimates. Specifically, we need to approximately solve the column generation subproblem (15).

Corollary 11

Let a tolerance \( \beta \; ( 0 < \beta \ll 1 ) \) and a probability \( p\; ( 0 < p \ll 1 ) \) be given. In \(\; O( - \log ( \beta \, p ) ) \) steps with the stochastic descent method, we find a vector \( {\widehat{{\varvec{z}}}} \) such that

$$\begin{aligned} \hbox {P}\Big (\; {\overline{\rho }}\left( {\widehat{{\varvec{z}}}} \;\right) \; \ge \; ( 1 - \beta ) \overline{{\mathcal {R}}} \Big )\; \ge \; 1 - p. \end{aligned}$$

Proof

We apply Theorem 10 to \( f( {\varvec{z}} ) = - {\overline{\rho }}( {\varvec{z}} ) \). This function inherits Assumptions 1 and 3 from \( \phi ( {\varvec{z}} ) \). Let \( \varrho = 1 - \frac{\alpha }{\omega ( {\sigma ^2} + 1 )} \) with some \( \sigma > 0 \). We assume that gradient estimates at the respective iterates are generated independently, and (20)–(21) hold for each of them.

Substituting \( {\varvec{z}}^0 = \overline{{\varvec{z}}} \) in (22) and taking into account (14), we get

$$\begin{aligned} \hbox {E}\left[ {\overline{\rho }}( {\varvec{z}}^j ) \right] \; \ge \; \left( 1 -\varrho ^j \right) \, \overline{{\mathcal {R}}}. \end{aligned}$$

The gap \( \overline{{\mathcal {R}}} \) is obviously non-negative. In case \( \overline{{\mathcal {R}}} = 0 \), the starting iterate \( {\varvec{z}}^0 = \overline{{\varvec{z}}} \) of the steepest descent method was already optimal, due to (14). In what follows we assume \( \overline{{\mathcal {R}}} > 0 \). A trivial transformation results in

$$\begin{aligned} \hbox {E}\left[ 1 - \frac{{\overline{\rho }}( {\varvec{z}}^j )}{\overline{{\mathcal {R}}}} \right] \; \le \; \varrho ^j. \end{aligned}$$

By Markov’s inequality, we get

$$\begin{aligned} \hbox {P}\left( \; 1 - \frac{{\overline{\rho }}( {\varvec{z}}^j )}{\overline{{\mathcal {R}}}} \ge \beta \right) \; \le \; \frac{\varrho ^j}{\beta }, \end{aligned}$$

and a trivial transformation yields

$$\begin{aligned} \hbox {P}\Big (\; {\overline{\rho }}( {\varvec{z}}^j \;)\; \le \; ( 1 - \beta ) \overline{{\mathcal {R}}} \Big )\; \le \; \frac{1}{\beta }\, \varrho ^j. \end{aligned}$$

Hence

$$\begin{aligned} \hbox {P}\Big (\; {\overline{\rho }}( {\varvec{z}}^j \;)\; >\; ( 1 - \beta ) \overline{{\mathcal {R}}} \Big )\; \ge \; 1- \frac{1}{\beta }\, \varrho ^j. \end{aligned}$$

Performing j steps with j such that \( \varrho ^j \le \beta p \) yields an appropriate \( {\widehat{{\varvec{z}}}} = {\varvec{z}}^j \). \(\square \)

Remark 12

Gradients of the function \( - {\overline{\rho }}( {\varvec{z}} ) \) have the form \( \nabla \phi ( {\varvec{z}} ) - \overline{{\varvec{u}}} \). The further the column generation procedure progresses, the smaller the norm \( \Vert \nabla \phi ( \overline{{\varvec{z}}} ) - \overline{{\varvec{u}}} \Vert \) gets (see Observation 4(c)).

To satisfy the requirement (20) on variance, better and better estimates are needed. We control accuracy according to Assumption 3.

3.1 Bounding the optimality gap and reliability considerations for the randomized column generation scheme

By analogy with the deterministic scheme, let

$$\begin{aligned} \overline{{\mathcal {B}}} := \frac{1}{1 - \beta } {\overline{\rho }}\left( {\widehat{{\varvec{z}}}} \right) , \end{aligned}$$
(28)

with the \( \beta \) and \( {\widehat{{\varvec{z}}}} \) of Corollary 11. Concerning the gap between the respective optima of the model problem (10) and the original convex problem (3), the reliability

$$\begin{aligned} \hbox {P}\left( \, \overline{{\mathcal {B}}}\, \ge \, \hbox {`gap'} \,\right) \end{aligned}$$
(29)

is at least \( 1 - p \) with the p of Corollary 11.

Assume that our initial model included the columns \( {\varvec{z}}_0, \ldots , {\varvec{z}}_{\iota } \). In the course of the column generation scheme, we select further columns according to Corollary 11, with gradient estimates generated independently. Let the parameters \( \sigma \) and \( \beta \) be fixed for the whole scheme, e.g., set \( \beta = 0.5 \). On the other hand, we keep increasing the reliability of the individual steps during the process, i.e., let \( p = p_{\kappa }\;\; ( \kappa = \iota +1, \iota +2, \ldots ) \) decrease with \(\kappa \).

Example 13

Given the number \(\iota \) of the initial columns, let \( p_{\kappa } = ( \kappa -\iota + 9 )^{-2}\quad ( \kappa = \iota +1, \iota +2, \ldots ) \). Then we have \( \prod _{\kappa =\iota +1}^{\infty }\, ( 1 - p_{\kappa } )\, = 0.9 \). (This is easily proven. We learned it from Szász 1951, Volume II., Chapter X., Section 642).

To achieve reliability \( 1 - p_{\kappa } \) set in Example 13, we need to make \( O( \log \kappa ) \) steps with the stochastic descent method when selecting the column \({\varvec{z}}_{\kappa } \).

We terminate the column generation process when \(\overline{{\mathcal {B}}} \) of (28) gets below the prescribed accuracy. With the setting of Example 13, the terminal bound is correct with a probability at least 0.9, regardless of the number of new columns generated over the course of the procedure.

3.2 On stochastic gradient methods

The aim of this section is to place Theorem 10 and the column generation scheme into the broader context of stochastic gradient methods. The idea of stochastic approximation goes back to Robbins and Monro (1951). Important contributions include Ermoliev (1969), Gaivoronski (1978), Nemirovski and Yudin (1978, 1983), Nesterov (1983, 2009), Ermoliev (1983), Ruszczyński and Syski (1986), Uryasev (1988), Pflug (1988, 1996), Polyak (1990), Polyak and Juditsky (1992), Benveniste et al. (1993), Nemirovski et al. (2009), Lan (2012). The approach is attractive from a theoretical point of view, but early forms might perform poorly in practice. Recent forms combine theoretical depth with practical effectiveness.

We consider the problem

$$\begin{aligned} \min \; f( {\varvec{x}} ) \quad \hbox {subject to}\quad {\varvec{x}} \in X, \end{aligned}$$
(30)

where \( X \subset \mathrm{I}\mathrm{R}^n \) is a convex compact set, and \( f : \mathrm{I}\mathrm{R}^n \rightarrow \mathrm{I}\mathrm{R}\) is a convex differentiable function. Our original problem (1) is easily transformed to this form.

Stochastic gradient methods are iterative, and a starting point \( {\varvec{x}}_1 \in X \) is needed. Let \( {\varvec{x}}_k \in X \) denote the kth iterate, and let \( G_k \) be a random estimate of the corresponding gradient \( {\varvec{g}}_k = \nabla f( {\varvec{z}}_k ) \). Gradient estimates for different iterates are assumed to be based on independent, identically distributed samples. The next iterate is computed as

$$\begin{aligned} {\varvec{x}}_{k+1} = \varPi _X\left( {\varvec{x}}_k - h_k G_k \right) , \end{aligned}$$
(31)

where \( h_k > 0 \) is an appropriate step length, and \( \varPi _X \) denotes projection onto the feasible domain, i.e., \( \varPi _X( {\varvec{x}} ) = \arg \min _{{\varvec{x}}^{\prime } \in X } \Vert {\varvec{x}} - {\varvec{x}}^{\prime } \Vert \).

Methods differ in the construction of gradient estimates and in the determination of step lengths. Establishing an appropriate stopping rule is also a critical issue. Many of the methods apply averaging (like the example method sketched below), and some employ the dual space also.

As a recent example of the stochastic gradient approach, we sketch the robust stochastic approximation method of Nemirovski and Yudin. It is assumed that, given \( {\varvec{x}} \in X \), realizations of a random vector \( {\varvec{G}} \) can be constructed such that \( \hbox {E}( {\varvec{G}} ) = \nabla f( {\varvec{x}} ) \), and \( \hbox {E}( \Vert {\varvec{G}} \Vert ^2 ) \le M^2 \) holds with a constant M independent of \( {\varvec{x}} \).

Nemirovski and Yudin prove different convergence results; from our present point of view, the most relevant one is the following. Suppose that we wish to perform N steps with the above procedure, and set step length to be constant:

$$\begin{aligned} h_k = \frac{\hbox {diag}(X)}{M \sqrt{N}}, \end{aligned}$$
(32)

where \(\hbox {diag}(X) \) is the longest (Euclidean) distance occurring in X. Then we have

$$\begin{aligned} \hbox {E}\Big ( f( \overline{{\varvec{x}}}_N ) \Big ) - {{\mathcal {F}}} \le \frac{M \cdot \hbox {diag}(X)}{ \sqrt{N} }, \end{aligned}$$
(33)

where \( {{\mathcal {F}}} \) denotes the minimum of (30), and

$$\begin{aligned} \overline{{\varvec{x}}}_N = \sum _{j=1}^N \lambda _j^N {\varvec{x}}_j \quad \hbox {with}\quad \lambda _j^N = \frac{h_j}{\sum _{j=1}^N h_j}. \end{aligned}$$
(34)

Our present Assumption 1 is much stronger than mere differentiability, hence the convergence estimate of Theorem 10 is naturally stronger than (33).

We proved Theorem 10 for unconstrained minimization (over \( \mathrm{I}\mathrm{R}^n \)). In our approach, the constraint \( A {\varvec{x}} \le {\varvec{b}} \) in the convex problem (1) was taken into account through a column generation scheme. Comparing the column generation scheme with the above stochastic gradient approach, a solution of the linear programming model problem (10) is analogous to the iterate averaging (34) and the projection in (31). The analogy is not complete. Having solved the linear programming model problem, we perform an approximate line search instead of a simple translation by \( - h_k G_k \). Larger effort of an individual step in the column generation scheme pays off when gradient estimation is taxing as compared to function value computation.

Having pondered a reviewer comment concerning further combinations of column generation and stochastic gradient schemes, we see a high potential in this approach. Different combinations of the column generation and stochastic gradient schemes may be efficient for functions with different characteristics.

4 Handling a difficult constraint

We work out an approximation scheme for the solution of the convex constrained problem (2). This scheme consists of the solution of a sequence of problems of the form (1), with a tightening stopping tolerance.

We consider the linear constraint set \( A {\varvec{x}} \le {\varvec{b}} \) of problem (1). The last constraint of this set is \( {\varvec{a}}^r {\varvec{x}} \le b_r \), where \( {\varvec{a}}^r \) denotes the rth row of A, and \( b_r \) denotes the rth component of \( {\varvec{b}} \). Assume that this last constraint is a cost constraint, and let \( {\varvec{c}}^T = {\varvec{a}}^r \) denote the cost vector. We consider a parametric form of the cost constraint, namely, \( {\varvec{c}}^T {\varvec{x}} \le d \), where \( d \in \mathrm{I}\mathrm{R}\) is a parameter.

Let \( \breve{A} \) denote the matrix obtained by omitting the rth row in A, and let \( \breve{{\varvec{b}}} \) denote the vector obtained by omitting the rth component in \( {\varvec{b}} \). Using these objects, we consider the problem

$$\begin{aligned} \min \; \phi ( T {\varvec{x}} ) \quad \hbox {subject to}\quad \breve{A} {\varvec{x}} \le \breve{{\varvec{b}}},\quad {\varvec{c}}^T {\varvec{x}} \le d, \end{aligned}$$
(35)

with the parameter \( d \in \mathrm{I}\mathrm{R}\). This parametric form of the unconstrained problem will be denoted by (1\(:\, b_r = d \)).

Let \( \chi ( d ) \) denote the optimal objective value of  problem (35), as a function of the parameter d. This is obviously a monotone decreasing convex function. Let \( {{\mathcal {I}}} \subset \mathrm{I}\mathrm{R}\) denote the domain over which the function is finite. We have either \( {{\mathcal {I}}} = \mathrm{I}\mathrm{R}\) or \( {{\mathcal {I}}} = [\, {\underline{d}}, +\infty ) \) with some \( {\underline{d}} \in \mathrm{I}\mathrm{R}\). Using the notation of the unconstrained problem, we say that \( \chi ( d ) \) is the optimum of (1\(:\, b_r = d \)) for \( d \in {{\mathcal {I}}} \).

Coming to the constrained problem (2), we may assume \( \pi \in \chi ( {{\mathcal {I}}} ) \). Let \( d^{\star } \in {{\mathcal {I}}} \) be a solution of the equation \( \chi ( d ) = \pi \), and let \( l^{\star }( d ) \) denote a linear support function to \( \chi ( d ) \) at \( d^{\star } \). In this section we work under.

Assumption 14

The support function \( l^{\star }( d ) \) has a significant negative slope, i.e., \( {l^{\star }}^{\prime } \ll 0 \).

From \( {l^{\star }}^{\prime } < 0 \), it follows that the optimal objective value of (2) is \( d^{\star } \). (This slope will be used in estimating the number of Newton steps required to reach a prescribed accuracy; see Corollary 19, below. That is why we need it to be significantly negative.)

Remark 15

Assumption 14 is reasonable if the right-hand-side value \( \pi \) has been set by an expert, on the basis of preliminary experimental information. (A near-zero slope \( {l^{\star }}^{\prime } \) means that a slight relaxation of the probabilistic constraint allows a significant cost reduction.)

We find a near-optimal \( {\widehat{d}} \in {{\mathcal {I}}} \) using an approximate version of Newton’s method. The idea of regulating tolerances in such a procedure occurs in the discussion of the Constrained Newton Method in Lemaréchal et al. (1995). Based on the convergence proof of the Constrained Newton Method, a simple convergence proof of Newton’s method was reconstructed in Fábián et al. (2015). We adapt the latter to the present case.

First, we describe a deterministic approximation scheme. A randomized version is worked out in Sect. 4.2.

4.1 A deterministic approximation scheme

Let Assumptions 1 and 2 hold. A sequence of unconstrained problems (1\(:\, b_r = d_{\ell } \))\(\;\; ( \ell = 1,2,\ldots ) \) is solved with increasing accuracy. Over the course of this procedure, we build a single model \( \phi _k( {\varvec{z}} ) \) of the nonlinear objective \( \phi ( {\varvec{z}} ) \), i.e., k is ever increasing. Columns added during the solution of (1\(:\, b_r = d_{\ell } \)) are retained in the model and reused in the course of the solution of (1\(:\, b_r = d_{\ell +1} \)).

Given the \( \ell \)th iterate \( d_{\ell } \in {{\mathcal {I}}} \), we need to estimate \( \chi ( d_{\ell } ) \) with a prescribed accuracy. This is done by performing a column generation scheme with the master problem (10\(:\, b_r = d_{\ell } \)). Let \( \overline{{\mathcal {B}}}_{\ell } \) denote an upper bound on the gap between the respective optima of the model problem (10\(:\, b_r = d_{\ell } \)) and the convex problem (1\(:\, b_r = d_{\ell } \)). Such a bound is constructed according to the expression (18).

Let moreover \( {\overline{\chi }}_{\ell } \) denote the optimum of the model problem. With these objects we have

$$\begin{aligned} {\overline{\chi }}_{\ell }\; \ge \; \chi ( d_{\ell } )\; \ge \; {\overline{\chi }}_{\ell } - \overline{{\mathcal {B}}}_{\ell }. \end{aligned}$$
(36)

The column generation process with the master problem (10\(:\, b_r = d_{\ell } \)) is terminated if \( {\overline{\chi }}_{\ell } \) and \( \overline{{\mathcal {B}}}_{\ell } \) satisfy a stopping condition, to be discussed below.

Let \( d_0, d_1 \in {{\mathcal {I}}},\; d_0< d_1 < d^{\star } \) be the starting iterates. The sequence of the iterates will be strictly monotone increasing, and converging to \( d^{\star } \) from below.

4.1.1 Near-optimality condition for the constrained problem

Given a tolerance \( \epsilon \;\; ( \pi \gg \epsilon > 0 ) \), let \( {\widehat{d}} \in {{\mathcal {I}}} \) be such that

$$\begin{aligned} {\widehat{d}} \le d^{\star } \quad \hbox {and}\quad \chi ( {\widehat{d}} \;) \le \pi + \epsilon . \end{aligned}$$
(37)

Let \( \widehat{{\varvec{x}}} \) be an optimal solution of (35\(:\, d = {\widehat{d}} \;\)). Then \( \widehat{{\varvec{x}}} \) is an \( \epsilon \)-feasible solution of (2) with objective value \( {\widehat{d}} \). Exact feasible solutions of (2) have objective values not less than \( d^{\star } \ge {\widehat{d}} \).

4.1.2 Stopping condition for the unconstrained subproblem

Let \( \delta \;\; ( 0 < \delta \ll \frac{1}{2} ) \) denote a fixed tolerance. (We can set e.g., \( \delta = 0.25 \) for the whole process).

Given iterate \( d_{\ell } \in {{\mathcal {I}}},\; d_{\ell } \le d^{\star } \), we perform a column generation scheme with the master problem (10\(:\, b_r = d_{\ell } \)). The process is terminated if either

$$\begin{aligned} \begin{array}{ll} (i)\quad &{} {\overline{\chi }}_{\ell } - \pi \; \le \epsilon ,\quad \hbox {or} \\ \\ (ii) &{} \overline{{\mathcal {B}}}_{\ell }\; \le \; \delta \, ( {\overline{\chi }}_{\ell } - \pi ) \end{array} \end{aligned}$$
(38)

holds. Taking into account (36), we conclude:

If (i) occurs then \( {\widehat{d}} := d_{\ell } \) satisfies the near-optimality condition (37), and the Newton-like procedure stops.

If (ii) occurs then \( {\overline{\chi }}_{\ell } \) satisfies

$$\begin{aligned} {\overline{\chi }}_{\ell }\; \ge \; \chi ( d_{\ell } )\; \ge \; {\overline{\chi }}_{\ell } - \delta \, ( {\overline{\chi }}_{\ell } - \pi ). \end{aligned}$$
(39)

A new iterate will be constructed in the latter case.

4.1.3 Finding successive iterates

Given \( \ell \ge 1 \), assume that we have bounded \( \chi ( d_{\ell -1} ) \) and \( \chi ( d_{\ell } ) \), as in (39). The graph of the function \( \chi ( d ) \) is shown in Fig. 1. Thick segments of the vertical lines \( d = d_{\ell -1} \) and \( d = d_{\ell } \) indicate confidence intervals for the function values \( \chi ( d_{\ell -1} ) \) and \( \chi ( d_{\ell } ) \), respectively. Let \( l_{\ell } : \mathrm{I}\mathrm{R}\rightarrow \mathrm{I}\mathrm{R}\) be the linear function determined by the upper endpoint of the former interval, and the lower endpoint of the latter one. Formally,

$$\begin{aligned} l_{\ell }( d_{\ell -1} ) := {\overline{\chi }}_{\ell -1}\; \ge \chi ( d_{\ell -1} ) \quad \hbox {and}\quad l_{\ell }( d_{\ell } ) := {\overline{\chi }}_{\ell } - \delta \, ( {\overline{\chi }}_{\ell } - \pi )\; \le \chi ( d_{\ell } ), \end{aligned}$$
(40)

where the inequalities follow from (39).

Due to the convexity of \( \chi ( d ) \) and to Assumption 14, the linear function \( l_{\ell }( d ) \) obviously has a negative slope \( l_{\ell }^{\prime } \le {l^{\star }}^{\prime } \ll 0 \). Moreover \( l_{\ell }( d ) \le \chi ( d ) \) holds for \( d_{\ell } \le d \).

The next iterate \( d_{\ell +1} \) will be the point satisfying \( l_{\ell }( d_{\ell +1} ) = \pi \). Of course \( d_{\ell } < d_{\ell +1} \le d^{\star } \) follows from the observations above.

Fig. 1
figure 1

The graph of the function \(\chi ( d )\), and the construction of the next iterate

Remark 16

In a Newton-like scheme, the selection of the starting iterates strongly affects efficiency. Let us first consider the selection of \( d_1 \). An expert familiar with the model may easily set a budget slightly overtight. In the absence of such expert, we can resort to heuristics, evaluating \( \chi ( d ) \) in a set of test points.

Once \( d_1 \) has been set, we can consider \( d_0 \). A good choice for \( d_0 \) is one that results in a large \( d_2 \). We have to take into account the accuracy of our evaluation of \( \chi ( d_1 ) \) on the one hand, and the slope of \( \chi ( d ) \) on the other hand. A possible way of organizing the selection process is the following. First, we evaluate \( \chi ( d_1 ) \) by solving the problem (35\(: d = d_1 \)). In the course of the solution, we build a model of the objective function. This model can then be used to estimate the slope of \( \chi ( d ) \).

4.1.4 Convergence

Let the iterates \( d_0, d_1, \ldots , d_{s} \) and the linear functions \( l_1( d ), \ldots , l_{s}( d ) \) be as defined above. We assume that \( s > 1 \), and the procedure did not stop before step \( ( s + 1 ) \). Then we have

$$\begin{aligned} {\overline{\chi }}_{\ell } - \pi > \epsilon \quad ( j = 0, 1, \ldots , s ). \end{aligned}$$
(41)

To simplify the notation, we introduce the linear functions \( L_{\ell }( d ) := l_{\ell }( d ) - \pi \;\; ( j = 1, \ldots , s )\). With these, (40) transforms into

$$\begin{aligned} L_{\ell }( d_{\ell -1} ) = {\overline{\chi }}_{\ell -1} - \pi \quad \hbox {and}\quad L_{\ell }( d_{\ell } ) = ( 1 - \delta ) ( {\overline{\chi }}_{\ell } - \pi ) \quad ( j = 1, \ldots , s ). \end{aligned}$$
(42)

Positivity of the above function values follows from (41). Moreover, the derivatives satisfy

$$\begin{aligned} L_{\ell }^{\prime } = l_{\ell }^{\prime }\le {l^{\star }}^{\prime }\; \ll 0 \quad ( j = 1, \ldots , s ) \end{aligned}$$
(43)

due to the observations in the previous section.

Theorem 17

We have

$$\begin{aligned} \gamma ^{s-1} \cdot \frac{| L_1^{\prime } |}{| {l^{\star }}^{\prime } |} \cdot L_1( d_1 ) \ge \; L_s( d_s ) \quad \hbox {with}\quad \gamma := \left( \frac{1}{2 ( 1 - \delta )} \right) ^ 2. \end{aligned}$$
(44)

Proof

The following statements hold for \( j = 1, \ldots , s-1 \). From (42), we get

$$\begin{aligned} \frac{L_{\ell +1}( d_{\ell } )}{L_{\ell }( d_{\ell } )}\; =\; \frac{{\overline{\chi }}_{\ell } - \pi }{( 1 - \delta ) ( {\overline{\chi }}_{\ell } - \pi )}\; =\; \frac{1}{1 - \delta }. \end{aligned}$$
(45)

By definition, we have

$$\begin{aligned} L_{\ell }( d_{\ell } ) + ( d_{\ell +1} - d_{\ell } )\, L_{\ell }^{\prime }\; =\; L_{\ell }( d_{\ell +1} )\; = 0. \end{aligned}$$

It follows that \(\; d_{\ell +1} - d_{\ell } = \frac{L_{\ell }( d_{\ell } )}{| L_{\ell }^{\prime } |} \). Using this, we get

$$\begin{aligned} L_{\ell +1}( d_{\ell } ) \; =\; L_{\ell +1}( d_{\ell +1} ) + ( d_{\ell } - d_{\ell +1} )\, L_{\ell +1}^{\prime } \; =\; L_{\ell +1}( d_{\ell +1} ) + \frac{L_{\ell }( d_{\ell } )}{| L_{\ell }^{\prime } |}\, | L_{\ell +1}^{\prime } |. \end{aligned}$$

Hence

$$\begin{aligned} \frac{L_{\ell +1}( d_{\ell } )}{L_{\ell }( d_{\ell } )}\; =\; \frac{L_{\ell +1}( d_{\ell +1} )}{L_{\ell }( d_{\ell } )} + \frac{| L_{\ell +1}^{\prime } |}{| L_{\ell }^{\prime } |}. \end{aligned}$$
(46)

From (45), we have

$$\begin{aligned} \frac{1}{1 - \delta }\; =\; \frac{L_{\ell +1}( d_{\ell +1} )}{L_{\ell }( d_{\ell } )} + \frac{| L_{\ell +1}^{\prime } |}{| L_{\ell }^{\prime } |} \; \ge \; 2\; \sqrt{ \frac{L_{\ell +1}( d_{\ell +1} )\, | L_{\ell +1}^{\prime } |}{L_{\ell }( d_{\ell } )\, | L_{\ell }^{\prime } |} }. \end{aligned}$$

(This is the well-known inequality between means.) It follows that

$$\begin{aligned} \left( \frac{1}{2 ( 1 - \delta )} \right) ^2\; L_{\ell }( d_{\ell } )\, | L_{\ell }^{\prime } | \; \ge \; L_{\ell +1}( d_{\ell +1} )\, | L_{\ell +1}^{\prime } |. \end{aligned}$$
(47)

By induction, we get

$$\begin{aligned} \left( \frac{1}{2 ( 1 - \delta )} \right) ^{2(s-1)}\; L_1( d_1 )\, | L_1^{\prime } | \; \ge \; L_{s}( d_{s} )\, | L_{s}^{\prime } |. \end{aligned}$$
(48)

Applying \( | L_{s}^{\prime } | \ge | {l^{\star }}^{\prime } | \) we obtain (44). \(\square \)

Example 18

Let \( \delta = 0.25 \), then \( \gamma = ( \frac{1}{2 ( 1 - \delta )} )^2\; < 0.5 \).

Corollary 19

With the setting of Example 18, the number of Newton-like steps needed to reach the stopping tolerance \( \epsilon \) does not exceed

$$\begin{aligned} N( \epsilon ) =\; \log \left( \frac{| L_1^{\prime } |}{| {l^{\star }}^{\prime } |} \cdot \frac{L_1( d_1 )}{\epsilon } \right) . \end{aligned}$$
(49)

Note that \( | {l^{\star }}^{\prime } | \gg 0 \) due to Assumption 14.

Given a problem, let us consider the efforts of its approximate solution as a function of the prescribed accuracy \(\epsilon \). According to (49), that is on the order of \(\; \log \frac{1}{\epsilon } \).

4.2 A randomized version of the approximation scheme

Let Assumptions 1 and 3 hold.

Concerning the function \( \chi (d) \), let Assumption 14 hold. Our aim, in principle, is the same as it has been in the deterministic case: find \( {\widehat{d}} \in {{\mathcal {I}}} \) such that \( \pi + \epsilon \ge \chi ( {\widehat{d}} ) \ge \pi \) holds with a pre-set tolerance \( \epsilon \). In the present uncertain environment, however, we may have to content ourselves with \( {\hat{d}} \) such that \( \pi + \epsilon \ge \chi ( {\hat{d}} ) > \pi - \epsilon \) holds. This problem statement is justifiable if the function \( \chi (d) \) is not constant for \( d > d^{\star } \). Let Assumption 20, below, hold.

Assumption 20

For our stopping tolerance \( \epsilon \), there exists (an unknown) \(d^{\star }_{\epsilon } \in {{\mathcal {I}}} \) such that \(\chi (d^{\star }_{\epsilon } ) = \pi - \epsilon \).

Let \( q\; ( 0.5 \ll q < 1 ) \) denote a pre-set reliability. Using the randomized column generation scheme, a sequence of unconstrained problems (1\(:\, b_r = d_{\ell } \))\(\;\; ( \ell = 1,2,\ldots ) \) is solved, each with reliability q, and with an accuracy determined by the Newton-like approximation scheme. As in the deterministic case, we build a single model \( \phi _k( {\varvec{z}} ) \) of the nonlinear objective \( \phi ( {\varvec{z}} ) \), i.e., k is ever increasing. Let \( k_{\ell -1} \) denote the number of columns at the outset of the solution of problem (1\(:\, b_r = d_{\ell } \)).

Given the \( \ell \)th iterate \( d_{\ell } \in {{\mathcal {I}}} \), we estimate \( \chi ( d_{\ell } ) \) by performing a column generation scheme with the master problem (10\(:\, b_r = d_{\ell } \)). Applying the procedure of Sect. 3.1, we obtain an estimate \( \overline{{\mathcal {B}}}_{\ell } \) for the gap between the respective optima of the model problem (10\(:\, b_r = d_{\ell } \)) and the convex problem (1\(:\, b_r = d_{\ell } \)). Keeping to the setting of Example 13, we set the reliability parameter to \( q = 0.9 \), obtaining \( \hbox {P}(\, \overline{{\mathcal {B}}}_{\ell }\, \ge \, \hbox {`gap'} \,)\, \ge 0.9 \). (Note that the columns with indices up to \( k_{\ell -1} \) belong to the initial model, hence in terms of Sect. 3.1, we have \( \iota = k_{\ell -1} \).)

Let moreover \( {\overline{\chi }}_{\ell } \) denote the optimum of the model problem. With these objects we have

$$\begin{aligned} {\overline{\chi }}_{\ell }\; \ge \; \chi ( d_{\ell } ) \quad \hbox {and}\quad \hbox {P}\left( \, \chi ( d_{\ell } )\, \ge \, {\overline{\chi }}_{\ell } - \overline{{\mathcal {B}}}_{\ell } \,\right) \; \ge 0.9. \end{aligned}$$
(50)

We proceed in accordance with the deterministic scheme. The present stochastic scheme actually coincides with the deterministic one, provided the gap is estimated correctly in the unconstrained problem. In the stochastic scheme, however, we may underestimate the gap, meaning that \( \overline{{\mathcal {B}}}_{\ell } \) is not an upper bound. Consequently the inequality \( \chi ( d_{\ell } )\, \ge \, {\overline{\chi }}_{\ell } - \overline{{\mathcal {B}}}_{\ell } \) may not hold in (50). In such a case, \( d_{\ell +1} > d^{\star } \) and hence \( {\overline{\chi }}_{\ell +1} < \pi \) may occur. If the latter is observed, then we step back to the previous iterate, i.e., set \( d_{\ell +2} = d_{\ell } \). We then carry on with the Newton-like procedure; first resolving the model problem (10\(:\, b_r = d_{\ell +2} \)) with reliability \( q = 0.9 \).

4.2.1 Stopping condition for the unconstrained subproblem

In accordance with the above discussion, we now formulate the stopping condition of the column generation process at the Newton-like step \( \ell \). Solution with the master problem (10\(:\, b_r = d_{\ell } \)) is terminated if \( {\overline{\chi }}_{\ell } \) and \( \overline{{\mathcal {B}}}_{\ell } \) satisfy one of the following conditions:

$$\begin{aligned} \begin{array}{ll} (\alpha )\quad &{} {\overline{\chi }}_{\ell }< \pi ,\\ \\ (\beta ) &{} \pi \le {\overline{\chi }}_{\ell } < \pi + \epsilon \quad \hbox {and}\quad \overline{{\mathcal {B}}}_{\ell } \le \epsilon , \\ \\ (\gamma ) &{} \pi + \epsilon \le {\overline{\chi }}_{\ell } \quad \hbox {and}\quad \overline{{\mathcal {B}}}_{\ell }\; \le \; \delta \, ( {\overline{\chi }}_{\ell } - \pi ). \end{array} \end{aligned}$$
(51)

If condition (\(\alpha \)) occurs, then we step back to the previous iterate \( d_{\ell -1} \).

If condition (\(\beta \)) occurs, then we stop the Newton-like process.

If condition (\(\gamma \)) occurs, then we carry on to a new iterate \(d_{\ell +1} > d_{\ell } \), like we did in the deterministic scheme.

4.2.2 Convergence and reliability

Let the unconstrained subproblems each be solved with a reliability of \( q = 0.9 \), and let \( \delta , \gamma \) be set according to Example 18. Moreover, let us assume that the randomized Newton-like scheme did not stop in L steps. The aim of this section is to show that, provided L is large enough, an \(\epsilon \)-optimal solution of the constrained problem has been reached with a high probability.

According to our assumption, case \((\beta )\) did not occur in the stopping condition of the previous section. Let us define ‘correct’ and ‘incorrect’ steps, depending on the starting point \(d_{\ell }\):

  • In case \( d_{\ell } \le d^{\star } \): We call step \( \ell \) correct if \(\; d_{\ell +1} \le d^{\star } \;\) and \(\; 0.5 \cdot L_{\ell }( d_{\ell } )\, | L_{\ell }^{\prime } | \ge L_{\ell +1}( d_{\ell +1} )\, | L_{\ell +1}^{\prime } | \;\) also holds, otherwise we call step \( \ell \) incorrect.

  • In case \( d_{\ell } > d^{\star } \): We call step \( \ell \) correct if a backstep occurs (i.e., if \( d_{\ell +1} = d_{\ell -1} \)), otherwise we call it incorrect.

A step is correct with a probability at least \( q = 0.9 \); this follows from the proof of Theorem 17, namely the expression (47).

If the difference between the number of the correct steps and the number of the incorrect steps exceeds \( N( \epsilon ) \), then an \( \epsilon \)-optimal solution of the constrained problem has been reached, according to Corollary 19.

Let \( Z_{\ell } \) be the random variable

$$\begin{aligned} Z_{\ell } = \left\{ \begin{array}{ll} 0 \quad &{} \hbox {if step }\ell \hbox { is correct}, \\ \\ 1 &{} \hbox {if step } \ell \hbox { is incorrect} \end{array} \right. \quad \quad \quad ( \ell = 1, \ldots , L ). \end{aligned}$$

As a step is correct with a probability at least \( q = 0.9 \), we have \( \hbox {E}( Z_{\ell } ) \le 0.1 \), and hence \( \hbox {E}( \sum _{\ell =1}^L Z_{\ell } ) \le 0.1 L \).

The difference between the number of the correct steps and the number of the incorrect steps is \(\; L - 2 \sum _{\ell =1}^L Z_{\ell } \). In order to show that the difference likely exceeds \( N( \epsilon ) \), we need an upper bound on the probability that \( \sum _{\ell =1}^L Z_{\ell } \) is significantly larger than \( \hbox {E}( \sum _{\ell =1}^L Z_{\ell } ) \).

Though all the gradient estimates were generated independently, there may be some interdependence among the random variables \( Z_1, \ldots , Z_L \), because of the time structure of the process. But this interdependence is weak in the following sense. Suppose that we are at the beginning of the process. Given \( 0 < k \le L \), we know that step k will be correct with a probability at least 0.9, no matter what happens in steps \( 1, \ldots , k-1 \). In particular,

$$\begin{aligned} \hbox {P}\big [\; Z_{k} = 1 \; \big |\; Z_{\ell } = 1\; ( \ell \in {{\mathcal {I}}}_k ) \; \big ]\, \le 0.1 \quad \hbox {holds for every}\;\; {{\mathcal {I}}}_k \subseteq \{ 1, \ldots , k-1 \}. \end{aligned}$$
(52)

The condition in the above probability represents the event that \( Z_{\ell } = 1 \) occurs for every \( \ell \in {{\mathcal {I}}}_k \). In case \( k = 1 \), the condition is empty, and (52) reduces to \( \hbox {P}( Z_{1} = 1 ) \le 0.1 \).

Generalized Chernoff–Hoeffding bounds were proposed by Panconesi and Srinivasan (1997). Intuitive proofs of such bounds, based on a simple combinatorial argument, were given by Impagliazzo and Kabanets (2010). (In this latter paper, concentration bounds are also explained in terms of successes of random experiments, just our present situation.) We are going to use a Chernoff-type bound, Theorem 1.1 in Impagliazzo and Kabanets (2010):

Theorem 21

Let \( Z_1, \ldots , Z_n \) be Boolean random variables such that, for some \( p \in [ 0, 1 ] \),

$$\begin{aligned} \hbox {P}\left[ \, Z_{\ell } = 1\; ( \ell \in A ) \,\right] \; \le \; p^{|A|} \quad \hbox {holds for every}\;\; A \subseteq \{ 1, \ldots , n \}, \end{aligned}$$
(53)

where |A| denotes the cardinality of A.

Then, for any \( \kappa \in [ p, 1 ] \), we have

$$\begin{aligned} \hbox {P}\left[ \, \sum _{{\ell } = 1}^n Z_{\ell } \,\ge \kappa n \,\right] \; \le \; e^{-n D( \kappa || p )}, \end{aligned}$$
(54)

where \( D(\cdot ||\cdot ) \) is the relative entropy function, satisfying \(D( \kappa || p ) \ge 2( \kappa - p )^2 \).

It is easy to see that our objects satisfy the precondition (53) with \( p = 0.1 \). Indeed, it follows from the repeated application of (52). A formal proof may apply induction on n. For \( n = 1 \), we have \( \hbox {P}( Z_{1} = 1 ) \le 0.1 \). Now let us assume that (53) holds for \( 1 \le n < k \). The statement for \( n = k \) follows from (52), by setting \( {{\mathcal {I}}}_k = A \cap \{ 1, \ldots , k-1 \} \).

As the precondition of Theorem 21 holds, we have (54) with \( n = L,\, p = 0.1 \) and \( \kappa = 1/3 \). Simple computation shows that, for \( L \ge 22 \),

$$\begin{aligned} \hbox {P}\left[ \, \sum _{{\ell } = 1}^L Z_{\ell }\; < \frac{1}{3}\, L \,\right] \; \ge 0.9. \end{aligned}$$
(55)

As we have seen, the difference between the number of the correct steps and the number of the incorrect steps is \(\; L - 2 \sum _{\ell =1}^L Z_{\ell } \) which exceeds L / 3 if \( \sum _{\ell =1}^L Z_{\ell } < 1/3L \) in (55) holds. We sum up the discussion in

Proposition 22

Let the unconstrained problems each be solved with a reliability of \( q = 0.9 \); let \( \delta , \gamma \) be set according to Example 18; and let \( L = \max \{\, 22,\, 3 N( \epsilon )\, \} \) with \( N( \epsilon ) \) defined in Corollary 19.

Assume that the randomized Newton-like scheme did not stop in L steps. Then an \( \epsilon \)-optimal solution of the constrained problem has been reached with a probability at least 0.9.

Remark 23

If case \( (\beta ) \) occurred in the stopping condition of the previous section, then further checks are needed to ensure reliability.

Remark 24

The stopping tolerance prescribed for the unconstrained subproblems is ever tightened in accordance with the progress of the Newton-like approximation scheme. However, the prescribed tolerance is never tighter than \( \delta \cdot \epsilon = 0.25 \epsilon \).

5 Adapting the approach to probabilistic problems

In this section we consider \( \phi ( {\varvec{z}} ) = -\log F( {\varvec{z}} ) \) with a nondegenerate n-dimensional standard normal distribution function \(F({\varvec{z}})\). Assumption 1 does not hold with such a function. However, as illustrated in Fábián et al. (2018), Assumption 1 holds over any bounded ball around the origin. (The ratio \(\frac{\alpha }{\omega } \) decreases as the radius of the ball increases.) Moreover, a construction was sketched in Fábián et al. (2018) that limits the column generation process to a ball of sufficiently large radius. That construction does not yield usable estimates for the values \( \alpha \) and \( \omega \).

When applied to probabilistic problems, we look on Corollaries 6 and 11 merely as a means of justification of the efficiency of the procedure. The gap between the respective optima of the model problem and the original probabilistic problem is measured by different means, to be described presently. In this setting we may perform just a single line search in each column generation problem.

In order to apply the procedures described in the previous sections, we need Assumption 3 to hold, with the relaxation that function values \( \phi ( {\varvec{z}} ) \) are computed with a high accuracy (instead of exactly). In the present case of a probabilistic \( \phi ( {\varvec{z}} ) \), high-precision computation of \( \log F( {\varvec{z}} ) \) is impractical in points \({\varvec{z}} \) with a low \( F( {\varvec{z}} ) \). Hence we need a technical assumption that helps keeping the process in a region where high-precision computation of \( \phi ( {\varvec{z}} ) \) is possible.

Assumption 25

A significantly high probability can be achieved. Specifically, a feasible point \( \check{{\varvec{z}}} \) is known such that \(F(\check{{\varvec{z}}} ) \ge 0.5 \).

By including \( \check{{\varvec{z}}} \) of Assumption 25 among the initial columns of the master problem, we always have \(F(\overline{{\varvec{z}}} ) \ge 0.5 \) with the current solution \(\overline{{\varvec{z}}} \) defined in (12). Hence \(\phi (\overline{{\varvec{z}}} ) \) can be computed with a high accuracy.

We perform a single line search in each column generation subproblem, starting always from the current \( \overline{{\varvec{z}}} \). It means that a high-quality estimate can be generated for the gradient, which designates the direction of the line search. Once the direction of the search is determined, we only work with function values (there is no need for any further gradient information in the current column generation subproblem). The line search is performed with a high accuracy over the region \({{\mathcal {L}}}( F,\, 0.5 ) = \{\, {\varvec{z}} \,|\, F( {\varvec{z}} ) \ge 0.5 \,\} \) which includes the optimal solution of the probability maximization problem (3).

We can carry on with the line search even if we have left the safe region \( {{\mathcal {L}}}( F,\, 0.5 ) \). Given a point \( \hat{{\varvec{z}}} \) along the search ray, let \( {\hat{p}} > 0 \) be such that \( {\hat{p}} \le F( \hat{{\varvec{z}}} ) \) holds almost surely. (Simulation procedures generally provide a confidence interval together with an estimate.) If the vector \( \hat{{\varvec{z}}} \) is to be included in the master problem (10) as a new column, then we set the corresponding cost coefficient as \( \phi = - \log {\hat{p}} \). Under such an arrangement, our model remains consistent, i.e., the model function \( \phi _k( {\varvec{z}} ) \) is almost surely an inner approximation of the probabilistic function \( \phi ( {\varvec{z}} ) \).

5.1 A bounded formulation

Exploiting monotonicity of the function \( \phi ( {\varvec{z}} ) = -\log F( {\varvec{z}} ) \), the unconstrained problem with variable splitting is formulated with inequality between \( {\varvec{z}} \) and \( T {\varvec{x}} \):

$$\begin{aligned} \min \; \phi ( {\varvec{z}} ) \quad \hbox {subject to}\quad A {\varvec{x}} - {\varvec{b}} \le {\varvec{0}},\;\; {\varvec{z}} - T{\varvec{x}} \le {\varvec{0}}. \end{aligned}$$
(56)

A further speciality of the normal distribution function is the existence of a bounded box \( {{\mathcal {Z}}} \) outside which the probability weight can be ignored. Including the constraint \( {\varvec{z}} \in {{\mathcal {Z}}} \) in (56) results in a closely approximating problem:

$$\begin{aligned} \min \; \phi ( {\varvec{z}} ) \quad \hbox {subject to}\quad A {\varvec{x}} - {\varvec{b}} \le {\varvec{0}},\;\; {\varvec{z}} - T{\varvec{x}} \le {\varvec{0}},\;\; {\varvec{z}} \in {{\mathcal {Z}}}. \end{aligned}$$
(57)

Observation 26

The difference between the respective optima of problems (56) and (57) is insignificant.

Proof

Let \( {\varvec{z}} \) be a part of a feasible solution of (56), and let us consider the box \( ( {\varvec{z}} + {{\mathcal {N}}} ) \cap {{\mathcal {Z}}} \), where \( {{\mathcal {N}}} \) denotes the negative orthant. In case this box is empty, we have \( F( {\varvec{z}} ) \approx 0 \) due to the specification of \( {{\mathcal {Z}}} \). Taking into account Assumption 25, such \( {\varvec{z}} \) cannot be a part of an optimal solution of (56).

In case the box \( ( {\varvec{z}} + {{\mathcal {N}}} ) \cap {{\mathcal {Z}}} \) is not empty, let \( \varPi _{{\varvec{z}}} \) denote its ’most positive’ vertex. We have \( \varPi _{{\varvec{z}}} \in {{\mathcal {Z}}},\; \varPi _{{\varvec{z}}} \le {\varvec{z}} \), and \( F( \varPi _{{\varvec{z}}} ) \approx F( {\varvec{z}} ) \). If \( F( {\varvec{z}} ) < 0.5 \), then, due to Assumption 25 again, \( {\varvec{z}} \) cannot be a partial optimal solution of (56).

In the remaining case of \( F( \varPi _{{\varvec{z}}} ) \approx F( {\varvec{z}} ) \ge 0.5 \), we have \( \phi ( \varPi _{{\varvec{z}}} ) \approx \phi ( {\varvec{z}} ) \). Moreover \( \varPi _{{\varvec{z}}} \) is a partial feasible solution of (57), due to \( \varPi _{{\varvec{z}}} \in {{\mathcal {Z}}},\; \varPi _{{\varvec{z}}} \le {\varvec{z}} \). \(\square \)

Remark 27

We could build duals and models of the above forms in the manner of Sect. 2. Observation 26 allows us to restrict \( {\varvec{z}} \) to \( {{\mathcal {Z}}} \) in the dual formulation (4). Formally, this would mean working with the restricted functions

$$\begin{aligned} \phi _{{\mathcal {Z}}}( {\varvec{z}} ) = \left\{ \begin{array}{ll} \phi ( {\varvec{z}} )\quad &{} \hbox {if}\quad {\varvec{z}} \in {{\mathcal {Z}}}, \\ \\ +\infty &{} \hbox {otherwise} \end{array} \right. \quad \hbox {and}\quad \phi _{{\mathcal {Z}}}^{\star }( {\varvec{u}} ) = \max _{{\varvec{z}} \in {{\mathcal {Z}}}} \{ {\varvec{u}}^T {\varvec{z}} - \phi ( {\varvec{z}} ) \} \end{aligned}$$
(58)

instead of \( \phi ( {\varvec{z}} ) \) and \( \phi ^{\star }( {\varvec{u}} ) \), respectively.

In a pure form of this bounded scheme, new columns would always be selected from \( {{\mathcal {Z}}} \). An obvious drawback of such a scheme is that Theorem 5 does not apply to the resulting bounded optimization problem. In Sect. 5.2 we develop a hybrid scheme, including a restriction to \({{\mathcal {Z}}}\) in the master problem, but selecting new columns by unconstrained maximization.

5.2 A hybrid form of the column generation scheme

Introducing new variables \( {\varvec{z}}^{\prime } \in \mathrm{I}\mathrm{R}^n \), we transform (57) to

$$\begin{aligned} \min \; \phi ( {\varvec{z}} ) \quad \hbox {subject to}\quad A {\varvec{x}} - {\varvec{b}} \le {\varvec{0}},\;\; {\varvec{z}}^{\prime } - T{\varvec{x}} \le {\varvec{0}},\;\; {\varvec{z}}^{\prime } \in {{\mathcal {Z}}},\;\; {\varvec{z}} = {\varvec{z}}^{\prime }. \end{aligned}$$
(59)

The above problem has the general pattern of (3), hence the dual problem can be formulated in the manner of Sect. 2, relaxing the equality constraint \( {\varvec{z}} = {\varvec{z}}^{\prime } \). Model problems are then formulated according to Sects. 2.1 and 2.2. Hence the columns \( {\varvec{z}}_i\; ( i = 0, \ldots , k ) \) may, in theory, fall outside \( {{\mathcal {Z}}} \), but their convex combination is restricted to \( {{\mathcal {Z}}} \).

We implemented this procedure. In our experiments reported in Sect. 7, the restriction \( {\varvec{z}}^{\prime } \in {{\mathcal {Z}}} \) was never active in any optimal solution of the master problem.

Let \( \overline{{\varvec{z}}} \in {{\mathcal {Z}}} \) denote the current primal iterate, obtained in the form (12) using an optimal solution of the model problem. Let \( \overline{{\varvec{g}}} = \nabla \phi ( \overline{{\varvec{z}}} ) \) be the corresponding gradient. Let moreover \( (\, {\overline{\vartheta }},\, \overline{{\varvec{u}}} \,) \) be part of an optimal dual solution of the current model problem. Finally, let \( \overline{{\mathcal {R}}} \) denote the gap between the respective optima of the model problem and the original probabilistic problem.

Observation 28

With the above objects, we have:

$$\begin{aligned} \overline{{\mathcal {R}}} \le \; \Big ( \phi _k( \overline{{\varvec{z}}} ) - \phi ( \overline{{\varvec{z}}} ) \Big ) + \max _{{\varvec{z}} \in {{\mathcal {Z}}}}\, ( \overline{{\varvec{u}}} - \overline{{\varvec{g}}} )^T ( {\varvec{z}} - \overline{{\varvec{z}}} ). \end{aligned}$$
(60)

Proof

An adaptation of Observation 7 to the present bounded setting is

$$\begin{aligned} \overline{{\mathcal {R}}} = \max _{{\varvec{z}} \in {{\mathcal {Z}}}}\, \left\{ \, {\overline{\vartheta }} + \overline{{\varvec{u}}}^T {\varvec{z}} - \phi ( {\varvec{z}} ) \right\} . \end{aligned}$$

Applying \( \phi ( {\varvec{z}} ) \ge \phi ( \overline{{\varvec{z}}} ) - \overline{{\varvec{g}}}^T ( {\varvec{z}} - \overline{{\varvec{z}}} ) \) we get

$$\begin{aligned} \overline{{\mathcal {R}}} \le {\overline{\vartheta }} - \phi ( \overline{{\varvec{z}}} ) + \overline{{\varvec{g}}}^T \overline{{\varvec{z}}}\; + \max _{{\varvec{z}} \in {{\mathcal {Z}}}}\, ( \overline{{\varvec{u}}} - \overline{{\varvec{g}}} )^T {\varvec{z}}= & {} \phi _k( \overline{{\varvec{z}}} ) - \left( {\overline{\vartheta }} + \overline{{\varvec{u}}}^T \overline{{\varvec{z}}} \right) \; + {\overline{\vartheta }} - \phi ( \overline{{\varvec{z}}} ) + \overline{{\varvec{g}}}^T \overline{{\varvec{z}}}\\&+\max _{{\varvec{z}} \in {{\mathcal {Z}}}}\, ( \overline{{\varvec{u}}} - \overline{{\varvec{g}}} )^T {\varvec{z}}. \end{aligned}$$

The above equality is a consequence of Observation 4(a). \(\square \)

The exact gradient \( \overline{{\varvec{g}}} \) is of course not known, but we can construct a gradient estimate together with a confidence interval, as will be described in Sect. 6.5. Given an error tolerance \( \varDelta > 0 \), a probability \( p\; ( 0 < p \ll 1 ) \) and the iterate \( \overline{{\varvec{z}}} \in {{\mathcal {Z}} } \), we obtain a gradient estimate \( \overline{{\varvec{G}}} \) together with a confidence interval \( \overline{{\mathcal {I}}} \). (The former is a random vector, the latter is an n-dimensional interval having random dimensions. The interval has the vector as a center.) The random objects satisfy the following rules:

$$\begin{aligned} \hbox {E}( \overline{{\varvec{G}}} ) = \overline{{\varvec{g}}},\quad \hbox {P}\left( \, \overline{{\varvec{g}}} \in \overline{{\mathcal {I}}} \,\right) \; \ge 1 - p \quad \hbox {and}\quad \mathrm {diag}\left( \overline{{\mathcal {I}}} \right) \le \varDelta , \end{aligned}$$
(61)

where diag denotes the largest distance in the interval.

The following observation shows that we can use \( \overline{{\varvec{G}}} \) to estimate the maximum on the right-hand side of (60).

Observation 29

The objects of (61) admit the following estimate:

$$\begin{aligned}&\max _{{\varvec{z}} \in {{\mathcal {Z}}}}\, ( \overline{{\varvec{u}}} - \overline{{\varvec{g}}} )^T ( {\varvec{z}} - \overline{{\varvec{z}}} )\;\; \le \;\; \max _{{\varvec{z}} \in {{\mathcal {Z}}}}\; ( \overline{{\varvec{u}}} - \overline{{\varvec{G}}} )^T ( {\varvec{z}} - \overline{{\varvec{z}}} )\; +\; \varDelta \cdot \mathrm {diag}( {{\mathcal {Z}}} )\nonumber \\&\quad \hbox {holds with a probability at least}\;\; 1 - p. \end{aligned}$$
(62)

Proof

Based on the confidence interval, a pessimist estimate of the left-hand side of (62) could be obtained by solving the (nonconvex) quadratic programming problem

$$\begin{aligned} \max \; ( \overline{{\varvec{u}}} - {\varvec{g}} )^T ( {\varvec{z}} - \overline{{\varvec{z}}} ) \quad \text{ such } \text{ that }\quad {\varvec{z}} \in {{\mathcal {Z}}},\;\; {\varvec{g}} \in \overline{{\mathcal {I}}}. \end{aligned}$$
(63)

Instead of the quadratic programming problem, we just solve the linear programming problem

$$\begin{aligned} \max \; ( \overline{{\varvec{u}}} - \overline{{\varvec{G}}} )^T ( {\varvec{z}} - \overline{{\varvec{z}}} ) \quad \text{ such } \text{ that }\quad {\varvec{z}} \in {{\mathcal {Z}}}. \end{aligned}$$
(64)

Let \( \left( \grave{{\varvec{z}}}, \grave{{\varvec{g}}} \right) \) denote an optimal solution of (63), and let \( {\widehat{{\varvec{z}}}} \) denote an optimal solution of (64). The difference between the respective optima is

$$\begin{aligned} ( \overline{{\varvec{u}}} - \grave{{\varvec{g}}} )^T ( \grave{{\varvec{z}}} - \overline{{\varvec{z}}} ) - ( \overline{{\varvec{u}}} - \overline{{\varvec{G}}} )^T ( {\widehat{{\varvec{z}}}} - \overline{{\varvec{z}}} )\le & {} ( \overline{{\varvec{u}}} - \grave{{\varvec{g}}} )^T ( \grave{{\varvec{z}}} - \overline{{\varvec{z}}} ) - ( \overline{{\varvec{u}}} - \overline{{\varvec{G}}} )^T ( \grave{{\varvec{z}}} - \overline{{\varvec{z}}} )\nonumber \\= & {} ( \overline{{\varvec{G}}} - \grave{{\varvec{g}}} )^T ( \grave{{\varvec{z}}} - \overline{{\varvec{z}}} ), \end{aligned}$$
(65)

the inequality being a consequence of the selection of \( {\widehat{{\varvec{z}}}} \). The estimate (62) follows by the Cauchy–Bunyakovsky–Schwarz inequality. \(\square \)

To sum up the above discussion: based on a gradient estimate \( \overline{{\varvec{G}}} \) satisfying (61), it is easy to compute

$$\begin{aligned} \overline{{\mathcal {B}}} := \Big ( \phi _k( \overline{{\varvec{z}}} ) - \phi ( \overline{{\varvec{z}}} ) \Big ) +\; \max _{{\varvec{z}} \in {{\mathcal {Z}}}}\; ( \overline{{\varvec{u}}} - \overline{{\varvec{G}}} )^T ( {\varvec{z}} - \overline{{\varvec{z}}} )\; +\; \varDelta \cdot \mathrm {diag}( {{\mathcal {Z}}} ). \end{aligned}$$
(66)

According to Observations 28 and 29, we have \(\, \hbox {P}(\, \overline{{\mathcal {B}}}\, \ge \, \hbox {`gap'} \,)\, \ge 1 - p \).

5.3 Regulating accuracy and reliability when solving an unconstrained probabilistic problem

Given iterate \( \overline{{\varvec{z}}} \), we wish to construct an estimate \( \overline{{\varvec{G}}} \) for the corresponding gradient. We have two objectives.

  • We need Corollary 11 to ensure efficiency of a descent step in the course of column selection. Hence (20) should hold with an appropriate \( \sigma \) between the vectors \( {\varvec{g}}^{\circ } = \overline{{\varvec{g}}} - \overline{{\varvec{u}}} \) and \( {\varvec{G}}^{\circ } = \overline{{\varvec{G}}} - \overline{{\varvec{u}}} \). Specifically,

    $$\begin{aligned} \hbox {E}\left( \left\| \overline{{\varvec{G}}} - \overline{{\varvec{g}}} \right\| ^2 \right) \; \le \; {\sigma ^2} \left\| \overline{{\varvec{g}}} - \overline{{\varvec{u}}} \right\| ^2 \end{aligned}$$
    (67)

    should hold.

  • We need (61) to hold with appropriate parameters \( \varDelta \) and p to ensure that the bound \( \overline{{\mathcal {B}}} \) is tight and reliable. We slightly re-formulate the definition of \( \overline{{\mathcal {B}}} \) in (66) as follows:

    $$\begin{aligned} \Big ( \phi _k( \overline{{\varvec{z}}} ) - \phi ( \overline{{\varvec{z}}} ) \Big ) +\; \max _{{\varvec{z}} \in {{\mathcal {Z}}}}\; \Big ( ( \overline{{\varvec{u}}} - \overline{{\varvec{g}}} ) + ( \overline{{\varvec{g}}} - \overline{{\varvec{G}}} ) \Big )^T ( {\varvec{z}} - \overline{{\varvec{z}}} )\; +\; \varDelta \cdot \mathrm {diag}( {{\mathcal {Z}}} ). \end{aligned}$$
    (68)

Concerning p, we increase reliability with each master iteration, as we did in the general case of Sect. 3.1. Having added \( \kappa \) columns, we prescribe the reliability \( 1 - p_{\kappa } \), with \( p_{\kappa } \) set according to Example 13.

In setting the parameters \( \sigma \) and \( \varDelta \), we aim to find a balance between the error of the polyhedral model function on the one hand, and the error of the gradient estimation on the other hand. According to Observation 4(c), \(\; \overline{{\varvec{u}}} \in \partial \phi _k( \overline{{\varvec{z}}} ) \) holds. Taking into account \(\; \overline{{\varvec{g}}} = \nabla \phi ( \overline{{\varvec{z}}} ) \), the vector \( \overline{{\varvec{u}}} - \overline{{\varvec{g}}} \) in (67) and (68) represents the gradient error of the polyhedral model function \( \phi _k( {\varvec{z}} ) \). Similarly, \( \phi _k( \overline{{\varvec{z}}} ) - \phi ( \overline{{\varvec{z}}} ) \) in (68) represents the error in function value. On the other hand, the vector \( \overline{{\varvec{G}}} - \overline{{\varvec{g}}} \) in (67) and (68) represents the error of the gradient estimate \( \overline{{\varvec{G}}} \).

A balance between those two types of error is found by a two-stage procedure. We begin with estimating the order of the magnitude of \( \Vert \overline{{\varvec{u}}} - \overline{{\varvec{g}}} \Vert \), and then refine the estimation as needed.

6 Estimation of the multivariate normal probability distribution function values and gradients

If a multivariate probability distribution function is differentiable everywhere then its partial derivatives have the general formula

$$\begin{aligned} \frac{\partial F(z_{1},\ldots ,z_{n})}{\partial z_{i}} = F(z_{1},\ldots ,z_{i-1},z_{i+1},\ldots ,z_{n}|\;z_{i})f_{i}(z_{i}) \end{aligned}$$
(69)

where \(F(z_{1},\ldots ,z_{n})\) is the probability distribution function of the random variables \(\xi _{1},\ldots ,\xi _{n}\), and \(f_{i}(z)\) is the probability density function of the random variable \(\xi _{i}\). \( F(z_{1},\ldots ,\)\(z_{i-1},z_{i+1},\ldots ,z_{n}|\;z_{i})\) is the conditional probability distribution function of the random variables \(\xi _{1},\ldots ,\xi _{i-1},\, \xi _{i+1},\ldots ,\xi _{n}\), given that \(\xi _{i}=z_{i}\). See Formula (6.6.22) on page 203 of the book by Prékopa (1995).

It is known that any conditional probability distribution of the multivariate normal probability distribution is also normal. Therefore from Formula (69) it follows that we can calculate the multivariate normal probability distribution function values and their partial derivatives by the same procedure. This is the reason why in this section we give a list of possible procedures for the estimation of multivariate probability distribution function values only.

6.1 Genz’s method

This method was published in Genz (1992). In this paper Genz was dealing with the estimation of the multivariate normal probability content of a rectangle, which is a more general problem than the calculation of multivariate probability distribution function values.

The main idea is to transform the integration region to the unit cube \([0,1]^n\) by a sequence of elementary transformations. This comes at the expense of a slightly more complicated integrand.

The sequence begins with the Cholesky transformation which transforms the components of the multivariate normally distributed random vector into independent random variables, however the integration limits become more complicated. Then the integration variables are transformed further by the inverse function of the one dimensional standard normal probability distribution function. The effect of this transformation is that all integrands will be equal one but the integration limits become even more complicated. Finally, by a simple linear transformation, the integration region changes to the unit cube \([0,1]^n\) and the integrand functions will be the differences of the earlier complicated integration limits.

We remark that the i-th integrand function is always independent of the i-th integrand variable and can be pulled out of one integral which allows explicit integration of the innermost integral. This way the numerical integration may be carried out on the unit cube \([0,1]^{n-1}\).

This sequence of transformations has also forced a priority ordering on the components of \({\mathbf {x}}\) which makes the problem amenable to the application of subregion adaptive algorithms. The method works best if the components are presorted so that the innermost integration has the most “weight”.

Genz describes three different methods for solving this transformed integral. The first method is based on a polynomial approximation of the integrand. For better performance, the unit cube is split into subregions which are subsequently partitioned further whenever the approximation is not accurate enough. The second method uses quasi-random integration points. Finally, the third method uses pseudo-random integration points, which results in error estimates that are statistical in nature.

6.2 Deák’s method

This method was first published in Deák (1980) and later in Deák (1986). Its main thrust is to decompose the normal random vector into two parts, a direction and a distance from the origin. This decomposition can be used both in the generation of sample points and in the calculation of the probability content of a rectangle. It is well known that the direction is uniformly distributed on the n-dimensional unit sphere, the distance from the origin has a chi-distribution with n degrees of freedom and they are independent of each other.

A simple Monte Carlo method is to generate N sample points uniformly distributed on the n-dimensional unit sphere, determine the probability content of the intersection of the rectangle in issue with the generated directions and finally average them. The determination of the probability content of the intersection can be done simply by applying a code to calculate the probability distribution function of the chi-distribution. The advantage of this method is that it counts the probability content of the rectangle not in a ’point to point’ way, rather in a ’line section to line section’ way.

In addition it is easy to apply some type of antithetic random variables technique to reduce the variance further. Deák devised an improvement over this scheme that is intended to distribute a large number of directions as uniformly as possible on the unit sphere.

A set of n directions is chosen first and converted into an orthonormal system, that is, n unit vectors which are mutually orthogonal. From each orthonormal system \(\{s_1,\ldots ,s_n\}\) one obtains \(2^k {n \atopwithdelims ()k}\) directions by computing the sum

$$\begin{aligned} d(v,l_1,\ldots ,l_k) = {1\over \sqrt{k}} \sum _{j=1}^k v_j s_{l_j}, \end{aligned}$$

where \(v=(v_1,\ldots ,v_k)\) is a sign vector (each component is either \(+1\) or \(-1\), and \(1\le l_1< \cdots < l_k \le n\).

The estimator can then be calculated jointly for the set of \(2^k {n \atopwithdelims ()k}\) directions resulting in faster calculation and further variance reduction. The parameter k can in principle be chosen arbitrarily from the set \(\{1,2,\ldots ,n\}\), but the computational complexity increases very fast. Best results are obtained for \(k=2\) or \(k=3\).

It is easy to see that the variance of even the simplest Deák estimator is less than the variance of the crude Monte Carlo method, for a given sample size N.

We remark here that the recent paper Teng et al. (2015) on spherical Monte Carlo simulations for multivariate normal probabilities provides various related simulation schemes.

6.3 Szántai’s method

The procedure was first published in Hungarian, see Szántai (1976) and Szántai (1985). In English it was first published in Szántai (1988) and it is quoted in Sects. 6.5 and 6.6 of the book Prékopa (1995).

This procedure can be applied to any multivariate probability distribution function. The only condition is that we have to be able to calculate the one- and the two-dimensional marginal probability distribution function values. Accuracy can easily be controlled by changing the sample size. This way we can construct gradient estimates satisfying Assumption 3.

As we have

$$\begin{aligned} F(z_{1},\ldots ,z_{n})=\hbox {P}\left( \xi _{1}<z_{1},\ldots ,\xi _{n}<z_{n}\right) =1-\hbox {P}({\overline{A}}_1\cup \cdots \cup {\overline{A}}_n), \end{aligned}$$

where

$$\begin{aligned} {\overline{A}}_{i}=\left\{ \xi _{i}\ge z_{i}\right\} \quad (i=1,\ldots ,n), \end{aligned}$$

we can apply bounding and simulation results for the probability of union of events.

If \(\mu \) denotes the number of those events which occur out of the events \({\overline{A}}_{1},{\overline{A}}_{2},\ldots ,{\overline{A}}_{n}\), then the random variable

$$\begin{aligned} \nu _{0}=\left\{ \begin{array}{ll} 0, &{}\quad \hbox {if}\quad \mu =0\\ 1, &{}\quad \hbox {if}\quad \mu \ge 1 \end{array} \right. \end{aligned}$$

obviously has expected value \({\overline{P}}=\hbox {P}({\overline{A}}_{1}\cup {\overline{A}}_{2}\cup \cdots \cup {\overline{A}}_{n})\).

Further two random variables having expected value \({\overline{P}}\) can be defined by taking the differences between the true probability value and its second order lower and upper Boole–Bonferroni bounds. The definitions of these bounds can be found in the book Prékopa (1995).

We can estimate the expected value of these three random variables in the same Monte Carlo simulation procedure and so we get three different estimates for the probability value \({\overline{P}}\). If we estimate the pairwise covariances of these estimates it will be easy to get a final, minimal variance estimate, too. This technique is well known as regression in the simulation literature.

Gassmann (1988) combined Szántai’s general algorithm and Deák’s algorithm into a hybrid algorithm. The efficiency of this algorithm was explored in Deák et al. (2002).

One can use higher than second order Boole–Bonferroni bounds, too. It will further reduce the variance of the final estimation. However, the necessary CPU time increases, which may reduce the overall efficiency of the resulting estimation. Many new bounds for the probability of the union of events have been developed in the last two decades. These bounds use not only the aggregated information of the first few binomial moments but they also use the individual product event probabilities which sum up the binomial moments. The most important results of this type can be found in the papers by Hunter (1976), Worsley (1982), Tomescu (1986), Prékopa et al. (1995), Bukszár and Prékopa (2000), Bukszár and Szántai (1999), Boros and Veneziani (2002) and Mádi-Nagy and Prékopa (2004). Szántai showed in his paper Szántai (2000), that the efficiency of his variance reduction technique can be improved significantly if one uses some of the above listed bounds.

6.4 The method of Ambartzumian, Der Kiureghian, Ohanian and Sukiasian

Ambartzumian et al. (1998) proposed to use the Sequential Conditioned Importance Sampling (SCIS) algorithm for the estimation of the cumulative distribution function values of a multivariate normal distribution. This is a variance reduction algorithm which is especially effective in the case of estimating extremely small probability values. This algorithm is based on the Sequential Conditioned Sampling (SCS) technique which is the following.

If the random vector \({\varvec{\xi }}=(\xi _{1},\ldots ,\xi _{n})^{T}\) is normally distributed with mean vector \({\varvec{\mu }}=(\mu _{1},\ldots ,\mu _{n})^{T}\) and positive definite covariance matrix \({\mathbf{C}}\) with elements \(c_{ij},i,j=1,\ldots ,n,\) then its probability density function is given by

$$\begin{aligned} f(x_{1},\ldots ,x_{n})= & {} \frac{1}{(2\text { }\pi )^{\frac{n}{2}}\mid \mathbf{C} \mid ^{\frac{1}{2}}}\exp \left[ -\frac{1}{2}\left( \mathbf{x}-{\varvec{\mu }}\right) ^{T}{} \mathbf{C}^{-1}\left( \mathbf{x}-{\varvec{\mu }}\right) \right] \\= & {} \frac{1}{(2\text { }\pi )^{\frac{n}{2}}\mid \mathbf{D}\mid ^{-\frac{1}{2}}} \exp \left[ -\frac{1}{2}\sum \limits _{i=1}^{n}\sum \limits _{j=1}^{n} d_{ij}\left( x_{i}-\mu _{i}\right) \left( x_{j}-\mu _{j}\right) \right] , \end{aligned}$$

where \(d_{ij}\) are the elements of the inverse matrix \(\mathbf{D} =\mathbf{C}^{-1}.\) The SCS technique consists of generating first a random number according to the one dimensional marginal probability distribution of the random variable \(\xi _{1}\) and then sequentially generating of random numbers according to the one dimensional probability density functions \(\varphi _{k}\left( x_{k}\mid x_{1},\ldots ,x_{k-1}\right) \) which are the one dimensional conditional probability density functions of \(\xi _{k}\) for given \(\xi _{1}=x_{1},\ldots ,\xi _{k-1}=x_{k-1}\). It is known that these are one dimensional normal distributions with mean

$$\begin{aligned} \mu _{k}(x_{1},\ldots ,x_{k-1})=\mu _{k}-\sum \limits _{i=1}^{k-1}d_{ki}\frac{x_{i}-\mu _{i}}{d_{kk}}, \end{aligned}$$

and variance

$$\begin{aligned} v_{k} = \frac{1}{d_{kk}}, \end{aligned}$$

for \(k=2,\ldots ,n\).

It is known that the crude Monte Carlo method for estimating very small multivariate normal probability distribution function values is less effective. However, Ambartzumian et al. (1998) proved that in such cases the SCS technique can be easily extended into SCIS by using an importance sampling density function (practically a truncated univariate normal density function) at each step.

6.5 Application of the numerical integration and the variance reduction Monte Carlo simulation algorithms in our procedures for probability maximization

In the course of the procedures proposed in this paper, we many times need to obtain a fixed size confidence interval for our probability distribution function value estimations. This is pronounced in Assumption 3; we need this for determining gradient estimates fulfilling the inequality given in (20) and we do this when constructing the fixed size multidimensional confidence interval described in (61). All this can be done by applying the results of Stein (1945). This is a two-stage sampling procedure. In the first stage we take a sample of size \(n_{1}\) where \(n_{1}\) is a positive integer not smaller than 2, otherwise it is arbitrary. Then in a second stage we take a sample of \(n_{2}\) elements where \(n_{2}\) is computed on the basis of the result of the first stage sampling. This way the total sample of size \(n_{1}+n_{2}\) results in a fixed size interval of the required confidence level. For a summary, see Section 7.10 of the book Prékopa (1995).

We believe that the above described two stage sampling technique can be realized on the variance reduction Monte Carlo simulation algorithms of Sects. 6.26.3 and 6.4 more easily than on the numerical integration algorithm of Sect. 6.1. Careful numerical testing is necessary to choose the most appropriate procedure which may be different in the different phases of our optimization procedure.

7 A computational experiment

The aim of this experiment is to demonstrate the workability of the randomized column generation scheme of Sect. 3, in case of probabilistic problems. Namely, we have \( \phi ( {\varvec{z}} ) = -\log F( {\varvec{z}} ) \) with a nondegenerate n-dimensional standard normal distribution function \( F( {\varvec{z}} ) \).

7.1 Cash matching problem

Like in the previous paper (Fábián et al. 2018), we tested our implementation on a cash matching problem, with a fifteen dimensional normal distribution. In this problem we are interested in investing a certain amount of cash on behalf of a pension fund that needs to make certain payments over the coming 15 years of time. This problem originates from Dentcheva et al. (2004) and Henrion (2004). The cash matching test problem had originally been formulated as cost minimization under a probabilistic constraint. We transformed the problem to probability maximization under a cost constraint.

7.2 Implementation

We used MATLAB with the IBM ILOG CPLEX (Version 12.6.3) optimization toolbox and the numerical computation of multivariate normal distribution values was performed with the QSIMVNV Matlab function implemented by Genz (1992).

Our solver is based on the implementation used in our former paper Fábián et al. (2018). In the present version we used the randomized procedure of Sect. 3. We implemented the bounding method of Sect. 5, with the hybrid form of Sect. 5.2.

The initial solution was set by the procedure described in Fábián et al. (2018). The time needed for setting the initial solution was negligible as compared to the time needed for a single iteration with the column generation scheme.

In the course of the randomized column generation scheme, we perform just a single line search in each column generation subproblem. This line search starts from the current \( \overline{{\varvec{z}}} \) vector. Gradients of the form \( \nabla \phi ( \overline{{\varvec{z}}} ) - \overline{{\varvec{u}}} \) need to be estimated, as mentioned in Remark 12. This goes back to the estimation of the gradient \( \nabla F( \overline{{\varvec{z}}} ) \) of the distribution function. A component of \( \nabla F( \overline{{\varvec{z}}} ) \) is, in turn, obtained according to (69).

Accuracy in Genz’s subroutine is controlled by setting the sample size. In the present simple implementation of the iterative scheme, we control accuracy in such a way that the norm of the error of the current gradient \( \nabla \phi ( \overline{{\varvec{z}}} ) - \overline{{\varvec{u}}} \) be less than one tenth of the norm of the previous gradient \( \Vert \nabla \phi ( \overline{{\varvec{z}}}_- ) - \overline{{\varvec{u}}}_- \Vert \).

7.3 Results and observations

We performed 10 runs of the randomized procedure, each with 50 iterations. The sequences of the probability levels obtained, i.e., of the values \( F( \overline{{\varvec{z}}} ) \), are shown in Fig. 2. At each iteration, the gradient \(\nabla \phi ( \overline{{\varvec{z}}} ) - \overline{{\varvec{u}}} \) is estimated by \( \overline{{\varvec{G}}} - \overline{{\varvec{u}}} \). The norm of this estimate decreases as the procedure progresses. For a single typical run, this decrease is shown in Fig. 3.

Fig. 2
figure 2

Probability levels obtained, as a function of iteration counts. Different runs are represented by different threads

We applied no stopping condition besides iteration count. After 50 iterations, optimal probability levels obtained in the different runs were already very near to each other (the difference between highest and lowest being less than 0.0003.) On the other hand, the value of the bound \( \overline{{\mathcal {B}}} \) of (66) was between 0.025 and 0.03 at the end of our runs. We conclude that, though the bounding procedure is workable, it needs further technical improvements to keep pace with the stochastic approximation scheme.

Fig. 3
figure 3

Decrease of the gradient norm as a function of iteration counts, in a single run

In accordance with the hybrid bounding form of Sect. 5.2, we did not restrict new columns \( {\varvec{z}}_i \) to the box \( {{\mathcal {Z}}} \). Still, the probability level was high in all iterates, \( F( {\varvec{z}}_i ) \ge 0.9 \) holding with the columns added in the course of the column generation process. This allowed high-accuracy computation of all probabilistic function values. As mentioned in Sect. 5.2, the restriction \({\varvec{z}}^{\prime } \in {{\mathcal {Z}}} \) of (59) was never active in any optimal solution \( {\varvec{z}}^{\prime } = \overline{{\varvec{z}}} \) of the master problem.

Density function values occurring in the computation of partial derivatives (69) have always been significant. From the 15 density function values occurring in a single gradient computation, two were always around the magnitude of \({{10}^{-2}}\), another one around \(5*{{10}^{-3}}\), and the rest around \({{10}^{-3}}\). In other problems, near-zero density function values may occur in (69) for many partial derivatives. For such components, the corresponding conditional distribution function need not be computed.

Our present, very simple implementation took about 2 min to perform 50 iterations on the cash-matching problem. Though it may seem long, we expect that technical improvements will substantially shorten solution times. (According to our experience, technical improvements may result in a speedup of one or two magnitudes.)

8 Conclusion and discussion

In this paper, we proposed a stochastic approximation procedure to minimize a function whose gradient estimation is taxing. In course of the process, we build an inner approximating model of the objective function. To handle a difficult constraint function, we proposed a Newton-like scheme, employing a parametric form of the stochastic minimization procedure. The scheme enables the regulation of accuracy and reliability in a coordinated manner.

We adapted this approach to probabilistic problems. In comparison with the outer approximation approach widely used in probabilistic programming, we mention that the latter is difficult to implement due to noise in gradient computation. The outer approximation approach applies a direct cutting-plane method. Even a fairly accurate gradient may result in a cut cutting into the epigraph (especially in regions farther away from the current iterate). One either needs sophisticated tolerance handling to avoid cutting into the epigraph—see, e.g., Szántai (1988), Mayer (1998), Arnold et al. (2014),—or else one needs a sophisticated convex optimization method that can handle cuts cutting into the epigraph—see, e.g., de Oliveira et al. (2011), van Ackooij and Sagastizábal (2014). Yet another alternative is perpetual adjustment of existing cuts to information revealed in the course of the process; see Higle and Sen (1996).

Inner approximation of the level set \( {{\mathcal {L}}}( F, p ) = \{\, {\varvec{z}} \,|\, F( {\varvec{z}} ) \ge p \,\} \), an approach initiated by Prékopa (1990), results in a model that is easy to validate. The level set is approximated by means of p-efficient points. In the cone generation approach initiated by Dentcheva et al. (2000), new approximation points are found by minimization over \( {{\mathcal {L}}}( F, p ) \). As this entails a substantial computational effort, the master part of the decomposition framework should succeed with as few p-efficient points as possible. This calls for specialized solution methods like those of Dentcheva et al. (2004), Dentcheva and Martinez (2013), van Ackooij et al. (2017). An increasing level of complexity is noticeable.

In this paper we apply inner approximation of the epigraph of the probabilistic function \( \phi ( {\varvec{z}} ) = - \log F( {\varvec{z}} ) \). This approach endures noise in gradient computation without any special effort. Noisy gradient estimates may yield iterates that do not improve much on our current model. But we retain a true inner approximation of the function, provided function values are evaluated with appropriate accuracy. This inherent stability of the model enables the application of randomized methods of simple structure.

For probability maximization, we propose a stochastic approximation procedure with relatively easy generation of new test points. A probabilistic constraint function is handled in a Newton-like scheme, approximately solving a short sequence of probability maximization problems, with increasing accuracy. As this scheme is built from randomized components, we provide a statistical analysis of its validity.

The proposed stochastic approximation procedure can be implemented using standard components. The master problem is conveniently solved by an off-the-shelf solver. New approximation points are found through simple line search whose direction can be determined by standard implementations of classic Monte Carlo simulation procedures. The Newton-like scheme can be implemented through minor variations on a standard Newton method.

In case of a probabilistic function derived from a multivariate standard normal distribution, computing a single non-zero component of a gradient vector will involve an effort comparable to that of computing a function value. The variance reduction Monte Carlo simulation procedures described in Sect. 6 were successfully applied in outer approximation approaches to the solution of jointly probabilistic constrained stochastic programming problems, see Szántai (1988). We trust that they will perform as well in the inner approximation approach discussed in the present paper. An elaborate implementation and a systematic computational study will be needed to verify this. We mention that a means of alleviating the difficulty of gradient computation in case of multivariate normal distribution has recently been proposed by Hantoute et al. (2018).

Emerging applications of probabilistic programming afford room for different solution approaches; e.g., new models of electricity markets or traffic control, brought about by novel infocommunication technologies.