Variable sample size method for equality constrained optimization problems
 148 Downloads
Abstract
An equality constrained optimization problem with a deterministic objective function and constraints in the form of mathematical expectation is considered. The constraints are transformed into the Sample Average Approximation form resulting in deterministic problem. A method which combines a variable sample size procedure with line search is applied to a penalty reformulation. The method generates a sequence that converges towards firstorder critical points. The final stage of the optimization procedure employs the full sample and the SAA problem is eventually solved with significantly smaller cost. Preliminary numerical results show that the proposed method can produce significant savings compared to SAA method and some heuristic sample update counterparts while generating a solution of the same quality.
Keywords
Stochastic optimization Equality constraints Variable sample size Penalty method Line search1 Introduction
In general it is difficult to compute the mathematical expectation and the common approach is to generate a sample of random vectors and replace the expectation function by the sample average function. In variable sample methods [9], different samples are used along the optimization process. Another approach is to fix a sample (rather large in general) at the beginning of the optimization procedure, so the stochastic problem is converted into a deterministic problem. This approach is known as the Sample Average Approximation (SAA) or the sample path, details can be found for example in [16, 17]. The obtained SAA problem can be solved by standard optimization techniques. Since the sample often needs to be large to ensure a good approximation of the mathematical expectation, the sample average function (and possibly its gradient) is expensive to evaluate and thus solving the SAA problem is expensive. One possible way to eliminate this drawback is to vary a sample size throughout the optimization process. Namely, when the current iteration point is far from the solution, a smaller sample size can give an iteration point which is good enough and therefore reduce a number of function evaluations. Some methods for controlling the sample size are presented in [3, 9]. Roughly speaking the optimization method starts with a small size subsample and increases the subsample throughout the iterations. An alternative approach, which can be classified as adaptive, relays on the progress achieved in each iteration and thus allows sample size to oscillate until eventually working with the full sample [1, 2, 11, 12].
The approach we consider in this paper is based on penalty methods [5, 7, 10, 13]. Penalty methods are successfully applied in stochastic environment. In [15, 19] an exact penalty method is used to solve a stochastic optimization problem with expected value objective function and deterministic constraints. Polak and Royset [15], considered the problem with inequality constraints and proposed an algorithm for solving the SAA reformulation for sufficiently large penalty parameter. Constraints in [19] are defined in form of equalities and the rule for varying the penalty parameter is defined through minimization of a subproblem.
We propose an algorithm for solving the SAA reformulation of the problem (1). As an optimization procedure we apply the quadratic penalty method combined with the variable sample size scheduling and line search technique. We show that if Linear Independence Constraint Qualification (LICQ) holds, then the proposed algorithm generates a sequence that converges to a Karush–Kuhn–Tucker (KKT) point of the SAA problem. The algorithm is implemented on a set of test problems from [8] with added noise, and the numerical results show that the proposed algorithm requires a significantly smaller number of function evaluation than the full sample SAA method as well as some heuristic procedure.
The rest of the paper is structured as follows. In the next section details of the observed problem and the algorithm are presented. The convergence results are stated in Sect. 3. Section 4 contains numerical results. Conclusions and some details that complete the paper are given in the last two sections.
Throughout this paper \(\Vert \cdot \Vert \) denotes the Euclidian norm.
2 The algorithm
The rule for changing the sample size is taken over from [12] with specified weighting parameter, i.e. the aim is to find a sample size \(N_{k+1}\) such that \(dm_{k} \approx N_k/N_{k+1}\; \epsilon _{\delta }^{N_{k+1}}(x_{k})\) and \(N_{k}^{min}\le N_{k+1}\le N_{max}\), where \(\{N_{k}^{min}\}\) is a lower bound sequence. This sequence is updated as in [12], but instead of the objective function we observe the measure of infeasibility \(\hat{\theta }_{N_{k+1}}\). These algorithms are stated in the Appendix for completeness, while here we give only a brief discussion.
The main idea behind the sample size updating is as follows. A relatively small value of \(dm_{k}\) suggest the proximity of a solution to the current approximate problem [i.e. a stationary point of \( \phi (x,N_k,\mu _k) \)] and thus the sample size is increased, to get a better approximation of (2). On the other hand, if \(dm_{k}\) is relatively large, the sample size is decreased (but not below \(N_{k}^{min}\)) in order to save computational efforts since we are probably still far away from the solution. That way, the algorithm copes with different kinds of approximations simultaneously.
Although the lower bound \( N_{k}^{min}\) update does not interfere within the main sample size update in most of the tested applications, it plays an important role in the convergence theory. Its main role is to prevent permanent oscillations of the sample size and to push it to the full sample, eventually. This is done by tracking different levels of precision determined by the sample size, or more precisely, by the function \(\hat{\theta }_{N}\). If a sample size is increased to some precision, let us say \(N_k\), and if there is not enough decrease in measure of infeasibility \(\hat{\theta }_{N_k}\) since the last time that this same level of precision has been used, the lower bound is increased. Consequently, it pushes up the overall precision controlled by the sample size. The main algorithm is stated below. Note that, a sample \(\xi _{1},\xi _{2},\ldots ,\xi _{N_{max}}\) is generated at the beginning of the optimization procedure. Throughout all iterations with the sample size \(N_k<N_{max}\), the first \(N_k\) elements of the whole sample are used. This way we are dealing with the socalled cumulative samples, see [9]. The line search is performed in Step 3 as the classical backtracking with the step \( \beta \) but other options, like interpolation and similar are possible as well.
Algorithm 1
 Step 0

Input parameters: \(N_{min} \in \mathbb {N}\), \(x_{0} \in \mathbb {R}^{n}\), \(\beta , \eta , \nu _{1} \in (0,1)\), \(\mu _0>0\), \(\gamma >1\).
 Step 1

Set \(k=0\), \(N_{k}=N_{min}\), \(x_{k}=x_{0}\), \(\mu _{k}=\mu _{0}\), \(l=1\), \(N_0^{min}=N_{min}\).
 Step 2

Calculate a descent search direction \(d_{k}\).
 Step 3
 Find the smallest nonnegative integer j such that \(\alpha _{k}=\beta ^{j}\) satisfiesSet \(x_{k+1}=x_{k}+\alpha _{k} d_{k}\) and \(dm_{k}=dm_{k}(\alpha _{k})\).$$\begin{aligned} \phi (x_{k}+\alpha _{k}d_{k};N_k;\mu _k) \le \phi (x_{k};N_k;\mu _k)\eta dm_{k}(\alpha _{k}). \end{aligned}$$(5)
 Step 4

If \(\displaystyle dm_k\le \alpha _k/\mu _k^2\), set \(z_{t}=x_{k}\) and \(t=t+1\).
 Step 5

Determine the sample size \(N_{k+1}\) using Algorithm 2.
 Step 6

Determine the lower bound of the sample size \(N_{k+1}^{min}\) using Algorithm 3.
 Step 7

Determine the penalty parameter \(\mu _{k+1}\):
If \(N_k=N_{k+1}<N_{max}\) or \(\displaystyle dm_k>\alpha _k/\mu _k^2\), then \(\mu _{k+1}=\mu _k\), else \(\mu _{k+1}=\gamma \mu _{k}\).
 Step 8

Set \(k = k+1\) and go to Step 2.
One additional comment regarding the Algorithm above is due here. Notice that in Step 2 we are taking an arbitrary descent direction \( d_k. \) Thus the Algorithm represents a general framework and some of it properties, like convergence rate for example, will be determined by a particular search direction used in actual implementation. The details of Algorithm 2 and 3 are available in the Appendix.
3 Convergence analysis
To show the convergence of Algorithm 1 we need the following standard assumption.
Assumption 1
Function f is bounded from bellow on a feasible set given in (2). Moreover, \(f, H(\cdot ,\xi _i)\in C^1(\mathbb {R}^n)\) for every \(i=1,2,\ldots ,N_{max}\) and the sequence \(\{x_k\}_{k\in \mathbb {N}_0}\) generated by Algorithm 1 has at least one accumulation point.
Assumption 1 provides continuity and differentiability of the function \(\hat{h}_{N}\) for every \(N\in \mathbb {N}\) and therefore ensures that the penalty function (3) is continuously differentiable. Moreover, the measure of infeasibility \(\hat{\theta }_{N}\) is nonnegative so the penalty function (3) is also bounded from below whenever f is. Furthermore, notice that fixing the penalty parameter and the sample size to \(\bar{\mu }\) and \(\bar{N}\), respectively, yields a standard backtracking line search method applied on \(\phi (x;\bar{N};\bar{\mu })\). Therefore, using the standard technique (see [11] for instance), we can prove the following lemma.
Lemma 1
Suppose that the Assumption 1 holds and there exists \(\bar{n}\in \mathbb {N}\) such that \(\mu _{k}=\bar{\mu }\) and \(N_k=\bar{N}\) for all \(k\ge \bar{n}\). Then \(\lim _{k\rightarrow \infty }dm_k=0.\)
Proof
The proof of the following lemma leans on the proof of Lemma 4.1 in [11]. Nevertheless, we provide the proof for completeness.
Assumption 2
There are \(\kappa >0\) and \(n_1\in \mathbb {N}\) such that \(\epsilon _{\delta }^{N_{k}}(x_{k})\ge \kappa \) for every \(k\ge n_1\).
Lemma 2
Suppose that the Assumptions 1–2 hold. Then there exists \(q\in \mathbb {N}\) such that \(N_k=N_{max}\) for every \(k\ge q\).
Proof
In order to prove the main result, we need an additional assumption. Notice that the following implication is obviously satisfied for the negative gradient.
Assumption 3
The search directions \(d_k\) are descent, bounded and the implication \(\lim _{k\in K}g_k^Td_k=0 \; \Longrightarrow \; \lim _{k\in K}g_k=0\) holds for any subsequence \(K\subseteq \mathbb {N}\).
Theorem 1
Suppose that the Assumptions 1–3 hold. Then \(\lim _{k\rightarrow \infty }\mu _k=\infty .\)
Proof
Notice that Theorem 1 implies the existence of an infinite sequence \(\{z_t\}\) defined by Step 4 of Algorithm 1. Finally we prove the global convergence result.
Theorem 2
Suppose that the Assumptions 1–3 hold. Then every accumulation point \(x^*\) of \(\{z_t\}_{t\in \mathbb {N}}\) is stationary for \(\hat{\theta }_{N_{max}}\). Moreover, if LICQ holds then \(x^*\) is a KKT point of the problem (2).
Proof
The statement of Theorem 2 is a general convergence result as we are dealing with an arbitrary search direction \( d_k, \) assuming only that \( d_k \) is a descent direction for \( \phi (x_k,N_k,\mu _k). \) Any particular search direction, like the negative gradient direction or some secondorder direction, will further imply additional properties like convergence rate and similar.
4 Numerical results
The test collection consists of 14 standard optimization problems with the unique solution and the objective function bounded from below on \(\mathbb {R}^n\). The problems (6, 27, 28, 42, 46–52, 61, 77 and 79) are taken from Hock and Schittkowski [8] and transformed into SAA by \(H(x,\xi )=c(\xi x)\) where \(\xi \) follows Normal distribution \(\mathscr {N}(1,1)\) and c(x) is the function defining constraints in [8]. For each of the problems, 10 different samples of size \(N_{max}=2000\) are generated and in total 140 different problems have been tested. The tests are carried out by implementing the proposed algorithm in Matlab 8.0. The samples are generated by the builtin solver randn. As already mentioned, the sample \( \{\xi _1,\ldots ,\xi _{N_{\max }}\} \) is generated at the beginning of the iterative procedure. Whenever the sample size is smaller than \( N_{\max }, \) i.e., in all iterations where \(N_k<N_{max}\), we take the first \(N_k\) elements of the full sample.
The proposed procedure (VSS) is compared with two other sample scheduling schemes—the SAA where \(N_{k}=N_{max}\) for each k, and the heuristic (HEUR) where \(N_{k+1}=\lceil \min \{1.1N_k,N_{max}\}\rceil \) as in [6, 12, 14]. To make the comparison fair, all the remaining parameters are the same for all tests. The BFGS search direction with the safeguard that ensures the descent property of \(d_{k}\) and the gradient difference \(y_k=\nabla \phi (x_{k+1};N_{k+1};\mu _k)\nabla \phi (x_k;N_k;\mu _k)\) is used. Line search is performed with \(\beta =0.5\) and \(\eta =10^{4}\). The initial penalty parameter is set to \(\mu _0=1\) and the increase factor is \(\gamma =1.5\). The initial points are as in [8] and the initial (and minimal) sample size is \(N_{min}=3\). Furthermore, Algorithm 1 is applied with \(\nu _1=1/\sqrt{N_{max}}\) and \(\epsilon _{\delta }^{N_k}\) defined by (4).
The results of comparison of algorithms are presented in Fig. 1. As we can see, the winning probabilities (results for \(\tau =1\)) of VSS, HEUR and SAA are 0.52, 0.29 and 0.12 respectively. Differences in probabilities when performance ratios are two times larger than the best performance ratio (the case \( \tau =2\)) are roughly of the same order—probabilities for \(\tau =2\) are 0.91, 0.67 and 0.47 respectively for VSS, HEUR and SAA. Therefore, the differences in number of function of evaluations in compared methods are significant. We can conclude that strategies with varying sample size outperform SAA method significantly, at least at the considered test collection, and it is worth while to use smaller sample size at the beginning of the search and while we are far away from the solution. Furthermore, plot shows that Algorithm 1 outperforms the heuristic scheme significantly and fully justifies the idea of decreasing the sample size whenever we are far away from the solution.
We conclude this section by addressing the robustness of the methods. In the vast majority of cases, algorithms approached a KKT point. The exceptions are problems 46 and 77. In the case of 46, HEUR and VSS failed each one at one run, while the SAA was fully successful. On the other hand, in problem 77 all the tested methods failed except for VSS which managed to solve one run.
5 Conclusions
Difficulties of solving stochastic problems of the form considered in this paper are due to high cost of computing the mathematical expectation. This difficulty can be resolved by transforming the problem (1) into an SAA problem, with sufficiently large sample. However, solving the SAA problem with large sample which ensures a good approximation of the original problem, leads to computationally costly procedure with very high number of function evaluations. The algorithm proposed in this paper is such that the sample size is varying during optimization process. Under a set of standard conditions, the convergence of sequence generated by the proposed algorithm to a KKT point of the SAA problem is shown. The presented numerical results demonstrate that the algorithm requires significantly smaller number of function evaluations than the SAA method and the heuristic procedure. The proposed method is also fairly robust.
Notes
Acknowledgements
We are grateful to the Associate Editor and reviewers whose comments helped us to improve the paper. N. Krejić and N. Krklec Jerinkić are supported by Serbian Ministry of Education, Science and Technological Development, Grant No. 174030.
References
 1.Bastin, F.: TrustRegion Algorithms for Nonlinear Stochastic Programming and Mixed Logit Models. Ph.D. thesis, University of Namur, Belgium (2004)Google Scholar
 2.Bastin, F., Cirillo, C., Toint, P.L.: An adaptive Monte Carlo algorithm for computing mixed logit estimators. Comput. Manag. Sci. 3(1), 55–79 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
 3.Deng, G., Ferris, M.C.: Variablenumber sample path optimization. Math. Program. 117(12), 81–109 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
 4.Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201–213 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
 5.Dolgopolik, M.V.: Smooth exact penalty functions: a general approach. Optim. Lett. 10, 635–648 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
 6.Friedlander, M.P., Schmidt, M.: Hybrid deterministicstochastic methods for data fitting. SIAM J. Sci. Comput. 34(3), 13801405 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
 7.Gill, P.E., Murray, W., Wright, M.H.: Practical Optimization. Academic Press, London (1997)zbMATHGoogle Scholar
 8.Hock, W., Schittkowski, K.: Test Examples for Nonilnear Programming Codes. Lecture Notes in Economics and Mathematical Systems, vol. 187. Springer, Berlin (1981)Google Scholar
 9.HomemdeMello, T.: Variablesample methods for stochastic optimization. ACM Trans. Model. Comput. Simul. 13(2), 108–133 (2003)CrossRefGoogle Scholar
 10.Huyer, W., Neumaier, A.: A new exact penalty function. SIAM J. Optim. 13(4), 1141–1158 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
 11.Krejić, N., Krklec, N.: Line search methods with variable sample size for unconstrained optimization. J. Comput. Appl. Math. 245, 213–231 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
 12.Krejić, N., Krklec Jerinkić, N.: Nonmonotone line search methods with variable sample size. Numer. Algorithms 68(4), 711–739 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
 13.Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research. Springer, Berlin (1999)Google Scholar
 14.Pasupathy, R.: On choosing parameters in retrospectiveapproximation algorithms for stochastic root finding and simulation optimization. Oper. Res. 58(4), 889–901 (2010)CrossRefzbMATHGoogle Scholar
 15.Polak, E., Royset, J.O.: Eficient sample sizes in stochastic nonlinear programing. J. Comput. Appl. Math. 217(2), 301–310 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
 16.Shapiro, A.: Monte Carlo sampling methods. In: Stochastic Programming, Handbook in Operations Research and Management Science, vol. 10, pp. 353–425. Elsevier, Amsterdam (2003)Google Scholar
 17.Shapiro, A., Dentcheva, D., Ruszczynski, A.: Lectures on Stochastic Programming: Modeling and Theory. MPSSIAM Series on Optimization (2009)Google Scholar
 18.Spall, J.C.: Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. WileyInterscience Series in Discrete Mathematics, New Jersey (2003)Google Scholar
 19.Wang, X., Ma, S., Yuan, Y.: Penalty Methods with Stochastic Approximation for Stochastic Nonlinear Programming, Technical report. arXiv:1312.2690 [math.OC] (2015)