Skip to main content
Log in

Gradient-only approaches to avoid spurious local minima in unconstrained optimization

  • Published:
Optimization and Engineering Aims and scope Submit manuscript

Abstract

We reflect on some theoretical aspects of gradient-only optimization for the unconstrained optimization of objective functions containing non-physical step or jump discontinuities. This kind of discontinuity arises when the optimization problem is based on the solutions of systems of partial differential equations, in combination with variable discretization techniques (e.g. remeshing in spatial domains, and/or variable time stepping in temporal domains). These discontinuities, which may cause local minima, are artifacts of the numerical strategies used and should not influence the solution to the optimization problem. Although the discontinuities imply that the gradient field is not defined everywhere, the gradient field associated with the computational scheme can nevertheless be computed everywhere; this field is denoted the associated gradient field.

We demonstrate that it is possible to overcome attraction to the local minima if only associated gradient information is used. Various gradient-only algorithmic options are discussed. A salient feature of our approach is that variable discretization strategies, so important in the numerical solution of partial differential equations, can be combined with efficient local optimization algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Allaire G, Jouve F, Toader A-M (2004) Structural optimization using sensitivity analysis and a level-set method. J Comput Phys 194(1):363–393

    Article  MathSciNet  MATH  Google Scholar 

  • Bazaraa MS, Sherali HD, Shetty CM (1993) Nonlinear programming—theory and algorithms, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  • Berberian SK (1994) A first course in real analysis. Springer, New York

    Book  MATH  Google Scholar 

  • Brandstatter BR, Ring W, Magele C, Richter KR (1998) Shape design with great geometrical deformations using continuously moving finite element nodes. IEEE Trans Magn 34(5):2877–2880

    Article  Google Scholar 

  • Conn AR, Mongeau M (1998) Discontinuous piecewise linear optimization. Math Program 80:315–380

    MathSciNet  MATH  Google Scholar 

  • Cook RD, Malkus DS, Plesha ME, Witt RJ (2002) Concepts and applications of finite element analysis, 4th edn. Wiley, New York

    Google Scholar 

  • Garcia MJ, Gonzalez CA (2004) Shape optimisation of continuum structures via evolution strategies and fixed grid finite element analysis. Struct Multidiscip Optim V26(1):92–98

    Article  MathSciNet  Google Scholar 

  • Groenwold AA, Etman LFP, Snyman JA, Rooda JE (2007) Incomplete series expansion for function approximation. Struct Multidiscip Optim 34:21–40

    Article  MathSciNet  Google Scholar 

  • Kodiyalam S, Thanedar PB (1993) Some practical aspects of shape optimization and its influence on intermediate mesh refinement. Finite Elem Anal Des 15(2):125–133

    Article  Google Scholar 

  • Kroshko D (2006) OpenOpt—Free GNU GPL2 MATLAB/Octave optimization toolbox, version 0.36, http://openopt.org

  • Nocedal J, Wright SJ (2006) Numerical optimization, 2nd edn. in the series Operation research and financial engineering. Springer, Berlin

    MATH  Google Scholar 

  • Olhoff N, Rasmussen J, Lund E (1993) A method of exact numerical differentiation for error elimination in finite element based semi-analytical shape sensitivity analysis. Mech Struct Mach 21:1–66

    Article  MathSciNet  Google Scholar 

  • Peressini A, Sullivan F, Uhl J (1988) The mathematics of nonlinear programming. Springer, New York

    Book  MATH  Google Scholar 

  • Rardin RL (1998) Optimization in operations research. Prentice Hall, Upper Saddle River

    Google Scholar 

  • Schleupen A, Maute K, Ramm E (2000) Adaptive FE-procedures in shape optimization. Struct Multidiscip Optim 4:282–302

    Article  Google Scholar 

  • Shor NZ, Kiwiel KC, Ruszczyński A (1985) Minimization methods for non-differentiable functions. Springer, New York

    Book  MATH  Google Scholar 

  • Snyman JA (2005a) A gradient-only line search method for the conjugate gradient method applied to constrained optimization problems with severe noise in the objective function. Int J Numer Methods Eng 62(1):72–82

    Article  MathSciNet  MATH  Google Scholar 

  • Snyman JA (2005b) Practical mathematical optimization: an introduction to basic optimization theory and classical and new gradient-based algorithms, 2nd edn. Applied optimization, vol. 97. Springer, New York

    Google Scholar 

  • Snyman JA, Hay AM (2001) The spherical quadratic steepest descent (sqsd) method for unconstrained minimization with no explicit line searches. Comput Math Appl 42:169–178

    Article  MathSciNet  MATH  Google Scholar 

  • Snyman JA, Hay AM (2002) The Dynamic-Q optimization method: an alternative to SQP? Comput Math Appl 44:1589–1598

    Article  MathSciNet  MATH  Google Scholar 

  • Svanberg K (2002) A class of globally convergent optimization methods based on conservative convex separable approximations. SIAM J Optim 12:555–573

    Article  MathSciNet  MATH  Google Scholar 

  • Van Miegroet L, Moës N, Fleury C, Duysinx P (2005) Generalized shape optimization based on the level set method. In: 6th Congr Struct Multidiscip Optim, pp 1–10, paper no 711

    Google Scholar 

  • Wilke DN, Kok S, Groenwold AA (2006) A quadratically convergent unstructured remeshing strategy for shape optimization. Int J Numer Methods Eng 65(1):1–17

    Article  MATH  Google Scholar 

  • Wilke DN, Kok S, Groenwold AA (2010) The application of gradient-only optimization methods for problems discretized using non-constant methods. Struct Multidiscip Optim 40:433–451

    Article  MathSciNet  Google Scholar 

  • Zang I (1981) Discontinuous optimization by smoothing. Math Oper Res 6(1):140–152

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgement

The first author gratefully acknowledges financial assistance from the National Research Foundation (NRF) of South Africa.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Nicolas Wilke.

Appendix A: Proofs of convergence for derivative descent sequences

Appendix A: Proofs of convergence for derivative descent sequences

Before we present proofs of convergence of (conservative) associated derivative descent sequences we include two gradient-only definitions of the well-known concepts in classical mathematical programming to simplify our proofs of convergence. First, we present a definition of coercive functions based solely on the associated gradient of a function (Peressini et al. 1988). Although this definition does not bear a strict analogy with the conventional coercive definition it suffices for our purposes.

Definition A.1

Let x 1,x 2∈ℝn. Then a real valued function f:X⊂ℝn→ℝ with associated gradient field ∇ A f(x) that is uniquely defined for every xX, is associated derivative coercive if there exist a positive number R M such that \(\nabla_{A}^{\mathrm{T}}f(\boldsymbol{x}^{2})(\boldsymbol{x}^{2}-\boldsymbol{x}^{1}) > \epsilon\) with ϵ>0∈ℝ for non perpendicular ∇ A f(x 2) and (x 2x 1), whenever ∥x 2∥≥R M and ∥x 1∥<R M .

Secondly, we present definitions for univariate and multivariate associated gradient unimodality based solely on the associated gradient field of a real valued function (Bazaraa et al. 1993).

Definition A.2

A univariate function f:X⊂ℝ→ℝ with associated derivative \(f^{\prime_{A}}(\lambda)\) uniquely defined for every λX, is (resp., strictly) associated derivative unimodal over X if there exists a \(x^{*}_{g}\in X\) such that

(A.1)

We now consider (resp., strictly) associated derivative unimodality for multivariate functions (Rardin 1998).

Definition A.3

A multivariate function f:X⊂ℝn→ℝ is (resp., strictly) associated derivative unimodal over X if for all x 1 and x 2X and x 1x 2, every corresponding univariate function

$$F(\lambda) = f(\boldsymbol{x}^1 +\lambda(\boldsymbol{x}^2-\boldsymbol{x}^1)),\quad \lambda\in[0,1]\subset\mathbb{R}$$

is (resp., strictly) associated derivative unimodal according to Definition A.2.

1.1 A.1 Univariate functions

Now that we have an associated derivative based definition of unimodality for univariate functions we present a proof of convergence for strict univariate associated derivative unimodal functions when associated derivative descent sequences are considered.

Theorem A.4

Let f:Λ⊆ℝ→]−∞,∞] be a univariate function that is strictly associated derivative unimodal as defined in Definition A.2, with first associated derivative \({f^{\prime_{A}}}:\varLambda \rightarrow\,]-\infty,\infty[\) uniquely defined everywhere on Λ. If λ {0}Λ and {λ {k}} is an associated derivative descent sequence, as defined in Definition 3.8, for f with initial point λ {0}, then every subsequence of {λ {k}} converges. The limit of any convergent subsequence of {λ {k}} is a strict non-negative associated gradient projection point (S-NN-GPP), as defined in Definition 3.4, of f.

Proof

Our assertion that f is strict associated derivative unimodal as defined in Definition A.2 implies that f has only one S-NN-GPS \(S_{S\mbox{-}NN}\subset \varLambda \) as defined in Definition 3.7 at λ Λ. Let \(\lambda^{r} \in S_{S\mbox{-}NN}\) such that |λ {k}λ r| is a maximum. Consider a sequence of 1-balls {B(b k ,ϵ k )} defined around \(b_{k} =\frac{1}{2}(\lambda^{\{k\}} + \lambda^{r})\) with radius of \(\frac{1}{2}|\lambda^{\{k\}} - \lambda^{r}|\). Then every λ {k+1}B(b k ,ϵ k ), since {λ {k}} is an associated derivative descent sequence as defined in Definition 3.8 and f is strict associated derivative unimodal as defined in Definition A.2. Therefore, k→∞ implies |λ {k}λ r|→0. It follows from the Cauchy criterion for sequences that {λ {k}} is convergent, which completes the proof of our first assertion.

Now let \(\{{\lambda}^{\{k\}_{m}}\}\) be a convergent subsequence of {λ {k}} and let λ m be its limit. Suppose, contrary to the second assertion of the theorem, that λ m is not a S-NN-GPP as defined in Definition 3.4 of f. Since we assume that λ m is not a S-NN-GPP, and by Definition 3.8, there exist a λ m+δ for δ≠0∈ℝ such that \({f^{\prime_{A}}}(\lambda^{m*}+\delta)<0\), which contradicts our assumption that λ m is the limit of the subsequence \(\{{\lambda}^{\{k\}_{m}}\}\). Therefore, for λ m to be the limit of an associated derivative descent subsequence \(\{\lambda^{\{k\}_{m}}\}\), \(\lambda^{m*} \in S_{S\mbox{-}NN}\), which completes the proof. □

We now proceed with a proof of convergence for generalized univariate associated derivative unimodal functions when associated derivative descent sequences are considered.

Theorem A.5

Let f:Λ⊆ℝ→ ]−∞,∞] be a univariate function that is associated derivative unimodal, as defined in Definition A.2, with firstassociated derivative \(f^{\prime_{A}}:\varLambda \rightarrow\,]-\infty,\infty[\) uniquely defined everywhere on Λ. If λ {0}Λ and {λ {k}} is an associated derivative descent sequence, as defined in Definition 3.8, for f with initial point λ {0}, then every subsequence of {λ {k}} converges. The limit of any convergent subsequence of {λ {k}} is a generalized G-NN-GPP, as defined in Definition 3.2, of f.

Proof

Our assertion that f is associated derivative unimodal as defined in Definition A.2 implies that f has at least one G-NN-GPS \(S_{G\mbox{-}NN}\in \varLambda \) as defined in Definition 3.7. Let SΛ be the union of G-NN-GPSs \(S_{G\mbox{-}NN}\). Consider the jth sequence of 1-balls {B(b k ,ϵ k )} j defined around \(b_{k} =\frac{1}{2}(\lambda^{\{k\}} + (\lambda_{j}^{*}\in S))\) and with radius \(\epsilon_{k} = \frac{1}{2}|\lambda^{\{k\}} - (\lambda_{j}^{*}\in S)|\). Then λ {k+1}B(b k ,ϵ k ) j for every sequence j since {λ {k}} is a associated derivative descent sequence as defined in Definition 3.8 and f is associated derivative unimodal as defined in Definition A.2. Therefore k→∞ implies \(|\lambda^{\{k\}} - (\lambda_{j}^{*}\in S)|\rightarrow a_{j}\) with a j a constant. Since \(|\lambda^{\{k\}} -(\lambda_{j}^{*}\in S)|- a_{j} \rightarrow0\) for every j it follows from the Cauchy criterion for sequences that {λ {k}} is convergent, which completes the proof of our first assertion.

Now let \(\{{\lambda}^{\{k\}_{m}}\}\) be a convergent subsequence of {λ {k}} and let λ m be its limit. Suppose, contrary to the second assertion of the theorem, that λ m is not a G-NN-GPP as defined in Definition 3.2 of f. Since we assume that λ m is not a G-NN-GPP, and by Definition 3.8, there exist a λ m+δ for δ≠0∈ℝ such that \({f^{\prime_{A}}}(\lambda^{m*}+\delta)<0\) which contradicts our assumption that λ m is the limit of the subsequence \(\{{\lambda}^{\{k\}_{m}}\}\). Therefore, for λ m to be the limit of an associated derivative descent subsequence (see Definition 3.8) \(\{\lambda^{\{k\}_{m}}\}\), λ mS, which completes the proof. □

Now that we have concluded our proofs of (strictly) associated derivative unimodal univariate functions, we present a proof of convergence for univariate associated derivative coercive functions that have at least one S-NN-GPS.

Theorem A.6

Let f:Λ⊆ℝ→ ]−∞,∞] be a univariate associated derivative coercive function, as defined in Definition A.1, with first associated derivative \({f^{\prime_{A}}}:\varLambda \rightarrow\,]-\infty,\infty[\) uniquely defined everywhere on Λ. If λ {0}Λ and {λ {k}} is an associated derivative descent sequence, as defined in Definition 3.8, for f with initial point λ {0}, then there exists at least one convergent subsequence of {λ {k}}. The limit of any convergent subsequence of {λ {k}} is a S-NN-GPP of f.

Proof

Since we only consider associated derivative descent sequences {λ {k}} our assertion that f is associated derivative coercive implies the closed interval [a,b]⊂Λ. The sequence {λ {k}} is bounded which follows from our premise of f. It follows from the Weierstrass-Bolzano theorem that in a closed interval [a,b], every sequence has a subsequence that converges to a point in the interval (Berberian 1994).

Now let \(\{{\lambda}^{\{k\}_{m}}\}\) be a convergent subsequence of {λ {k}} and let λ mΛ be its limit. Suppose, contrary to the second assertion of the theorem, that λ m is not a S-NN-GPP of f. Since we assume that λ m is not a S-NN-GPP, and by Definition 3.8, there exist a λ m+δ for δ≠0∈ℝ such that \({f^{\prime_{A}}}(\lambda^{m*}+\delta)<0\), which contradicts our assumption that λ m is the limit of the subsequence \(\{{\lambda}^{\{k\}_{m}}\}\). Therefore, for λ m to be the limit of an associated derivative descent sequence (see Definition 3.8) \(\{\lambda^{\{k\}_{m}}\}\), \(\lambda^{m*} \in S_{S\mbox{-}NN}\) with \(S_{S\mbox{-}NN}\subset \varLambda \) which completes the proof. □

1.2 A.2 Multivariate functions

We begin our proof of convergence of associated derivative descent sequences for multivariate functions with C 1 continuous convex functions (Peressini et al. 1988), whereupon we present proofs of convergence for broader classes of functions.

Theorem A.7

Suppose f:X⊆ℝn→ℝ is a C 1 continuous convex function with xX. If x {0}X and {x {k}} is an associated derivative descent sequence, as defined in Definition 3.8, for f with initial point x {0}, then every subsequence of {x {k}} converges. The limit of any convergent sequence of {x {k}} is a S-NN-GPP as defined in Definition 3.4 of f.

Proof

Our assertion that f is convex and C 1 continuous ensures that f has a single global gradient projection point \(\boldsymbol{x}^{*}_{g}\in X\). Also, by Definition 3.8 and the continuity of the first partial derivatives, we see that {f(x {k})} is a decreasing sequence that is bounded below by \(f(\boldsymbol{x}^{*}_{g})\). It follows that {x {k}} is a bounded sequence since f is convex. The Bolzano-Weierstrass theorem implies that {x {k}} has at least one convergent subsequence, which completes the proof of our first assertion (Peressini et al. 1988).

Now let \(\{\boldsymbol{x}^{\{k\}_{m}}\}\) be a convergent subsequence of {x {k}} and let x mX be its limit. Suppose, contrary to the second assertion of the theorem, that x m is not a S-NN-GPP as defined in Definition 3.4 of f which from our continuity assumption implies ∇ A f(x m)≠0, which in turn implies that there exists a descent direction u m at x m, such that u m0.

Since \(\{\boldsymbol{x}^{\{k\}_{m}}\}\) is an associated derivative descent sequence as defined in Definition 3.8 of which the limit x m is by assumption not a S-NN-GPP i.e.

$$-\nabla_A^{\textrm{T}} f(\boldsymbol{x}^{m*})\nabla_Af(\boldsymbol{x}^{m*}) < 0.$$

It follows from the continuity assumptions that there exists a small λ>0∈R such that \(-\nabla_{A}^{\mathrm{T}}{f}(\boldsymbol{x}^{m*} +\lambda\boldsymbol{u}^{m*})\nabla_{A} f(\boldsymbol{x}^{m*}) < 0\) which contradicts our assumption that x m is the limit of the sequence {x {km}}. Therefore, for x m to be the limit of an associated derivative descent sequence {x {km}}, ∇ A f(x m)=0, which in turn implies u m=0. The limit x m of an associated derivative descent sequence as defined in Definition 3.8, is therefore a S-NN-GPP as defined in Definition 3.4, which completes the proof. □

Before we proceed to present a proof of convergence for C 1 continuous associated derivative coercive functions, we show that if a function is associated derivative coercive and C 1 continuous it has at least one global gradient projection point.

Proposition A.8

Suppose f:X⊆ℝn→ℝ is a C 1 continuous associated derivative coercive function as defined in Definition A.1 with xX, then f has at least one S-NN-GPP as defined in Definition 3.4.

Proof

Let x 1,x 2,x 3∈ℝn. Since f is associated derivative coercive as defined in Definition A.1, there exists by definition a number R M such that for every {x 2:∥x 2∥>R M }, and every {x 1:∥x 1∥<R M }, the following holds: \(\nabla_{A}^{\mathrm{T}}f(\boldsymbol{x}^{2})(\boldsymbol{x}^{2}-\boldsymbol{x}^{1}) > 0\), for non perpendicular ∇ A f(x 2) and (x 2x 1). In addition, there exists {x 3:∥x 3∥<R M }, such that \(\nabla_{A}^{\mathrm{T}}f(\boldsymbol{x}^{3})(\boldsymbol{x}^{3}-\boldsymbol{x}^{1}) > 0\). Therefore, the set {x:∥x∥<R M } is closed and bounded, which by the continuity assumption implies that f(x) assumes a minimum value on {x:∥x∥<R M } at a point \(\boldsymbol{x}^{*}_{g}\in X\). From the continuity assumption of the first partial associated derivatives, it follows that \(\nabla_{A} f(\boldsymbol{x}^{*}_{g})=\boldsymbol{0}\) (Peressini et al. 1988). It therefore follows from the continuity assumptions that Definition 3.4 holds at \(\boldsymbol{x}^{*}_{g}\). □

Theorem A.9

Suppose f:X⊆ℝn→ℝ is a C 1 continuous associated derivative coercive function, as defined in Definition A.1, with xX. If x {0}X, and {x {k}} is a conservative associated derivative descent sequence, as defined in Definition 3.9, for f with initial point x {0}, then some subsequence of {x {k}} converges. The limit of any convergent sequence of {x {k}} is a G-NN-GPP, as defined in Definition 3.2, of f.

Proof

Our assertion that f is continuous and associated derivative coercive ensures that f has a global minimizer \(\boldsymbol{x}^{*}_{g}\in X\). Also, by the definition of a conservative associated derivative descent sequence and the continuity of the first partial associated derivatives, we see that {f(x {k})} is a decreasing sequence that is bounded below by \(f(\boldsymbol{x}^{*}_{g})\). Note that we require conservative associated derivative descent sequences, since derivative descent sequence is not sufficient to guarantee convergence as it may result in oscillatory behavior for n>1. The remainder of the proof is similar to the proof of Theorem A.7. □

We now proceed to functions that are either C 0 continuous or discontinuous, but for which the function values and associated gradient field are uniquely defined everywhere. We present classes of C 0 continuous or discontinuous functions for which convergence is guaranteed, since associated derivative descent sequences may not converge to NN-GPP when all C 0 continuous or discontinuous functions are considered, as is evident from the following example.

Consider the linear programming problem of finding the intersection between two intersecting planes. Since the associated gradient on each plane is constant, a steepest descent sequence that terminates at the intersection of the two planes is an example of a sequence that converges to some point that is not a NN-GPP.

Hence, we now present classes of well-posed discontinuous functions for which convergence is guaranteed.

Definition A.10

We consider the (resp. generalized/strict) gradient-only optimization problem to be well-posed (resp. convex/unimodal) associated derivative when

  1. 1.

    the associated gradient field is everywhere uniquely defined,

  2. 2.

    the problem is associated derivative coercive as defined in Definition A.1,

  3. 3.

    there exits one and only one (resp. G/S)-NN-GPS (resp. \(S_{G\mbox{-}NN}/S_{S\mbox{-}NN}\)) as defined in Definition 3.7, and

  4. 4.

    when every associated derivative descent sequence as defined in Definition 3.8 has at least one converging subsequence to a point in (resp. \(S_{G\mbox{-}NN}/S_{S\mbox{-}NN}\)).

We now present a class of well-posed associated derivative coercive functions; this includes multimodal functions.

Definition A.11

We consider the gradient-only optimization problem to be (resp. proper/generalized) well-posed associated derivative coercive when

  1. 1.

    the associated gradient field is everywhere uniquely defined,

  2. 2.

    the problem is associated derivative coercive as defined in Definition A.1,

  3. 3.

    there exits at least one (resp. G/S)-NN-GPS (resp. \(S_{G\mbox{-}NN}/S_{S\mbox{-}NN}\)) as defined in Definition 3.7, and

  4. 4.

    when every conservative associated derivative descent sequence as defined in Definition 3.9 has at least one converging subsequence to a point in (resp. \(S_{G\mbox{-}NN}/S_{S\mbox{-}NN}\)).

We note that the classes of functions defined in Definitions A.10–A.11 still exclude many problems of practical significance e.g. linear programming problems. Many of these practically significant problems may be accommodated by altering Definitions A.10–A.11 to hold only for specific associated derivative descent sequences.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wilke, D.N., Kok, S., Snyman, J.A. et al. Gradient-only approaches to avoid spurious local minima in unconstrained optimization. Optim Eng 14, 275–304 (2013). https://doi.org/10.1007/s11081-011-9178-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11081-011-9178-7

Keywords

Navigation