1 Introduction

This paper is about convergence rate of iterative stochastic optimization techniques [37] which are designed, in general, for finding global extrema of difficult problem functions. Such functions may be multimodal and complex (especially in high dimensions, [38]) and at the same time the derivatives may be not available (or may not exist). From the theoretical perspective examining the convergence rate under general assumptions is very difficult and thus the corresponding mathematical results are available for special class of functions, including specific transformations of linear and spherical functions, and more generally for various classes of convex functions, see for instance [1, 3, 4, 14, 15, 17,18,19, 25, 28, 29]. The theory often uses Markov chains, [10], and Lyapunov functions [32]. Many results are about how the expected time of hitting the \(\varepsilon \)- neighbourhood of the global solution changes as \(\varepsilon \) goes to zero, besides the aforementioned references see also [2, 11, 12, 20, 34,35,36] which focus on the hitting times.

We assume that we are given the problem function \(f:A\rightarrow \mathbb {R}\) on the solutions space \(A\subset \mathbb {R}^d\) and some stochastic process \(\mathbb {X}=X_0,X_1,X_2,\dots \) with values in A. The \(X_0,X_1,\dots \) represent the successive states of an iterative optimization algorithm. The literature on convergence rate usually works on the convergence rate of the trajectories \(f(X_t)\rightarrow f^{\star }\), where \(f^\star \) is the global optimum of f, the convergence rate of the expectations \(e_t:=E[|f(X_t)-f^{\star }|]\), where \(E[\cdot ]\) denotes the expected value, or on the expected hitting times of an \(\varepsilon \)-neighborhood of the optimum. This paper works on the asymptotic convergence rate \(ACR=ACR(\mathbb {X},f)\) for the expectations \(e_t\) which is the infimum of constants \(C>0\) for which \(E[|f(X_t)-f^\star |] \le C^t\cdot E[|f(X_0)-f_{opt}|]\) for all large \(t \in \mathbb {N}\). Condition \(ACR(\mathbb {X},f)\in (0,1)\) determines the exponential decrease of \(e_t\) (which is most often called linear convergence, [3, 8, 18], also geometric convergence in some context, [10, 31]). Section 2 connects the ACR to other convergence characteristics. The value of ACR equals to the upper limit of the geometric average of the one-step decrease ratios in sense

$$\begin{aligned} ACR(\mathbb {X},f)=\limsup \limits _{t\rightarrow \infty }\root t \of {\prod \limits _{i=0}^{t-1} \frac{E[|f(X_{i+1})-f^\star |]}{E[|f(X_{i})-f^{\star }|]}} \end{aligned}$$

and thus is determined by the Average Convergence Rate [8, 13]. Furthermore, Sect. 2 shows that under general assumptions condition \(A:=ACR(\mathbb {X},f)<1\) implies linear convergence of the trajectories \(f(X_t)\) in sense

$$\begin{aligned} \limsup \limits _{t\rightarrow \infty }\frac{1}{t}\log \left( \frac{|f(X_t)-f^{\star }|}{|f(X_0)-f^{\star }|}\right) \le log(A)\ a.\ s. \end{aligned}$$

The constant \(\log (A)\) is the upper bound for convergence rate of the trajectories. Additionally, \(ACR(\mathbb {X},f)<1\) implies the linear behaviour (with \(\varepsilon \rightarrow 0^+\)) of the expected hitting time \(\tau _\varepsilon \) given by

$$\begin{aligned} \tau _{\varepsilon }=\inf \{t\in \mathbb {N}:|f(X_t)-f^\star |<\varepsilon \}, \end{aligned}$$

in sense that for some constant \(M>0\) we have \(E[\tau _\varepsilon ]\le M\cdot \log (\frac{1}{\varepsilon })\) for all small \(\varepsilon \). Linear convergence of trajectories and linear behaviour of expected hitting times are both the subject of extensive studies on convergence rates, see for instance [1, 4, 14, 15, 17,18,19]. Finally, we note that \( \ln (ACR(\mathbb {X},f))=\limsup \limits _{t\rightarrow \infty }\frac{\ln (E[|f(X_t)-f^\star |])}{t}\) and thus the problem of minimizing the ACR is related to financial mathematics and control theory in the context of portfolio optimization and the entropic utility, see [22].

For Markov chains on finite state spaces the asymptotic behaviour is determined explicitly by the eigenvalues of the corresponding probability transition matrix, [13, 16]. In continuous case, without specific assumptions the general characterisation of asymptotic behaviour is not obvious. This paper attempts to provide some form of general characterization of asymptotic convergence rate for \(e_t=E[|f(X_t)-f^{\star }|]\) incontinuous optimization under general assumptions. In particular, given the process \(\mathbb {X}=\{X_t\}_{t\in \mathbb {N}}\), Theorem 10 from Sect. 4 finds the lower bound for \(ACR(\mathbb {X},f)\) which does not depend on f. Furthermore, Sect. 4 brings to attention that for the linear convergence rate of the trajectories \(f(X_t)\) only the asymptotic behaviour of the sequence \(f(X_t)\) matters but at the same time the asymptotic convergence rate of the expectations \(e_t=E[|f(X_t)-f^\star |]\) may be strongly influenced by the behaviour of \(X_t\) outside some vicinity of the global solutions of the problem ( for instance, in the context of optimization the value of ACR may be determined by the ability of escaping from local minima). The initial motivation for this paper was to answer the following two questions: given stochastic process \(\mathbb {X}=\{X_t\}_{t\in \mathbb {N}}\), how the change of function f influences the value of \(ACR(\mathbb {X},f)\) and what is the relation between asymptotic convergence rate in the objective space \(ACR(\mathbb {X},f)\) and asymptotic convergence rate in the search space \(ACR(\mathbb {X},d)\), where d is a metric on A (the detailed definitions and assumptions will be introduced in the next section). Besides theoretical explanations on the above problems, the paper presents examples and numerical simulations.

Section 2 is an introductory chapter which presents notation, general properties and connections between the ACR and other convergence rate characteristics. Among other results, Sect. 3 shows that for bounded functions f the value of \(ACR(\mathbb {X},f)\) has the lower bound which does not depend on f (depends on the process \(\mathbb {X}\) only). Furthermore, Sect. 3 shows that some algorithms cannot converge linearly fast (exponentially fast) for any nontrivial continuous problem function. Section 4 analyses further how the change of function f may influence the \(ACR(\mathbb {X},f)\). In Sect. 5 we analyze the relation between \(ACR(\mathbb {X},f)\) and \(ACR(\mathbb {X},d)\) ( the latter measures the convergence rate in the search space). Section 6 presents numerical simulations with use of Himmelblau’s test function, (1+1) evolution strategy and other algorithms.

2 Notation, definitions, properties

In the whole paper we assume that \(A\subset \mathbb {R}^d\) is a closed set (with standard Euclidean topology ) and \(f:A\rightarrow \mathbb {R}\) is a continuous function. The A represents the space of admissible solutions of the global minimization problem given by the problem function f which is how to find a point \(x^\star \in A\) with

$$\begin{aligned} f(x^\star )=\min \limits _{x\in A}f(x). \end{aligned}$$

It is assumed that the above global minimum exists. Let \(\emptyset \ne A^\star \subsetneq A\) denote the set of global solutions and let \(f^\star \) denote the global minimum so we have \(f^\star =\min \limits _{x\in A}f(x)\) and

$$\begin{aligned} A^\star =\{x\in A:f(x)=f^\star \}. \end{aligned}$$

In the whole paper \((\Omega ,\Sigma ,\mathbb {P})\) is a probability space and \(\mathbb {X}=\{X_t\}_{t\in \mathbb {N}}\) is a stochastic process \(X_t:\Omega \rightarrow A\) which represents successive states \(X_0,X_1,X_2,\dots \) of an iterative optimization scheme. In general, optimization process \(X_t\) may take values in an extended space \(X_t=(X_t^1,\dots ,X_t^m,\sigma ^1_t,\dots ,\sigma ^k_t) \in A^m\times B^k\), where B is a space for parameters - in such case we may analyse the convergence rate of the approximations of \(A^{\star }\) with respect to one of the following functions

$$\begin{aligned} \min \limits _{i=1,\dots ,m} f(X^i_t),\ \ \max \limits _{i=1,\dots ,m} f(X^i_t),\ \ \sum \limits _{i=1}^mf(X^i_t). \end{aligned}$$

We assume that space A is equipped with the Euclidean metric \(d:A\times A\rightarrow \mathbb {R}^+\). We will say that \(X_t\) converges in probability to \(A^\star \), denoted by \(X_t{\mathop {\rightarrow }\limits ^{\mathbb {P}}}A^\star \), if:

$$\begin{aligned} \forall \varepsilon >0\ \mathbb {P}[d(X_t,A^{\star })<\varepsilon ]\rightarrow 1 \text{ as } t\rightarrow \infty , \end{aligned}$$
(1)

where \(d(x,A^{\star })=\inf \limits _{a\in A^\star }d(x,a)\) is the distance between x and \(A^\star \). Usually \(A^\star \) is a compact set (most often a finite set) and then simply \(d(x,A^\star )=\min \limits _{a\in A^{\star }}d(x,a).\) Note also that \(\mathbb {P}[d(X_t,A^{\star })<\varepsilon ]=\mathbb {P}[X_t\in B(A^{\star },\varepsilon )]\), where

$$\begin{aligned} B(A^{\star },\varepsilon )=\{x\in A:d(x,A^{\star })<\varepsilon \} \end{aligned}$$

is the \(\varepsilon -\)neighbourhood of \(A^{\star }\). If \(d(X_t,A^{\star })\rightarrow 0 \ almost \ sure \) then we shortly write \(X_t\rightarrow A^{\star }\ a. \ s.\) By \(\mathbb {P}_{X_t}\) we will denote the probability distribution of \(X_t\) defined by \(\mathbb {P}_{X_t}(C)=\mathbb {P}[X_t\in C],\ C\in \mathcal {B}(A),\) where \(\mathcal {B}(A)\) denotes the family of Borel sets on A, one can see [9] for the general theory of probability. In further chapters we will most often assume that f is a bounded function in sense \(\sup \limits _{x\in A}f(x)<\infty \) and that f has a compact underlevel set \(\{x\in A:f(x)\le f^\star +\delta \}\) for some \(\delta >0\) which guarantee that convergence (1) implies the convergence (2):

$$\begin{aligned} E[f(X_t)]\rightarrow f^\star \text{ as } t\rightarrow \infty , \end{aligned}$$
(2)

where \(E[f(X_t)]=\int \limits _\Omega f(X_t)d\mathbb {P}\) is the expected value of \(f(X_t)\).

Finally, we define the asymptotic convergence rate below.

Definition 1

For process \(\mathbb {X}=X_0,X_1,\dots \) and function \(f:A\rightarrow \mathbb {R}\) as above let

$$\begin{aligned} ACR(\mathbb {X},f)=\inf \{C\in \mathbb {R}^+:\exists T_C\in \mathbb {N}\ \forall t>T_C\ E[f(X_t)-f^\star ]\le C^t \cdot E[f(X_0)-f^\star ]\}. \end{aligned}$$

In words, \(ACR(\mathbb {X},f)\) is the infimum of constants \(C\ge 0\) for which we have \(E[f(X_t)-f^\star ]\le C^t \cdot E[f(X_0)-f^\star ]\}\) for all t large enough. If the sequence \(E[f(X_t)]\) is bounded then always \(ACR(\mathbb {X},f)\le 1\) as condition \(ACR[\mathbb {X},f]>1\) would imply \(E[f(X_t)]\nearrow \infty \). If \(ACR(\mathbb {X},f)<1\), then \(E[f(X_t)]\) converges to \(f^\star \) exponentially fast which is a very strong convergence mode - it says that the algorithm is able to keep linear convergence while it approximates the global solutions. The smaller value of \(ACR[\mathbb {X},f]\), the faster convergence. On the other hand, condition \(ACR[\mathbb {X},f]=1\) excludes the exponential convergence but does not exclude the convergence (2). If \(E[f(X_t)]\rightarrow f^\star \) and \(ACR(\mathbb {X},f)=1\) then we may say that the convergence is sublinear.

Given the function \(f:A\rightarrow \mathbb {R}\) define the auxiliary function \(e_f\) by

$$\begin{aligned} e_f(x):=f(x)-f^\star . \end{aligned}$$

Naturally, the \(e_f\) has global minimum \(e_f^\star =0\) and satisfies \(ACR(\mathbb {X},f)=ACR(\mathbb {X},e_f).\)

The proof of the following simple observation may be found in Appendix.

Observation 1

We have

$$\begin{aligned} ACR(\mathbb {X},f)=\limsup \limits _{t\rightarrow \infty }\root t \of {E[e_f(X_t)]}. \end{aligned}$$
(3)

Equation (3) states that \(ACR(\mathbb {X},f)\) is the upper limit of \(\root t \of {E[e_f(X_t)]}\) and hence the asymptotic convergence rate constant is defined through the geometric average of the one step decrease ratios, [8]. Assuming that \(E[e_f(X_t)]>0\) for all \(t\in \mathbb {N}\), we have

$$\begin{aligned} \root t \of {E[e_f(X_t)]}=\root t \of {\prod \limits _{i=0}^{t-1} \frac{E[e_f(X_{i+1})]}{E[e_f(X_{i})]}}\cdot \root t \of {E[e_f(X_0)]} \end{aligned}$$
(4)

and hence, as \(\root t \of {E[e_f(X_0)]}\rightarrow 1\),

$$\begin{aligned} \limsup \limits _{t\rightarrow \infty }\root t \of {E[e_f(X_t)]}=\limsup \limits _{t\rightarrow \infty }\root t \of {\prod \limits _{i=0}^{t-1} \frac{E[e_f(X_{i+1})]}{E[e_f(X_{i})]}}. \end{aligned}$$
(5)

Observation 1 and Eq. (5) immediately imply that

$$\begin{aligned} \liminf \limits _{t\rightarrow \infty }\frac{E[e_f(X_{t+1})]}{E[e_f(X_t)]}\le ACR(\mathbb {X},f)\le \limsup \limits _{t\rightarrow \infty }\frac{E[e_f(X_{t+1})]}{E[e_f(X_t)]} \end{aligned}$$
(6)

From the above, if the following limit exists

$$\begin{aligned} C=\lim \limits _{t\rightarrow \infty }\frac{E[e_f(X_{t+1})]}{E[e_f(X_t)]}, \end{aligned}$$
(7)

then \(ACR(\mathbb {X},f)=C\). The \(ACR(\mathbb {X},f)\) measures the convergence rate in the objective space. The analogous definition may be formulated for the convergence in the solution space A.

Definition 2

For any metric \(\rho \) on A compatible with the Euclidean topology,

$$\begin{aligned} ACR[\mathbb {X},\rho ]:= ACR[\mathbb {X},{\hat{\rho }}], \end{aligned}$$

where \({\hat{\rho }}(x):=\rho (x,A^\star )=\inf \limits _{a\in A^{\star }}\rho (x,a)\) measures the distance from \(A^\star \). More explicitly, as \(\min \limits _{x\in A}{\hat{\rho }}(x)=0\), by Observation 1,

$$\begin{aligned} ACR(\mathbb {X},\rho )=\limsup \limits _{t\rightarrow \infty }\root t \of {E[\rho (X_{t},A^\star )]}. \end{aligned}$$

The above measures how fast the expected distance between \(X_t\) and global solutions \(A^\star \) decreases. The convergence in the search space is sometimes referred to as strong convergence, see [14]. In this paper for simplicity we focus on the Euclidean metric d but what we will really need for further considerations is that closed and bounded sets are compact which takes place in case of the Euclidean metric.

We start with the following simple theorem.

Theorem 2

If \(ACR[\mathbb {X},f]<1\) then \(f(X_t)\rightarrow f^\star \ almost\ sure\ ( a.s. )\)

The above states that the exponential convergence of expectations implies convergence almost sure - the proof is standard application of so called complete convergence and may be found in Appendix. See also for comparison Theorems 3.2 and 6.8 in [26] which work under the supemartingale inequality:

$$\begin{aligned} E[f(X_{t+1})|X_t,X_{t-1},\dots ,X_0]\le f(X_t). \end{aligned}$$

Theorem 2 holds true without the supermartingale property of the process and it provides a necessary condition for \(ACR[\mathbb {X},d]<1\), see below:

Observation 3

If \(X_t{\mathop {\rightarrow }\limits ^{\mathbb {P}}}A^\star \) but not almost sure, then \(X_t\) does not converge exponentially fast to \(A^\star \), i.e. \(ACR[\mathbb {X},d]\ge 1\).

Proof

The above follows from Theorem 2 applied to function \({\hat{d}}(x)=d(x,A^\star )\). \(\square \)

Observation 4 is another conclusion of Theorem 2.

Observation 4

Assume that a continuous function f has a compact underlevel set

$$\begin{aligned} A_\delta :=\{x:f(x)\le f^\star +\delta \} \end{aligned}$$

for some \(\delta >0\). If \(ACR[\mathbb {X},f]<1\), then \(X_t{\mathop {\rightarrow }\limits ^{1}}A^\star \) almost sure.

Proof

If set \(A_\delta \) is compact then for any sequence \(x_t\in A\), \(f(x_t)\rightarrow f^\star \) implies \(d(x_t,A^{\star })\rightarrow 0\) (the proof may be found for instance in [30]). Thus, as from Theorem 2 we have \(f(X_t)\rightarrow f^\star \ a.s\), we also have \(d(X_t,A^{\star })\rightarrow 0\ a.s.\) \(\square \)

Example 1

Consider for a moment the Simulated Annealing algorithm (SA). There are various assumptions under which SA converges to \(A^{\star }\) in probability but does not converge almost surely ( one can see [5, 21]). Roughly speaking, if the so called cooling schedule of the algorithm generates the sequence of temperatures which decrease to zero slow enough then the algorithm wanders through the whole solution space without being trapped by local solutions but also converges in probability and not almost sure. Observation 3 immediately implies that in such case SA cannot converge exponentially fast.

An important characteristic of convergence which gets much attention in the literature is the convergence rate of the trajectories of the process.

Definition 3

We will say that \(f(X_t)\rightarrow f^\star \) linearly a.s. if

$$\begin{aligned} \mathbb {P}[\limsup \limits _{t\rightarrow \infty }\root t \of {e_f(X_t)]}<1]=1. \end{aligned}$$
(8)

We will say that \(f(X_t)\rightarrow f^\star \) uniformly linearly if for some \(A\in (0,1)\) we have

$$\begin{aligned} \mathbb {P}[\limsup \limits _{t\rightarrow \infty }\root t \of {e_f(X_t)]}\le A]=1. \end{aligned}$$
(9)

Note that under assumption \(\mathbb {P}[X_t\notin A^{\star }]=1\) condition (8) is equivalent to

$$\begin{aligned} \limsup \limits _{t\rightarrow \infty }\frac{1}{t}\log \left( \frac{|f(X_t)-f^\star |}{|f(X_0)-f^\star |}\right) <0\ a. s. \end{aligned}$$

The above condition is weak because it does not provide the uniform control over the convergence rate of the trajectories. A stronger condtition (9) is equivalent to the commonly used condition:

$$\begin{aligned} \limsup \limits _{t\rightarrow \infty }\frac{1}{t}\log \left( \frac{|f(X_t)-f^\star |}{|f(X_0)-f^\star |}\right) \le \log (A)\ a. s. \end{aligned}$$

The above assumes the upper bound for the convergence rate. Furthermore, (9) is equivalent to:

$$\begin{aligned} \forall \varepsilon >0\ \ \mathbb {P}\left[ \sup \limits _{t\in \mathbb {N}}{\frac{e_f(X_t)}{(A+\varepsilon )^t}}<+\infty \right] =1. \end{aligned}$$

Theorem 5 shows that \(ACR[\mathbb {X},f]<1\) is stronger than uniform linear convergence of trajectories and provides the upper bound for the convergence rate of trajectories.

Theorem 5

If \(ACR(\mathbb {X},f)=A<1\) then \(\mathbb {P}[\limsup \limits _{t\rightarrow \infty }\root t \of {e_f(X_t)]}\le A]=1.\)

The proof of Theorem 5 uses Borel-Cantelli Lemma and may be found in Appendix. Below we present simple illustrations of Theorem 5.

Example 2

Let \(X:\Omega \rightarrow [0,1)\) be a random variable with positive density \(f:[0,1)\rightarrow (0,\infty )\) and let \(X_{t}:=(X)^{t}\) so we have \(X_{t+1}=X\cdot X_t\). It is easy to see that

$$\begin{aligned} \limsup \limits _{t\rightarrow \infty }\root t \of {X_t}=X<1 \ a.\ s. \end{aligned}$$

and thus the trajectories of \(X_t\) converge linearly in sense of (8). At the same time they do not satisfy (9) because for any constant \(A<1\) we have \(\mathbb {P}[X>A]>0\). By Theorem 5, this immediately implies that \(ACR=1\). For a simple illustration: if the initial variable X is sampled from uniform distribution then we may calculate directly that \(E[X_{t}]=\int _0^1x^{t}dx=\frac{1}{t+1}\) and thus \(\limsup \limits _{t\rightarrow \infty }\root t \of {E[X_t]}=1.\)

Example 3

Let \(Z_t:\Omega \rightarrow [0,1]\) be a sequence of random variables with \(M:=\sup \limits _{t\in \mathbb {N}}E[Z_t]<+\infty \). Define \(X_t:=\exp (-t)\cdot Z_t\). We have

$$\begin{aligned} ACR=\limsup \limits _{t\rightarrow \infty }\root t \of {E[X_t]}\le \limsup \limits _{t\rightarrow \infty }\root t \of {\exp (-t)\cdot M}=\exp (-1) \end{aligned}$$

which, by Theorem 5, provides the upper bound for the convergence rate of trajectories: \(\mathbb {P}[\limsup \limits _{t\rightarrow \infty }\root t \of {X_t}\le \exp (-1)]=1\).

Theorem 5 shows that the linear convergence of trajectories in sense of Definition 3 is a necessary condition for \(ACR(\mathbb {X},f)<1\). However, it is not a sufficient condition - under general assumptions the uniform linear convergence of trajectories does not imply that \(ACR<1\), see for instance Example 7 in Sect. 4.

Another important characteristic of convergence is the behaviour of the stopping time

$$\begin{aligned} \tau _{\varepsilon }=\inf \{t\in \mathbb {N}:e_f(X_t)<\varepsilon \} \end{aligned}$$

with \(\varepsilon \rightarrow 0^+\). As mentioned before, if \(\limsup \limits _{t\rightarrow \infty }\root t \of {e_f(X_t)]}\le A\ a.\ s.\) then for any \(C>A\) we have

$$\begin{aligned} M_C:=\sup \limits _{t\in \mathbb {N}}\frac{e_f(X_t)}{C^t}<\infty \ a. s. \end{aligned}$$
(10)

and, by definition, the above random variable \(M_C\) satisfies

$$\begin{aligned} e_f(X_t)\le M_C\cdot C^t, \ t\in \mathbb {N}. \end{aligned}$$

This implies that for any \(\varepsilon >0\) and \(t\in \mathbb {N}\), if \(M_C\cdot C^t<\varepsilon \) then \(\tau _{\varepsilon }\le t\). For \(C\in (A,1)\) we have:

$$\begin{aligned} M_C\cdot C^t<\varepsilon \Leftrightarrow t>\frac{-1}{\log (C)}\cdot \log (\frac{M_C}{\varepsilon }),\ t\in \mathbb {N}. \end{aligned}$$

On the one hand, the above implies that for \(\varepsilon >M_C\) the \(t=0\) satisfies the above and thus

$$\begin{aligned} \varepsilon >M_C \Rightarrow \tau _{\epsilon }=0. \end{aligned}$$
(11)

On the other hand, as \(\tau _{\varepsilon }\in \mathbb {N}\), on set \(\{\varepsilon \le M_C\}\) we have

$$\begin{aligned} \tau _\varepsilon \le \frac{-1}{\log (C)}\cdot \log \left( \frac{M_C}{\varepsilon }\right) +1=\frac{-1}{\log (C)}\log \left( \frac{1}{\varepsilon }\right) -\frac{\log \left( M_C\right) }{\log (C)}+1. \end{aligned}$$
(12)

The above implies that \(\mathbb {P}\)- almost all trajectories of \(X_t\) find a set \(\{x\in A:e_f(x)\le \varepsilon \}\) in \(O(\log (\frac{1}{\varepsilon }))\) iterations \(( with\ \varepsilon \rightarrow 0^+)\), where \(O(h(\varepsilon ))\) is standard big O notation.

Lemma below strengthens (10) - \(M_C\) is not only finite a.s. but also satisfies \(E[\log (M_C)\cdot 1_{\{1\le M_C\}}]<+\infty \). The proof may be found in Appendix.

Lemma 6

If \(A=ACR[\mathbb {X},f]<1\) then for any \(C>A\) and \(M_C=\sup \limits _{t\in \mathbb {N}}\frac{e_f(X_t)}{C^t}\) we have \(E[\log (M_C)\cdot 1_{\{1\le M_C\}}]<+\infty \).

Lemma 6 and Eq. (12) lead to Theorem 7: condition \(ACR(\mathbb {X},f)<1\) implies that the expected time of hitting \(\{x\in A:e_f(x)\le \varepsilon \}\) is \(O(\log (\frac{1}{\varepsilon }))\) with \(\varepsilon \rightarrow 0^+\). This property is often used as definition of linear convergence.

Theorem 7

If \(A=ACR(\mathbb {X},f)<1\) then \(E[\tau _\varepsilon ]=O(\log (\frac{1}{\varepsilon }))\) with \(\varepsilon \rightarrow 0^+\). More specifically, for any \(C\in (A,1)\) and \(\varepsilon \in (0,1)\) we have

$$\begin{aligned} E[\tau _{\epsilon }]\le D_C\cdot \log (\frac{1}{\varepsilon })+E_C, \end{aligned}$$

where \(D_C=\frac{-1}{\log (C)}\) and \(E_C=-\frac{1}{\log (C)}\cdot E[\log (M_C)\cdot 1_{\{1\le M_C\}}]+1<\infty \).

Proof

From Theorem 5 we have (9), i.e. \(\limsup \limits _{t\rightarrow \infty }\root t \of {e_f(X_t)]}\le A\ a. s.\)

As we have shown, from (9) it follows that for any \(C\in (A,1)\) the Eqs. (11) and (12) are satisfied and thus for the random variable

$$\begin{aligned} h_C(\varepsilon ):=\frac{-1}{\log (C)}\log \left( \frac{1}{\varepsilon }\right) -\frac{\log (M_C)}{\log (C)}+1, \ \varepsilon >0, \end{aligned}$$

the stopping moment \(\tau _{\epsilon }\) satisfies

$$\begin{aligned} \tau _{\varepsilon }\le h_C(\varepsilon )\cdot 1_{\{\varepsilon \le M_C\}}. \end{aligned}$$

From the above,

$$\begin{aligned} E\left[ \tau _{\varepsilon }\right] \le E[ h_{\epsilon }\cdot 1_{\{\epsilon \le M_C\}}]\le \frac{-1}{\log (C)}\log \left( \frac{1}{\varepsilon }\right) -\frac{1}{\log (C)}\cdot E[\log (M_C)\cdot 1_{\{\epsilon \le M_C\}}]+1. \end{aligned}$$

The above and Lemma 6 finish the proof as for all \(\varepsilon \in (0,1)\) we have

$$\begin{aligned} E[\log (M_C)\cdot 1_{\{\varepsilon \le M_C\}}]\le E[\log (M_C)\cdot 1_{\{1\le M_C\}}]<\infty . \end{aligned}$$

\(\square \)

3 Lower bound for convergence rate. Lazy convergence.

The general results of this section and the next section will be presented for two cases: general stochastic processes and Markov chains. Recall that \(X_t\) is a Markov chain on A with probability kernel \(P:A\times \mathcal {B}(A)\rightarrow [0,1]\) if

$$\begin{aligned} \mathbb {P}[X_{t+1}\in C|X_t,\dots ,X_0]=P(X_t,C),\ \ C\in \mathcal {B}(A), \end{aligned}$$

so the future of the process \(X_t\) depends only on the current state (given \(X_t=x_t\) it does not depend on the previous history \(X_0,\dots ,X_{t-1}\)) and the joint probability distribution of \(\mathbb {X}\) is determined by initial distribution \(\mathbb {P}_{X_0}\) and P. In particular,

$$\begin{aligned} \mathbb {P}_{X_{t+1}}(C)=\int _AP(x,C)\mathbb {P}_{X_t}(dx). \end{aligned}$$

Finally, for a Borel set \(B\subset \mathbb {R}\) let

$$\begin{aligned} \mathcal {C}(A,B,A^{\star }) \end{aligned}$$

denote the set of continuous functions \(h:A\rightarrow B\) with

$$\begin{aligned} h(x)=h_{\min }\Leftrightarrow x\in A^{\star }. \end{aligned}$$

If \(A^{\star }=\{x^\star \}\) then we simply write

$$\begin{aligned} \mathcal {C}(A,B,x^{\star }):=\mathcal {C}(A,B,\{x^\star \}). \end{aligned}$$

We will start the investigation with the following example which shows that in some cases the \(ACR(\mathbb {X},h)\) can take an arbitrary value depending on \(h\in \mathcal {C}(A,\mathbb {R},A^{\star })\).

Example 4

For simplicity, let (Ad) be the interval [0, 1] with the Euclidean distance and let \(A^{\star }=\{0\}\). Let \(a_t\in [0,1]\) be a decreasing sequence with \(a_t\searrow 0\). Let \(\mathbb {X}\) be a Markov chain which starts from \(X_0\in (a_1,a_0]\) and, given \(X_t=x_t\in (a_{t+1},a_t]\), it jumps to \((a_{t+2},a_{t+1}]\) according to uniform distribution \(U(a_{t+2},a_{t+1})\). It is easy to see that distributions \(\mathbb {P}_{X_t}\) are uniform distributions \(U(a_{t+1},a_{t})\), and that the probability kernel P satisfies

$$\begin{aligned} P(x,\cdot )=U(a_{t+1},a_{t}) \text{ for } x\in [a_{t},a_{t-1}). \end{aligned}$$

We have \(E[X_t]=c_t\), where \(c_{t}=\frac{a_{t}+a_{t+1}}{2}\), and

$$\begin{aligned} \frac{E[X_{t+1}]}{E[X_t]}=\frac{c_{t+1}}{c_{t}}. \end{aligned}$$

The above implies that if the limit \(C=\lim \limits _{t\rightarrow \infty }\frac{c_{t+1}}{c_t}\) exists, \(ACR(\mathbb {X},d)=C\). Now, fix an arbitrary number \(D\in (0,1]\) and define function \(h\in \mathcal {C}(A,\mathbb {R}^+,0)\) by \(h(0):=0\), \(h(a_t):=D^t\) and, for \(x\in (a_{t+1},a_t)\),

$$\begin{aligned} h(x):=\frac{x-a_{t+1}}{a_t-a_{t+1}}\cdot D^{t}+ \frac{a_{t}-x}{a_t-a_{t+1}}\cdot D^{t+1}. \end{aligned}$$

For such h we have \(ACR(\mathbb {X},h)=D\) as

$$\begin{aligned} \lim \limits _{t\rightarrow \infty }\frac{E[h(X_{t+1})]}{E[h(X_t)]} =\lim \limits _{t\rightarrow \infty }\frac{\frac{1}{2}(D^{t+2}+D^{t+1})}{\frac{1}{2}(D^{t+1}+D^{t})}=D. \end{aligned}$$

We see that in simple case \(\mathbb {P}_{X_t}=U([a_{t+1},a_t])\) the value of \(ACR(\mathbb {X},h)\) may be arbitrary regardless of the value of \(ACR(\mathbb {X},d)\).

In spite of the above example, surprisingly it appears that there are natural limitations on how function h may influence \(ACR(\mathbb {X},h)\). The following example is very instructive.

Example 5

Pure Random Search (PRS). Let \(A=[0,1]^d\) be the cube with the Euclidean distance and let \(x^\star \) be the unique global minimum of the problem function \(f:A\rightarrow \mathbb {R}\). Let \(\nu \) be a measure on \([0,1]^d\) with \(\nu (\{x^\star \})=0\).

Algorithm 1
figure a

Pure Random Search

Let \(X_1,Q_1,Q_2,\dots \) represent the successive sample points of the above algorithm so random variables \(X_1,Q_1,Q_2,\dots \) are independent and with distribution \(\mathbb {P}_{Q_t}=\nu \). We have

$$\begin{aligned} f(X_t)=\min \{f(X_1),f(Q_1),\dots ,f(Q_t)\} \end{aligned}$$

and we will show that, regardless of \(h\in \mathcal {C}(A,\mathbb {R},x^\star )\), \(ACR(\mathbb {X},h)=1\). To simplify notation we will assume that \(h_{\min }=0\) which does not influence the reasoning. Fix such h and define

$$\begin{aligned} A_\delta :=\{x\in A:h(x)\le \delta \},\ A_\delta '=A\setminus A_\delta \end{aligned}$$

and

$$\begin{aligned} H(\delta ):=\inf \limits _{x\notin A_\delta }h(x), \ \delta >0. \end{aligned}$$

Note that, as \(h\in \mathcal {C}([0,1]^d,\mathbb {R}^+,x^\star )\), for any \(\delta >0\) we have \(H(\delta )>0\) and for any \(x\in A_\delta '\) we have \(h(x)\ge H(\delta )\). Hence, for any \(\delta >0\),

$$\begin{aligned} Eh(X_t)\ge E[h(X_t)\cdot 1_{A_\delta '}(X_t)]\ge E[H(\delta )\cdot 1_{A_\delta '}(X_t)]=H(\delta ) \cdot \mathbb {P}[X_t\in A_\delta ']. \end{aligned}$$

Now note that by the definition of \(X_t\),

$$\begin{aligned} X_t\notin A_\delta \Leftrightarrow (X_1\notin A_\delta ,Q_1\notin A_\delta ,\dots ,Q_{t-1}\notin A_\delta ) \end{aligned}$$

and hence, by the independence of \(X_1,Q_1,\dots ,Q_{t-1}\),

$$\begin{aligned} \mathbb {P}[\{X_t\notin A_\delta \}]=\left( \nu (A'_\delta )\right) ^{t}, \end{aligned}$$

and thus we obtained

$$\begin{aligned} Eh(X_t)\ge H(\delta )\cdot \left( \nu (A'_\delta )\right) ^t. \end{aligned}$$
(13)

The above implies that for any \(\delta >0\)

$$\begin{aligned} ACR(\mathbb {X},h)\ge \limsup \limits _{t\rightarrow \infty }\root t \of {H(\delta )\cdot \left( \nu (A'_\delta )\right) ^t}=\nu (A'_\delta ). \end{aligned}$$
(14)

Now recall that \(\nu (\{x^\star \})=0\) and \(\bigcap \limits _{\delta >0}A_\delta =\{x^\star \}\). Thus, with \(\delta \searrow 0^+\) we have

$$\begin{aligned} \nu (A'_\delta )\nearrow \nu (A\setminus \ \bigcap \limits _{\delta >0}A_\delta )= \nu (A\setminus \{x^\star \})=\nu (A)-\nu (\{x^\star \})=1-0= 1 \end{aligned}$$

and hence, by (14), \(ACR(\mathbb {X},h)=1\).

Remark 1

The above implies, in particular, that \(ACR(\mathbb {X},f)=1\), where f is the problem function for PRS algorithm. This algorithm cannot converge exponentially fast for any nontrivial continuous problem assuming that the target \(A^{\star }\) is a null set ( with respect to the sampling measure).

Now, we are back to the general case of \(A\subset \mathbb {R}^d\) and we want to examine how \(ACR(\mathbb {X},h)\) may depend on h. Naturally, we need some assumptions on h and we choose to have the following convenient assumptions on the set of problem functions.

Definition 4

Let \(\mathcal {C}_B(A,\mathbb {R},A^{\star })\subset \mathcal {C}(A,\mathbb {R},A^{\star })\) be the set of those functions f for which:

\(\mathbf{A1)}\):

f is bounded in sense \(\sup \limits _{x\in A} f(x)<\infty \),

\(\mathbf{A2)}\):

for some \(\delta >0\), f has a compact underlevel set of the form

$$\begin{aligned} A_\delta =\{x\in A:f(x)\le f_{\min }+\delta \}. \end{aligned}$$

Note that if \(A\subset \mathbb {R}^d\) is a compact set then \(\mathcal {C}_B(A,\mathbb {R},A^{\star })= \mathcal {C}(A,\mathbb {R},A^{\star }).\) Furthermore, for any process \(X_t\in A\),

$$\begin{aligned} X_t{\mathop {\rightarrow }\limits ^{\mathbb {P}}} A^{\star }\Longleftrightarrow ( E[f(X_t)]\rightarrow f_{\min },\ f\in \mathcal {C}_B(A,\mathbb {R},A^{\star }) ). \end{aligned}$$

Additionally, for any \(f\in \mathcal {C}_B(A,\mathbb {R},A^{\star })\) we have that for any sequence \(x_t\in A\),

$$\begin{aligned} f(x_t)\rightarrow f_{\min }\Longleftrightarrow x_t\rightarrow A^{\star }, \end{aligned}$$

where \(x_t\rightarrow A^{\star }\) stands for \(d(x_t,A^{\star })\rightarrow 0.\) Additionally, \({\textbf {A2)}}\) implies that \(A^{\star }\) is compact.

Example 5 has shown that if \(\mathbb {X}\) is a stochastic process resulting from applying the PRS to the function f then \(ACR(X,h)=1\) for any \(h\in \mathcal {C}([0,1]^d,\mathbb {R},x^{\star })\) and thus, in particular, \(ACR(X,f)=1\) and \(ACR(X,d)=1\). Now we will generalize this observation.

Theorem 8

For any \(h\in \mathcal {C}_{B}(A,\mathbb {R},A^{\star })\),

$$\begin{aligned} ACR[\mathbb {X},h]\ge \lim \limits _{\varepsilon \rightarrow 0^+}\limsup \limits _{t\rightarrow \infty }\root t \of {\mathbb {P}[X_t\notin B(A^{\star },\varepsilon )]}. \end{aligned}$$

Proof

Fix \(h\in \mathcal {C}_B(X,\mathbb {R},A^{\star })\) and to simplify notation assume that \(h_{\min }=0\). For any \(\varepsilon >0\) we write \(B(A^{\star },\varepsilon )'=A\setminus B(A^{\star },\varepsilon )\). We have

$$\begin{aligned} H(\varepsilon ):=\inf \limits _{x\notin B(A^{\star },\varepsilon )} h(x)>0, \end{aligned}$$

and

$$\begin{aligned} E[h(X_t)]\ge E[h(X_t)1_{B(A^{\star },\varepsilon )'}(X_t)]\ge E[H(\varepsilon )1_{B(A^{\star },\varepsilon )'}(X_t)]= H(\varepsilon )\mathbb {P}_{X_t}(B(A^{\star },\varepsilon )'). \end{aligned}$$

The above implies that for any \(\varepsilon >0\), \(ACR(\mathbb {X},h)=\limsup \limits _{t\rightarrow \infty }\root t \of {E[h(X_t)]}\ge \)

$$\begin{aligned} \limsup \limits _{t\rightarrow \infty }\root t \of {H(\varepsilon )\cdot \mathbb {P}_{X_t}(B(A^{\star },\varepsilon )')} =\limsup \limits _{t\rightarrow \infty }\root t \of {\mathbb {P}_{X_t}(B(A^{\star },\varepsilon )')} \end{aligned}$$

which finishes the proof as \(\varepsilon \) may be arbitrarily small. \(\ {\Box }\)Now we will focus on Markov chains. We will start from the following definition which transfers the definition from [33] to Markov chains with probability kernel P.

Definition 5

Lazy convergence. Assuming \(X_t{\mathop {\rightarrow }\limits ^{\mathbb {P}}}A^{\star }\), we will say that \(X_t\) converges lazily to \(A^{\star }\) iff:

L):

\(\lim \limits _{\varepsilon \rightarrow 0^+}\inf \limits _{x\notin B(A^{\star },\varepsilon )}P(x,A{\setminus } B(A^{\star },\varepsilon ))=1\)

Often we will just say shortly that \(X_t\) is lazy. The simplest example of lazy Markov chain is PRS which samples candidate points from constant kernel \(\nu (\cdot )\). In this case the condition L) reduces to

$$\begin{aligned} \lim \limits _{\varepsilon \rightarrow 0^+}\nu (A\setminus B(A^{\star },\varepsilon ))=1 \end{aligned}$$

which is satisfied always when \(\nu (A^{\star })=0\). Equivalent condition for lazy convergence states that for any sequence \(x_n\in A\) with \(r_n:=d(x_n,A^{\star })\rightarrow 0\) we have

$$\begin{aligned} P(x_n,B(A^{\star },r_n))\rightarrow 0 \text{ as } n\rightarrow \infty . \end{aligned}$$

The intuition behind the above is: the closer to the global solution, the harder to generate a better candidate than the current one because an optimization method uses poor information on the function f. Some algorithms are lazy for any continuous nontrivial problem (the examples may be found in [33], including PRS and various versions of Simulated Annealing). Lazy methods do not converge exponentially fast, see below.

Theorem 9

Let \(\mathbb {X}\) be a Markov chain with probability kernel P which starts from \(X_0\notin A^{\star }\). If

$$\begin{aligned} \lim \limits _{\varepsilon \rightarrow 0^+}\inf \limits _{x\notin B(A^{\star },\varepsilon )}P(x,A\setminus B(A^{\star },\varepsilon ))\ge p, \end{aligned}$$
(15)

then, for any \(h\in \mathcal {C}_B(A,\mathbb {R},A^{\star })\),

$$\begin{aligned} ACR[\mathbb {X},h]\ge p. \end{aligned}$$

In particular, if \(\mathbb {X}\) is lazy and \(X_0\notin A^{\star }\), then \(ACR[\mathbb {X},h]\ge 1\) for any \(h\in \mathcal {C}_B(A,\mathbb {R},A^{\star })\).

Proof

Assume that \(p>0\) ( in the case \(p=0\) the theorem is trivial). From Theorem 8, it is enough to show

$$\begin{aligned} \lim \limits _{\varepsilon \rightarrow 0^+}\limsup \limits _{t\rightarrow \infty }\root t \of {\mathbb {P}[X_t\notin B(A^{\star },\varepsilon )]}\ge p. \end{aligned}$$

We have

$$\begin{aligned} \mathbb {P}_{X_{t+1}}(B(A^{\star },\varepsilon )')\ge \int _{B(A^{\star },\varepsilon )'}P(x,B(A^{\star },\varepsilon )')\mathbb {P}_{X_t}(dx)\ge \end{aligned}$$
$$\begin{aligned} \ge \inf _{x\notin B(A^{\star },\varepsilon )}P(x,B(A^{\star },\varepsilon )')\cdot \mathbb {P}_{X_t}(B(A^{\star },\varepsilon )'). \end{aligned}$$
(16)

From \(X_0\notin A^{\star }\) and (15), there is \(\varepsilon >0\) small enough to have

$$\begin{aligned} \mathbb {P}_{X_0}(B(A^{\star },\varepsilon )')>0 \text{ and } \inf _{x\notin B(A^{\star },\varepsilon )}P(x,B(A^{\star },\varepsilon )')>0 \end{aligned}$$

which further implies, by (16) and simple induction, that for small \(\varepsilon >0\),

$$\begin{aligned} \mathbb {P}_{X_t}( B(A^{\star },\varepsilon )')>0,\ \ t\in \mathbb {N}. \end{aligned}$$

Thus, for small \(\varepsilon >0\) we may use the inequality:

$$\begin{aligned} \limsup \limits _{t\rightarrow \infty }\root t \of {\mathbb {P}_{X_t}(B(A^{\star },\varepsilon )')}\ge \liminf \limits _{t\rightarrow \infty }\frac{\mathbb {P}_{X_{t+1}}(B(A^{\star },\varepsilon )')}{\mathbb {P}_{X_{t}}(B(A^{\star },\varepsilon )')}. \end{aligned}$$

and hence to prove the theorem it is enough to have.

$$\begin{aligned} \lim \limits _{\varepsilon \rightarrow 0^+}\liminf \limits _{t\rightarrow \infty } \frac{\mathbb {P}_{X_{t+1}}(B(A^{\star },\varepsilon )')}{\mathbb {P}_{X_{t}}(B(A^{\star },\varepsilon )')}\ge p. \end{aligned}$$

The above is easy to see as from (16) for any small \(\varepsilon \),

$$\begin{aligned} \liminf \limits _{t\rightarrow \infty }\frac{\mathbb {P}_{X_{t+1}}(B(A^{\star },\varepsilon )')}{\mathbb {P}_{X_{t}}(B(A^{\star },\varepsilon )')} \ge \inf _{x\notin B(A^{\star },\varepsilon )}P(x,B(A^{\star },\varepsilon )') \end{aligned}$$

and from (15), \(\lim \limits _{\varepsilon \rightarrow 0^+}\inf \limits _{x\notin B(A^{\star },\varepsilon )}P(x,B(A^{\star },\varepsilon )')\ge p.\ \ \ \) \(\square \)

Warning: In practice the optimization process \(\mathbb {X}\) is defined by function f. If f and g are two different problem functions then the application of the same optimization algorithm to those functions may result in two different optimization processes \(X_f\) and \(X_g\). The lower bound from Theorems 8 and 9 may differ between \(X_f\) and \(X_g\). On the other hand, if the change of problem function from f to g does not change the optimization process, i.e. if

$$\begin{aligned} g\in Inv(f)=\{h\in C_B(A,\mathbb {R},A^{\star }) :X\!\!_f=X_h\}, \end{aligned}$$

then Theorems 8 and 9 provide the same lower bound. For instance, many comparison-based algorithms are invariant under monotonic transformations in sense that for any problem function f and any strictly increasing function \(h:\mathbb {R}\rightarrow \mathbb {R}\) we obtain \(X_f=X_{h\circ f}\). Additionally, if an optimization algorithm is lazy for some class of functions then for any function from this class the above mentioned lower bound equals to 1.

4 General characterization of ACR.

We will start with Theorem 10 which states that, given the process \(\mathbb {X}=X_0,X_1,\dots \), the ACR is the maximum of two components: component B which does not depend on the problem function f (and Theorem 8 is the conclusion) and component \(D_f\) which depends on f but only on arbitrarily small neighbourhoods of global solutions \(A^{\star }\).

Theorem 10

For any \(\mathbb {X}=X_0,X_1,\dots \) and any \(f\in \mathcal {C}_B(A,\mathbb {R},A^{\star })\) we have

$$\begin{aligned} ACR(\mathbb {X},f)=\max \{B,D_f\}, \end{aligned}$$

where

  • \(B= \lim \limits _{\varepsilon \rightarrow 0^+}\limsup \limits _{t\rightarrow \infty }\root t \of {\mathbb {P}[X_t\notin B(A^{\star },\varepsilon )]},\)

  • \(D_f=\lim \limits _{\varepsilon \rightarrow 0^+}\limsup \limits _{t\rightarrow \infty }\root t \of {E[e_f(X_t)\cdot 1_{B(A^{\star },\varepsilon )}(X_t)]}\).

Theorem 10 separates two factors: the value of \(D_f\) depends only on the behaviour of \(f(X_t)\) asymptotically close to \(A^{\star }\) and the value of B may be determined by factors which are not related to the asymptiotics of the trajectories of \(f(X_t)\) - in practice B may represent, for instance, the ability of escaping local minima. The interpretation of B and \(D_f\) is case - dependant, for instance, an algorithm may have good local convergence properties which may result in the small values of \(D_f\) and at the same time if the algorithm has poor abilities of escaping local minima then for a multimodal function one may expect the big value of B which will determine the convergence rate. According to our knowledge the existing theoretical results on the convergence rates depend on the limiting behaviour of the algorithm and do not take into account how the ability of escaping local minima may influence the asymptotic convergence rate for the expectations \(e_t=E[f(X_t)-f^\star ]\). Example 6 presents a general idea, more detailed study is left for future research.

Example 6

Assume that for some set \(L\in \mathcal {B}(A)\) which is separated from \(A^{\star }\) we have \(p_T=\mathbb {P}[X_T\in L]>0\) for some \(T\in \mathbb {N}\) and that L traps the algorithm with some positive probability bounded from below:

$$\begin{aligned} \mathbb {P}[X_{t+1}\in L| X_t\in L]\ge 1-\delta \text{ for } \ t\ge T,\ \text{ where } \delta \in (0,1). \end{aligned}$$

Set L may represent a difficult region in the domain A like for example the neighbourhood of some local minima. Regardless of the asymptotic behaviour of the trajectories of \(X_t\), we have by Theorem 10,

$$\begin{aligned} ACR(\mathbb {X},f)=\max {\{B,D_f\}}\ge B \end{aligned}$$

and, by the assumptions,

$$\begin{aligned} B\ge \limsup \limits _{h\rightarrow \infty } \root T+h \of {\mathbb {P}[X_{T+h}\in L]} \ge \limsup \limits _{h\rightarrow \infty } \root T+h \of {p_T\cdot (1-\delta )^h} =1-\delta . \end{aligned}$$

Below we present another simple example which illustrates that uniform linear convergence of trajectories \(f(X_t)\rightarrow 0\) in sense of (9) does not imply that \(ACR<1\).

Example 7

Let \(f(x)=x\) with \(A=[0,1]\) and let \(C\in (0,1)\). Assume that \(X_t\in A\) starts from \(X_0=1\) and that \(p_t:=\mathbb {P}[X_t=1]\searrow 0^+\). Assume also that once \(X_t\) leaves point \(x=1\) then it decreases to 0 deterministically according to: \(X_{t+1}=C\cdot X_t\). We work under notation of Theorem 10:

  1. 1.

    Condition \(p_t\searrow 0\) implies that we have almost sure

    $$\begin{aligned} \limsup \limits _{t\rightarrow \infty }\root t \of {X_t}\le C \end{aligned}$$

    and the above convergence rate constant for the trajectories does not depend on the convergence rate of probabilities \(p_t\).

  2. 2.

    By definition, constant B from Theorem 10 satisfies

    $$\begin{aligned} B= \lim _{\varepsilon \rightarrow 0^+}\limsup \limits _{t\rightarrow \infty }\root t \of {\mathbb {P}[X_t\ge \varepsilon ]}\ge \limsup \limits _{t\rightarrow \infty }\root t \of {p_t}. \end{aligned}$$

    Constant \(D_f\) from Theorem 10 satisfies

    $$\begin{aligned} D_f\ge C \end{aligned}$$

    as for small \(\varepsilon >0\) we have \(X_{t+1}\cdot 1_{\{X_{t+1}<\varepsilon \}}\ge C\cdot X_{t}\cdot 1_{\{X_{t}<\varepsilon \}}\).

In particular, regardless of \(C\in (0,1)\) which determines the convergence rate of the trajectories, we have

$$\begin{aligned} ACR(\mathbb {X},f)\ge \limsup \limits _{t\rightarrow \infty }\root t \of {p_t} \end{aligned}$$

which may equal to one. Furthermore, as in Theorem 5, we have almost sure

$$\begin{aligned} \limsup \limits _{t\rightarrow \infty }\root t \of {X_t}\le ACR(\mathbb {X},f)=\limsup \limits _{t\rightarrow \infty }\root t \of {E[X_t]}. \end{aligned}$$

Now we will prove Theorem 10. Let use introduce the simplifying notation:

$$\begin{aligned} ACR(c_t,\ {t\in \mathbb {N}}):=\limsup _{t\rightarrow \infty }\root t \of {c_t}, \end{aligned}$$

where \(c_t\in \mathbb {R}^+\) is an arbitrary sequence. Often we will write shortly \(ACR(c_t):=ACR(c_t,\ {t\in \mathbb {N}})\). We have:

$$\begin{aligned} ACR(a_t+b_t)=\max \{ACR(a_t),ACR(b_t)\}, \end{aligned}$$
(17)
$$\begin{aligned} ACR(a_t\cdot b_t)\le \max \{ACR(a_t),ACR(b_t)\} \text{ if } a_t\le 1, b_t\le 1, \end{aligned}$$
(18)

where \(a_t\in \mathbb {R}^+\), \(b_t\in \mathbb {R}^+\). Furthermore, if \(a_t\le b_t\), then

$$\begin{aligned} ACR(a_t)\le ACR(b_t). \end{aligned}$$
(19)

Proof of Theorem 10

Fix \(f\in \mathcal {C}_B(A,\mathbb {R},A^{\star })\) and, as \(ACR(\mathbb {X},f)=ACR(\mathbb {X},e_f)\), assume for simplicity \(f^\star =0\) so we have \(f=e_f\). Also let \(D=D_f\) and for \(\varepsilon >0\) define:

$$\begin{aligned} D^t_\varepsilon :=E[f(X_t)1_{B(A^{\star },\varepsilon )}(X_t)], \ \ B^t_\varepsilon :=E[f(X_t)1_{B(A^{\star },\varepsilon )'}(X_t)], \end{aligned}$$

so we have

$$\begin{aligned} E[f(X_t)]=D^t_\varepsilon +B^t_\varepsilon . \end{aligned}$$
(20)

To further simplify notation let

$$\begin{aligned} D_\varepsilon =ACR(D^t_\varepsilon , t\in \mathbb {N}), \ \ B_\varepsilon =ACR(B^t_\varepsilon , t\in \mathbb {N}) \end{aligned}$$

so we have, by (17) and (20), for any \(\varepsilon >0\)

$$\begin{aligned} ACR(\mathbb {X},f)=\max \{B_\varepsilon ,D_\varepsilon \}. \end{aligned}$$

Now it remains to show that, as \(\varepsilon \searrow 0^+\), from (19),

$$\begin{aligned} B_\varepsilon \nearrow B \text{ and } D_\varepsilon \searrow D\in \mathbb {R}^+, \end{aligned}$$
(21)

so we will have

$$\begin{aligned} ACR(\mathbb {X},f)=\max \{B_\varepsilon ,D_\varepsilon \}\rightarrow \max (B,D). \end{aligned}$$
(22)

The second part of Theorem is thus proved as by definition

$$\begin{aligned} D=\lim \limits _{\varepsilon \rightarrow 0^+}\limsup _{t\rightarrow \infty }\root t \of {D^t_\varepsilon } =\lim \limits _{\varepsilon \rightarrow 0^+}ACR({D^t_\varepsilon },t\in \mathbb {N})=\lim \limits _{\varepsilon \rightarrow 0^+}D_\varepsilon . \end{aligned}$$

The above depends on f only asymptotically close to \(A^{\star }\).

It remains to show \(B=\lim \limits _{\varepsilon \rightarrow 0^+}B_\varepsilon \). Define \(H(\varepsilon ):=\inf \limits _{x\notin B(A^{\star },\varepsilon )}f(x)\) and \(f_{\max }:=\sup \limits _{x\in A}f(x)\). For any \(x\notin B(A^{\star },\varepsilon )\), \(H(\varepsilon )\le f(x)\le f_{\max }\), and hence

$$\begin{aligned} H(\varepsilon )\cdot \mathbb {P}[X_t\notin B(A^{\star },\varepsilon )] \le B^t_\varepsilon \le f_{\max }\cdot \mathbb {P}[X_t\notin B(A^{\star },\varepsilon )]. \end{aligned}$$

Both \(H(\varepsilon )\) and \(f_{\max }\) are positive finite constants and thus \(ACR(B^t_\varepsilon ,t\in \mathbb {N})=ACR(\mathbb {P}[X_t\notin B(A^{\star },\varepsilon )],t\in \mathbb {N})\) which finishes the proof as by definition

\(B=\lim \limits _{\varepsilon \rightarrow 0^+}ACR(\mathbb {P}[X_t\notin B(A^{\star },\varepsilon )], t\in \mathbb {N})\) \(\square \)

Conclution 11

If \(f,g\in C_B(A,\mathbb {R},A^{\star })\) satisfy

$$\begin{aligned} \lim \limits _{x\rightarrow A^{\star }}\frac{f(x)-f_{\min }}{g(x)-g_{\min }}=1 \end{aligned}$$

then \(ACR(\mathbb {X},f)=ACR(\mathbb {X},g)\).

Proof

From Theorem 10, \(ACR(\mathbb {X},f)=\max (B,D_f)\) and \(ACR(\mathbb {X},g)=\max (B,D_g)\). Hence, it is enough to show \(D_f=D_{g}\). For simplicity of notation let \(f_{\min }=g_{\min }=0\). From the assumptions it follows that function defined by \(H(x)=\frac{f(x)}{g(x)}\) for \(x\notin A^{\star }\) and \(H(x)=1\) for \(x\in A^{\star }\) is uniformly continuous on some small neighbourhood of \(A^{\star }\). Hence, for some \(M\in (0,1)\) and some \(\varepsilon >0\), for all \(x\in B(A^{\star },\varepsilon )\),

$$\begin{aligned} (1-M)\cdot g(x) \le f(x)\le (1+M)\cdot g(x) \end{aligned}$$
(23)

and thus for any \(\varepsilon >0\) small enough

$$\begin{aligned} ACR(E[f(X_t)1_{B(A^{\star },\varepsilon )}(X_t)], t\in \mathbb {N})=ACR(E[g(X_t)1_{B(A^{\star },\varepsilon )}(X_t)],t\in \mathbb {N}), \end{aligned}$$

which implies \(D_f=D_g\). \(\square \)

Markov chains. Now we will reformulate the above results in terms of Markov chains. We assume that \(X_t\) is a Markov chain with probability kernel P. We will work under assumption:

A):

\(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \mathbb {P}_{X_t}(A^{\star })=0,\ t\in \mathbb {N},\)

so we may safely work with the ratio

$$\begin{aligned} \frac{e_f(X_{t+1})}{e_f(X_t)}=\frac{f(X_{t+1})-f^\star }{f(X_t)-f^\star }. \end{aligned}$$

Furthermore, we will assume the inequality:

B):

\(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ f(X_{t+1})\le f(X_t),\ t\in \mathbb {N},\)

which is satisfied for monotonic (elitist) algorithms. Finally, given \(X_t=x\) define the expected error ratio:

$$\begin{aligned} C_f(x):=\frac{E[e_f(X_{t+1})|X_t=x]}{e_f(x)},\ x\notin A^{\star }. \end{aligned}$$
(24)

The above is defined by function f and probability kernel P:

$$\begin{aligned} C_f(x)=\frac{\int \limits _A e_f(y)P(x,dy)}{e_f(x)}, \ \ \ x\notin A^{\star }. \end{aligned}$$

It is easy to see that under \({\textbf {A)}}\), if \(C_f(x)\le C\), \(x\in A{\setminus }A^{\star }\), then

$$\begin{aligned} E[e_f(X_{t+1})]\le C\cdot E[e_f(X_t)] \end{aligned}$$

and thus \(ACR(\mathbb {X},f)\le C\). Bounding the convergence rates by

$$\begin{aligned} ACR(\mathbb {X},f)\le \sup \limits _{x\in A\setminus A^{\star }}C_f(x) \end{aligned}$$

is the classical approach to the problem, see [24]. Below we provide the lower and upper bound on the convergence rate based on the limiting behaviour of \(C_f\).

Theorem 12

Assume that \(\mathbb {X}\) and \(f \in \mathcal {C}_B(A,\mathbb {R},A^{\star })\) satisfy A) and B). We have

$$\begin{aligned} {\underline{C}}\le \max \{B,{\underline{C}}\} \le ACR(\mathbb {X},f)\le \max \{B,{\overline{C}}\}, \end{aligned}$$

where:

  1. 1.

    B is given by Theorem 10 and does not depend on f,

  2. 2.
    $$\begin{aligned} {\overline{C}}=\lim \limits _{\varepsilon \rightarrow 0^+}\sup _{x\in B(A^{\star },\varepsilon )\setminus A^{\star }}C_f(x),\\ {\underline{C}}=\lim \limits _{\varepsilon \rightarrow 0^+}\inf _{x\in B(A^{\star },\varepsilon )\setminus A^{\star }}C_f(x). \end{aligned}$$

Recall that from Theorem 10, \(ACR(\mathbb {X},f)=\max \{B,D_f\}\), and thus the above theorem follows from the following Lemma.

Lemma 13

Under assumptions of Theorem 12:

$$\begin{aligned} {\underline{C}}\le D_f\le \max \{B,{\overline{C}}\} \end{aligned}$$

Proof of Lemma 13

We will work under assumptions and notation of Theorems 10 and 12. In the proof we may assume that \(\mathbb {X}\) is a Markov chain given in general recursive form

$$\begin{aligned} X_{t+1}=T(X_t,Y_t), \end{aligned}$$
(25)

where \(Y_t:\Omega \rightarrow \mathbb {R}^k\), \(t\in \mathbb {N}\), is the sequence of independent identically distributed random variables (and independent of \(X_0\)) which represent randomness of the optimization scheme and \(T:A\times \mathbb {R}^k\rightarrow A\) is a Borel measurable mapping which stands for the deterministic procedures of the algorithm. All Markov chains have such representation, see [7], see also [23, 32, 33]. Markov chain (25) has probability kernel P given by

$$\begin{aligned} P(x,C)=\mathbb {P}(T(x,Y_t)\in C) \end{aligned}$$

and, for any \(h\in \mathcal {C}_B(A,\mathbb {R},A^{\star })\),

$$\begin{aligned} E[h(X_{t+1})|X_t=x]=E[h(T(x,Y_t))]. \end{aligned}$$

Now, we will also assume for simplicity of notation that \(f_{\min }=0\) so we have:

$$\begin{aligned} e_f(X_{t+1})=f(X_{t+1})=\frac{f(X_{t+1})}{f(X_t)}\cdot f(X_t). \end{aligned}$$

Recall that \(X_{t+1}=T(X_t,Y_t)\) and note that \(X_t\) and \(Y_t\) are independent. Hence, in the proof we may use Fubini’s theorem to integrate as below

$$\begin{aligned}{} & {} E[f(X_{t+1})]=E[\frac{f(T(X_{t},Y_t))}{f(X_t)}\cdot f(X_t)] =\int \limits _{A\setminus {A^{\star }}} E[\frac{f(T(x,Y_t))}{f(x)}]\cdot f(x) \mathbb {P}_{X_t}(dx)= \nonumber \\{} & {} =\int \limits _{A\setminus {A^{\star }}} C_f(x)\cdot f(x) \mathbb {P}_{X_t}(dx). \end{aligned}$$
(26)

First inequality. We need to prove that \({\underline{C}}\le D_f\). Let

$$\begin{aligned} D:=D_f=\lim \limits _{\varepsilon \rightarrow 0^+}ACR(E[f(X_t)1_{B(A^{\star },\varepsilon )}(X_t)], t\in \mathbb {N}). \end{aligned}$$

Let \(A_\delta =\{x\in A:f(x)\le \delta \}\). As \(f\in \mathcal {C}_B(A,\mathbb {R}^+,A^{\star })\), for any \(\varepsilon _1>0\), \(\delta _1>0\) there are \(\varepsilon _2>0\), \(\delta _2>0\) with

$$\begin{aligned} B(A^{\star },\varepsilon _1)\supset A_{\varepsilon _2} \text{ and } A_{\delta _1}\supset B(A^{\star },\delta _2), \end{aligned}$$

and hence

$$\begin{aligned} D=\lim \limits _{\delta \rightarrow 0^+}ACR(E[f(X_t)1_{A_\delta }(X_t)], t\in \mathbb {N}). \end{aligned}$$

Define

$$\begin{aligned} D_\delta ^t=E[f(X_t)1_{A_\delta }(X_t)]=\int _{A_\delta } f(x)\mathbb {P}_{X_t}(dx) \end{aligned}$$

and note that

$$\begin{aligned} D_\delta ^{t+1}=E[\frac{f(T(X_{t},Y_t))}{f(X_t)}\cdot f(X_t)1_{A_{\delta }}(X_{t+1})] \end{aligned}$$

As \(f(X_{t+1})\le f(X_t)\), we have \(1_{A_\delta }(X_{t+1})\ge 1_{A_\delta }(X_{t})\) and hence, by Fubini theorem we integrate,

$$\begin{aligned} D_\delta ^{t+1}\ge E[\frac{f(T(X_{t},Y_t))}{f(X_t)}\cdot f(X_t)1_{A_\delta }(X_{t})]= \\ =\int _{A_\delta } C_f(x)\cdot f(x) \mathbb {P}_{X_t}(dx). \end{aligned}$$

Next,

$$\begin{aligned} \frac{D_\delta ^{t+1}}{D_\delta ^{t}}\ge \frac{\int _{A_\delta } C_f(x)\cdot f(x) \mathbb {P}_{X_t}(dx)}{\int _{A_\delta } f(x)\mathbb {P}_{X_t}(dx)}\ge \inf _{x\in A_\delta }C_f(x). \end{aligned}$$

Thus, as

$$\begin{aligned} {\underline{C}}=\lim \limits _{\varepsilon \rightarrow 0^+}\inf \limits _{x\in B(A^{\star },\varepsilon )}C_f(x)=\lim \limits _{\delta \rightarrow 0^+}\inf \limits _{x\in A_\delta }C_f(x), \end{aligned}$$

and for any \(\delta >0\),

$$\begin{aligned} ACR(D_\delta ^t, t\in \mathbb {N})\ge \liminf \limits _{t\rightarrow \infty }\frac{D_\delta ^{t+1}}{D_\delta ^{t}}\ge \inf _{x\in A_\delta }C_f(x), \end{aligned}$$

we have

$$\begin{aligned} D=\lim \limits _{\delta \rightarrow 0^+}ACR(D_\delta ^t)\ge \lim \limits _{\delta \rightarrow 0^+}\inf \limits _{x\in A_\delta }C_V(x)={\underline{C}}. \end{aligned}$$

Second inequality. It remains to prove \(D\le \max \{B,{\overline{C}}\}\). We have

$$\begin{aligned} C_f(x)\le {\overline{C}}_\delta :=\sup _{z\in A_\delta }C_f(z) \text{ for } \text{ any } x\in A_\delta , \end{aligned}$$
(27)

and

$$\begin{aligned} D^t_\delta =E[f(X_{t})1_{A_\delta }(X_t)]= \\ =E[f(X_{t})1_{A_\delta }(X_t)1_{A_\delta }(X_{t-1})+E[f(X_{t})1_{A_\delta }(X_t)1_{A'_\delta }(X_{t-1})]. \end{aligned}$$

We start with the first component. By assumption \({\textbf {B)}}\) and next by Fubini’s theorem:

$$\begin{aligned} E[f(X_{t})1_{A_\delta }(X_t)1_{A_\delta }(X_{t-1})]=E[f(X_{t})1_{A_\delta }(X_{t-1})]=\\ =\int \limits _{A_\delta } E[f(T(x,Y_t))]\mathbb {P}_{X_{t-1}}(dx)=\int \limits _{A_\delta } C_f(x)\cdot f(x) \mathbb {P}_{X_{t-1}}(dx)\le \\ \le {\overline{C}}_\delta \int _{A_\delta } f(x) \mathbb {P}_{X_{t-1}}(dx)={\overline{C}}_\delta E[f(X_{t-1})1_{A_\delta }(X_{t-1})]. \end{aligned}$$

The second component, as \(f(X_t)\le \delta \) for \(X_t\in A_\delta \), satisfies

$$\begin{aligned} E[f(X_{t})1_{A_\delta }(X_t)1_{A'_\delta }(X_{t-1})]\le \delta E[1_{A_\delta }(X_t)1_{A'_\delta }(X_{t-1})]. \end{aligned}$$

Hence

$$\begin{aligned} D^t_\delta \le {\overline{C}}_\delta E[f(X_{t-1})1_{A_\delta }(X_{t-1})]+\delta E[1_{A_\delta }(X_t)1_{A'_\delta }(X_{t-1})]. \end{aligned}$$

Define now the stopping moment

$$\begin{aligned} \tau _\delta :=\inf \{t\in \mathbb {N}:X_t\in A_\delta \} \end{aligned}$$

so we have for any \(t\ge 1\),

$$\begin{aligned} \mathbb {P}(\tau _\delta =t)=\mathbb {P}(X_t\in A_\delta ,X_{t-1}\notin A_\delta )= \end{aligned}$$

\(=E[1_{A_\delta }(X_t)1_{A'_\delta }(X_{t-1})]\), and hence

$$\begin{aligned} D^t_\delta \le {\overline{C}}_\delta E[f(X_{t-1})1_{A_\delta }(X_{t-1})]+\delta \mathbb {P}(\tau _\delta =t). \end{aligned}$$

Now we bound \(E[f(X_{t-1})1_{A_\delta }(X_{t-1})]\) in the same fashion as above so we obtain

$$\begin{aligned} D^t_\delta \le ({\overline{C}}_\delta )^2 E[f(X_{t-2})1_{A_\delta }(X_{t-2})]+{\overline{C}}_\delta \mathbb {P}(\tau _\delta =t-1) + \\ +\delta \mathbb {P}(\tau _\delta =t) \end{aligned}$$

and we proceed as above for the next \(t-2\) steps so we finally obtain:

$$\begin{aligned} D^t_\delta \le \dots \le \sum _{k=0}^t({\overline{C}}_\delta )^k\delta \cdot \mathbb {P}[\tau _\delta =t-k]=\delta \cdot c^\delta _t, \end{aligned}$$

where

$$\begin{aligned} c^\delta _t:=\sum _{k=0}^t({\overline{C}}_\delta )^k\cdot \mathbb {P}[\tau _\delta =t-k]= \\ =\ \mathbb {P}[\tau _\delta =t]+{\overline{C}}_\delta \cdot \mathbb {P}[\tau _{\delta }=t-1]+ ({\overline{C}}_\delta )^2\cdot \mathbb {P}[\tau _{\delta }=t-2]+ \\ +\dots +({\overline{C}}_\delta )^t\cdot \mathbb {P}[\tau _{\delta }=0]. \end{aligned}$$

This implies that, as \(D_\delta ^t=E[f(X_{t})1_{A_\delta }(X_t)]\), we have

$$\begin{aligned} ACR(D_\delta ^t,t\in \mathbb {N})=ACR(E[f(X_{t})1_{A_\delta }(X_t)])\le ACR(c^\delta _t). \end{aligned}$$

Fix \(\delta >0\) and to simplify notation let \(C={\overline{C}}_\delta \) and

\(P_t=\mathbb {P}[\tau _{\delta }=t]\). With this notation

$$\begin{aligned} c^\delta _t=P_t+C\cdot P_{t-1}+\dots +C^{t-1}P_1+C^tP_0. \end{aligned}$$
(28)

Note that \(P_t\le \mathbb {P}[X_{t-1}\notin A_\delta ]\) and hence \(ACR(P_t)\le B_\delta \), where \(B_\delta :=ACR(\mathbb {P}[X_{t-1}\notin A_\delta ],t\in \mathbb {N})\) satisfies \(\lim \limits _{\delta \rightarrow 0}B_\delta =B\). We have \(ACR(P_t)< B_\delta +\delta \) and thus, by Eq. (35), there is a constant M such that

$$\begin{aligned} P_t\le (B_\delta +\delta )^t\cdot M,\ t\in \mathbb {N}.\end{aligned}$$
(29)

Let \(G:=\max (B_{\delta }+\delta ,C)\) so we have, by (29),

$$\begin{aligned} C^{t-i}P_i\le G^{t-i}\cdot M\cdot G^i=M\cdot G^t \end{aligned}$$

and hence, by (28),

$$\begin{aligned} c^\delta _t\le M\cdot G^t+M \cdot G^t+\dots +M\cdot G^t=(t+1)\cdot M\cdot G^t. \end{aligned}$$

Thus we have:

$$\begin{aligned} ACR(c^\delta _t)\le \limsup \limits _{t\rightarrow \infty }\root t \of {(t+1)\cdot M\cdot G^t}=G=\max (B_{\delta }+\delta ,C). \end{aligned}$$

This finishes the proof as we have \(ACR(D^t_\delta ,t\in \mathbb {N})\le \)

$$\begin{aligned} \le ACR(c_t^\delta ) \le \max (B_{\delta }+\delta ,{\overline{C}}_\delta )\rightarrow \max (B,{\overline{C}}) \text{ as } \delta \rightarrow 0. \end{aligned}$$

\(\square \)

Observation 14 shows a simple situation in which examining the limiting value \({\overline{C}}\) from Theorem 12 is enough to verify the exponential decrease of the expectations \(e_t=E[f(X_t)-f_\star ]\). We work under notation of Theorem 12.

Observation 14

Assume that A is compact and function

$$\begin{aligned} A\setminus A^{\star }\ni x\rightarrow C_f(x) \in [0,1] \end{aligned}$$

is continuous. Additionally assume \({\textbf {A)}}\), i.e. \(\mathbb {P}[X_t\in A^{\star }]=0\). If \(C_f(x)<1\) for any \(x\notin A^{\star }\) and \({\overline{C}}<1\), then \(ACR(\mathbb {X},f)<1\).

Proof

We have \(ACR\le \sup \limits _{x\in A{\setminus }A^{\star }}C_f(x)\) and thus it is enough to see \(\sup \limits _{x\in A\setminus A^{\star }}C_f(x)<1\). For some small \(\varepsilon >0\) we have \(\sup \limits _{x\in B(A^{\star },\varepsilon ){\setminus }A^{\star }}C_f(x)<1\) by condition \({\overline{C}}<1\). At the same time we have \(\sup \limits _{x\notin B(A^{\star },\varepsilon )}C_f(x)=\max \limits _{x\notin B(A^{\star },\varepsilon )}C_f(x)<1\) because under our assumptions the set \(A\setminus B(A^{\star },\varepsilon )\) is compact and thus the continuous function \(C_f\) obtains its maximal value on \(A\setminus B(A^{\star },\varepsilon )\). \(\square \)

5 Objective space and solutions space.

Various numerical experiments and various theoretical papers analyse the convergence rate in the objective space (fitness space). In practice, during optimization process the succesive values of \(f(x_t)\) are observed. However, in many scenarios, we are more interested in how good our approximations are in the search space rather than what are the successive values of \(f(x_t)\). The convergence in the search space is sometimes called strong convergence, [14]. This chapter is about the relation between \(ACR(\mathbb {X},d)\) and \(ACR(\mathbb {X},f)\). Such relation is not always a straightforward conclusion from comparison between convergence rates of the corresponding trajectories \(x_t\rightarrow A^{\star }\) and \(f(x_t)\rightarrow f^\star \).

Recall that metric d is the Euclidean metric on A and fix \(f\in \mathcal {C}_B(A,\mathbb {R},A^{\star })\). For simplicity of notation let

$$\begin{aligned} d(x):=d(x,A^{\star }),\ x\in A. \end{aligned}$$

We assume A) so we may use the Cauchy-Schwartz inequality as below:

$$\begin{aligned} E[e_f(X_t)]=E\left[ \frac{e_f(X_t)}{\sqrt{d(X_t)}} \cdot \sqrt{d(X_t)}\right] \le \sqrt{E\left[ \frac{(e_f(X_t))^2}{d(X_t)}\right] }\cdot \sqrt{E[d(X_t)]}. \end{aligned}$$

Let . The above implies

$$\begin{aligned} ACR_t(\mathbb {X},e_f)\le \sqrt{ACR_t(\mathbb {X},\frac{e_f^2}{d}) } \cdot \sqrt{ACR_t(\mathbb {X},d)}. \end{aligned}$$

where \(\frac{e_f^2}{d}(x)=\frac{\left( e_f(x)\right) ^2}{d(x)}\). Hence, we deduce

$$\begin{aligned} ACR(\mathbb {X},f)=ACR(\mathbb {X},e_f)\le \sqrt{ACR(\mathbb {X},\frac{e_f^2}{d}) } \cdot \sqrt{ACR(\mathbb {X},d)}. \end{aligned}$$

If we assume that \(ACR(\mathbb {X},f)>0\), the above inequality implies that \(ACR(\mathbb {X},\frac{e_f^2}{d})>0\), and

$$\begin{aligned} K^f_1\cdot ACR(\mathbb {X},f) \le ACR(\mathbb {X},d), \end{aligned}$$
(30)

where

$$\begin{aligned} K^f_1=\frac{ACR(\mathbb {X},f)}{ACR(\mathbb {X},\frac{e_f^2}{d})}. \end{aligned}$$

Above, if \({ACR(\mathbb {X},\frac{e_f^2}{d})}=\infty \) then \(K^f_1=0\).

Now, we can repeat the above reasoning starting with \(d(X_t)=\frac{d(X_t)}{\sqrt{e_f(X_t)}} \cdot \sqrt{e_f(X_t)}\) to obtain that

$$\begin{aligned} ACR_t(\mathbb {X},d)\le \sqrt{ACR_t(\mathbb {X},\frac{d^2}{e_f}) } \cdot \sqrt{ACR_t(\mathbb {X},f)},\\ ACR(\mathbb {X},d)\le \sqrt{ACR(\mathbb {X},\frac{d^2}{e_f}) } \cdot \sqrt{ACR(\mathbb {X},f)}. \end{aligned}$$

If we assume that \(ACR(\mathbb {X},d)>0\), then

$$\begin{aligned} ACR(\mathbb {X},d) \le K^f_2\cdot ACR(\mathbb {X},f), \end{aligned}$$
(31)

where

$$\begin{aligned} K_2^f=\frac{ACR(\mathbb {X},\frac{d^2}{f})}{ACR(\mathbb {X},d)}. \end{aligned}$$

Theorem 15 summarises the above. If f(x) approaches \(f^\star \) not slower than the argument x approaches \(A^{\star }\) then \(ACR(\mathbb {X},f)\le ACR(\mathbb {X},d)\) (in the opposite case, the opposite inequality holds true). For instance, if \(A^{\star }\) is a finite set contained in the interior of A and f is differentiable then \(K^1_f\ge 1\) and further \(ACR(\mathbb {X},f)\le ACR(\mathbb {X},d)\), a natural situation in which the convergence rate in the objective space may be better than the convergence rate in the search space.

Theorem 15

Assume condition A) and let \(f\in \mathcal {C}_B(A,A^{\star },\mathbb {R})\). If \(ACR(\mathbb {X},f)>0\) and \(ACR(\mathbb {X},d)>0\) then

$$\begin{aligned} K^f_1\cdot ACR(\mathbb {X},f)\le ACR(\mathbb {X},d)\le K^f_2\cdot ACR(\mathbb {X},f), \end{aligned}$$

where constants \(K^f_1\) and \(K^f_2\) are from Eqs. (30) and (31). Furthermore, we have:

  1. (a)

    If \(\limsup \limits _{x\rightarrow A^{\star }}\frac{e_f(x)}{d(x)}<\infty \), then \(1\le K^f_1\le K^f_2\)

  2. (b)

    If \(\sup \limits _{x\in A\setminus A^{\star }}\frac{d(x)}{e_f(x)}<\infty \), then \(K^f_1\le K^f_2\le 1\)

Proof

As (30) and (31) are already proved, it remains to prove (a) and (b). Let us start from a) which assumes that

$$\begin{aligned} M:=\limsup \limits _{x\rightarrow A^{\star }}\frac{e_f(x)}{d(x)}<\infty . \end{aligned}$$

To show \(K^1_f\ge 1\) we need to show \(ACR(\mathbb {X},f)\ge ACR(\mathbb {X},\frac{e_f^2}{d})\). Our assumptions imply that function \(h:A\rightarrow \mathbb {R}^+\) given by

$$\begin{aligned} h(x):=0 \text{ for } x\in A^{\star } \text{ and } h(x):= \frac{e_f(x)}{d(x)} \text{ for } x\notin A^{\star }\end{aligned}$$

is bounded in sense \(\sup \limits _{x\in A}h(x)<\infty \). Indeed, we have \(M<\infty \) and, for any \(\varepsilon >0\) and any \(x\notin B(A^{\star },\varepsilon )\), \(h(x)\le \frac{1}{\varepsilon }\cdot \sup \limits _{x\in A}e_f(x)<\infty \) as \(f\in C_B(A,\mathbb {R},A^{\star })\). Hence,

$$\begin{aligned} \sup \limits _{x\in A}h(x)<\infty . \end{aligned}$$

Now, it remains to note that

$$\begin{aligned} ACR(\mathbb {X},\frac{e_f^2}{d})=ACR(\mathbb {X},e_f\cdot h)\le \\ \le ACR(\mathbb {X},e_f\cdot \sup \limits _{x\in A}h(x))=ACR(\mathbb {X},e_f)=ACR(\mathbb {X},f) \end{aligned}$$

which proves

$$\begin{aligned} K_1^f\ge 1. \end{aligned}$$

The proof of b) is similar to the above - assumption \(M_2:=\sup \limits _{x\in A\setminus A^{\star }}\frac{d(x)}{e_f(x)}<\infty \) implies \(K^f_2\le 1\) because under this assumption we have

$$\begin{aligned} ACR(\mathbb {X},\frac{d^2}{e_f})=ACR(\mathbb {X},\frac{d}{e_f}\cdot d)\le ACR(\mathbb {X}, d). \end{aligned}$$

\(\square \)

The theorem below states that if f approaches global minimum not faster than some polynomial of degree n then \(ACR(\mathbb {X},d)\le \root n \of {ACR(\mathbb {X},f)}\). This implies that in such case linear convergence in the objective space (\(ACR(\mathbb {X},f)<1\)) implies linear convergence in the search space (\(ACR(\mathbb {X},d)<1\)).

Theorem 16

Let \(f\in \mathcal {C}(A,\mathbb {R},A^{\star })\). If A is compact and there are \(M>0\) and \(n\ge 1\) such that

$$\begin{aligned} e_f(x)\ge M\cdot \left( d(x,A^{\star })\right) ^n \end{aligned}$$

for all x from \(B(A^{\star },\varepsilon )\) for some \(\varepsilon >0\), then

$$\begin{aligned} ACR(\mathbb {X},d)\le \root n \of {ACR(\mathbb {X},f)}. \end{aligned}$$

Proof

By Theorem 10, \(ACR(\mathbb {X},d)=\max \{B,D_d\}\) and \(ACR(\mathbb {X},f)=\max \{B,D_f\}\), where \(B\le 1\), \(D_d\le 1\), \(D_f\le 1\). Hence, it is enough to prove that \(D_d\le \root n \of {D_f}\). For \(M>0\) and any \(\varepsilon >0\) with

$$\begin{aligned} e_f(X_t)\cdot 1_{B(A^{\star },\varepsilon )}(X_t)\ge M\cdot 1_{B(A^{\star },\varepsilon )}(X_t)\cdot (d(X_t))^n \end{aligned}$$

we have, by Jensen’ inequality,

$$\begin{aligned} E[e_f(X_t)\cdot 1_{B(A^{\star },\varepsilon )}(X_t)]\ge M\cdot E[1_{B(A^{\star },\varepsilon )}(X_t)\cdot (d(X_t))^n]\ge \\ \ge M\cdot \left( E[1_{B(A^{\star },\varepsilon )}(X_t)\cdot d(X_t)]\right) ^n. \end{aligned}$$

The above implies that for all \(\varepsilon >0\) small enough

$$\begin{aligned} \root t \of {E[e_f(X_t)\cdot 1_{B(A^{\star },\varepsilon )}(X_t)]}\ge \root t \of {M}\left( \root t \of {E[1_{B(A^{\star },\varepsilon )}(X_t)\cdot d(X_t)]}\right) ^n \end{aligned}$$

which further implies that \({D_f}\ge (D_d)^n\) and thus \(\root n \of {D_f}\ge D_d. \) \(\square \)

The above results imply that for functions like polynomials on bounded search spaces the linear convergence in the search space is equivalent to the linear convergence in the fitness space. Theorem below is conlusion of Theorems 15 and 16.

Theorem 17

Let \(f\in \mathcal {C}_B(A,\mathbb {R},A^{\star })\). If for some constant \(M>0\) and some polynomial \(w:\mathbb {R}\rightarrow \mathbb {R}\) of degree \(n\in \mathbb {N}\setminus \{0\}\) we have

$$\begin{aligned} M\cdot d(x,A^{\star }) \ge e_f(x)\ge w(d(x,A^{\star })), \end{aligned}$$

then

$$\begin{aligned} ACR(\mathbb {X},f)\le ACR(\mathbb {X},d)\le \root n \of {ACR(\mathbb {X},f)}. \end{aligned}$$
(32)

Without additional assumptions the above inequality cannot be improved. Assume that

$$\begin{aligned} f(x)=x^n,\ x\in [0,1]. \end{aligned}$$

To see that sometimes \(ACR(\mathbb {X},d)=\left( ACR(\mathbb {X},f)\right) ^{\frac{1}{n}}\), it is enough to consider a deterministic sequence \(x_{t+1}=C\cdot x_t,\) where \(C\in (0,1)\). We naturally have \(ACR(\mathbb {X},d)=C\) and \(ACR(\mathbb {X},f)=C^n\). However, as indicated before, in the stochastic case the limiting behaviour of the trajectories \(f(x_t)\) does not necessarily determine the value of ACR and in particular we may have \(ACR(\mathbb {X},f)=ACR(\mathbb {X},d)\). To see that it is enough to consider a simple example: let \(X_t\) satisfy \(X_t=1\) with probability \(p_t\) and \(X_t=0\) with probability \(1-p_t\). It is easy to see that

$$\begin{aligned} ACR(\mathbb {X},f)=ACR(\mathbb {X},d)=\limsup \limits _{t\rightarrow \infty }\root t \of {p_t}. \end{aligned}$$

Naturally, if f decreases to \(f^\star \) faster than any polynomial then conditions \(ACR(\mathbb {X},d)<1\) and \(ACR(\mathbb {X},f)<1\) are not equivalent which follows immediately from deterministic case. For instance, for \(x_t=\frac{1}{t}\in [0,1]\) we have \(ACR(x_t,d)=1\). On the other hand, if \(f(x)=\exp (\frac{-1}{x})\) for \(x>0\) and \(f(0)=0\), then \(ACR(x_t,f)=\exp ^{-1}<1\).

6 Simulations.

ACR. Equation (3) allows us to approximate ACR numerically. The procedure is simple: given the problem function, run the algorithm independently N times (the bigger N the better). Let \(x^i_0,x^i_1,\dots ,x^i_T\) denote the successive states of the algorithm during \(i-th\) run (the whole history of the algorithm up to T-th iteration). Define

$$\begin{aligned} e_t:=E[e_f(X_t)] \text{ and } ACR_t:=\root t \of {e_t}\cong \root t \of {\frac{1}{N}\sum _{i=1}^{N}(e_f(x_t^i))}. \end{aligned}$$

If the global minimum \(f^\star \) is unknown, we are not able to calculate the error function \(e_f(x)=f(x)-f^{\star }\) and the value of \(ACR_t\) directly. Still, one may use alternative definitions for simulations in such case, for instance one can use the following approximation proposed in [13]

$$\begin{aligned} ACR^\tau _t=\root \tau \of {\frac{e_t-e_{t+\tau }}{e_{t-\tau }-e_t}}, \end{aligned}$$

where \(\tau \) is a sufficiently large time interval (\(ACR^{\tau }_t\) is defined for \(t\ge \tau \) and uses \(2\tau \) algorithm iterations for calculations). Naturally, instead of fixing large \(\tau \) we may also increase \(\tau \) while t increases and our proposition is \(\tau =t\):

$$\begin{aligned} ACR^{t}_t=\root t \of {\frac{e_t-e_{2t}}{e_0-e_t}} \end{aligned}$$

Note that \(ACR^{t}_t=\root t \of {e_t}\cdot \root t \of {\frac{1-\frac{e_{2t}}{e_t}}{e_0-e_t}}\) and thus if \(\root t \of {1-\frac{e_{2t}}{e_t}}\rightarrow 1\) then \(ACR_t\) and \(ACR_t^t\) have the same limit points.

Now, we will illustrate numerically \(ACR_t\), \(ACR^\tau _t\) and \(ACR^t_t\) with use of Himmelblau’s test function

$$\begin{aligned} f(x,y)=\frac{1}{100}(x^2+y-11)^2+\frac{1}{100}(x+y^2-7)^2. \end{aligned}$$
(33)

The Himmelblau’s function has four global solutions and satisfies \(f_{\min }=0\). For simulations we focus on the cube \(A=[-4,4]^2\) which contains all of the four global solutions and we use the step-size adaptive random search (SARS) which is an example of (1+1) evolution strategy, see [3, 6]. For simulation we use parameters \(\alpha =1.1\) and \(\beta =\frac{1}{\sqrt{\alpha }}\). One can see [3] for more details on parameters (given appropriate problem and parameters SARS is able to approach \(A^{\star }\) exponentially fast). In the simulation the algorithm runs independently \(N=10^3\) times and each time is stopped after \(T=10^3\) iterations.

Algorithm 2
figure b

SARS

The blue dots on Fig. 1 represent the successive values of \(ACR_t\). This sequence stabilizes quickly. The red dots represent the successive values of \(ACR^t_t\) - it does not use the information on \(f_{\min }\) and converges slower than \(ACR_t\). Note that first 1000 iterations of the algorithm provides us information about the sequence \(ACR^1_1,\dots ,ACR^{500}_{500}\) which is presented on the picture ( the value of \(ACR^t_t\) corresponds to the 2t-th iteration of the algorithm). The green points represent the successive values of \(ACR^{\tau }_t\) with use of \(\tau =100\) (the values of \(ACR^{\tau }_t\) are available after first 2\(\tau \) iterations). The averaging period is finite ( \(\tau =100\) ) and the convergence is fast. The \(ACR_t^t\) converges slower but it omits the problem of choosing the proper value of \(\tau \). More discussions on simulations may be found in [8].

Fig. 1
figure 1

\(ACR_t\), \(ACR^\tau _t\) and \(ACR^t_t\) for Himmelblau’s test function

Lazy convergence. We will illustrate numerically Theorem 9 with use of PRS algorithm and the following (1+k) Evolutionary Algorithm with constant mutation parameter \(\sigma \).

Algorithm 3
figure c

(1+k) EA

Algorithm 3 is lazy because we assumed that it keeps constant mutation parameter \(\sigma \) ( evolutionary algorithms avoid lazy convergence because of proper self-adaptation mechanisms but naturally may suffer the premature convergence in some cases, [27]). The theoretical details (and other examples) may be found in [33]. We will illustrate numerically Theorem 9 with use of Himmelblau’s function (33). In simulation PRS samples points from uniform distribution and in Algorithm 3 ( (1+1) ES) the parameters are fixed: \(k=1\) (one offspring per generation) and \(\sigma =0,1\). Figure 2 shows \(ACR_t\) for both algorithms for first 2000 iterations. Before reaching iteration \(T=100\) EA starts to outperform PRS. However, as the EA approaches the target \(A^{\star }\) with precision \(<<\sigma =0.1\), it looses good convergence properties too and presents poor asymptotic behaviour. See also Example 6.1 in [8, 26] for similar example. Our Theorem 9 states that lazy algorithms (including PRS and Algorithm 3) will present the same poor asymptotic behaviour for any continuous nontrivial problem.

Fig. 2
figure 2

\(ACR_t\) for PRS and (1+1) ES for Himmelblau’s function

Search space and objective space. We use again Algorithm 2 with parameters \(\alpha =1.1\) and \(\beta =\frac{1}{\sqrt{\alpha }}\) and Himmelbau’s test function which is a polynomial of degree \(n=2\). Hence, for some constant \(M>0\) the error function \(e_f\) satisfies

$$\begin{aligned} e_f(x)\ge M \cdot \left( d(x,A^{\star })\right) ^2,\ x \in A. \end{aligned}$$

Hence, as in the proof of Theorem 16, by Jensen’ inequality,

$$\begin{aligned} \root 2 \of {ACR_t(\mathbb {X},f)} \ge \root 2\cdot t \of {M} \cdot ACR_t(\mathbb {X},d). \end{aligned}$$

Figure 3 shows the \(ACR_t(\mathbb {X},d)\) and \(ACR_t(\mathbb {X},f)\) for first 500 iterations. Both sequences stabilize quickly and we see that for big iterations \(ACR_t(\mathbb {X},d)\) (red points) behave like \(\root 2 \of {ACR_t(\mathbb {X},f)}\) (green points). Theorem 17 states that for polynomial of degree \(n=2\) we will always have

$$\begin{aligned} ACR(\mathbb {X},f)\le ACR(\mathbb {X},d)\le \root 2 \of {ACR(\mathbb {X},f)}. \end{aligned}$$
(34)

Figure 3 suggests that \(ACR_t(\mathbb {X},d)\approx \root 2 \of {ACR(\mathbb {X},f)}\) for large t which indicates that in this case \(ACR(\mathbb {X},d)\) reaches the upper bound of inequality (34).

Fig. 3
figure 3

\(ACR_t\) in the search space and in the objective space

7 Conclusion and further research

This paper analyses the properties of the asymptotic convergence rate (ACR) of the sequence \(e_t=E[f(X_t)]-f_\star \), where \(X_t\) is a stochastic process which converges to global minimum \(f_\star \) of function f. The presented studies are motivated by the questions on how to provide the general characterisation of ACR in continuous optimization under general assumptions, how the change of f influences the value of asymptotic convergence rate of \(e_t\) and what is the relation between the convergence rate in the objective space and the convergence rate in the search space. Additionally, this paper brings to attention the difference between the analysis of convergence rate for the expectations \(e_t=E[f(X_t)]-f_\star \) and the analysis of the convergence rate for the trajectories \(f(X_t)\). The good asymptotic behaviour of the trajectories does necessarily determine the exponential decrease of \(e_t\) and this paper shows, among other results, that for a bounded function f the ACR is the maximum of two factors - the first factor depends on the function f only on arbitrarily small vicinities of global minimums and the second factor may depend on the behaviour of the sequence \(X_t\) outside of such vicinities. According to the author’s knowledge so far the existing theoretical results about convergence rate focus on unimodal functions and the presented methodology shows how to approach the theoretical analysis for convergence rates of the expectations \(e_t\) in case of multimodal functions which is the aim of future studies.