1 Introduction

Algorithm portfolios [6, 19] (APs) have gained increasing popularity as general-purpose multi-algorithmic frameworks for global optimization. The rationale behind their development lies in the fact that, given a previously unknown optimization problem, it is preferable to apply an algorithmic scheme that combines multiple solvers (or their variants), instead of a single algorithm arbitrarily selected from the relevant literature. The constituent algorithms can interchangeably or concurrently run, sharing the available computational resources. Consequently, the user is relieved from the burden of carefully selecting an appropriate algorithm, which commonly comes at the risk of a poor choice due to unavailable a priori information for the specific problem. In addition, the necessity for proper parameter setting of the selected algorithm can be alleviated by including different variants (parametrizations) in the portfolio. So far, APs have been successfully used in interdisciplinary applications including inventory routing [15], berth allocation [23], lot sizing [16], facility location [3], combinatorics [17], graph pattern matching [2], and cryptography [18]. Also, recent technologies such as deep neural networks have been used for the construction of APs [7].

The inner structure of an AP depends on the type of the computation environment. In the case of a serial computation environment, the constituent algorithms run on a single processing unit. In such cases, the computational resources (a.k.a. computational budget) refer to the total available number of function evaluations, which are gradually allocated in batches to the constituent algorithms [19]. Thus, at each batch, each solver receives a fraction of the available function evaluations, which is proportional to its performance, and consumes it before proceeding to the next solver in a round-robin manner.

In the case of parallel computation multi-processor environments, there are two alternative types of APs. The first type assumes that each constituent algorithm occupies one processing unit, while sharing the available function evaluations with the rest of the algorithms at each batch. In other words, each solver is assigned a fraction of the available function evaluations of the current batch according to its performance, and runs until it consumes it. Then it becomes idle until the next batch assignment. The second type of APs assumes that the same fixed number of function evaluations is allocated to each processing unit at every batch. In this case, the processing units become the computational resources, and the constituent algorithms compete to occupy a larger fraction according to their performance.

The resource allocation process plays primary role in efficiency. Offline allocation prior to execution may result in sub-optimal performance scores. On the other hand, online allocation can actively track the relative performance of each constituent algorithm and proportionally allocate the available resources. This way, the most efficient algorithms are awarded additional resources, while the inferior ones are not neglected since they may be useful in later stages of the optimization procedure. For instance, exploration-oriented algorithms can provide advantage in early iterations by roughly detecting the most promising regions of the search space (global search), while exploitation-oriented algorithms become more beneficial at later stages of the optimization procedure where fine-grained exploration (local search) is desirable. Relevant works have verified these properties using various resource allocation schemes [19].

The present work proposes a resource allocation process based on adaptive decision-making procedures. More specifically, a parallel AP consisting of three state-of-the-art numerical optimization algorithms is considered. The constituent algorithms are the BFGS method with backtracking line search [10], the Nelder–Mead nonlinear simplex method [9], and the particle swarm optimization metaheuristic [12]. The selected algorithms stand among the most popular representatives of the three essential solver classes, namely gradient-based, direct search, and swarm intelligence, respectively. To the best of our knowledge, this is the first AP in the relevant literature that combines algorithms of all three types. We considered a parallel computation environment where each algorithm occupies one processing unit (worker) in a master–worker computation model. The computational budget (function evaluations) is allocated to the three algorithms in batches, according the general description above.

The initial resource assignment is fair, i.e., each solver is assigned equal fraction of the function evaluations of the first batch. Then, at the end of each batch, the performance of each solver is assessed in terms of its solution quality so far. An adaptive pursuit strategy is employed to determine the fraction of computational resources that will be assigned to each solver in the next batch proportionally to its performance. The use of the specific adaptive approach was motivated by its tolerance in non-stationary environments, as well as its nice performance in relevant resource allocation tasks [22]. Thus, each solver is assigned a reward at the end of each batch, and its forthcoming computational budget is determined according to the reward as well as an estimation of its future quality, similarly to a reinforcement learning approach.

The proposed scheme was benchmarked against its constituent algorithms on the challenging and computationally demanding problem of atoms configuration through the minimization of Lennard–Jones potentials [8]. The specific problem type has long served as testbed for benchmarking global optimization algorithms [11], and exhibits high number of local minima that scales exponentially with the number of atoms. Our experiments employed instances ranging from 20 to 80 atoms in order to assess the ability of the proposed AP in outperforming all its constituent algorithms on the specific problems. For this purpose, different configurations of the AP were studied and analyzed. The results are supported by statistical analysis, providing insight on the proposed algorithmic schemes.

The rest of the paper is organized as follows: Sect. 2 offers the necessary background information. Section 3 describes the proposed approach and Sect. 4 is devoted to the experimental evaluation. Eventually, the paper concludes in Sect. 5.

2 Background information

In the following paragraphs, we offer the necessary background information. Our presentation includes APs, the adaptive pursuit approach, the three solvers that are later used to build an illustrative example of the proposed AP, as well as the Lennard–Jones potential functions, i.e., our testbed for benchmarking. In our descriptions below, we assume that the problem under consideration is the general bound-constrained minimization problem defined as:

$$\begin{aligned} \min \limits _{x \in X} f(x), \end{aligned}$$
(1)

where \(X \subset {\mathbb {R}}^n\) is an n-dimensional hypercube.

2.1 Algorithm portfolios

APs constitute multi-algorithmic optimization schemes. They typically consist of a number of different algorithms (solvers) applied on a specific problem, running either interchangeably (serial computation) or concurrently (parallel computation). Different variants or parametrizations of an algorithm can be considered as different algorithms participating in the AP. Typical characteristics of APs include resource sharing, gradual allocation of the computational budget, and retaining all algorithms active during the run [19].

The rationale behind such schemes lies in the assumption that, having no prior information on the optimal algorithm for the problem at hand (if one exists), it is better to apply a number of algorithms that share the available resources. This way, the user is prevented from adopting the wrong solver, a decision that may cost crucial time and/or money. Even in the case where an algorithm is appropriate for the problem at hand, its performance may vary at different stages of the optimization procedure. For example, in the first iterations of the algorithm, wide exploration of the search space is beneficial, while local fine-tuning is preferred at the final stage (convergence phase). Finding an algorithm that satisfies all these requirements is rather hard, if not impossible [1, 20].

Taking into consideration the above requirements, we can infer that making the most of such multi-algorithmic schemes depends on the following properties:

  1. (a)

    Complementarity: The constituent algorithms shall be complementary in terms of their performance profiles or their types (in the sense explained above). Thus, weaknesses of one algorithm can be alleviated by the other.

  2. (b)

    Adaptability: The resource allocation mechanism shall adapt to the current performance of the algorithms in order to promote the best-performing ones.

  3. (c)

    Interaction: The algorithms shall be able to share their findings. This property allows cooperation among them, expediting the detection of acceptable solutions.

Complementarity among algorithms is part of the more general algorithm selection problem, which has been studied for many decades [14]. In APs the user is not strictly expected to select the best algorithms. Such a demand would cancel the main gain of using APs, namely relieving from the burden of optimal algorithm selection. However, we expect that diverse characteristics of different algorithms can be roughly distinguished by the user. Such characteristics include algorithm type (e.g., deterministic or stochastic, gradient-based or gradient-free, etc.) or variants produced by exploitation-oriented or exploration-oriented operators (e.g., the gbest and lbest model of the particle swarm optimization).

Interaction between the constituent algorithms is also an essential feature in AP design. Interaction allows all algorithms to share the best findings of the most successful ones, and possibly improve them even further. For example, a nice candidate solution detected at the first steps by an exploration-oriented algorithm can be communicated to an exploitation-oriented one and be further improved, e.g., to a local minimizer. Depriving interaction among the algorithms renders the expected solution quality at most equal to the best expected solution quality that can be achieved by the constituent algorithm for the specific computational budget [13].

Resource allocation is probably the most critical process of an AP. The underlying procedure shall monitor the current performance of the constituent algorithms, and proportionally allocate the available computational resources. The allocation takes place in batches and shall adhere to all restrictions described so far. Naturally, previous experience on using the constituent algorithm can be very helpful at this stage.

figure a

Putting it formally, let:

$$\begin{aligned} {{\mathcal {A}}}{{\mathcal {P}}} = \left\{ {\mathcal {A}}_1, {\mathcal {A}}_2,\ldots , {\mathcal {A}}_M \right\} , \end{aligned}$$

denote an AP comprising of M algorithms, \({\mathcal {A}}_1, {\mathcal {A}}_2,\ldots , {\mathcal {A}}_M\), for solving the general problem defined in Eq. (1). Let \(t_{\max }\) denote the total computational budget available to the portfolio in terms of function evaluations, and let \(b_{\max }\) denote the number of batches, i.e., the cycles of resource allocation. It is typically assumed that each batch comprises the same fraction of the total computational budget:

$$\begin{aligned} t_b = \frac{t_{\max }}{b_{\max }}. \end{aligned}$$

The resource allocation procedure is responsible to define the fraction of \(t_b\) that is allocated to each constituent algorithm. Thus, in the i-th batch, algorithm \({\mathcal {A}}_j\) receives \(\tau _j^{(i)}\) function evaluations to consume, such that:

$$\begin{aligned} \sum \limits _{j=1}^{M} \tau _j^{(i)} = t_b, \end{aligned}$$
(2)

and,

$$\begin{aligned} \sum \limits _{i=1}^{b_{\max }} \,\, \sum \limits _{j=1}^{M} \tau _j^{(i)} = t_{\max }. \end{aligned}$$
(3)

After consuming its current budget, each algorithm returns the best candidate solution it has detected so far. The overall best solution (among all constituent algorithms) is updated at this point and communicated to all algorithms. At the next batch, each algorithm may either restart or continue its run from the point it stopped in the previous one.

Algorithm 1 describes the general AP workflow. In serial computation environments, the algorithms run one after the other at each batch (Steps 8–10). In parallel computing environments, the most representative model is the standard master–worker model where each worker node runs an algorithm, while the master node is responsible for the bookkeeping and coordination among the workers. In that case, Steps 8–10 are performed in parallel while the rest of the procedure is run on the master node.

2.2 Adaptive pursuit

Adaptive pursuit is a special case of probability matching, a decision-making strategy where predictions of class membership are proportional to the class prior probabilities [5, 21]. In the original probability matching framework, a set of algorithms \(\{{\mathcal {A}}_1,{\mathcal {A}}_2,\ldots ,{\mathcal {A}}_M\}\) is considered. The problem resembles the multi-armed bandit problem, i.e., an algorithm is selected from a group to be applied at the next iteration \((t+1)\), given that the algorithms’ selection probabilities at iteration t are:

$$\begin{aligned} P^{(t)} = \left\{ P_1^{(t)}, P_2^{(t)}, \ldots , P_M^{(t)} \right\} , \end{aligned}$$

with:

$$\begin{aligned} \sum \limits _{i=1}^{M} P_i^{(t)} = 1. \end{aligned}$$

An algorithm \({\mathcal {A}}_m\) is selected according to these probabilities, using a proportional selection mechanism. Applying \({\mathcal {A}}_m\) returns a reward denoted as \(R_m^{(t)}\), which is stored in a reward vector for all algorithms:

$$\begin{aligned} R^{(t)} = \left\{ R_1^{(t)}, R_2^{(t)}, \ldots , R_M^{(t)} \right\} . \end{aligned}$$

The rewards are related to the solution quality of the algorithm, and they are used to produce running estimates of the algorithms’ performance:

$$\begin{aligned} Q^{(t)} = \left\{ Q_1^{(t)}, Q_2^{(t)}, \ldots , Q_M^{(t)} \right\} . \end{aligned}$$

The estimates can be defined in various ways, with the exponential recency-weighted average being among the most popular ones. According to it, the reward estimate of the selected algorithm \({\mathcal {A}}_m\) is updated as follows:

$$\begin{aligned} Q_m^{(t+1)} = (1-\gamma ) \, Q_m^{(t)} + \gamma \, R_m^{(t)}, \end{aligned}$$
(4)

where \(\gamma \in (0,1]\) is the adaptation rate.

The main goal is the iterative selection of appropriate algorithms such that the expected cumulative reward defined as:

$$\begin{aligned} {\mathbb {E}}[R] = \sum \limits _{t=1}^{t_{\max }} R_m^{(t)}, \end{aligned}$$

is maximized. Therefore, the selection probabilities are proportionally updated according to the new quality estimates. In order to avoid vanishing probabilities, a common lower bound \(p_{\min }\) is considered:

$$\begin{aligned} P_i^{(t)} \geqslant p_{\min } > 0, \qquad \forall \, i, t. \end{aligned}$$

Letting \(S_Q^{(t+1)}\) be the sum of current estimates:

$$\begin{aligned} S_Q^{(t+1)} = \sum \limits _{i=1}^{M} Q_i^{(t+1)}, \end{aligned}$$

the selection probabilities of all algorithms are updated as follows:

$$\begin{aligned} P_i^{(t+1)} = p_{\min } + (1-M \, p_{\min }) \frac{Q_i^{(t+1)}}{S_Q^{(t+1)}}, \qquad i=1,2,\ldots ,M. \end{aligned}$$

The adaptation of this allocation rule in non-stationary environments does not retain the reward-maximization goal [22].

For this reason, the adaptive pursuit alternative has been proposed. In this case, an algorithm is probabilistically selected according to the selection probabilities. The selected algorithm is applied and a reward is received, similarly to probability matching. The reward is used to update the estimate of the selected algorithm according to Eq. (4). However, the probabilities are now updated following a different approach. Firstly, the algorithm with the best estimate is determined:

$$\begin{aligned} m^* = \mathop {{{\,\mathrm{arg max}\,}}}\limits _{j \in \{1,\ldots ,M\}} Q_j^{(t+1)}. \end{aligned}$$
(5)

Then, the selection probabilities of all algorithms are updated as follows:

$$\begin{aligned} P_i^{(t+1)} = \left\{ \begin{array}{ll} P_{m^*}^{(t)} + \beta \, \left( p_{\max } - P_{m^*}^{(t)} \right) , &{} \quad \text {if } \,\, i=m^*, \\ P_{i}^{(t)} + \beta \, \left( p_{\min } - P_{i}^{(t)} \right) , &{} \quad \text {if } \,\, i \ne m^*, \end{array} \right. \end{aligned}$$
(6)

where \(p_{\min }\) and \(p_{\max }\) define lower and upper bounds of the probabilities, and \(0 < \beta \leqslant 1\) is the learning rate. The bounds are necessary for alleviating vanishing algorithms. The two values are related as follows:

$$\begin{aligned} p_{\max } = 1-(M-1) \, p_{\min }. \end{aligned}$$
(7)

Showing that the probabilities above sum to 1 is trivial.

The adaptive-pursuit modification was shown to be suitable for non-stationary environments [22]. Such environments are expected during the minimization of the targeted objective function of Eq. (1), in terms of the rewards awarded to the algorithms in consecutive batches of the AP.

2.3 BFGS with backtracking line search

BFGS is probably the most popular quasi-Newton optimization algorithm. The version that was used in our experiments is the standard BFGS with backtracking line search that can be found in popular texts such as [10]. The algorithm produces a sequence of iterates:

$$\begin{aligned} x^{(k+1)} = x^{(k)} + \alpha _k p^{(k)}, \end{aligned}$$

where \(p^{(k)}\) is the search direction and \(\alpha _k\) is the step size. The search direction is given at each step as:

$$\begin{aligned} p^{(k)} = -H_k \nabla _x f^{(k)}, \end{aligned}$$

where \(H_k\) is a symmetric, positive definite approximation of the inverse Hessian matrix of f(x) evaluated at \(x^{(k)}\). This matrix undergoes rank-2 updates at each iteration as follows:

$$\begin{aligned} H_{k+1} = H_k + \frac{\left( s_k^T y_k + y_k^T H_k y_k \right) \left( s_k s_k^T \right) }{\left( s_k^T y_k \right) ^2 } - \frac{H_k y_k s_k^T + s_k y_k^T H_k}{s_k^T y_k}, \end{aligned}$$

where \(y_k = \nabla f^{(k+1)}-\nabla f^{(k)}\), and \(s_k = x^{(k+1)} - x^{(k)}\). As initial approximation \(H_0\), the identity matrix can be used. The step size \(\alpha _k\) is determined through backtracking line search, satisfying the Armijo condition of sufficient reduction:

$$\begin{aligned} f^{(k+1)} \leqslant f^{(k)} + \rho _1 \alpha _k \nabla _x f^{(k)} p^{(k)}, \end{aligned}$$

where \(0< \rho _1 < 1\) is a user-defined parameter. The algorithm restarts to new initial conditions whenever it reaches a local minimum, i.e.:

$$\begin{aligned} \Vert \nabla _x f^{(k+1)} \Vert \leqslant \varepsilon _g. \end{aligned}$$

for a prescribed threshold \(\varepsilon _g > 0\). In case the available computational budget is spent in the current batch of the AP, the algorithm continues its run from the point it stopped as soon as the next batch is started. The reader is referred to Chapter 6 in [10] for a more detailed presentation.

2.4 Nelder–Mead nonlinear simplex algorithm

The Nelder–Mead algorithm [10], henceforth abbreviated as NM, is based on the concept of n-simplex, which is the convex hull of \(n+1\) points \(x_1,x_2,\ldots ,x_{n+1} \in {\mathbb {R}}^n\) with the property that all vectors \(x_i - x_1\), \(i = 2,\ldots ,n+1\), are linearly independent. The simplex is usually denoted as \(S = S(x_1,x_2,\ldots ,x_{n+1})\), and it geometrically defines a polyhedron with vertices at the points \(x_i\).

At each iteration of the NM algorithm, at least one vertex of the simplex moves to a new position in order to improve its value. The vertices are assumed to be labeled in ascending function value, \(f_1 \leqslant f_2 \leqslant \cdots \leqslant f_n \leqslant f_{n+1}\), where \(f_i = f(x_i)\), \(i=1,2,\ldots ,n+1\). Given the average of the best n vertices of the simplex:

$$\begin{aligned} {\bar{x}} = \frac{1}{n} \sum \limits _{i=1}^{n} x_i, \end{aligned}$$

the algorithm produces new points as follows:

$$\begin{aligned} x = (1+\rho ) \, {\bar{x}} - \rho \, x_{n+1}, \end{aligned}$$
(8)

where \(\rho \in \{-0.5, 0.5, 1, 2\}\) is a parameter that determines the corresponding operator. Assuming k to be the current iteration index, a full cycle of the algorithm consists of the following steps:

  1. (1)

    Reflection: Apply Eq. (8) with \(\rho = 1\). If the produced reflection point \(x_R\) has \(f_1^{(k)} \leqslant f_{R} < f_n^{(k)}\), then it replaces \(x_{n+1}^{(k)}\) in the simplex.

  2. (2)

    Expansion: If the reflection point has \(f_R < f_1^{(k)}\), then apply Eq. (8) with \(\rho = 2\). If the produced expansion point \(x_{E}\) has \(f_{E} < f_{R}\), then \(x_{E}\) replaces \(x_{n+1}^{(k)}\), otherwise \(x_{R}\) is retained.

  3. (3)

    External contraction: If the reflection point has \(f_n^{(k)} \leqslant f_{R} < f_{n+1}^{(k)}\), then apply Eq. (8) with \(\rho = 0.5\). If the produced external contraction point \(x_{EC}\) has \(f_{EC} \leqslant f_{R}\), then \(x_{EC}\) replaces \(x_{n+1}^{(k)}\) in the simplex.

  4. (4)

    Internal contraction: If the reflection point has \(f_{n+1}^{(k)} \leqslant f_{R}\), then a step is taken in the opposite direction by applying Eq. (8) with \(\rho = -0.5\). If the produced internal contraction point \(x_{IC}\) has \(f_{IC} < f_{n+1}^{(k)}\), then it replaces \(x_{n+1}^{(k)}\).

  5. (5)

    Shrink: If all the previous operators fail, the simplex shrinks toward its best vertex:

    $$\begin{aligned} x_{i}^{(k+1)} = x_{1}^{(k)} - \frac{x_i^{(k)}-x_1^{(k)}}{2}, \qquad i=2,3,\ldots , n+1, \end{aligned}$$

    reducing its volume.

In our implementation, the algorithm was restarted whenever the difference between best and worst vertex value was lower than a prescribed threshold \(\varepsilon _f > 0\), i.e., \(f_{n+1}^{(k)} - f_{1}^{(k)} \leqslant \varepsilon _f\), or a number of iterations \(k_{imp}\) without any improvement of the best solution is exceeded. In order to avoid very frequent restarts especially at the later stages of the optimization procedure where improvements are not easily achieved, the value of \(k_{imp}\) is doubled after each restart. The reader is referred to Chapter 9 of [10] for a detailed presentation of the NM algorithm.

2.5 Particle swarm optimization

Particle swarm optimization, henceforth denoted as PSO, is placed among the most popular stochastic metaheuristics for numerical optimization [12]. It employs a swarm of search points, \(S = \{ x_1,x_2, \ldots , x_N \}\), where \(x_i = (x_{i1}, x_{i2},\ldots , x_{in})^T \in X\), \(i=1,2,\ldots ,N\), are also called the particles. The swarm is randomly initialized in X. The algorithm iteratively proceeds by allowing each particle \(x_i\) to move in X using an adaptable position shift, called the velocity, \(v_i = (v_{i1},v_{i2},\ldots , v_{in})^T\). The absolute value of each velocity component is typically constrained by a user-defined threshold \(v_{\max }\), which is usually defined as a fraction of the corresponding search-space range.

During its move, each particle retains in memory the best position it has ever visited (in terms of function value). Assuming k to be the current iteration index, the best position of the ith particle is defined as:

$$\begin{aligned} p_i^{(k)} = {{\,\mathrm{arg min}\,}}_{x \in \{x_i^{(1)},\ldots ,x_i^{(k)}\}} f(x). \end{aligned}$$

The best positions of all particles are collected in the best-positions set \(P = \{ p_1, p_2, \ldots , p_N \}\). Moreover, the algorithm has an inherent mechanism of diffusing information among the particles. For this purpose, each particle has a predefined neighborhood, which is a subset of particles with which it shares its findings. In practice, the neighborhood consists of a subset of particle indices. Two popular PSO variants are the gbest model, where each particle assumes the whole swarm as its neighborhood, and the lbest model where the ith particle assumes as neighbors the \((i+1)\)th and \((i-1)\)th particles.

The particles update their positions at each iteration as follows:

$$\begin{aligned} v_{ij}^{(k+1)}= & {} \chi \, \left[ v_{ij}^{(k)} + \mathtt {rand(\,)} \, c_1 \, \left( p_{ij}^{(k)} - x_{ij}^{(k)} \right) + \mathtt {rand}(\,) \, c_2 \, \left( p_{g_i j}^{(k)} - x_{ij}^{(k)} \right) \right] \\ x_{ij}^{(k+1)}= & {} x_{ij}^{(k)} + v_{ij}^{(k+1)} \end{aligned}$$

where \(i = 1,2,\ldots ,N\), and \(j=1,2,\ldots ,n\); \(g_i\) is the index of the best particle in its neighborhood; \(c_1, c_2 > 0\) are the so called cognitive and social learning rates; \(\chi \) is the constriction coefficient that prevents velocity explosion; and \(\mathtt {rand(\,)}\) is a function producing a random number in [0, 1] at each call. In case a particle component violates the search space boundary, it is restricted on that boundary.

After evaluating the new particle positions, their best positions are updated as follows:

$$\begin{aligned} p_i^{(k+1)} = \left\{ \begin{array}{ll} x_i^{(k+1)}, &{} \quad \text {if} \,\, f(x_i^{(k+1)}) < f(p_i^{(k)}) \\ p_i^{(k)}, &{} \quad \text {otherwise}, \end{array} \right. \end{aligned}$$

where \(i = 1,2,\ldots ,N\). The parameters of the algorithm are usually set to default values \(\chi = 0.729\), \(c_1=c_2=2.05\) [4]. The reader is referred to Chapter 4 of [12] for further details on PSO design.

2.6 Lennard–Jones potential function

The Lennard–Jones cluster optimization problem (abbreviated as LJ) comprises the detection of stable clusters of atoms [8, 11]. The geometry of the cluster is related to the global minimum of its potential surface. The LJ potential function defines the corresponding global minimization problem. The problem is differentiable with numerous local minima, and dimension that scales linearly with the number of atoms. Due to its complexity, it has been identified as a challenging real-world problem for various types of algorithms. This is the reason for selecting it as our testbed for the experimental assessment of the proposed APs against their constituent algorithms.

Putting it formally, let \(\eta \geqslant 2\) be a number of atoms given by their real-valued coordinates \(a_{ij}\) in the 3-dimensional Euclidean space:

$$\begin{aligned} A_1 = (a_{11},a_{12},a_{13}), \,\, A_2 = (a_{21},a_{22},a_{23}), \,\, \dots , \,\, A_{\eta } = (a_{\eta 1},a_{\eta 2},a_{\eta 3}). \end{aligned}$$

The total energy of the cluster is given by:

$$\begin{aligned} E(A_1,\ldots ,A_{\eta }) = 4 \, \epsilon \, \sum \limits _{i=1}^{\eta -1} \sum \limits _{j=i+1}^{\eta } \left( \left( \frac{\sigma }{r_{ij}} \right) ^{12} - \left( \frac{\sigma }{r_{ij}} \right) ^{6} \right) , \end{aligned}$$

where \(\epsilon \) is the pair well-depth, \(\sigma \) is the equilibrium pair separation, and \(r_{ij}\) is the Euclidean distance (\(l_2\)-norm) between atoms \(A_i\) and \(A_j\). A common setting followed in our work is \(\epsilon = \sigma = 1\) [8]. The problem of minimizing the objective function \(E(A_1,\ldots ,A_{\eta })\) is differentiable, and its dimension for \(\eta \) atoms is \(n=3 \, \eta \), while the coordinates of the atoms are assumed to lie in \(X \triangleq [-3,3]^n\).

3 Proposed resource allocation approach

The proposed approach combines the general AP framework described in Sect. 2.1 with the adaptive pursuit strategy presented in Sect. 2.2 for resource allocation in a parallel computation environment. More specifically, we consider a master–worker model where the master node is bookkeeper and coordinator among the algorithms, while each worker node is devoted to one algorithm of the portfolio. Note that the notion of “node” stands for a processing thread and not necessarily for a physical node.

Firstly, the user needs to specify the constituent algorithms \({\mathcal {A}}_1,\ldots , {\mathcal {A}}_M\), of the algorithm portfolio. For simplicity reason, we henceforth assume that each algorithm \({\mathcal {A}}_j\) is assigned to the worker node \({\mathcal {N}}_j\), \(j=1,2,\ldots ,M\), while \({\mathcal {N}}_0\) stands for the master node. Secondly, the user needs to specify the total computational budget \(t_{\max }\), which can be either the maximum number of function evaluations or the maximum running time. The first one is the typical choice, although the latter may be desirable in cases of commercial clusters where users pay for specific amount of running time for their applications. In our presentation below, we assume that \(t_{\max }\) refers to function evaluations, while the adaptation to running time is trivial.

The next step is to define the number of batches, \(b_{\max } \ll t_{\max }\). The standard choice is a user-defined fixed number of batches. In this case, the number of function evaluations per batch is fixed to \(t_b = t_{\max } / b_{\max }\). Given that, in the next batch, an algorithm can continue its run from the point it stopped in the previous one, the exact value of \(b_{\max }\) does not have direct impact on the constituent algorithms dynamic. However, it can implicitly affect it because at the beginning of a new batch each algorithm receives the overall best solution so far from the master node. If this solution is incorporated in the algorithm, we reasonably expect to affect its dynamic.

Thus, using a large number of batches results in frequent communication of the overall best solution to each algorithm. In turn, this can rapidly bias the constituent algorithms towards the best detected solution. On the other hand, small values of \(b_{\max }\) allow each algorithm to explore the search space before receiving information on the findings of other solvers. Running the algorithms individually without any interaction can be given as a special case where \(b_{\max } = 1\). Random selection of \(b_{\max }\) is another option if an informative decision is difficult. In our experimental analysis in Sect. 4, we show that random selection of batches can give interesting results in a multi-experiment setting.

After setting \(b_{\max }\), the computational budget \(t_b\) per batch is available and can be allocated to the algorithms, taking care to satisfy the conditions of Eqs. (2) and (3). In the first batch, all algorithms receive the same portion of \(t_b\), i.e., \(\tau _j^{(1)} = t_b/M\), for all \(j=1,2,\ldots ,M\). Moreover, the initial probabilities, rewards, and estimates are set to:

$$\begin{aligned} P_j^{(1)}=1/M, \qquad R_j^{(1)} = Q_j^{(1)} = 0, \qquad j=1,2,\ldots ,M. \end{aligned}$$

After finishing all these preliminary tasks in the master node \({\mathcal {N}}_0\), the worker nodes are evoked. Each worker \({\mathcal {N}}_j\) runs its assigned algorithm \({\mathcal {A}}_j\) until its allocated computational budget is exceeded. Obviously, this step is fully parallelizable. After finishing their first batch, all algorithms send their best solutions so far to the master node. Thus, the master node receives M candidate solutions, \(x_j^{(1)}\), \(j=1,2,\ldots ,M\), and sets the best one among them as the overall best solution \(x^*\) so far.

Note that an algorithm may restart (even multiple times) before consuming its allocated budget in the current batch. This is observed especially in local search algorithms that rapidly converge to local minima. On the other hand, an algorithm may not converge within its available budget limits. In both cases, the best candidate solution detected by the algorithm from its start is the one sent to the master node. In the next batch, the algorithm can continue its run from the point it stopped.

The next step is the calculation of the reward of each algorithm. Obviously, reward is tightly connected to performance, i.e., to the achieved solution value. Given that our main goal is minimization, lower solution values are preferable and the corresponding algorithms shall be awarded most of the credits. Thus, the reward of each algorithm shall be inversely proportional to the solution value it achieved up to the current batch. Direct use of the inverse of the solution values is not recommended due to possible near-zero, negative, or very large values.

figure b

Instead, we suggest the use of linear ranking for the obtained solution values. More specifically, let b denote the current batch, and \(f_j^{(b)}\) be the objective values of the corresponding candidate solutions \(x_j^{(b)}\), \(j=1,2,\ldots ,M\), achieved by our solvers so far. The values are sorted in descending order:

$$\begin{aligned} \begin{array}{rccccccc} \text {Solution value:} ~~~ &{} f_{j_1}^{(b)} &{} \geqslant &{} f_{j_2}^{(b)} &{} \geqslant &{} \cdots &{} \geqslant &{} f_{j_M}^{(b)} \\ &{} \updownarrow &{} &{} \updownarrow &{} &{} &{} &{} \updownarrow \\ \text {Algorithm:} ~~~ &{} {\mathcal {A}}_{j_1} &{} &{} {\mathcal {A}}_{j_2} &{} &{} \cdots &{} &{} {\mathcal {A}}_{j_M} \end{array} \end{aligned}$$

Assuming that \(\rho _j^{(b)} \in \{1,2,\ldots ,M \}\) is the position of algorithm \({\mathcal {A}}_j\) in the descending order, the reward of \({\mathcal {A}}_j\) is defined as:

$$\begin{aligned} R_j^{(b)} = \frac{\rho _j^{(b)}}{\sum \limits _{j=1}^{M} \rho _j^{(b)}}, \qquad j=1,2,\ldots ,M. \end{aligned}$$

This way, all rewards lie in the interval (0, 1) and sum to 1, while better algorithms receive higher values. Moreover, the reward values depend only on the relevant positions of the algorithms in the descending order, not on the actual solution value. This approach prevents rewards from taking extremely large or vanishing values.

Based on the calculated rewards, the reward estimates \(Q_j^{(b+1)}\), \(j=1,2,\ldots ,M\), are updated as follows:

$$\begin{aligned} Q_j^{(b+1)} = (1-\gamma ) \, Q_j^{(b)} + \gamma \, R_j^{(b)}. \end{aligned}$$

Note that in adaptive pursuit only the estimate of the applied algorithm is updated. In our case, all algorithms are applied and, thus, all estimates are updated. Obviously, higher values of the reward estimates are better. Eventually, the new selection probabilities are determined according to the adaptive pursuit approach of Eqs. (5) and (6). Assuming that \(j^*\) is the index of the algorithm with the highest reward estimate, i.e.:

$$\begin{aligned} j^* = \mathop {{{\,\mathrm{arg max}\,}}}\limits _{j \in \{1,\ldots ,M\}} Q_j^{(b+1)}, \end{aligned}$$

the selection probabilities become:

$$\begin{aligned} P_j^{(b+1)} = \left\{ \begin{array}{ll} P_{j^*}^{(b)} + \beta \, \left( p_{\max } - P_{j^*}^{(b)} \right) , &{} \quad \text {if } \,\, j=j^*, \\ P_{j}^{(b)} + \beta \, \left( p_{\min } - P_{j}^{(b)} \right) , &{} \quad \text {if } \,\, j \ne j^*, \end{array} \right. \end{aligned}$$

where \(p_{\min }\) is the user-defined lower bound of the probabilities; \(p_{\max }\) is the corresponding upper bound determined by Eq. (7); and \(0 < \beta \leqslant 1\) is the learning rate.

The current batch of the algorithm is completed with the assignment of the computational budget of the next batch. The assignment is proportional to the selection probabilities of the algorithms. Thus, assuming \(t_b\) be the available budget for batch \(b+1\), each algorithm is assigned the following number of function evaluations (rounded below):

$$\begin{aligned} \tau _j^{(b+1)} = \left\lfloor t_b \, P_j^{(b+1)} \right\rfloor , \qquad j=1,2,\ldots ,M. \end{aligned}$$

In case that some function evaluations are left unassigned due to rounding, they are directly assigned to the best algorithm \({\mathcal {A}}_{j^*}\). At this point the current batch cycle is finished, and the algorithm portfolio proceeds to the next batch by running all algorithms on the worker nodes.

The proposed approach is outlined in Algorithm 2 for the master node (upper part) as well as the worker nodes (lower part). Steps 2–11 of the master node procedure initialize the algorithm portfolio and the relevant quantities. Steps 12–22 constitute the main batch cycle, which starts by running the algorithms on the worker nodes in Step 13. The corresponding procedure in Steps 1–5 of the worker nodes takes place at this point. More specifically, the algorithm receives the available budget (Step 2). Then, it runs until it exceeds its assigned budget, returning the best solution it has detected so far (Step 3). Finally, this solution is sent to the master node (Step 4). The master node receives the solutions (Step 14), and calculates the rewards (Step 15). After updating the overall best solution and the batch counter (Steps 16 and 17, respectively), the new reward estimates are calculated (Step 18), along with the new selection probabilities (Step 19). Eventually, the budget for the next batch is proportionally allocated (Step 20), and the overall best solution so far is communicated to all algorithms in the worker nodes (Step 21).

The algorithm portfolio continues its run until all batches are completed, i.e., the overall available computational budget is spent. Finally, we shall note that the same procedure can be applied also in serial computing environments, by simply running the algorithms in Step 13 of the master node, one after the other.

4 Experimental assessment

The experimental assessment of the proposed AP model against its constituent algorithms was based on extensive experimentation of different variants on the minimization of the Lennard–Jones potential function. In the following paragraphs, we firstly explain the configuration of the tested APs; secondly, we present the experimental setup; and, finally, we analyze the obtained results.

The considered APs consisted of the three state-of-the-art solvers as described in Sect. 2, which belong to the three main classes of numerical optimization solvers. Given the essential structural and operational differences among them, our choice adheres to the complementarity requirement discussed in Sect. 2.1.

Table 1 Parameter settings of all algorithms

4.1 Algorithm and problem configuration

The primary goal in our experimental assessment was to probe the performance of the proposed AP framework and compare it against its constituent algorithms. The main questions under investigation were:

  1. (Q1)

    Can the proposed algorithm portfolio outperform all its constituent algorithms?

  2. (Q2)

    What is the effect of different learning rate values \(\beta \) in the adaptive pursuit strategy?

  3. (Q3)

    What is the effect of changing a constituent algorithm in the portfolio?

  4. (Q4)

    What is the impact of fixed number against random number of batches?

In our experiments, we considered the following algorithms and parametrizations, which are summarized in Table 1:

  1. (1)

    BFGS with (Armijo) backtracking line search given in Sect. 2.3. The algorithm was considered with its common parameter \(\rho _1 = 10^{-4}\).

  2. (2)

    NM with its default parameters given in Sect. 2.4.

  3. (3)

    PSO with its default parameters given in Sect. 2.5. Regarding the swarm size and the neighborhood setting, we initially considered two different sizes namely 50 and 100, each one combined with both the gbest and the lbest model. In all experiments, the combination gbest-50 was superior, while lbest-100 had the worst performance. These two versions were employed in our experiments, both individually as well as in the designed APs.

  4. (4)

    AP consisting of BFGS, NM, and PSO, with fixed number of batches and adaptive pursuit strategy for resource allocation. This is our proposed AP scheme, and it was considered with both the gbest and lbest PSO version. Regarding the learning rate \(\beta \), three different values were considered, namely 0.2, 0.5, and 0.8.

  5. (5)

    AP consisting of BFGS, NM, and PSO, but with randomized number of batches and learning rate. This fully randomized AP was considered in order to assess the effect of uninformative parameter setting, and it was considered with both the gbest and lbest PSO version.

Regarding the LJ testbed, we considered the problems of \(\eta =20\), 30, 40, 50, 60, 70, and 80 atoms, which imply minimization problems of dimensions ranging from \(n=60\) (20 atoms) up to \(n=240\) (80 atoms). In all cases, the decision variables, i.e., the coordinates of all atoms, were ordered according to their indices in the decision vector \(x \in X \triangleq [-3,3]^n\). The total computational budget for each problem was proportional to the number of atoms:

$$\begin{aligned} t_{\max } = 3 * \eta * 50{,}000, \end{aligned}$$

which implies that 50,000 evaluations were considered per dimension component. Finally, when fixed in the APs, the number of batches was set equal to the number of atoms of the corresponding problem. Although there is no valid method to optimally set the number of batches, the discussion in Sect. 3 regarding its effect on the algorithm dynamic motivated our decision. It is reasonable to consider that the problem’s dimension is relevant, especially if we take into consideration that the total budget in our experiments is a linear function of dimension. Given that the dimension in LJ problems is defined as three times the number of atoms, it seems reasonable to assume one batch of function evaluations per atom (on average). This empirical rule has motivated our choice, while preprocessing (short runs) of the APs suggested it can be a viable choice in our experimental settings.

All algorithms were statistically analyzed and compared within a multiple-experiments framework. Each algorithm was run 25 times using random initial conditions. The best detected solution was recorded at each experiment. The algorithms were then compared using statistical hypothesis testing methods, including Kruskal–Wallis, Friedman, and Wilcoxon rank-sum tests. The first test (Kruskal–Wallis) allows for comparisons among multiple samples in order to assess the null hypothesis that the samples come from the same distribution. It follows the same reasoning with 1-way ANOVA, although extended for non-parametric testing (it is also referred as 1-way ANOVA on ranks). Besides the p value that accepts or rejects the null hypothesis, the specific test offers a relevant ranking of the samples (studied algorithms) according to their mean ranks. This property is very useful for building insight regarding the most successful algorithms. Using the Kruskal–Wallis test, we were able to identify that our designed portfolios have different performance among them, and accordingly rank them.

The second test (Friedman) is the non-parametric counterpart of 2-way ANOVA. This test was used in order to assess the effect of using lbest PSO instead of gbest PSO in the designed APs. In contrast to the Kruskal–Wallis test, Friedman’s test allows for testing two factors in the samples. Finally, in order to test different learning rates among portfolios of same type, we conducted Wilcoxon rank-sum tests that allow head-to-head comparisons between different APs (in pairs). The Wilcoxon rank-sum test assesses two samples regarding the null hypothesis of coming from distributions with equal medians against the alternative, i.e., different medians. Thus, it provides information regarding the similarity of two samples, while further evaluation of their medians provides hints on the best one among them. However, it does not provide information on the dispersion of the samples, i.e., their variance differences. This information can be useful in case of statistically equivalent samples (in terms of medians), which however exhibit quite different dispersion implying different robustness profiles. In order to tackle this issue, we combined the Wilcoxon test with the Ansari–Bradley test that allows testing for same medians but different dispersion (in fact this test is a non-parametric alternative to the 2-sample F-test of equal variances).

Table 2 Average relative error from the optimal solution for all algorithms and problems
Table 3 Kruskal–Wallis ranks of all algorithms (at 0.05 level of significance)

4.2 Experimental results

The primary question under investigation (Q1 in Sect. 4.1) was the ability of the designed APs to outperform each one of their constituent algorithms. For this purpose, all approaches were run according to the experimental configuration presented in the previous section. At each experiment, the best solution value per algorithm was tracked, and the corresponding relative error from the globally optimal solution was used as the main performance measure.

Table 2 reports the average relative error per algorithm and problem, along with the overall average over all problems (last column). The results are clearly in favor of the proposed APs. In all cases, they achieved considerably lower relative errors (around \(5\%\) in all cases) than all individual algorithms. In three cases (20, 30, and 80 atoms), APs with the gbest PSO model achieved the smallest errors. In two cases (40 and 50 atoms), they were outperformed by APs that comprised lbest PSO, while in two other cases (60 and 70 atoms) the two AP types achieved equal relative errors. The overall best (smallest) average relative error was achieved by \(\hbox {AP}_{0.8}^{(g)}\). Interestingly, the randomized APs scored two smallest errors for 60 and 70 atoms, competing with \(\hbox {AP}_{0.2}^{(l)}\) and \(\hbox {AP}_{0.2}^{(g)}\), respectively.

The superiority of the gbest-based APs is apparent in cases of lower dimension. Although this effect cannot be attributed solely to the PSO algorithm, it is widely known that the gbest model is greedier than the lbest one, achieving higher convergence speed at the risk of premature convergence, especially in high-dimensional problems. Taking a closer look at the results, we can observe that even the individual gradient-free algorithms achieved their worst error values for the cases of 40 to 60 atoms. The combination of high dimension and restricted computational budget apparently renders the specific problems quite challenging for the algorithms. These are exactly the cases where lbest-based APs outperformed the gbest-based ones. Thus, the impact of the PSO model on performance cannot be neglected.

Fig. 1
figure 1

Average position of each algorithm in the ordering from best to worst according to their Kruskal–Wallis ranks at 0.05 level of significance (smaller values are better)

In order to shed further light on the relevant quality among the algorithms, we used the Kruskal–Wallis statistical test at 0.05 level of significance to rank all algorithms according to their performance. Table 3 reports the mean ranks per algorithm and problem. The last two lines of the table report the average rank (AR) of the algorithms, as well as their average position (AO) in the ordering from the best (first) to the worst (last) algorithm, over all test problems. Figure 1 depicts the AO values ordered from best to worst, elucidating the relevant quality of the algorithms. Undoubtedly, the observed results verify those for the average relative errors in Table 2. Moreover, they corroborate that using the proposed APs instead of their constituent algorithms is beneficial even under arbitrary (randomized) setting of their parameters (Q4 in Sect. 4.1).

The observed differences between APs containing the gbest against the lbest PSO model raises the question of how crucial this choice is (Q3 in Sect. 4.1). In order to probe this issue, we used the Friedman statistical test at 0.05 level of significance to compare all gbest-based APs against lbest-based APs. Friedman test was preferred instead of Kruskal–Wallis because it allows changes in two factors in the data, although it tests only for differences in one factor. In our case, the two factors were the PSO model in the APs (tested factor) and the second one was the learning rate \(\beta \) (not taken into account). The obtained p values of the Friedman test per problem are reported in Table 4. All tests reveal lack of significance in using the one PSO model against the other. This is a strong indication that the overall performance of the APs is the outcome of interactions among all its algorithms rather than a sole quality of the specific PSO model.

Table 4 Friedman test p values at 0.05 level of significance between APs containing gbest PSO against those containing lbest PSO
Table 5 Wilcoxon tests and Ansari–Bradley tests (in parenthesis) at 0.05 level of significance for the differences between APs of same PSO type but different learning rate \(\beta \) (“N”: no significance, “Y”: significance exists, “–”: not applicable)

Another point of interest is the effect of the learning rate \(\beta \) of the employed adaptive pursuit approach (Q2 in Sect. 4.1). In this case, we isolated APs of same PSO type (gbest or lbest, respectively) but different values of \(\beta \), and tested them against each other. We used two statistical tests for these head-to-head comparisons. The first one was the Wilcoxon rank-sum test that allows to test whether two samples come from distributions of equal medians or not. The second one was the Ansari–Bradley test that tests whether two samples coming from distributions of same median exhibit significant dispersion differences. The two tests together promote more accurate interpretation of the observed performance differences.

Table 5 reports the results for the two tests for each pair of algorithms and test problem. Lack of significance in Wilcoxon test is denoted with “N”, while “Y” denotes the opposite. The corresponding y/n notation is used for the Ansari–Bradley test, given in the parentheses. The results verify that the learning rate \(\beta \) has significant effect only in few cases, especially in the variance of the results, with the lbest-based APs being more likely to be affected. Note that all significant cases under the Ansari–Bradley test were observed for 50 and 60 atoms, two problems that proved to be quite challenging as we mentioned above, and mainly between \(\hbox {AP}_{0.8}^{(l)}\) (the worst version of lbest-based AP) and its counterparts with smaller values of \(\beta \). Nevertheless, the choice of sub-optimal learning rate evidently induces only mild disruption to the algorithm’s performance.

Fig. 2
figure 2

Average percentage of computational budget assigned to each constituent algorithm of the AP

The \(\hbox {AP}_{0.8}^{(l)}\) version intrigued our interest on the reasons of its defective performance, since \(\hbox {AP}_{0.8}^{(g)}\), i.e., its gbest counterpart with the same learning rate, appeared to be the best-performing AP as we observed in Fig. 1. Suspecting that this effect can be partially attributed to the computational budget assignment, we analyzed the average percentage of the computational budget assigned to each constituent algorithm per AP and problem. The results are illustrated in Fig. 2. We can clearly observe that \(\hbox {AP}_{0.8}^{(l)}\) systematically assigns smaller fractions of the computational budget to PSO, while increasing that of BFGS. This is a direct consequence of using \(\beta = 0.8\), which rapidly increases the selection probabilities of the best-performing algorithm in the adaptive pursuit scheme. In our APs, this means that the best algorithm was rapidly dominating the rest of the algorithms. Given that BFGS in \(\hbox {AP}_{0.8}^{(l)}\), as a local search algorithm, achieves local minima in early stages of each run faster than the exploration-oriented lbest PSO model, it had the chance to be assigned remarkably high percentages of the computational budget. In turn, this resulted in overall reduced exploration properties of the AP and, hence, inferior performance. In the gbest counterpart, \(\hbox {AP}_{0.8}^{(g)}\), this effect was counterbalanced by the gbest PSO model, which has also intense exploitation properties and, hence, it was capable of achieving nice solutions thereby retaining much higher percentage of the computational budget than the rest of the algorithms. This observation emphasizes the influence of the PSO model in the portfolio.

Summarizing, the results offer sound evidence that the proposed APs can outperform their constituent algorithms. They achieved better relative error values, while exhibiting remarkable tolerance on parameter values. Even random selection of the parameters at each run offered satisfactory average performance. Moreover, it was shown that the PSO model was not a critical performance factor overall, although it evidently affected the performance from one problem to the other. Finally, we underlined the role of including PSO in the APs, as well as the marginal effect of the learning factor.

5 Conclusions

We proposed a resource allocation scheme for APs, based on adaptive pursuit procedures. The proposed approach was demonstrated through a parallel AP consisting of three state-of-the-art optimization algorithms, namely BFGS with backtracking line search, Nelder–Mead nonlinear simplex method, and particle swarm optimization. The selected algorithms belong to the three essential categories of numerical optimization, namely gradient-based, direct search, and evolutionary algorithms, respectively. This is the first AP in the relevant literature that combines algorithms of all these categories.

Each algorithm occupied one processing unit (worker) in our master–worker parallel computation model. The computational budget, in terms of function evaluations, was allocated to the constituent algorithms in batches according to the estimated performance of the algorithms. The estimations were based on online assessment of the algorithms according to the quality of their solutions using the adaptive pursuit strategy.

The proposed AP scheme was tested against its constituent algorithms on the atoms configuration problem through the minimization of Lennard–Jones potentials. The problem is quite challenging, exhibiting high number of local minima that scales exponentially with the number of atoms. Different configurations of the proposed APs were studied and statistically analyzed. The results revealed that the proposed APs habitually outperform their constituent algorithms. Almost always, they achieved smaller error values, with only mild dependence on their parameters. Results using also arbitrary parameter values verified the parameter tolerance of the proposed APs. Moreover, changing one of the constituent algorithms from one variant to another induced only mild disturbance in performance. In the same vein, only marginal effect of the learning factor was observed.

The obtained results grow optimism about the potential of the proposed APs. Rich ground for further developments is left, including improvements of the resource allocation scheme, large-scale experiments, and further combinations of different constituent algorithms.