1 Introduction

Search is a core concept of Artificial Intelligence and Pattern Recognition. In this field, meta-heuristic algorithms are critical as they do not require any prior on the fitness function, and allows to explore large search spaces. This ability is extensively used for model fitting and several important tasks in Artificial Intelligence. However, meta-heuristics are very dependent on their parameterization and often, require experts to determine which hyper-parametersFootnote 1 to modify and how should they be tuned, meaning that for a non-expert it might be hard to find good settings.

For meta-heuristics, hyper-parameters appear frequently and have a direct impact in the convergence speed and accuracy of the algorithms [7]. Also, hyper-parameters affect the execution time and memory cost of running the algorithm, the quality of the model resulting from the training process, or its ability to generalize to the unseen data. For these reasons, finding the best possible hyper-parameters becomes crucial [13].

Hyper-parameter Optimization (HPO), is a field of paramount importance in machine learning and optimization [5, 17]. HPO basically stands for using optimization algorithms in order to perform automatic hyper-parameter configuration. HPO is a difficult optimization problem. Until recently, HPO in several areas, such as machine learning, was considered a combination of science and art due to the high computational cost of such task [1].

There are two main approaches to HPO: manual or automatic methods. Manually approach assumes that there exists an understanding of how the hyper-parameters affect the algorithm. On the other hand, automatic approach greatly reduce the need for this understanding, but they come at the expense of the costlier computation.

In the latest years there has been an increase in the efforts to address automatic HPO with very good results in contrast with manually choosing the hyper-parameters [3, 14]. In the literature, among the most popular algorithms are Random Search (RS) [2], Sequential Model-based Algorithm Configuration (SMAC) [8] and Tree Parzen Estimators (TPE) [3]. HPO involves two nested cycles of optimization when configuring hyper-parameters for an arbitrary meta-heuristic algorithm. Here we refer to the regular optimization algorithm simply as base-algorithm. We call parent-algorithm to the optimization algorithm that deals with choosing the hyper-parameters of the base-algorithm.

When performing HPO, the base-algorithm may need to deal with thousands of dimensions and expensive fitness functions. For example, training a neural network is an NP-hard optimization problem [12] with tens of thousands of dimensions. Setting the hyper-parameters for such training algorithms is a challenge for practitioners because standard recommendations in the literature are most of the time useless. Trial and error is very tedious, is not reproducible for others and is prone to produce over-fitting. Additionally, insights on the hyper-surface structure usually are not available due to the large amount of dimensions, hence common sense may not be applicable.

When sufficient computational resources are provided, practitioners may choose to use a parent-algorithm (such as RS, SMAC or TPE) to perform HPO. There are several examples of successful applications of such parent-algorithms [3]. However, for the engineering mind this approach faces to major drawbacks:

  • the practitioner needs to know the specific details of two (probably different) optimization algorithms and

  • parent-algorithms also have hyper-parameters that need to be configured, manually most of the time.

For meta-heuristics, that do not impose restrictions (such as derivatives or convexity) on the fitness function, tuning its own hyper-parameters may be a good approach to perform HPO. In this case, the parent-algorithm hyper-parameters also need to be somehow configured. However, the small number of hyper-parameters, usually less than ten, makes standard recommendations in the literature more suitable. Even if standard recommendations do not produce good results, the practitioner may use its experience about the meta-heuristic and three-dimensional insights to come up with a working set of hyper-parameters for the parent-algorithm.

In this paper, we deal with the problem of automated HPO of Particle Swarm Optimization (PSO) meta-heuristic algorithm by using the same meta-heuristic as parent-algorithm. We tested this approach with several artificial benchmark functions for optimization. Our meta-heuristic HPO of meta-heuristics approach is able to find hyper-parameters that leads to more stable and accurate optimization of the base-algorithm when compared with RS, SMAC and TPE as parent-algorithms. The main limitation of this work is that we only consider continuous hyper-parameters.

In the Sect. 2 we discuss the state of the art on hyper-parameter optimization algorithms. In Sect. 3 we describe PSO algorithm in detail and in Sect. 4 we present the benchmark test functions to be optimized by PSO. Section 5 presents the experimental results when comparing the stability and accuracy of our HPO approach. Finally, some conclusions and recommendations are given. Throughout this article we use x for scalars, \(\varvec{w}\) for vectors and X for sets.

2 Automatic Hyper-parameter Optimization

There are at least two kind of methods to perform automatic HPO: model-free and model-based methods. Model-based techniques build a surrogate model of the hyper-parameter space through its careful exploration and exploitation. Alternatively, model-free algorithms do not utilize the knowledge about the solution space extracted during optimization.

2.1 Model-Free Methods

Model-free approaches are characterized by not utilizing the knowledge about the solution space extracted during the optimization. This lack of adaptability makes them simple to implement at the expense of poor results for large hyper-parameter spaces.

RS: is a trivial to implement alternative to Grid Search [2], more convenient to use and faster to converge to an acceptable set of hyper-parameters. There are approaches designed to enhance RS capabilities, for example, the random sampling can be intensified in the neighborhood of the best hyper-parameters. Finally, there are hybrid algorithms to couple RS with other techniques for refining its performance,for example, manual updates provided by an expert [2, 9].

2.2 Model-Based Methods

Model-based methods build a model of the fitness function, and then elaborate hyper-parameter values by performing optimization within this model. Most of such techniques use a Bayesian regression model turning this problem into a trade-off between the exploration and exploitation.

SMAC is a model-based method for optimizing algorithm hyper-parameters [8]. SMAC is effective for HPO of machine learning algorithms, scaling better to high dimensions and discrete input dimensions than other algorithms [11].

TPE method sequentially construct models to approximate the performance of hyper-parameters based on historical measurements, and then subsequently choose new hyper-parameters to test based on this model. This optimization approach is described in detail in [3].

3 Particle Swarm Optimization

PSO is a population based meta-heuristic algorithm [4]. PSO has several hyper-parameters such as: global contribution (\(\alpha \)), local contribution (\(\beta \)), maximum speed (\(v_{max}\)) and minimum speed (\(v_{min}\)). Let q be the number of particles in the population, a particle’s position in the search space and \(f(\varvec{w}_i)\) the fitness value of the particle.

The speed of a particle in the population is given by Eq. 1 where is the position of the particle with the lowest fitness function value in the population and is the position where \(\varvec{w}_i\) achieved its lowest fitness function value. Particle \(\varvec{w}_*\) is known as best-global particle and \(\varvec{w}^*_{i}\) as best-so-far particle of \(\varvec{w}_i\).

$$\begin{aligned} \varvec{v}_i \leftarrow \gamma \varvec{v}_i + \alpha \varvec{\omega }_1 \odot \varvec{w}_* + \beta \varvec{\omega }_2 \odot \varvec{w}_{i*} \end{aligned}$$
(1)

The parameter \(\gamma \) is known as speed parameter such as \(v_{min} \le \gamma \le v_{max}\). The vectors \(\varvec{\omega }_1\) and \(\varvec{\omega }_2\) are randomly generated by means of an uniform distribution such as \(\omega _{1, i}\sim U(0,1)\) and \(\omega _{2, i}\sim U(0,1)\). Finally, a particle’s position is updated according to Eq. 2:

$$\begin{aligned} \varvec{w}_i \leftarrow \varvec{w}_i + \varvec{v}_i \end{aligned}$$
(2)

Due to the diversity of PSO algorithms in the literature, we provide further details on the implementation used in this work as described in Algorithm 1. In step 1 the population is randomly generated. After that, in steps 2 and 3 the best-so-far positions of each particle and the best-global position of the population are set. The algorithm is considered in a convergence state after reaching a given number of fitness function evaluations determined by the value of \(\eta \) in step 5.

figure a

For each iteration of PSO algorithm, the best-so-far and best particles are updated in steps 10 and 13. Then, for each particle in the population a new position is calculated in steps 7–8 and the speed parameter is linearly decreased in step 17. Finally, the best particle of the population is returned in step 20.

PSO have at least four continuous hyper-parameters: local contribution, global contribution, maximum speed and minimum speed. In addition, PSO also have several other hyper-parameters of non-continuous nature, e.g. the number of particles in the population, the number of evaluations of the fitness function to archive convergence, the lower and upper bounds of the search space, etc. Here we will only deal with the four continuous hyper-parameters while tuning the others manually. Let be a vector that contains the hyper-parameters of PSO algorithm and \(\bar{f}(\bar{\varvec{w}}_i)\) the quality of a given set of hyper-parameters, the hyper-parameter configuration problem can be defined as:

(3)

In Eq. 3, \(\bar{f}(\bar{\varvec{w}}_i) = f(\varvec{w}_*)\) where \(\varvec{w}_*\) is the result of optimizing \(f(\varvec{w}_j)\) by means of PSO with hyper-parameters \(\bar{\varvec{w}}_i\). Notice here the two nested cycles of optimization: in the outer cycle we try to minimize \(\bar{f}(\bar{\varvec{w}}_i)\) while in the inner cycle we try to minimize \(f(\varvec{w})\). While the outer cycle may be solved with GS, TPE or SMAC, here both optimization cycles are solved by PSO.

Several standard recommendations for PSO hyper-parameters are available in the literature. Here we will consider such recommendations for hyper-parameter configuration in order to compare such recommendation with automatic HPO. Some of the PSO recommendations come in the form of a general heuristic, for example, \(\alpha + \beta \le 4\) [16]. Here we will consider the following more specific recommendations:

  • From [16], we denote STD.R.1 to the hyper-parameter set: \(\alpha = 0.5, \beta = 0.5, v_{min} = 0, v_{max} = 1.2\).

  • From [6], we denote STD.R.2 to the hyper-parameter set:, \(\alpha = 2, \beta = 2, v_{min} = 0.9, v_{max} = 0.9\).

4 Optimization Benchmark Problems

In this section we describe a set of optimization problems to be solved by base-algorithms. In the literature, several benchmark test functions for optimization have been suggested. We considered five properties regarding to the test functions: continuous/discontinuous, differentiable/non-differentiable, separable/non-separable, scalable/non-scalable and uni-modal/multi-modal [10, 15]. The following list enumerate benchmark test function used in this work and its properties, in all cases consider that .

  1. 1.

    De Jung function:

    $$\begin{aligned} f_1(\varvec{w}) = \sum _{i=0}^{n}{{\varvec{w}_i}^2} \end{aligned}$$
    (4)

    has a global minima in \(f_1(0, ..., 0) = 0\). De jung is continuous, differentiable, separable, scalable and multi-modal. A commonly used search domain is \(0 \le \varvec{w}_i \le 10\).

  2. 2.

    Ackley function:

    $$\begin{aligned} f_2(\varvec{w}) = -20e^{0.02\sqrt{n^-1\sum _{i=0}^n{\varvec{w}_i^2}}} - e^{n^{-1}\sum _{i=0}^n{\cos (2\pi \varvec{w}_i)}} + 20 + e \end{aligned}$$
    (5)

    has a global minima in \(f_2(0, ..., 0) = 0\). Ackley is continuous, differentiable, non-separable, scalable and multi-modal. A commonly used search domain is \(-35 \le \varvec{w}_i \le 35\).

  3. 3.

    Grienwangk function:

    $$\begin{aligned} f_3(\varvec{w}) = \sum _{i=0}^n{\varvec{w}_i^2 / 4000} - \prod _{i=0}^n{\cos (\varvec{w}_i / \sqrt{i})} + 1 \end{aligned}$$
    (6)

    has a global minima in \(f_3(0, ..., 0) = 0\). Griendwangk is continuous, differentiable, non-separable, scalable and multi-modal. A commonly used search domain is \(-100 \le \varvec{w}_i \le 100\).

  4. 4.

    Rastrigin function:

    $$\begin{aligned} f_4(\varvec{w}) = 10n + \sum _{i=0}^n{\varvec{w}_i^2-10\cos (2\pi \varvec{w}_i)} \end{aligned}$$
    (7)

    has a global minima in \(f_4(0, ..., 0) = 0\). Rastrigin is continuous, differentiable, separable, scalable and multi-modal. A commonly used search domain is \(-5.12 \le \varvec{w}_i \le 5.12\).

  5. 5.

    Rosenbrock function:

    $$\begin{aligned} f_5(\varvec{w}) = \sum _{i=0}^{n - 1}[100(\varvec{w}_{i+1} - \varvec{w}_i^2)^2 + (\varvec{w}_i - 1)^2] \end{aligned}$$
    (8)

    has a global minima in \(f_5(1, ..., 1) = 0\). Rosenbrock is continuous, differentiable, non-separable, non-scalable and multi-modal. A commonly used search domain is \(-30 \le \varvec{w}_i \le 30\).

  6. 6.

    Schwefel function:

    $$\begin{aligned} f_6(\varvec{w}) = -1/n\sum _{i=0}^n{\varvec{w}_i\sin {\sqrt{|\varvec{w}_i|}}} \end{aligned}$$
    (9)

    has a global minima in \(f_6() = -418.983\). Schwefel is continuous, differentiable, separable, scalable and multi-modal. A commonly used search domain is \(-500 \le \varvec{w}_i \le 500\).

  7. 7.

    Styblinski-Tang function:

    $$\begin{aligned} f_7(\varvec{w}) = 0.5* \sum _{i=0}^n{\varvec{w}_i^4 + 16\varvec{w}_i^2 + 5\varvec{w}_i} \end{aligned}$$
    (10)

    has a global minima in \(f_7(-2.903534, ..., -2.903534) = -78.332\). Styblinski-Tang is continuous, differentiable, non-separable, non-scalable and multi-modal. A commonly used search domain is \(-5 \le \varvec{w}_i \le 5\).

  8. 8.

    Step function:

    $$\begin{aligned} f_8(\varvec{w}) = \sum _{i=0}^n{(\lfloor \varvec{w}_i + 0.5\rfloor )^2} \end{aligned}$$
    (11)

    has a global minima in \(f_8(0, ..., 0) = 0\). Step is discontinuous, non-differentiable, separable, scalable and uni-modal. A commonly used search domain is \(-100 \le \varvec{w}_i \le 100\).

  9. 9.

    Alpine function:

    $$\begin{aligned} f_9(\varvec{w}) = \sum _{i=0}^n{|\varvec{w}_i\sin (\varvec{w}_i) + 0.1\varvec{w}_i|} \end{aligned}$$
    (12)

    has a global minima in \(f_9(0, ..., 0) = 0\). Alpine is continuous, non-differentiable, separable, scalable and uni-modal. A commonly used search domain is \(10 \le \varvec{w}_i \le 10\).

5 Results and Discussion

This section describes the experimental setup and results of the proposed PSO continuous hyper-parameter optimization approach. We compare quality of PSO hyper-parameters from configured with RS, SMAC, TPE and PSO it self. Moreover, we present a comparison of automatic hyper-parameter optimization with standard recommendation in the literature: STD.R.1 and STD.R.2.

Base PSO was configured with a population size of 40 particles and a search domain based on the literature recommendation (see Sect. 4). For PSO as base-algorithm, we consider convergence when the number of fitness function evaluations is above 4.0E+4. For parent-algorithms (RS, SMAC, TPE and PSO), we consider convergence when number of fitness function evaluations is above 100. For PSO as parent-algorithm the population size is of 10 particles and the search domain is defined in the interval [0, 2]. Hyper-parameters for PSO as parent-algorithm are configured according to STD.R.1.

Fig. 1.
figure 1

Accuracy of PSO when considering different parent-algorithms for HPO.

Figure 1 shows the accuracy of PSO when considering different benchmark optimization problems after performing hyper-parameter configuration by means of different parent-algorithms. The bottom and top lines in the box plots represents the first and third quartiles, the line in the middle represents the median of 10 measurements and the whiskers represent standard deviation. The small square represents the average of the measurements and the (x) marks represent the maximum and minimum value.

In Table 1 we a summary of the accuracy of PSO when considering different benchmark optimization problems. The presented accuracy values are the averages of 10 measurements. In addition, we show the global best for each optimization problem.

Table 1. Accuracy of PSO algorithm with hyper-parameters optimized by different parent-algorithms.

We use the non-parametric Friedman test of differences among repeated measures to find statistical significant differences between parent-algorithms SMAC, TPE, RS, PSO and standard recommendations STD.R.1 and STD.R.2. The results of this test rendered a Chi-square value of 10.54 which was significant for a p-value \( < 0.01\). This way we reject the null hypothesis, hence, the means of the results of two or more algorithms are not the same.

Furthermore, we conduct a Benferroni-Dunn test to compare the control parent-algorithm (PSO) with the other alternatives for hyper-parameter configuration. This way, we reject the null hypothesis of similar means for PSO vs. STD.R.1 (Chi-square value of 2.83 which was significant for a p-value \(= 0.022\)) and for PSO vs. STD.R.2 (Chi-square value of 4.09 which was significant for a p-value \( < 0.01\)). For PSO vs. SMAC, PSO vs. TPE and PSO vs. RS we accept the null hypothesis which means that in such cases the mean of the results of the control method against each other groups is equal.

As can be seen, for each optimization problem, automatic HPO always outperforms standard recommendation in the literature with statistical significant differences. In addition, PSO obtains as average better results than RS, SMAC and TPE although we don’t see statistical significant differences. Also, the computational cost (execution time) of the parent-algorithms are very similar.

6 Conclusions and Recommendations

In this paper, we compare different strategies to perform HPO of PSO. We observe that automatic HPO provides better results than standard recommendation in the literature. Moreover, optimizing PSO hyper-parameters with PSO it self proved to be a competitive approach when compared with popular HPO algorithms such as RS, SMAC or TPE. This approach has the advantage that the practitioner do not need to consider several optimization algorithms in order to find proper hyper-parameters. However, the main limitation of this work is that only continuous hyper-parameters can be configured with this approach. Future research should investigate hybrid meta-heuristic approaches that allow to consider not only continuous hyper-parameters.