1 Introduction

Machine Learning (ML) and optimisation are two different fundamental research fields in computer science [21, 22]. Due to the rapid progress in the performance of computing and communication techniques, those two research areas have drawn widespread attention in a wide variety of applications and grown rapidly [66, 145]. Although both fields belong to different communities, they are fundamentally based on artificial intelligence and the techniques from ML and optimisation interact frequently with each other as well as themselves in order to improve their learning and/or search capabilities.

ML and data mining are disciplines that focus on the question of how to build computer programs that automatically improve from experience [113]. A typical ML task aims to extract the hidden patterns within data based on an algorithm, which is usually designed by an expert and comprises a set of adjustable hyper-parameters. Thanks to the rapid advance of technology and communication, these algorithms have been successfully applied to many real-world problems ranging from robotics [169], natural language processing [187], computer vision [67] to recommendation systems [37].

Optimisation targets the modelling, design and implementation of solution techniques for finding the best (optimal) entity from a set of candidate alternatives for a given computational search problem [22]. An optimisation process typically starts from a single or a set (population) of initial solutions and moves towards the best solution step by step guided by an objective function. Many computationally hard real-world problems have been tackled by various optimisation techniques, such as travelling salesman problems [108], scheduling problems [150], vehicle routing problems [139], timetabling problems [160], etc.

ML and optimisation have been extensively studied and achieved extraordinary progress in their respective fields; however, they both still suffer from several limitations, such as, choosing the best set of parameter values and algorithm components based on human experience for improved performance, solving the problems requiring expensive computation (e.g., evaluation of each case/solution taking a long time), developed solution becoming tailored only to the specific cases handled and not generalising well to the unseen cases, and more. Figure 1 represents different interactions between ML and optimisation that have been proposed to address those issues. As presented in Fig. 1, ML and optimisation have been interacted with themselves to improve their respective performance (interactions 3 and 4). Thus, we define self-interaction as the use of the techniques in a domain to assist the techniques in the same domain. In addition, ML and optimisation have been enhanced by each other (interactions 1 and 2). We define dual interaction as the application of the techniques from two domains to improve each other.

Fig. 1
figure 1

The interactions between ML and optimisation

As represented in interaction 1, optimisation techniques have been naturally incorporated into ML as a powerful tool for decades [79, 146, 189]. Many ML and data mining tasks can be formulated as optimisation problems, such as the training of an ML model by building a loss function and minimising it with respect to model parameters [146], or to optimise the preprocessing of a dataset [189], and also for clustering purposes [79]. Besides, optimisation techniques have also been introduced into ML to tackle the design issues of ML algorithms. At present, many ML algorithms are still designed by experts that manually choose suitable algorithm components and hyper-parameters in order to achieve good performance. This is not coincident with the real aim of artificial intelligence, which is to learn knowledge from experience automatically without human intervention. This has motivated the use of optimisation techniques to tune hyper-parameters of an ML algorithm at a different level [58]. From another point of view, researchers in ML community are addressing the issue by introducing another ML algorithm to extract meta-knowledge that can be further used to guide ML tasks in the base level, which is shown in interaction 3. These approaches are typically called meta-learning [142] or learning-to-learn [170]. Furthermore, interaction 3 can also be found in areas such as algorithm selection [25, 141], transfer learning [130] or model training [7, 102].

Similarly to ML, many optimisation algorithms also often rely on experts to fix the parameter settings that provide good performance, although there are limited theoretical studies on the parameter setting of some particular algorithms. To choose more reliable parameter values and make the optimisation process more automated, as shown in interaction 4, optimisation techniques have been introduced to improve a base optimisation algorithm, such as using a meta-optimisation algorithm at the high level to tune the parameters of a base algorithm [49] or embedding these parameters into a solution representation and adaptively altering them together with the solution [128]. From another perspective, ML techniques have also been introduced to tackle the same issues [11, 195], which belongs to interaction 2. A typical optimisation process generates plenty of data including solution states, associated objective values, etc. The richness of the information within the data has motivated the use of ML techniques to extract hidden knowledge and further use it to configure the parameter settings or algorithmic components [11, 195] for improving the optimisation process and the performance of the overall algorithm. In addition, the learned knowledge has also been used to assist in solving unseen optimisation problems in order to accelerate the optimisation process [80, 100] and improve the quality of solutions [9, 164].

In the past few years, a number of review papers have been published about the dual interactions [18, 33, 39], individual interactions [5, 24, 84, 97, 177] and the interactions that focused on a specific ML or optimisation algorithm [16, 56, 58, 79, 117, 118, 125, 189, 193, 197] between ML and optimisation. Differently, our study provides a global overview of the interaction between ML and optimisation as well as those within themselves as shown in Fig. 1. Since the whole research field is very extensive, in this study, we mainly focus on the first three interactions where ML sits at the core. However, we also provide a brief overview of the interaction 4 to distinguish between different research areas. We clarify the same terminologies having different meanings used in both fields. We also investigate representative approaches and recent advances, presenting a taxonomy for each interaction. Besides, we analyse the advantages and disadvantages of various approaches from different interactions tackling a common task, such as the use of ML and optimisation techniques to tune hyper-parameters for ML algorithms. Furthermore, identified research gaps and potential research directions are presented for future work.

The remainder of this paper is organised as follows. Section 2 provides some background knowledge of ML and optimisation. Section 3 reviews the approaches that use optimisation to improve ML. Section 4 investigates the approaches that use ML to enhance optimisation. A survey of recent progress in meta-learning is presented in Sect. 5. Section 6 presents a brief overview of the approaches that use optimisation to improve optimisation. Finally, some research gaps and potential research directions are discussed in Sect. 7.

2 Background

Since researchers and practitioners in ML and optimisation mainly focus on their own separate areas, this section makes a brief introduction to those fields as a background. First, Sects. 2.1 and 2.2 provide basic knowledge on ML and optimisation, respectively. Then, Sect. 2.3 distinguishes the terminologies used in both fields.

2.1 Machine learning

ML, which aims to extract knowledge from data, is one of today’s hottest research topics in computer science. According to [21], ML tasks can be generally classified into three categories:

  • Supervised learning The values of input variables and their corresponding values of output variables are known. The learning process is to automatically find some regularities between input variables and output variables. Depending on different learning tasks, supervised learning can be further categorised into classification and regression. Classical supervised learning algorithms include logistic regression, neural networks (NNs), support vector machines (SVMs), decision trees (DTs), k-nearest neighbours (k-NN), etc.

  • Unsupervised learning The values of input variables are known while the values of output variables are unknown. The learning task is to find some hidden patterns within the data based on input variables. An example of unsupervised learning is clustering, which learns the distribution of data to gather the samples into different groups. Representative unsupervised learning algorithms are k-means, autoencoders, generative adversarial networks, etc.

  • Reinforcement learning (RL) aims at choosing the most suitable action at a specific state in an environment to maximise the cumulative reward [167]. Classic RL algorithms include Q-Learning, Monte Carlo RL, SARSA, etc.

Although ML has been dramatically developed, it is still a young field with a number of unexplored research opportunities and challenging problems. A few ML tasks target at learning knowledge in extreme scenarios, such as imbalanced data classification [75], big data mining [186], data stream mining [60, 161] and few-shot learning [98]. Imbalanced data exhibits an unequal distribution between its classes [75], on which standard ML classifiers cannot perform well because they would pay more attention to majority classes to achieve better overall performance. Big data mining concerns large-volume, complex, growing data sets with multiple, autonomous sources [186]. Data stream mining focuses on learning from streaming data containing concept drifts, which need an efficient learner with continuous learning ability [60]. Few-shot learning works towards learning knowledge quickly from very few examples, while conventional ML algorithms need a lot of data for training and they may suffer from overfitting based on such little data [155]. In addition, most existing ML algorithms are still designed by experts that select suitable hyper-parameters or algorithm components for a specific ML task. However, the real artificial intelligence aims to learn knowledge without human interventions. Therefore how to automate the design of ML algorithms is another challenging problem in ML field. These challenging problems require more sophisticated techniques like meta-optimisation or meta-learning to quickly adapt an ML algorithm without human intervention.

2.2 Optimisation

Optimisation focuses on finding the best (optimal) solution from a group of alternative candidate solutions, which has the general form [22]:

$$\begin{aligned} \begin{aligned}&\mathrm{minimise}\; f_{0}(x)\\&\mathrm{subject}\, \; \mathrm{to}\; f_{i}(x)\le b_{i},\; i=1,\ldots ,m \end{aligned} \end{aligned}$$
(1)

Here, x represents the decision variables of the optimisation problem and optimisation algorithms search for the x values that optimise (minimise) \( f_{0} \), which is the objective function, subject to a set of constraints denoted as \( f_{i} \).

There are many orthogonal categories of optimisation, such as exact versus inexact methods, single-objective versus multi-objective optimisation, unconstrained versus constrained optimisation, deterministic versus stochastic optimisation, and convex versus non-convex optimisation. More generally, it can be classified into continuous versus discrete optimisation according to the type of decision variables. The common methods for continuous optimisation consist of simplex algorithm, gradient-based methods, such as gradient descent method and its variants, Newton’s method and its variants, conjugate gradient method, interior-point methods [6], gradient-free methods such as Bayesian optimisation, heuristics and metaheuristics. These gradient-free methods are also commonly used to solve discrete optimisation problems. Besides, common methods for discrete optimisation also include approximation algorithms, mathematical programming, dynamic programming and hyper-heuristics. To make the content in the following sections more understandable, we briefly introduce some optimisation algorithms here.

Gradient-based methods search among the solution space according to the gradient of a differentiable objective function [146]. Bayesian optimisation is a sequential design strategy for global optimisation of black-box functions that does not require derivatives [114]. A heuristic is an intuitive approach that may quickly lead to a near-optimal solution, derived based on the problem domain knowledge. A metaheuristic represents “a high-level problem independent algorithmic framework that provides a set of guidelines or strategies to develop heuristic optimisation algorithms” [158]. Typical methods include single point based search methods, such as tabu search [68] and iterated local search [106] and multi-point (population)-based search methods, including evolutionary algorithms (EAs) [13], ant colony optimisation algorithm [50], particle swarm optimisation (PSO) [88]. Hyper-heuristics focus on automating the design of heuristic methods to solve hard computational search problems [28]. Hyper-heuristics consist of two separate layers. At the high level, the algorithm automatically selects or generates a number of low-level heuristics performing search over the space of heuristics, while at the low-level where the domain specific components sit, the selected or generated heuristics are invoked performing search over the space of solutions. Selection hyper-heuristics contain two key components: heuristic selection and move acceptance. The improvement is achieved based on often a single point based search which makes iterative moves from a single solution to another via low level operators/heuristics. On the other hand, generation hyper-heuristics build new heuristics based on the predefined components for a given problem operating in a train and test fashion.

Table 1 Clarification of the most common terminologies used in both ML and optimisation

Although optimisation has been widely researched, it still suffers from a few weaknesses, such as high computational cost, getting trapped at local optima and choosing suitable parameters for different problems or even instances. These issues have been investigated for years and still not completely solved. In addition, nowadays, real-world optimisation problems are becoming more and more complex with increasing number of decision variables as well as constraints. Efficient and effective advanced intelligent optimisation algorithms are needed more than ever.

2.3 Terminology

Some common terminologies are both used in ML and optimisation fields, while they may represent different concepts. To avoid confusion, we clarify those common key terminologies in Table 1, providing some examples. Note that the number of distinctive terminologies in ML and optimisation is so high that we limit this section to the most common terminologies that are used in both ML and optimisation.

3 Optimisation for machine learning

A typical ML approach usually starts with collecting raw data. Then, data preparation and preprocessing are conducted to shape and clean the data and extract relevant features [53]. After that, a suitable ML algorithm is chosen and a set of hyper-parameters are set by an expert. Then, a modelFootnote 1 that captures the hidden patterns within the data is built by adjusting the model parameters according to a loss function. Finally, the learned model can be used to do prediction tasks. Optimisation plays a crucial role in many stages of the ML cycle. For example, at the stage of data preprocessing, optimisation algorithms can help to search for relevant features and examples for the subsequent learning process [171, 189]. Besides, in the phase of hyper-parameter tuning, instead of setting these hyper-parameters by ML experts, optimisation algorithms can be used to find suitable hyper-parameters in order to make the learning process more automatic. In addition, in the training phase, model parameters learning can be formulated as an optimisation problem by building up a loss function and minimising it. To automate a whole ML cycle, a more general research field has been formed these years known as AutoML, which covered a few above mentioned research problems, such as data preprocessing, algorithm selection and hyper-parameter tuning, etc. A recent survey presented an overview of automated ML, in which a number of approaches that used optimisation algorithms to automate ML tasks, such as feature selection or generation, ML algorithm selection and optimisation algorithm selection for model training, were investigated [192]. A recent book provided a tutorial-level overview of the methods underlying automatic ML [81]. These works also presented some advanced ML techniques for automated ML, such as meta-learning, which we discuss in Sect. 5. Besides, a few competitions on AutoML have also been organised in recent years [70, 71]. Other than the stages within a common ML cycle, optimisation has also been used to improve some specific ML areas, such as clustering, ensemble learning.

This section delves into the use of optimisation for ML in five different subsections depending on where optimisation is introduced in a typical ML approach as shown in Fig. 2.

Fig. 2
figure 2

Interaction 1: optimisation for machine learning

3.1 Data preprocessing

Data preprocessing techniques aim to transform raw data into a reasonable shape for the subsequent learning, which includes data cleaning, dimensionality reduction, instance reduction, discretisation, data balancing, etc. Some of those strategies can be formulated as an optimisation problem. Dimensionality reduction focuses on finding a subset of features from original feature space or extract new features, which are informative, non-redundant and can facilitate the following learning process [72]. Similarly, instance reduction selects a subset of instances from original instance space or generate new instances [26]. Discretisation looks for the best set of cut points to transform numerical or continuous attributes into discrete ones, which can facilitate the subsequent learning and help researchers understand data easily [64]. Data balancing usually oversamples and/or downsamples a new set of examples when the data suffers from the class imbalanced problem [63]. These problems can be transformed into optimisation problems, in which an optimal set of features, instances or intervals are searched in a finite feature, instance or interval space. These optimisation problems have been tackled by means of heuristics [133], metaheuristics [63, 137, 163, 171, 189] and hyper-heuristics [115]. Compared to heuristic-based data preprocessing, metaheuristic-based methods have better global search ability by balancing exploration and exploitation. Hyper-heuristic-based approaches provide a more general framework, which automatically chooses suitable heuristics to select features or instances during the search. From the attribute perspective, we can find heuristic optimisation methods [133], metaheuristic methods [189] and also hyper-heuristic optimisation methods [115]. A survey on state-of-the-art evolutionary computation approaches for feature selection was presented in [189], which highlighted the use of genetic algorithm (GA) [77], Genetic Programming, PSO and ant colony optimisation and different ways of representing a solution, search mechanism and performance measure. Then, from the instance perspective, most optimisation techniques are based on EAs [171], while hyper-heuristics have not been introduced yet, which might be of use similarly to feature selection [115] and improve the generalisation ability of instance selection algorithms. As for discretisation, EAs have been frequently used to search the best set of cut points for each attribute, in which binary encoding was usually utilised to determine whether the predefined cut points were adopted [137, 163]. Data balancing can be seen as an instance selection problem, so that, it has been tackled by EAs similarly, in which EAs were used to search for a new set of examples from the imbalanced data by downsampling or oversampling with the purpose of balancing data and achieving good learning performance [63].

3.2 Algorithm selection in ML

Algorithm selection is an important task for solving ML problems, because different ML algorithms may have different performance on a specific problem. For example, SVMs may broadly perform better than NNs when the number of training examples is small. Since there are a number of ML algorithms, choosing a suitable one for a specific ML task has been a difficulty for ML researchers, especially for ML beginners. To address the issue, some approaches introduce optimisation techniques to search for a suitable ML algorithm from a pool of those for a specific ML problem. Several representative methods have been further developed into pieces of software for automating the design of ML algorithms, such as Auto-WEKA [168], Auto-WEKA 2.0 [93], auto-sklearn [54] and TPOT [126]. Some of these approaches used Bayesian optimisation techniques [156] to search for suitable ML algorithms and corresponding hyper-parameters as well [54, 93, 168]. The method in [126] transformed an ML approach into tree-based pipelines including features, algorithms and hyper-parameters, and utilised Genetic Programming [94] to optimise these pipelines.

3.3 Hyper-parameter tuning

Hyper-parameters are adjustable parameters that are related to the property of an ML algorithm, such as the number of clusters in a k-means clustering, the number of hidden layers in NNs and the kernel function for SVMs. These hyper-parameters have a great influence on the performance of the subsequent model training. The hyper-parameter tuning for ML algorithms mainly relies on the knowledge of ML experts. To make the learning process more automatic, optimisation algorithms can be introduced to search for promising hyper-parameters without human intervention. Hyper-parameter tuning is a kind of black-box optimisation, in which the objective function is unknown and usually approximated by the performance on the validation sets. This problem has been tackled by grid search, random search, metaheuristics, Bayesian optimisation and hyper-heuristics, and more. Depending on their generalisation ability, those approaches can be categorised into hyper-parameter tuning for a specific dataset and a type of datasets.

3.3.1 Hyper-parameter tuning for a specific dataset

A typical ML task learns knowledge from a specific dataset, which is usually divided into training set, validation set and testing set. The training set is used for extracting knowledge, the validation set serves to tune hyper-parameters and prevent overfitting, and the testing set provides an evaluation of the learned knowledge. To achieve good generalisation performance on a dataset, hyper-parameters are often tuned by experts according to the performance of the model learned from training set on the validation set. However, to remove human intervention, a few optimisation or search techniques have been introduced to do that task. Grid search and random search are two strategies that are commonly used [19]. In addition, metaheuristics and Bayesian optimisation, which are two effective global optimisation methods for black-box functions, are also used for hyper-parameter tuning in the ML field [59, 156].

  • Grid search and random search Grid search and random search are the two most popular hyper-parameter tuning algorithms because they are easy to implement and parallelisation is trivial. Grid search performs exhaustive search in the hyper-parameter space in order to find the optimal set of hyper-parameters, while this process can be time consuming and even virtually impossible in a very high dimensional space. To address the issue, a more efficient search method that explores the hyper-parameter space randomly was proposed, which is called random search. This approach is able to achieve promising hyper-parameters with less computation time when compared to grid search [19].

  • Metaheuristic-based search Rather than simply search exhaustively or randomly, metaheuristic-based approaches perform search according to the generalisation performance, which may be beneficial for quick convergence. Such search usually starts from a single or a set of random hyper-parameters. They are altered using some operators, such as mutation or crossover, and improved by selection mechanism according to their generalisation performance on the validation datasets. For example, the approach in [59] encoded the kernel parameter and regularisation parameter of a SVM into a candidate solution (chromosome) of an EA, and a generalisation performance measure was chosen as the fitness function. Other than SVM, several previous studies provided an overview of the approaches that used metaheuristics to tune the hyper-parameters of NNs [56, 125, 193].

  • Bayesian optimisation From the point of view of probability, Bayesian optimisation based method for hyper-parameter tuning models a learning algorithm’s generalisation performance as a sample from a Gaussian process and evaluates promising hyper-parameters based on the model [156]. In most cases, these approaches are better than previous methods because they use cheaper models to exploit uncertainty to balance exploration against exploitation, which can obtain better results in fewer evaluations. However, they suffer from selecting hyper-parameters of Gaussian processes. In [156], the authors identified good practices for Bayesian optimisation of ML algorithms.

3.3.2 Hyper-parameter tuning for a type of datasets

Some ML tasks learn knowledge from different but related datasets that include similar structural and statistical characteristics and belong to the same domain, such as microarray gene expression data [17]. Even though tuning hyper-parameters of an ML algorithm for each particular dataset could achieve good performance, this procedure can be time consuming. To tackle the issue, some approaches introduce optimisation algorithms to tune hyper-parameters for a type of datasets from the same domain so that we do not have to tune them for every specific task. Hyper-heuristics have been mainly utilised because of their higher generalisation ability compared to other optimisation techniques. Different from metaheuristics based methods, hyper-heuristic-based approaches usually search for a set of promising rules, strategies or methods, which can be further used to design an ML algorithm for different but related datasets.

Most of the existing methods used EA-based hyper-heuristics to select a set of generic hyper-parameters of an ML algorithm for a type of datasets, in which the selection mechanism was based on natural selection and the low-level heuristics were encoded into a gene sequence [17, 46]. After the whole run of an EA, the hyper-parameters for designing an ML algorithm tailored to a type of datasets can be obtained. Although these methods can be time consuming, they have been demonstrated to achieve better generalisation performance for a type of datasets so that there is no need to spend additional time tuning hyper-parameters on unseen datasets in the same type. In [17], a hyper-heuristic EA for designing DTs was proposed. In this study, each individual in the EA was encoded as an integer vector and the genes of each individual were related to different rules and parameters of designing a DT, such as the split rules, stopping criteria. The fitness was decided by validation performance. After the whole run of EA, a specific DT induction algorithm for a type of datasets can be obtained. The approach in [46] proposed a similar approach to automatically design Bayesian network classifier based on hyper-heuristic EA. In this approach, each individual of EA contained 11 genes, the first of which represented one of 12 search algorithms for generating the structure of a network, the other of which represented the strategies and parameters related to the corresponding search algorithm.

3.4 Model training

Model training is the core of an ML approach. Usually, this task can be transformed into an optimisation problem by building a loss function based on an ML model and minimising it with respect to the model parameters. For most ML models, the parameters are continuous variables so that a differentiable loss function can be built in order to allow gradient-based optimisation algorithms to solve the problem [134, 146]. However, gradient-based algorithms may be prone to fall into local optima. To address the issue, metaheuristics are sometimes introduced to search for the optimal model parameters utilising their global search ability. In addition, if model parameters are discrete variables or the loss function is non-differentiable, metaheuristics have also been used to train an ML model.

3.4.1 Gradient-based optimisation

The optimisation problems for training an ML model can be broadly divided into convex and non-convex optimisation [82, 162]. For example, training a SVM model can be formulated into a convex quadratic programming problem, which can be solved by a sequential minimal optimisation algorithm [134]. In terms of non-convex optimisation, learning the weights of deep NNs can be transformed into a non-convex optimisation problem, which can be tackled by various gradient-based algorithms. These algorithms can be classified into batch gradient descent, stochastic gradient descent (SGD) and mini-batch gradient descent based on how much data was used to compute the gradient in each iteration. These vanilla algorithms suffer from a few problems, such as fluctuation in SGD and choosing suitable learning rate. To overcome these shortcomings, a number of improvements have been proposed, such as Momentum [135], Adaptive Moment Estimation (Adam) [91], etc. An overview of different variants of gradient descent algorithms for training NNs was provided in [146].

3.4.2 Metaheuristic

To prevent model parameters from falling into local optima, metaheuristics have been introduced to train an ML model due to their global search ability. Even though these methods are usually more time consuming, they may achieve better performance. For example, in [56, 125, 193], several overviews of approaches that optimise the weights of NNs based on various metaheuristics were provided. For some ML algorithms, some of the model parameters are discrete, and therefore gradient-based optimisation algorithms do not work on training these models, instead metaheuristics can be used. For example, the splitting attributes of a DT are discrete model parameters. They can be encoded into a chromosome of an EA and optimised through the evolution process [16].

3.5 Other ML areas improved by optimisation

Apart from the previous stages in which optimisation have been incorporated in the conventional ML cycle, optimisation has also been used for some specific ML tasks, such as clustering [79, 116], ensemble learning [34, 140] or association rules mining [4, 44]. Section 3.5.1 provides a brief overview of optimisation based clustering. Section 3.5.2 investigates the approaches that use optimisation techniques to improve ensemble learning.

Fig. 3
figure 3

(this taxonomy is adapted from [84])

Interaction 2: machine learning for optimisation

3.5.1 Clustering based on optimisation

Different from previously discussed typical ML methods, clustering is an unsupervised learning ML task, whose goal is to determine a finite set of clusters to describe a dataset according to similarities among its objects [86]. It can be considered as an optimisation problem that maximises the homogeneity within each cluster and the heterogeneity among different clusters [8]. Optimisation techniques have been used to directly perform clustering [69, 79, 152] or to assist the existing clustering algorithms [151]. Most approaches used metaheuristics to directly perform clustering by transforming it into an optimisation problem [69, 79, 152]. The study in [79] provided a survey of using EAs to perform clustering. A taxonomy that highlighted some very important aspects in the context of evolutionary data clustering was presented in this review. A recent survey provided an investigation on multi-objective evolutionary clustering techniques [116]. Conversely, some other approaches introduced metaheuristics to assist the existing clustering methods, such as using GAs to initialise the k-means algorithm [151].

Some other approaches utilise hyper-heuristics to do clustering, in which suitable heuristics are automatically chosen during iterations. Compared to metaheuristic-based methods, these approaches are more generic for different clustering problems and have a lower probability of falling into local optima. For example, in [38], the authors proposed a hyper-heuristic-based approach for web document clustering. The approach used random selection and roulette wheel selection to choose the low-level heuristic based on their performance calculated by Bayesian information criteria. Furthermore, two acceptance strategies were used, replacing the worst and restricted competition replacement. The approach in [172] proposed a hyper-heuristic clustering algorithm, which chose three metaheuristics, simulated annealing, tabu search, GAs, and one classic clustering method, k-means, as low-level heuristics. One of the methods was selected iteratively based on their previous performance, which may prevent a fixed algorithm from being trapped in local optima.

3.5.2 Ensemble learning based on optimisation

Ensemble learning combines the predictions of individual ML models to obtain better performance [140]. Conventional ensemble methods include bagging, boosting and random forest, decomposition methods, negative correlation learning methods, fuzzy ensemble methods, multiple kernel learning ensemble methods, etc [140]. As [74] pointed out, the key to successful ensemble learning is to construct individual predictors which are accurate and disagree with one another. This means a good ensemble learning approach should consist of a set of accurate and diverse predictors. Therefore, it is natural to incorporate multi-objective optimisation techniques into ensemble learning to search for a set of individual predictors that are both accurate and diverse. Since multi-objective EAs can produce a set of trade-off predictors in the form of a non-dominated set [118], they have been used frequently to assist ensemble learning [34, 35, 61]. An example can be found in [34], this approach used multi-objective EAs to find the optimal trade-off between diversity and accuracy of an ensemble of NN classifiers. Each particular NN was treated as an individual in the population and the final ensemble was the non-dominated set of NNs in the final generation.

4 Machine learning for optimisation

There are many approaches that use ML techniques to improve optimisation algorithms, mainly metaheuristics and hyper-heuristics. The aims of introducing ML to optimisation are to speed up the search process and to improve the quality of solutions. The general approach of this field is to use ML techniques to extract knowledge from the data gathered from optimisation processes. Then, the extracted knowledge, which is usually represented by a model or rules, is used to tune or substitute for a component of an optimisation algorithm. Multiple overviews have recently covered the interactions between ML and metaheuristics [33, 39, 84, 197], however, to the best of our knowledge, there is no survey presenting a categorisation on the way that ML is used for enhancing hyper-heuristics. In addition, ML techniques can also be used to choose the best performing algorithm for a particular optimisation problem. Section 4.1 provides a brief overview of using ML to enhance metaheuristics based on previous review papers. Section 4.2 investigates the approaches that used ML to improve hyper-heuristics. Section 4.3 presents the latest advances on algorithm selection in optimisation based on ML techniques. Overall, the ways of applying ML to improve optimisation are shown in Fig. 3.

4.1 Machine learning for metaheuristic optimisation

A metaheuristic is a high-level problem independent algorithmic framework that provides a set of strategies to develop heuristic optimisation algorithms [158]. This framework is general, and it can be applied to different optimisation problems in the same domain or even in different domains. However, this problem solving process can be time consuming and the method may be prone to fall into local optimum. To tackle these problems, ML techniques have been used to accelerate the search process and improve the quality of solutions. There are several previous review works about improving metaheuristics by ML [33, 39, 84, 197]. They provided similar taxonomies and one of them proposed in [84] depended on three different criteria, namely localisation, aim and kind of knowledge. Based on this taxonomy, we summarise the recent advances in the literature.

Depending on the localisation where ML techniques are embedded in an optimisation algorithm, the approaches that used ML to improve metaheuristics can be categorised into eight classes, specifically evaluation, decision variables, parameters, initialisation, population, operators, local search and problem instances.

  • Evaluation For some cases in optimisation, the objective function is hard to be expressed by a mathematical model or computationally expensive, for example, many engineering problems require simulations to evaluate a solution. To solve the problem, ML techniques are usually used to approximate the objective function of a optimisation problem, which is also a branch of surrogate models [15], or reduce the number of solutions to be evaluated. A recent approach in [131] used generalised regression NNs to approximate the fitness function in PSO to speed up evaluation. Several overviews of using surrogate models to assist evolutionary computation were provided in [41, 48, 83]. To reduce the number of solutions to be evaluated, some other methods used a clustering algorithm to group the population of a metaheuristic and then only chose representative individual from each cluster to be evaluated [90]. Some other approaches focused on multi-objective optimisation problems. These methods used ML techniques, such as principal component analysis [148] or feature selection techniques [104], to reduce the number of objectives in a multi-objective optimisation problem for the sake of less computational burden and complexity.

  • Decision variables An optimisation problem usually includes a set of decision variables, whose values need to be determined to solve the problem. Conventional metaheuristics seldom take into account the scalability of the number of decision variables. However, this factor sometimes may influence the performance of an optimisation process, especially for problems with large-scale decision variables. To fill the gap, some approaches utilise ML techniques to deal with decision variables so that they can be optimised efficiently. A recent EA-based approach introduced a clustering algorithm to separate the decision variables into convergence-related and diversity-related ones, and two different EA strategies were applied to dealing with the two classes of decision variables, respectively [198].

  • Parameters A metaheuristic typically consists of many adjustable parameters, which have great impact on the performance of an optimisation process. To choose promising parameters automatically, ML techniques have been introduced to choose suitable ones before an optimisation process (parameter tuning) or even adjust them adaptively during the run of an optimisation algorithm (parameter control). For example, in EAs, we have population size, crossover probability and mutation probability, among others. These parameters can be adaptively set using ML algorithms. One instance was given in [196], in which a clustering algorithm was used to find different subgroups within the population of a GA, then crossover probability and mutation probability were adaptively adjusted based on the relative size of the cluster containing the best chromosome and the one containing the worst chromosome.

  • Initialisation Normally, a metaheuristic starts searching from a single or a number of randomly generated initial solutions, which may include some bad solutions. ML algorithms can be used to help generate a number of relatively promising initial solutions, which can accelerate convergence. For example, a clustering based initial population strategy is proposed for travelling salesman problem [45], in which the cities were clustered into several groups based on k-means and then an initial solution can be obtained by using a GA to find the local optimal path for each group and a global optimal path connecting different groups.

  • Population Some metaheuristics are population-based methods, such as EAs and PSO, in which a population comprised with a number of solutions evolves iteratively. The diversity and quality of the population highly influence the results. To improve the quality of solutions, ML techniques can be introduced to maintain population diversity or predict promising regions. Clustering has been the most frequently used ML method to maintain population diversity. The approach in [164] used a clustering algorithm to group the population of a GA into several sub-populations and operate selection within each sub-population. To predict promising region in the search space, several ML algorithms were used, such as clustering algorithms [87] and SVMs [191], etc. The approach in [191] learned the mapping between best individuals and their corresponding fitness function values based on SVMs in each generation of an EA, and then searched for the promising individuals according to the learned model.

  • Operators A metaheuristic normally has some operators to alter solutions, such as selection, crossover and mutation operator in EAs. To accelerate the convergence of an optimisation process, ML techniques have been applied to adaptively choosing suitable operators for each problem state during searching. RL is a frequently used ML algorithm. For instance, the approach in [195] proposed an adaptive evolutionary programming algorithm based on RL, in which the optimal mutation operator was chosen for each individual based on immediate and delayed performance of mutation operators.

  • Local search metaheuristics are often hybridised with local search to exploit local region of candidate solutions in metaheuristics. ML techniques can be applied to helping local search. For example, in [109], clustering algorithm was used to cluster the population into several groups and the best individual in each group was refined by local search. However, the local search can be computational expensive. To address the issue, the approach in [183] proposed a pruning strategy for bee colony optimisation algorithm [184, 185], which employs the bi-directional extension based frequent closed pattern mining algorithm to only allow relatively better bees to undergo local optimisation.

  • Problem instances An instance indicates a specific description that can be used as an input for a given problem in optimisation. For some optimisation problems, the number of existing real-world or even synthetic instances are limited. More importantly, there might be many problem instances but they might not differ much from each other when their characteristics (instance features) are considered. It is not trivial and reliable to evaluate an optimisation algorithm based on such insufficient instances, since they cannot possibly cover all the representative regions in the instance space. To address that issue, ML techniques have been used to generate synthetic instances with varying characteristics from different regions of the instance space. It should be noted that this method does not assume any particular optimisation method and useful purely for a better and informed performance comparison of optimisation methods, including metaheuristics and hyper-heuristics. The approach in [103] is an illustration of this research topic, which used DTs to classify the real and generated instances into their corresponding classes and the instance generator was iteratively modified according to the feedback from the classifier in order to generate more real instances.

Based on the type of knowledge that ML techniques extract from the optimisation processes, the approaches can be categorised into the ones that use a priori knowledge and the ones that use dynamic knowledge [84]. From the perspective of ML, these could also be classified as metaheuristics that use online (dynamic) and offline (a priori) ML techniques. For the methods that use dynamic knowledge or online ML techniques, knowledge is extracted in each iteration to guide the following search process. For example, as discussed previously, the approach in [195] used RL to learn optimal policy of choosing mutation operators. The policy was updated dynamically based on the performance of mutation operators every iteration. For the approaches that use a priori knowledge or offline ML techniques, knowledge is extracted by an offline ML algorithm from solving different instances in a specific optimisation problem domain and use the knowledge to help solving unseen instances in the same problem domain [190, 191].

Focusing on the main goal for which ML is introduced in metaheuristics, we can classify these approaches into two categories, to speed up search process and to improve the quality of solutions. Some of previously discussed approaches aimed to speed up search process, such as using ML algorithms to approximate costly objective function [80] and to generate promising initial solutions [190]. Some of them aimed to improve the quality of solutions, such as using ML techniques to maintain population diversity [164] and to choose suitable operators for individuals [195].

4.2 Machine learning for hyper-heuristic optimisation

The main difference between metaheuristics and hyper-heuristics is that metaheuristics use a pre-designed algorithmic framework comprised by specific parameters and operators to solve a problem, while hyper-heuristics utilise selection or generation mechanisms to automatically design an optimisation algorithm by selecting or generating suitable heuristics during the search process. Hyper-heuristics typically have a stronger generalisation ability than metaheuristics and can be applied to different optimisation problems without significant modifications. In the literature, there is a certain degree of confusion between hyper-heuristics and the approaches discussed in Sect. 4.1 that used ML algorithms to adaptively select suitable parameters and operators for metaheuristics. Even though researchers in different fields use different terms to describe them, actually they share the same goal of automatically designing an algorithm for an optimisation problem. Concentrating on the difference between metaheuristics and hyper-heuristics, we divide this section into two subsections. The first presents an overview of the approaches that use ML techniques to improve selection or generation mechanism in hyper-heuristics, which is missing in metaheuristics. The second reviews the methods that aim to accelerate the evaluation of hyper-heuristics, which can also be found in a hybridised metaheuristic.

4.2.1 Learning to select heuristics

There are many approaches that apply ML techniques to learning how to select heuristics in hyper-heuristics. Depending on the ML techniques they used, these methods can be generally classified into three categories, RL-based approaches, tensor analysis based approaches and apprenticeship learning approaches.

4.2.2 4.2.1.1 RL-based approaches

In this branch, these methods use RL to learn a heuristic selection mechanism in an online manner, in which the selection mechanism keeps changing according to the performance of low-level heuristics. They can be further split into two classes, using only RL, hybridising RL with other algorithms.

  • Using only RL A simple and intuitive way of learning the best low-level heuristic during the optimisation process may be based on classical RL, so that good performing heuristics are positively rewarded while bad performing heuristics are negatively punished. Two approaches in [95, 121] used such simple RL scheme to reward and punish the weights of low-level heuristics based on their previous performance and selected low-level heuristic using roulette wheel method according to their weights in each iteration. Rather than using such simple RL scheme, a more complex RL method, Q-learning, was introduced to learn to select heuristics in [36], which strictly follows the criteria of RL. In [47], the author proposed a RL-based hyper-heuristic method that fulfilled the criteria of RL. In their approach, several variants of RL were investigated to test their performance on selecting heuristics.

  • Hybridising RL with other methods Only using RL as heuristic selection mechanism may result in sticking to a specific heuristic if the heuristic is heavily rewarded previously. To address the issue, Tabu search was introduced to prevent a specific heuristic from being chosen for certain times [31]. Previous approaches that used RL to select heuristics only focused on which heuristic was most suitable in each iteration. However, choosing a suitable sequence of heuristics may benefit more for a consecutive optimisation process. Therefore, a Markov chain and RL-based hyper-heuristic method was proposed in [110], in which each state of Markov chain was represented by a low-level heuristic and the transition weights of Markov chain was updated adaptively by RL based on the probability of generating dominating solutions. This method did not only learn which low-level heuristic was effective, but also focused on which sequence of low-level heuristics was promising. Besides, RL can be utilised to improve the heuristic-based selection mechanism of a hyper-heuristic. The approach in [51] used RL to improve Choice Function heuristic selection mechanism [40] by adaptively adjusting the weights of measures in the choice function. If no improvement was made, the intensification weight would be richly rewarded; otherwise, the diversification weight would be heavily punished.

From a critical viewpoint, the author of [3] presented a theoretical analysis of the limitation of using RL as selection mechanism in hyper-heuristics. Their theoretical results showed that if the probability of improving the candidate solution in each point of the search process is less than 50% which is a mild assumption, then additive RL mechanisms perform asymptotically similar to the simple random mechanism.

4.2.3 4.2.1.2 Tensor analysis based approaches

Tensor analysis, also known as tensor calculus, is a collective term for the techniques that investigate high dimensional data and extract latent patterns and correlations between various modes of data [11]. It is used for elegant and compact formulation and presentation of equations and identities in mathematics, science and engineering. It is also used for algebraic manipulation of mathematical expressions and proving identities in a neat and succinct way [157]. Some hyper-heuristic approaches use such technique to extract knowledge for heuristic selection. Rather than directly choosing a promising heuristic in each iteration, these approaches select a suitable group of heuristics for a move acceptance method based on tensor analysis in order to group heuristics that work well with each move acceptance method [10, 11]. The approach in [10] is an offline learning process, in which low-level heuristics were grouped based on tensor analysis once at the beginning of the search. In [11], the authors further extended the method to an online learning manner and made it more flexible. In this method, the low-level heuristics were partitioned dynamically during the search process and the selected heuristics can be overlapped between different groups.

4.2.4 4.2.1.3 Apprenticeship learning approaches

Promising hyper-heuristics, such as AdapHH [112] which was the winner of a cross domain heuristic search challenge among twenty competitors, consist of effective manually designed heuristic selection mechanism. Instead of directly improving these methods, some approaches use ML algorithms to learn selection mechanism from them and may achieve better performance on some problem instances. These approaches can be seen as apprenticeship learning, which is a process of learning by observing an expert [1]. These apprenticeship learning methods use different ML algorithms, such as NNs and DTs, to learn heuristic selection mechanism from experts, such as AdapHH and choice function [40]. Most of them learn in an offline manner. The approach in [173] trained a time delay NN to extract heuristic selection mechanisms from the training examples that were generated by a promising heuristic selection method, choice function. Then, the time delay NN was used as a classifier to select suitable low-level heuristic for solving unseen problems. In [9], several DTs were trained to imitate heuristic selection of a promising hyper-heuristic method, AdapHH. Before training, several datasets were constructed by applying AdapHH to an optimisation problem and collecting search states. Then, predictors were trained based on these datasets by means of supervised learning. The previous approaches used promising hyper-heuristics as experts, while some other methods chose some evaluation functions as experts. In [127], the author used NNs to learn the hidden patterns between problem states and the promising low-level heuristics, in which the problem states were represented by constraint density and tightness. The approach in [194] learned recurrent neural networks (RNNs) [78] to predict next suitable heuristic based on a set of promising heuristic sequences according to the final log return. The trained RNNs can be used later to generate a sequence of heuristics on an unseen problem.

4.2.5 Learn to accelerate evaluation

In addition to learning heuristic selection mechanism, there are also a few approaches aiming at speeding up the whole hyper-heuristic process with the help of ML techniques. Generally, they learn to evaluate the solutions by classification or regression methods.

  • Classification As discussed in Sect. 4.1, evaluating solutions sometimes can be time consuming. Different from the approaches that use ML to speed up metaheuristics, accelerating hyper-heuristics can be treated as a classification problem, because hyper-heuristics include an acceptance mechanism that decides whether a solution is accepted or rejected. Therefore, a classifier can be trained to classify a solution into an acceptable one or an unacceptable one without explicitly computing the objective function. In [99], the authors focused on finding global hidden patterns in large data sets of heuristic sequence. The approach used NNs and logistic regression to classify the intermediate solutions generated by the hyper-heuristic into “good” solutions and “bad” solutions. Another method in [100] transformed a schedule as a pattern and trained a NN to learn if the pattern is good or not. Inspired by [100], another approach utilised a NN to learn if the relative changes in a schedule would improve the performance [174], which was faster than [100].

  • Regression Another way to accelerate the evaluation for hyper-heuristics is to train a regression model to approximate computationally expensive objective function, which follows the same way that uses ML techniques to accelerate evaluation for metaheuristics and is also related to surrogate model. The approach in [159] proposed an evolvability metric estimation method based on a parallel perceptron, which accelerated the online heuristic selection process. In this study, the fitness value of an incoming solution was estimated by a single layer of perceptrons.

Fig. 4
figure 4

Interaction 3: machine learning for machine learning

4.3 Machine learning for algorithm selection in optimisation

The algorithm selection in optimisation was first proposed by [142] aiming at answering the question: which algorithm is likely to perform best for my (optimisation) problem? This problem can be transformed into a learning task by mapping the features of problem instances to the best performing algorithm or algorithm performance. The algorithm selection in this section is related to the algorithm selection mechanism of hyper-heuristics. The former aims to perform algorithm selection before solving a problem while the latter executes algorithm selection during the iterative search process. However, the former can also be transformed into an online selection approach, which selects a suitable algorithm during the search, and solved by hyper-heuristics. However, it is noteworthy that the representation of the solutions generated by different algorithms during the search should be in the same form. For example, we cannot directly pass the solutions represented by a binary string of an EA to a PSO algorithm which represents the solutions by real numbers. This insight prevents most ML algorithm selection problems (Sect. 3.2) from being solved by hyper-heuristics, because most ML algorithms use different models to solve a problem. However, algorithm selection for clustering can be an exception, since the solutions generated by different clustering algorithms can be kept in the same form as clusters, which is discussed in Sect. 3.5.1. There has been a growing number of studies on algorithm selection in optimisation, recently. The approach in [154] extended the framework of algorithm selection in [142] and introduced a case study on graph colouring problem. Rather than only choosing the best performing algorithm, this method aimed to use the meta-data to identify the strengths and weaknesses of algorithms, in which a few ML techniques, such as Naive Bayes classifiers, SVMs or NNs, were used based on the type of performance metrics. A recent work in [120] focused on algorithm selection on continuous black-box optimisation problem. This method first used exploratory landscape analysis to produce a set of instance features, then applied correlation analysis to select relevant features, finally used SVMs to learn the mapping between instance features and the best performing algorithm or some other algorithm performance measures. A case study on algorithm selection for the generalised assignment problem was provided by [42], in which Random Forest was used to learn the mapping between instance features and suitable algorithms. This work also discussed a few practical issues when applying algorithm selection, such as deciding whether to conduct algorithm selection or not. A latest survey provided an overview of algorithm selection on continuous and discrete optimisation problems with the help of ML techniques and also discussed several related problems, such as automated algorithm configuration and algorithm schedules [89].

5 Machine learning for machine learning

In Sect. 3, we have discussed how to make an ML approach more automatic from the point of view of introducing optimisation techniques. However, searching for a suitable ML algorithm and its corresponding hyper-parameters through optimisation techniques sometimes can be time consuming. Due to the powerful learning ability of ML techniques, researchers filled the gap by applying another ML algorithm at the meta-level to learn meta-knowledge from fulfilling base-level ML tasks and then using the learned meta-knowledge to guide unseen ones. Such approach is called meta-learning, also known as learning to learn. Several previous works provide overviews of meta-learning [5, 24, 97, 175, 177]. The most recent one in [175] categorised meta-learning techniques depending on the type of meta-data they leveraged, in which the meta-data were broadly classified into model evaluations, task properties and prior models. From another point of view, we provide a categorisation based on the type of meta-knowledge as shown in Fig. 4. This section presents an investigation on the recent progress in meta-learning. Most of these approaches focus on NNs, which are further discussed in the following subsections.

5.1 Learn to select algorithms

Rather than determination of an ML algorithm by experts, the meta-learning methods in this field work towards using ML techniques to learn how to automatically select appropriate algorithms for an ML task. This question can also be tackled by optimisation techniques as discussed in Sect. 3.2; however, using optimisation algorithms to search for a suitable algorithm can be time consuming, because we need to perform an ML algorithm during each optimisation iteration. To address the issue, some meta-learning approaches utilise ML algorithms to learn to select algorithms, in which the meta-model can efficiently choose suitable algorithms after training. Generally, these methods learned a classifier that determined a specific ML algorithm or a ranking of ML algorithms based on the characteristics of a task [25, 141], or a regression model that captured the mapping between the characteristics of ML tasks and algorithm performance [62]. A recent approach took into account runtime and incorporated multi-objective measures that comprised of accuracy and runtime into two algorithm selection methods, average ranking and active testing, to accelerate the selection process [2].

5.2 Learn to tune hyper-parameters

The meta-learning approaches in this class concentrate on learning how to tune hyper-parameters for an ML algorithm. Different from the methods that use optimisation algorithms to search for promising hyper-parameters in Sect. 3.3, most of these approaches aim to learn the mapping between hyper-parameters and generalisation performance, which is usually represented by algorithm’s performance on the validation datasets or the model parameters leading to good generalisation performance. Then hyper-parameter tuning can be conducted efficiently based on the learned mapping. Compared to the optimisation based methods that search among the hyper-parameter space, these meta-learning approaches are much more time-saving because they do not need to train a model for each set of hyper-parameters during searching, instead they evaluate a model by predicting the generalisation performance based on the learned mapping model. Even though training the mapping would take some time, hyper-parameter tuning can be fast after training. A recent approach in [105] introduced NNs to learn the mapping between hyper-parameters and the approximate optimal parameters of a base model. In their study, the optimal base model can be directly given without time consuming training for each set of hyper-parameters, which can accelerate the hyper-parameter tuning. Furthermore, using NNs to approximate the black-box function between hyper-parameters and the optimal model parameters allowed gradient descent algorithms to be introduced to find promising hyper-parameters. Another two approaches trained regression models, such as support vector regression and NNs, to predict the performance or weights of NNs with respect to their architecture in order to speed up the search for the promising architecture of NNs [14, 27].

5.3 Learning to train a model

These approaches focus on learning an efficient meta-learner that can be used to optimise the parameters of an ML model. Conventional ML algorithms typically use a manually designed optimisation algorithm, such as gradient-based methods, to train a model by building a loss function and minimising it with respect to the model parameters as discussed in Sect. 3.4. However, such process sometimes can be time consuming and stuck in local optima. To overcome these weaknesses, meta-learning methods in this class work towards learning an efficient optimisation algorithm for the sake of faster convergence and better performance. These approaches can be categorised into learning an optimiser and fast adaptive parameters based on whether the model parameters are iteratively updated.

  • Learning an optimiser These approaches target at learning an optimiser that can be used to update the model parameters iteratively based on a sequence of gradients and objective values for a class of ML tasks. The two most representative approaches changed the design of an optimisation algorithm within an ML method into a learning problem [7, 102]. The method in [7] trained a RNN as an optimiser to update the weights of a base ML model based on the gradient information. In [102], the authors transformed a parameter optimisation problem into a RL one, in which a policy was trained as an optimiser to update the model parameters iteratively. These approaches both outperformed the existing hand-designed optimisation algorithms on some ML problems, while they were not as stable as hand-designed algorithms due to the limited learning ability of the meta-learner. Also, they were restricted to solve a class of problems. To address the issue, the approach in [181] proposed learned optimisers that scale and generalise by introducing a hierarchical RNN architecture, which had high generalisation ability to new tasks.

  • Adaptive parameters Rather than learning an optimiser that updates the model parameters iteratively, some approaches aim at learning to generate quick adaptive model parameters. Since these methods predict model parameters in a feedforward manner without the backward propagation of errors, they are much faster than those in the first branch while the quality of the generated parameters may be not guaranteed. Some methods in this branch focused on generating adaptive weights for different time steps in RNNs [73, 149]. Different from conventional RNNs that share weights across all the time steps, these approaches meta-learned another RNN that predicted specific weights for each time step of the base RNN. The adaptive weights allowed the base RNN to process each element in the input sequence differently, which can improve the performance. Some other approaches focused on producing adaptive model parameters for different tasks. In [119], the model parameters were divided into slow weights and fast weights. Slow weights were generic weights across tasks, while fast weights were task-specific weights, which were outputted by a trained meta-learner.

5.4 Learning an algorithm component

An algorithm component can be seen as a discrete hyper-parameter of an ML algorithm, such as an activation function in NNs and a splitting criterion in DTs. Sections 3.3 and 5.2 focused on selecting such hyper-parameters from a pool of manually designed algorithm components. However, some meta-learning approaches aim to learn an algorithm component by embedding another ML model into a base one to represent the algorithm component and training them jointly. These approaches can achieve better performance because the algorithm components are automatically learned for a specific ML task while hand-designed ones are generic. A recent approach in [176] explored task-specific activation functions based on NNs rather than choosing a suitable hand-designed one. The experimental results showed that the explored new activation functions were different from the common ones and achieved better performance. Some other approaches worked towards exploring a promising policy or strategy in an ML algorithm, which is usually designed by experts. In [12], a meta-learning method was proposed to learn the strategy of selecting examples to label for active learning based on Matching Networks [178]. Another approach in [188] learned a splitting criterion for designing a DT based on a RNN, in which the learned RNN controller was used to predict the splitting feature at each non-leaf node. In [43], a strategy to control the magnitude of gradient updates was proposed for weakly labelled data. A confidence network was meta-trained to weight every update for the parameters of a base model in order to alleviate the influence of noisy data.

5.5 Learning to transfer knowledge

In ML, many different ML tasks are related to each other, such as handwriting recognition in different languages. Similarly, in the optimisation field, a lot of different optimisation problems are closely connected, for example different instances of the travelling salesman problem. These relevant tasks or problems contain some useful generic knowledge that can be utilised to assist other similar tasks or problems. In order to achieve faster convergence and better performance, some methods discussed in Sect. 4 focus on extracting common knowledge from solving a number of instances in a specific optimisation problem domain and apply the knowledge to solving unseen instances in the same domain. For the same purpose, in the ML field, some meta-learning approaches extract the meta-knowledge from one ML task and transfer it to a different but related task, or extract the common meta-knowledge across a particular class of tasks and apply it to unseen tasks in the same class. Even though the approaches in different fields solve different problems, they share the same goal of improving algorithms by transferring knowledge.

  • Transfer knowledge across tasks Some approaches extract the meta-knowledge from one ML task and transfer it to another different but related task. In [180], the authors proposed a meta-learning method to model the minority class of imbalanced datasets. They transferred the knowledge extracted from data-rich classes to data-poor classes by learning the mapping between the parameters of many-shot model and those of few-shot model on data-rich classes and transferring it to data-poor classes. This approach presented a new idea of learning from imbalanced data. A comprehensive overview of this field was provided in [130]. This study focused on categorising and reviewing transfer learning for classification, regression and clustering problems.

  • Learn common meta-knowledge for a class of tasks Some other approaches aim to extract the common meta-knowledge, which can be shared within a class of tasks. For example, the approach in [144] learned a set of basic function blocks that were shared across different tasks. These function blocks were iteratively selected by a trained router to construct a model for a specific task. A similar method in [57] learned a group of shared basic policies for a distribution of RL tasks, which can be quickly switched between by a trained master policy for unseen tasks. Another approach for RL learned a shared baseline reward function for a class of tasks [101]. The task-specific reward functions were estimated as a probabilistic distribution conditioned on the baseline reward function.

    Many other approaches in this research area concentrate on a specific challenging problem, few-shot learning, in which a concept need to be learned from only one or few examples. This problem cannot be tackled by conventional ML algorithms due to the limited number of examples. To address the issue, many meta-learning methods were proposed in this field. They aim to extract the common meta-knowledge across different few-shot learning tasks within a particular distribution and apply it to unseen tasks in the same distribution. These approaches can be roughly categorised into two classes. The first learns a common feature extractor for a distribution of few-shot learning tasks and do classification by measuring the similarity between the extracted features of examples [65, 92, 111, 147, 155, 166, 178]. These approaches usually chose convolutional neural networks (CNNs) [96] or long short-term memory (LSTM) [76] as a common feature extractor and used Euclidean distance, cosine distance or learned distance metric to measure similarity. The second class focuses on fast parameterisation of a base model by learning a common promising initialisation, parameter updating rules or adaptive parameters for a distribution of tasks [20, 55, 119, 124, 136, 138]. The learned common initialisation or parameter updating rules allowed the model parameters to be quickly tuned to task-specific ones based on few examples.

Fig. 5
figure 5

Interaction 4: optimisation for optimisation

6 Optimisation for optimisation

Similarly to meta-learning in ML, optimisation can also be improved by self-interaction. There are various ways how an optimisation algorithm can be used to improve itself or an other optimisation algorithm as shown in Fig. 5. Generally, these approaches perform search at a high-level and operate in an offline or online manner. Since there are extensive studies providing comprehensive reviews of the relevant well-established research fields ranging from automated algorithm generation to memetic computing [23, 28, 29, 52, 85, 122, 123, 153, 165], we briefly cover a few selected studies in this section to point out those different research areas.

Hyper-heuristics are search methods to select or generate (meta)heuristics; hence, they can be considered as optimisation methods improving individual performance of constituent heuristics for improved performance [28, 30]. Automated generation of heuristics is of interest in many fields and Genetic Programming is one of the most commonly used tools as an optimisation technique to generate components of optimisation approaches [29]. The studies in [23, 123] covered many approaches using Genetic Programming (hyper-heuristics) for solving various scheduling problems. As an example of a selection hyper-heuristic, the approach in [32] employed tabu search for detecting the best permutation of graph colouring heuristics to cooperatively construct near-optimal exam and course timetables. The results showed that mixing different heuristics rather than using one individually yield improved solutions. As a separate well-established field of research, there are many studies on cooperative search methods, that is, parallel/distributed implementation of multiple optimisation techniques improving upon the performance of each individual technique run on its own [52].

The majority of optimisation methods come with a set of algorithmic parameters influencing their performances and various algorithms might perform well on different instances and problems. Therefore, obtaining the best parameter setting of an optimisation algorithm can be formulated as another optimisation problem. Optimising the parameter settings as well as choosing the best algorithm for a problem can be performed by an optimisation algorithm instead of a ML technique as discussed in Sect. 4. Some approaches (e.g., meta-optimisation, automated algorithm configuration) aim to use an optimisation method to tune the parameters of another optimisation algorithm before the search or even configure the optimisation algorithm and its components [49, 165]. For example, an EA can be used to search the best parameter settings for or configure another EA [49]. On the other hand, some other approaches, such as, co-evolutionary algorithms and hyper-heuristics often control and modify the algorithmic parameter settings adaptively during the search process for improved performance. An illustration of parameter control can be found in [128], which embedded the operator choices as well as their parameter settings of a memetic algorithm into the solution representation and co-evolved them adaptively along with the solutions to the problem instance in hand. As another example [112] presented an effective selection hyper-heuristic which not only chose the best operator to invoke but also set its parameter(s) adaptively at each decision point while solving a given problem instance. Both examples also show that the optimisation algorithms used in the optimisation-for-optimisation category can carry out multiple improvement tasks simultaneously.

7 Remarks and discussion

This paper has provided a global overview of the interactions between ML and optimisation as well as their self-interplays. Different from the existing review works that presented comprehensive investigations on an individual or dual interactions, we have aimed to draw a whole picture of the interplays between these two fields and link different interplays by comparing the similar approaches. Background and latest advances of both research area have been provided. In addition to investigating the representative methods, we have presented and discussed individual taxonomies for each interaction. We have also analysed the advantages and disadvantages of the approaches in different interactions that use different techniques to tackle the common tasks. Therefore, this paper serves as a useful tutorial for researchers in ML and optimisation fields to collaborate with each other for the sake of improvement. It has also provided the guidance for non-experts in these two fields on how to automatically design an algorithm without much experience.

Even though a lot of works have been done in the whole research area, there still remains some issues to be explored. Several guidelines on future research directions can be suggested.

  • When a high-level ML or optimisation algorithm is introduced into a base-level ML or optimisation one, that is likely to introduce additional parameters to be tuned. Especially, when we use a high-level algorithm to design a base-level one, this would add the extra work of designing the high-level algorithm. Since our focus is on the base-level ML tasks, we usually do not pay much attention to the design of the high-level method and choose a regular configuration for the high-level algorithm to make it effective. A potential research direction would be simultaneous automated design of high-level and base-level algorithms.

  • The existing approaches usually focus on a single interaction between ML and optimisation, while very few of them consider multiple interactions, which means using an improved ML or optimisation algorithm to further enhance other ML or optimisation algorithms. It could be a promising research direction because multiple interactions mean multiple improvements. For example, the research work in [7] introduced ML techniques to learn a promising optimisation algorithm and the learned optimisation method is further used to quickly train an ML model. Auto-sklearn [54] introduced meta-learning techniques to be complementary to Bayesian optimisation for optimising an ML framework. Another research work in this direction can be found in [143].

  • There are many software tools for ML and optimisation respectively, such as Weka 3 [182] and scikit-learn [132] for ML, HeuristicLab [179] and Opt4J [107] for optimisation. However, there is no API that enables them to smoothly interact with each other. Since more and more approaches about the interplays between ML and optimisation appear in the literature, the researchers would benefit from such an API that connects ML and optimisation software. In addition, there are several pieces of software for automating the design of ML approaches based on optimisation techniques, such as Auto-WEKA 2.0 [93], auto-sklearn [54] and TPOT [126]. They can automatically select suitable ML algorithms and tune corresponding hyper-parameters for a specific dataset, while they still cannot handle challenging ML problems, such as imbalanced classification or few-shot learning. Therefore, a more powerful tool that can combine ML and optimisation library and tackle challenging problems need to be developed in the future.

  • With the rapid development of technologies, the real-world problems in both ML and optimisation field are becoming more and more complex and the scale of them are getting larger and larger. These problems include or generate massive data, which cannot be mined efficiently by conventional ML algorithms. However, very few existing approaches of the interactions between ML and optimisation have taken into account big data issues [129]. Therefore, more advanced and efficient ML techniques need to be introduced or developed to overcome the big data challenges for better interactions.

  • Although ML techniques have been widely applied to optimisation in the literature, very few of them have covered the theoretical analysis of proposed approaches. They mainly focus on numerical experiments, which cannot guarantee the convergence and stability of the proposed algorithms. Therefore, the theoretical analysis of algorithm’s convergence and stability need to be addressed as a future research direction.

  • Many approaches in meta-learning can be trained end-to-end by embedding a high-level ML algorithm into a low-level one. However, when ML algorithms are introduced into optimisation methods to solve a combinatorial optimisation problem, most approaches separate the learning and optimisation process, because the decision variables of combinatorial optimisation are discrete, which is not consistent with the continuous parameters of most ML algorithms. They usually perform ML algorithms to extract knowledge first and then use it to do optimisation tasks. Such an approach is less efficient compared to end-to-end ones. It would be beneficial if we can train an ML model and solve an optimisation problem jointly. One potential way to do that is treating the parameters of an ML model as the decision variables of an optimisation problem and optimise them jointly in an end-to-end manner, so that an ML model can be trained during the optimisation process.

  • The interactions that use ML techniques to improve optimisation mostly rely on conventional ML algorithms, such as NNs, SVMs and k-means etc. As more and more advanced ML methods have emerged recently, such as deep learning and the techniques that learn from very few examples, we could introduce these advanced techniques to further improve optimisation algorithms. In addition, algorithm selection is another issue when we introduce ML algorithms into optimisation since there are plenty of ML algorithms can be used. How to choose a suitable ML method for a specific optimisation problem need to be tackled.

  • There are various ML algorithms which are embedded with human-designed algorithm components. Due to the higher learning ability of ML algorithms, they have been widely applied to automatically learning ML algorithm components that are designed by experts. The learned algorithm components achieved better performance than human-designed ones, such as the activation function of NNs, the splitting criterion of DTs, initialisation of an algorithm, optimisation algorithm for training a model, selection strategy for active learning, etc. A future area of research would be using meta-learning techniques to automatically learn those components, which is still underdeveloped.