1 Introduction

Hyperparameter optimization and algorithm selection are ubiquitous tasks in machine learning contexts that usually have to be conducted for every individual research task and real-world application. Choosing the correct model and hyperparameter configuration usually improves very poor predictions to state-of-the-art performance.

Hyperparameter optimization tries to find the hyperparameter configuration that minimizes a certain black box function \(y(x)\), which is commonly a cross-validation loss of a model learned on some training data using the hyperparameter configuration x. Despite its omnipresence, hyperparameter optimization is usually a difficult task, as the optimization cannot be carried out by minimizing a loss function with nice mathematical properties such as differentiability or convexity. Consider for example the number of hidden layers and hidden neurons for a simple feed-forward neural network. When learning the neural network, both of these hyperparameters have to be set, as the final prediction performance depends heavily on the correct setting of the model complexity.If the model is too complex (i.e. many layers and neurons), the model will very likely overfit the training data or get stuck in a local minimum. However, if the model complexity is not high enough, it might underfit the training data, and miss information vital to the optimization procedure. Thus, the correct setting of these hyperparameters is vital for any serious application of machine learning; however, as mentioned above, the difficulty with this task is that we have no loss function which we can optimize to learn the specific best choices for these hyperparameters.

The majority of efforts to solve this problem are based on the sequential model-based optimization (SMBO) framework, which has its roots in the area of black-box optimization. SMBO is an iterative approach which trains a surrogate model \(\varPsi \) on the observed meta-level instances of \(y\). Then, it can be used in order to predict the performance of an algorithm on a specific data set given the hyperparameter settings and data set descriptors. We use this method to find promising hyperparameter configurations, evaluate \(y\) for these configurations, and finally retrain \(\varPsi \). The overall process is repeated T many times, and in the end, we take the best hyperparameter configuration found so far. In comparison to exhaustive search methods, SMBO tries to adaptively steer the optimization into promising regions in the hyperparameter space.

More recently, SMBO has been used in conjunction with meta-learning in order to create a “meta-learning system”. According to Lemke et al. (2015), a meta-learning system must fulfill two properties:

  1. 1.

    A meta-learning system must include a learning subsystem which adapts with experience.

  2. 2.

    Experience is gained by exploiting meta-knowledge extracted

    1. (a)

      in a previous learning episode on a single data set, and/or

    2. (b)

      from different domains or problems.

Our contributions to SMBO lead to a system that fulfills all of these requirements. Our system adapts with experience by updating the surrogate model which represents the meta-knowledge. Furthermore, we exploit meta-knowledge extracted on the new data set and from previous problems.

Throughout this paper, the term meta-data refers to observations of the performances of different sets of hyperparameter configurations evaluated on a various, different data sets. The inclusion of such meta-data in SMBO-based hyperparameter optimization can be accomplished in different ways, with the most commonly used approach being to pretrain the surrogate model on these observations. Such surrogate models, which are pretrained on meta-data, are called “transfer surrogate models” because they are capable of using the meta-data to infer from previously seen data sets to new ones.

Another approach learns a particular initialization, i.e. a set of hyperparameter configurations, which is most likely to work well on the data. This approach has been shown to produce better results; this seems plausible due to the nature of the problem.

Usually, researchers gain more experience in choosing well-performing hyperparameter configurations for their models by running these models on a variety of data sets and hyperparameters. Consequently, if a new data set arrives, the hyperparameter optimization guided by an expert will comprise this knowledge into choosing which initial configurations to test. This intuition makes it clear that incorporating hyperparameter performance on other data sets into the surrogate model used within SMBO aids in speeding up and steering the hyperparameter optimization towards regions where we can suspect to find good hyperparameter configurations.

A number of publications show that using meta-data is beneficial, including but not limited to Bardenet et al. (2013), Yogatama and Mann (2014), Swersky et al. (2013), Schilling et al. (2015), Wistuba et al. (2015, 2016), Feurer et al. (2015).

In this paper, we integrate two pieces of work, Schilling et al. (2016), Wistuba et al. (2016), which both learn individual Gaussian processes on subsets of the meta-data. Each Gaussian process is learned on all the observed performances of a single data set, i.e. the hyperparameter configuration and its corresponding performance on this specific data set. Finally, all processes are then combined into a single surrogate model. In this way, we can achieve scalability to large amounts of meta-data because the training effort of the surrogate model is no longer cubic in the number of data sets used to create the meta-data. We compare both papers and extend the ideas therein by learning a transfer acquisition function using an ensemble of Gaussian processes. The resulting approach shows better empirical performance than the state-of-the art for hyperparameter optimization. We present this result via a set of thoroughly conducted experiments which maintain the scalability properties of a simple product of experts model. We then also show that the acquisition function is an elegant way of dealing with different performance scales for different data sets and a decaying use of meta-data.

Our contributions in this work are:

  • The unification of our previous work (Schilling et al. 2016; Wistuba et al. 2016) as the scalable Gaussian process transfer surrogate framework.

  • Identification of typical problems faced when using meta-data in surrogate models and thus,

  • Proposal of using meta-learning in the acquisition function instead of the surrogate model to overcome these issues by using the transfer acquisition function framework.

  • Extensive empirical evaluations for comparing all approaches including previous and new methods with additional discussion.

This work is structured as follows: in the next section we review the related work that has been done on SMBO and its combination with meta-data. In Sect. 3 we formally define the problem of hyperparameter optimization. We then present the detailed definition of SMBO and Gaussian processes (often used as an important component of SMBO) in Sects. 4 and 5. In Sect. 6, we present SGPT, our scalable transfer surrogate framework which unifies our previous work (Schilling et al. 2016; Wistuba et al. 2016). Section 7 then contains our extension this work by using the meta-data within the acquisition function of SMBO via our proposed “transfer acquisition function” (TAF) We evaluate our proposed method and compare it to current state of the art methods in Sect. 8, and then conclude the paper in Sect. 9.

2 Related work

The algorithm selection problem is an important problem in many domains and was first introduced in the 1970s (Rice 1976). Application domains include hard combinatorial problems such as SAT (Xu et al. 2008) and TSP (Kanda et al. 2012), software design (Cavazos and O’Boyle 2006), numerical optimization (Kamel et al. 1993), optimization (Nareyek 2004) and many more. In our work, we limit ourselves to the domain of machine learning although there are generalizations that subsume hyperparameter optimization in the broad category of algorithm configuration (Eggensperger et al. 2018). Thus, we investigate both the problem of finding the right algorithm as well as finding suitable hyperparameters for this algorithm.

Existing work can be grouped with respect to different properties. One way to group previous work is by methodological approach, where we have on the one hand approaches that search the hyperparameter space exhaustively and on the other hand methods that use black-box optimization techniques such as sequential model-based optimization (Jones et al. 1998). Other methods under this grouping make use of search algorithms from artificial intelligence such as genetic algorithms.

One can also distinguish between approaches based on those which use meta-data and those that do not. Meta-learning permits us to transfer past experiences with particular algorithms and hyperparameter configurations from one data set to another. Plenty of work has been done in that area and can be found in some recently published books and surveys: Brazdil et al. (2009), Vilalta and Drissi (2002), Lemke et al. (2015).

In the next section we discuss further related work as classified by methodology.

2.1 Exhaustive search methods

The most widely used method to optimize hyperparameters in machine learning is the “grid search”. For a grid search, we choose a finite subset of hyperparameter configurations and evaluate them all in a brute force manner, ideally within a parallel computing environment. In some cases, grid search is manually steered by choosing a coarse grid at first to find regions where hyperparameter performance is generally good, with such regions being investigated more closely using a fine-grained grid. This mixture of grid search and manual search techniques appears in many publications: for instance Hinton (2010), Larochelle et al. (2007). The downsides of grid search are rather obvious: if the dimension of the hyperparameter space is large and no prior knowledge about hyperparameter performance is given, grid search requires many evaluations to deliver good results, often at the expense of many useless computations.

Bergstra and Bengio (2012) introduce Random Search, which essentially replaces the fixed set of points by sampling points from some probability distribution. This has mainly two advantages, at first, one may enter prior beliefs over the hyperparameter space by defining the probability distributions to draw from. Secondly, random search works better in scenarios of low effective dimensionality, which is the case if hyperparameter performance almost stays constant in one dimension of the hyperparameter space and changes drastically in another dimension.

2.2 Model specific methods

Many methods for hyperparameter optimization exist which have been designed for a specific algorithm. These methods are usually based on genetic algorithms (Friedrichs and Igel 2005a; de Souza et al. 2006), although some are deterministic (Keerthi et al. 2007). Beyond this, there are many other methods for specific scenarios, including but not limited to methods for general regression and time-series models (McQuarrie and Tsai 1998), for regression when the sample size is small (Chapelle et al. 2002), for Bayesian topical trend analysis (Masada et al. 2009) and for log-linear models (Foo et al. 2007). Moreover, Schneider et al. deal with hyperparameter learning in probabilistic prototype-based models (Schneider et al. 2010), Seeger employs hyperparameter learning for large scale hierarchical kernel methods (Seeger 2006), and Kapoor et al. are concerned with optimizing hyperparameters for graph-based semi-supervised classification models (Kapoor et al. 2005).

The major limitation of all of these methods is that they are specifically tailored to one particular model and only work well in certain scenarios. This is a drawback that more widely applicable SMBO-based methods can alleviate.

2.3 Sequential model-based optimization

In order to overcome the issues of exhaustive search methods or model specific methods, black box optimization has been used in the context of sequential model-based optimization (SMBO) (Jones et al. 1998). SMBO learns a surrogate model on the observed hyperparameter performance, which is then queried to provide predictions for unobserved hyperparameter configurations. The predicted performance and the uncertainty of the surrogate model is then used within the expected improvement acquisition function to choose which of the many unobserved hyperparameter configurations to test next. The main strand of research along these lines has been committed to finding surrogate models, for example a Gaussian process (Rasmussen et al. 2005) which provided the so-called Spearmint method (Snoek et al. 2012). Other surrogate models, such as the random forests proposed in SMAC (Hutter et al. 2011), have also been investigated. While this earlier work focuses on optimizing hyperparameters only, Auto-WEKA (Thornton et al. 2013) has shown that algorithm-selection can be considered in a similar fashion to hyperparameters, and so the existing work is capable of choosing both algorithms and hyperparameters in combination.

Additionally, research on including meta-data, i.e. observations of hyperparameter performance on other data sets, has been gaining a lot of attention. Bardenet et al. (2013) use \(\text {SVM}^{\text {RANK}}\) as a surrogate and thus consider the hyperparameter selection as a ranking task rather than a regression task. In this way, they are able to overcome the issues of different data sets having different performance levels. To estimate uncertainties, they train a Gaussian process on the output of the \(\text {SVM}^{\text {RANK}}\), in order to compute expected improvement. Moreover, Gaussian processes with a meta-kernel have been proposed (Swersky et al. 2013; Yogatama and Mann 2014). Finally, neural networks have been used as surrogate models in combination with a factorization machine (Rendle 2010) in the input layer (Schilling et al. 2015).

2.4 Learning curve predictions

The idea of using meta-data in SMBO is to find better performing prediction models within a smaller fraction of time. Another idea applicable to models that are learned in an iterative fashion is to predict the learning curve, i.e. the performance of the resulting model, after a number of epochs. For example, Domhan et al. (2015) predict the performance of the hyperparameter configuration based on the partially observed learning curve after a few iterations. If the final performance is likely to be worse than the current best configuration, then the process is stopped and the configuration discarded, and the optimization continues with another, different configuration. Swersky et al. (2014) propose a similar approach which never discards a configuration, but instead learns the models for various hyperparameter configurations at the same time and switch from one learning process to another if it turns out to be more promising.

There are other hyperparameter approaches not related to SMBO following the same idea. There are, for example, some population-based approaches, such as Successive Halving (Jamieson and Talwalkar 2016) and Hyperband (Li et al. 2016), which choose a set of hyperparameter configurations at random and incrementally train the learning algorithms in parallel. From time to time, the weakest configurations are discarded. The Racing Algorithm by Maron and Moore (1997) follows a similar principle but is focused on lazy learners where the expensive part is testing rather than training.

2.5 Meta-initializations

There are several strategies to find a set of initial configurations for hyperparameter optimization methods. Reif et al. (2012) propose to initialize a hyperparameter search based on genetic algorithms with the best hyperparameters on other data sets, where the similarity of data sets is defined through meta-features. Feurer et al. (2014) propose the same idea for SMBO which was later extended (Feurer et al. 2015; Wistuba et al. 2015). The drawback of these approaches is that they do not consider whether the initial hyperparameter configurations are very close to each other and therefore may waste computation time by choosing too similar hyperparameters initially. Thus, one of our previous works proposes to learn a set of initial hyperparameter configurations by optimizing a meta-loss that maximizes the overall improvement on the meta-data (Wistuba et al. 2015).

2.6 Meta-features

Meta-features are descriptive characteristics of a data set and thus an essential component of all traditional meta-learning methods that are learning across problems. In this work, we use pairwise comparisons of the performance of two hyperparameter configurations on one data set compared to another. This is a very special instance of landmarkers (Pfahringer et al. 2000). Landmark features are created by applying very fast machine learning algorithms (e.g. decision stumps, linear regression) to the data, with the performance is added as a meta-feature. In contrast, our approach uses only the performance of algorithms and hyperparameter configurations which we have evaluated during our optimization process, and thus, no additional time was spent for estimating these landmarkers. This idea has been already employed by some others (Leite et al. 2012; Sun and Pfahringer 2013; Wistuba et al. 2015). In contrast to their work, we propose a way of using these meta-features also in cases with continuous hyperparameters, since for continous hyperparameters, it is very unlikely that we have seen the same hyperparameter configurations for all data sets. The approach of pairwise comparisons proposed by the literature works only if we either only want to find the best algorithm and ignore the hyperparameters (Sun and Pfahringer 2013) or discretize the hyperparameters (Leite et al. 2012; Wistuba et al. 2015). We overcome this problem by predicting the performance of a hyperparameter configuration if it is not part of our meta-data set.

2.7 Other approaches

Furthermore, there also exist strategies to optimize hyperparameters that are based on optimization techniques from artificial intelligence such as tabu search (Cawley 2001), particle swarm optimization (Guo et al. 2008) and evolutionary algorithms (Friedrichs and Igel 2005b). Since none of these strategies use information from previous experiments, meta-data can be added analogously to the SMBO counterpart using in initialization (Gomes et al. 2012; Reif et al. 2012). Another interesting recent proposition is the use of bandit optimization techniques for automatic machine learning (Hoffman et al. 2014).

3 Problem definition

In this section we will formally define the problem of hyperparameter optimization and introduce the notation that will be used in the remainder of the paper. We will follow the notation that was introduced by Bergstra and Bengio (2012), but extend it to account for a more general problem of hyperparameter optimization by also including model choice and other tasks.

Let \({\mathcal {D}}\) denote the space of all data sets and let \(\mathcal {M}\) denote the space of all models. Thus, \({\mathcal {D}}\) consists of all possible data sets, where instances might have a vector representation, but can also be images, time-series, or similar representations.

We then let \(\mathcal {M}\) define the space of all machine learning models. This includes all parametric models, besides also trees and other models with parameters and a specific structure such as neural networks. The configuration space \(\mathcal {X}\) encodes the choice of algorithms and hyperparameters. Then, let us define a general algorithm \(\mathcal {A}\) as a mapping

$$\begin{aligned} \mathcal {A}: \mathcal {X}\times {\mathcal {D}} \longrightarrow \mathcal {M}\end{aligned}$$
(1)

that takes as input a choice of hyperparameter \(x \in \mathcal {X}\) and a data set \(D \in {\mathcal {D}}\) to then deliver a model \(M \in \mathcal {M}\) learned on the training partition \({D}^{\text {train}}\) of data set D. In this formulation, the model choice as well as the hyperparameter setting are both combined in the choice of x. Additionally, preprocessing tasks, the choice of optimization technique, and other such settings can be treated as hyperparameters. While allowing such treatment may arguably not be the best option, we follow the lead of Thornton et al. (2013). We assume that \(\mathcal {X}\) has a fixed dimensionality p for notational purposes, although with new models and learning algorithms being researched everyday, the dimensionality of \(\mathcal {X}\) is constantly growing. Given a concrete setting of x, \(\mathcal {A}\) searches through the model space \(\mathcal {M}\) to find a model that minimizes the empirical loss \({\mathcal {L}}\) on the training partition of data D, \({D}^{\text {train}}\), considering a regularization \({\mathcal {R}}\) to avoid overfitting:

$$\begin{aligned} {\mathcal {A}}\left( x,D\right) =\mathop {\hbox {arg min}}\limits _{M\in \mathcal {M}}{\mathcal {L}}\left( M,{D}^{\text {train}}\right) +{\mathcal {R}}\left( M\right) . \end{aligned}$$
(2)

Now we can define the task of hyperparameter optimization as finding the configuration \(x^{\star }\), that yields a model learned on the training partition which minimizes the loss on the validation partition \({D}^{\text {valid}}\) of the data set:

$$\begin{aligned} x^{\star }=\mathop {\hbox {arg min}}\limits _{x\in \mathcal {X}}\ {\mathcal {L}}\left( {\mathcal {A}}\left( x,{D}^{\text {train}}\right) ,{D}^{\text {valid}}\right) = \mathop {\hbox {arg min}}\limits _{x\in \mathcal {X}}\ y\left( x,D\right) . \end{aligned}$$
(3)

Note that, to shorten the notation a little bit, we introduce the function \(y\) instead of the more cumbersome expression in the middle. In the literature, the contours of \(y\) are also called response surface (Czogiel et al. 2006), a term which we will also use throughout the paper. Figure 1 shows a response surface of an RBF-SVM on the well-known Iris data set.

Fig. 1
figure 1

Response surface of an RBF-SVM on the famous Iris data set. Hyperparameters are the cost of slack (C), and the kernel width \(\gamma \)

As \(y\) is an unknown black-box function, it cannot be minimized using standard optimization techniques. Usually, \(y\) is optimized by doing a grid search, which exhaustively searches through \(\mathcal {X}\) for an optimum. Grid search is conducted by defining a finite subset \(G \subset \mathcal {X}\) which is usually the Cartesian product of a few points in each dimension of \(\mathcal {X}\), \(\mathcal {X}_{i}\),

$$\begin{aligned} G=\prod _{i=1}^p G_i \qquad \quad G_i \subset \mathcal {X}_i \quad |G_i |< \infty \end{aligned}$$
(4)

and then evaluating \(y\) for all of these points. If learning the model takes a lot of time, this exhaustive approach can be a very time-consuming process that results in a lot of useless computations, as usually grid search is not conducted in an adaptive way, where observations of \(y\) are taken into account to design the grid. In the recent years, as hyperparameter optimization has become more and more an issue in machine learning, researchers have used black-box optimization techniques described next.

4 Sequential model-based optimization

Sequential model-based optimization (SMBO) is a technique that iteratively fits a so called surrogate model, henceforth denoted by \(\varPsi \), on the observed values of \(y\) such that \(\varPsi \approx y\). For brevity, we will denote the set

$$\begin{aligned} \mathcal {H}= \{ (x_1,y(x_1,D)),...,(x_t,y(x_t,D)) \} \end{aligned}$$
(5)

of t many observed values as observation history for data set D. Having estimated a surrogate model, it is queried for not yet observed hyperparameter configurations. Its output will be evaluated by an acquisition function a, which chooses the next hyperparameter optimization to test, which naturally depends on the surrogate prediction as well as \(y^{\text {min}}\), which is the best value found so far. This process is repeated until either a certain number of trials has been conducted or the cross-validation performance has achieved an adequate level. An overview of the whole process can be seen in Algorithm 1.

figure a

Many different surrogate models have been proposed in the recent years, for instance Gaussian processes (Bardenet et al. 2013; Snoek et al. 2012; Swersky et al. 2013; Yogatama and Mann 2014) in different variations. Additionally, random forests (Hutter et al. 2011) and neural networks (Schilling et al. 2015) have been employed. What all these surrogate models have in common is that they are relatively easy and fast to evaluate, at least in comparison to evaluating \(y\), while still being able to learn complex functions. Using linear models to estimate a response surface as in Fig. 1 apparently leads to poor results. Additionally, a growing observation history enables the surrogate model to better approximate the true response surface of \(y\). One key ingredient of surrogate models in the SMBO framework is that, besides predicting the validation performance more accurately, they also have to give an estimation about their uncertainty, i.e. predict a probability distribution instead of a single value. In our work, the surrogate model \(\varPsi \) will predict for each configuration \(x\) a posterior mean \(\mu \left( \varPsi \left( x\right) \right) \) and a standard deviation \(\sigma \left( \varPsi \left( x\right) \right) \).

If the surrogate is fitted to the current observation history, we can query it for new hyperparameter configurations to then decide which hyperparameter to choose next; however, this decision is based on an acquisition function a, which helps in balancing both exploration and exploitation throughout the hyperparameter optimization. Exploration, on the one hand, is the process of exploring the hyperparameter space, i.e. going into regions where we have next to no observations of \(y\). Naturally, this is what we want to do at the start of the SMBO procedure. Exploitation, on the other hand, is conducted if we have tested enough configurations and believe the surrogate model in its predictions. At this stage, we expect to only test new configurations in the vicinity of the currently best one. It is clear, however, that the search will result in a bad local minimum if no exploration and only exploitation is done. It is easiest to understand how exploitation and exploration are achieved through the acquisition function by having a look at the GP-LCB acquisition function (Srinivas et al. 2010):

$$\begin{aligned} a_{\text {GP-LCB}}\left( x\right) =-\mu \left( \varPsi \left( x\right) \right) +\beta _{t}\sigma \left( \varPsi \left( x\right) \right) . \end{aligned}$$
(6)

We fix a trade-off \(\beta _{t}\) between the predicted value and the uncertainty. For higher \(\beta _{t}\), we prefer exploration over exploitation because we give \(x\) with higher uncertainty a higher weight and vice versa. For a fixed \(\beta _{t}\), good candidates \(x\) are those with very small predicted posterior mean values or those with a high uncertainty. If the score is dominated by the posterior mean, we are dealing with an exploitation scenario, if it is dominated by the uncertainty, we are dealing with exploration.

While GP-LCB is the best acquisition function to explain how exploration and exploitation can be achieved, we use the expected improvement in our experiments, which achieves a similar effect, but works slightly differently. At the end of this section, we show an example of how a balanced tradeoff between exploration and exploitation is achieved by expected improvement.

As mentioned earlier, a very common choice of acquisition function is the expected improvement (EI) which was firstly used in Jones et al. (1998). The improvement of a new hyperparameter configuration \(x\) can be defined as

$$\begin{aligned} I\left( x\right) =\max \left\{ y^{\text {min}}-{\hat{Y}}\left( x\right) ,0\right\} \end{aligned}$$
(7)

where \({\hat{Y}}\left( x\right) \) is a random variable that covers our current belief over the performance of \(x\), i.e. \({\hat{Y}}\left( x\right) \) is actually the prediction of our surrogate model \(\varPsi \). Then, the expected improvement is simply the expected value of the improvement function given our observation history \(\mathcal {H}\):

$$\begin{aligned} \mathrm {E}\left[ I\left( x\right) \right] =\mathrm {E}\left[ \max \left\{ y^{\text {min}}-{\hat{Y}}\left( x\right) ,0\right\} \ \Big |\ \mathcal {H}\right] . \end{aligned}$$
(8)

If \({\hat{Y}}\left( x\right) \) follows a Gaussian distribution with mean and variance representing the mean and variance of the surrogate model,

$$\begin{aligned} {\hat{Y}}\left( x\right) \sim {\mathcal {N}} \left( \mu (\varPsi (x)),\sigma ^2(\varPsi (x))\right) , \end{aligned}$$
(9)

the expected improvement can be computed analytically. In order to do so, let us first define Z as the best performance \(y^{\text {min}}\) standardized by our currently estimated distribution

$$\begin{aligned} Z=\frac{y^{\text {min}}-\mu \left( \varPsi \left( x\right) \right) }{\sigma \left( \varPsi \left( x\right) \right) } . \end{aligned}$$
(10)

Then, \(\mathrm {E}\left[ I\left( x\right) \right] \) can be computed as follows

$$\begin{aligned} \mathrm {E}\left[ I\left( x\right) \right] ={\left\{ \begin{array}{ll} \sigma \left( \varPsi \left( x\right) \right) \left( Z\cdot \varPhi \left( Z\right) +\phi \left( Z\right) \right) &{} \text {if }\sigma ^2\left( \varPsi \left( x\right) \right) >0\\ 0 &{} \text {otherwise} , \end{array}\right. } \end{aligned}$$
(11)

where \(\phi \left( \cdot \right) \) and \(\varPhi \left( \cdot \right) \) denote the Gaussian density and the cumulative distribution function of a standard Gaussian distribution, respectively.

An overview of how SMBO works can be seen in Fig. 2, where a Gaussian process (black solid line with uncertainty indicated by dashed gray lines) is initially learned on three data points to approximate the ground truth (yellow solid line). The blue line at the bottom indicates the score of the acquisition function, the cross indicates the maximum of the acquisition function, which is the argument that will be evaluated next by SMBO. In the first three steps, SMBO is doing exploration. The maximum of the acquisition function is with arguments where we have a very high uncertainty about our prediction. In the last step, uncertainty is low and the maximum of the acquisition function is strongly determined by the posterior mea, and thus we are doing exploitation.

Fig. 2
figure 2

Overview of SMBO using four trials. In this simple example, we assume that the dimensionality of the hyperparameter configuration space \(\mathcal {X}\) is one. The plots show the mapping from the hyperparameter configuration (x-axis) to the corresponding loss (y-axis). We start with three observations. Sequentially, different hyperparameters configurations are evaluated, more knowledge about the function is gathered, and we slowly get closer to the global minimum

4.1 SMBO using meta-information

Using meta-information, i.e. information of hyperparameter performance on another data set, during the process of hyperparameter optimization has attracted a lot of interest within the last years. The motivation is very natural: the more experiments we run on diverse data sets, the better feeling we get for hyperparameters and how they affect the final validation performance. Thus, an experienced researcher usually starts the search for good hyperparameters in a subspace of \(\mathcal {X}\) where improvements are likely. In order to use meta-information in SMBO, we now denote the observation history as

$$\begin{aligned} \mathcal {H}= \left\{ \left( x_1,y(x_1,D_1)),...,(x_{K_1},y(x_{K_1},D_1),....,(x_{K_M},y(x_{K_M},D_M)) \right) \right\} . \end{aligned}$$
(12)

Note that the hyperparameter configurations evaluated on diverse data sets have to include identical settings.

The majority of work has focused on coming up with surrogate models that can effectively integrate the meta-knowledge, for example by being trained on the observation history prior to starting SMBO on a new data set. In order to do so, we have to add so called meta-features, that describe the characteristics of a data set, as otherwise the surrogate model would be unable to differentiate between instances if the same hyperparameter configuration has been used. Compared to the work in traditional meta-learning, only a few (three or four) meta-features have been used for surrogate models (Bardenet et al. 2013; Yogatama and Mann 2014; Schilling et al. 2015). We extended the number of meta-features in our previous work (Schilling et al. 2016; Wistuba et al. 2016) and will use the same meta-features in this work for all methods. We have listed all those meta-features in Table 1. To simplify the notation, we will from now on assume that by writing D as a variable of \(y\), the meta-features of D are automatically included.

Several surrogate models are explicitly designed for handling meta-data (Bardenet et al. 2013; Yogatama and Mann 2014; Schilling et al. 2015). Given the importance of these approaches for this work, we explain how they work and how meta-data is used in detail in the upcoming subsections.

4.1.1 Surrogate collaborative tuning (SCoT)

Bardenet et al. (2013) were the first to propose a surrogate model in conjunction with meta-data, showing how to learn a single surrogate model over observations from many data sets. Since the same algorithm applied to different data sets leads to loss values that can differ significantly in scale, they recommend tackling this problem using a ranking model instead of a regression model. Finally, they propose to use \(\text {SVM}^{\text {RANK}}\) with an RBF kernel to learn a ranking of hyperparameter configurations per data set. The ranking itself does not provide uncertainty estimations which are needed for the acquisition function, and thus, Bardenet et al. finally fit a Gaussian process to the ranking in order to provide this.

4.1.2 Gaussian process with multi-kernel learning (MKL-GP)

Yogatama and Mann (2014) propose to learn a Gaussian process using meta-data. To overcome the problem of different scales on different data sets, they propose to standardize the loss per data set by subtracting the mean and scaling it to unit variance. Furthermore, they employ a linear combination of a squared exponential kernel with automatic relevance determination (SE-ARD) for observations on the same data set and a nearest neighbor kernel for modeling similarities between data sets. They define the kernel as

$$\begin{aligned} k_\text {MKL}\left( \left( x_{i},D_{k}\right) ,\left( x_{j},D_{l}\right) \right) =&\alpha \delta \left( D_{k}=D_{l}\right) k_{\text {SE-ARD}}\left( x_{i},x_{j}\right) \nonumber \\&+\left( 1-\alpha \right) \delta \left( D_{l}\in {\mathcal {N}}\left( D_{k}\right) \right) k_{\text {NN}}\left( x_{i},x_{j}\right) \end{aligned}$$
(13)

where the SE-ARD kernel is defined as

$$\begin{aligned} k_{\text {SE-ARD}}\left( x_{i},x_{j}\right) =\exp \left( -\frac{1}{2}\sum _{k=1}^{p}\frac{\left( x_{i,k}-x_{j,k}\right) ^{2}}{\sigma _{k}^{2}}\right) . \end{aligned}$$
(14)

The \(\delta \) functions returns one if its predicate is true and zero otherwise. The data set similarity kernel is set to

$$\begin{aligned} k_{\text {NN}}\left( x_{i},x_{j}\right) =1-\frac{1}{B}\left\| x_{i}-x_{j}\right\| , \end{aligned}$$
(15)

where B must be chosen such that \(k_{\text {NN}}\) is always non-negative and \({\mathcal {N}}\left( D\right) \) denotes the set of most similar data sets with respect to a distance function. The distance between two data sets is defined as the Euclidean distance between their meta-features, and is used to determine the neighboring data sets. \(x_{i}\in \mathcal {X}\) is the vectorial representation of the configuration. The tuple \(\left( x_{i},D_{k}\right) \) indicates that \(x_{i}\) has been evaluated on data set \(D_{k}\). The similarity between two tuples \(\left( x_{i},D_{k}\right) \) and \(\left( x_{j},D_{l}\right) \) is the weighted sum of the SE-ARD and the NN kernel. The SE-ARD kernel is only considered if \(x_{i}\) and \(x_{j}\) are evaluated on the same data set. The NN kernel is only considered if the settings are evaluated on very similar data sets.

4.1.3 Factorized multilayer perceptron (FMLP)

Schilling et al. (2015) proposed to use a modified multilayer perceptron as a surrogate model. Meta-instances are extended by meta-features and data set indicators. Data set indicators are nothing else but \(M+1\) additional binary predictors, one for each data set. The indicator is one if the meta-instance belongs to the corresponding data set, and zero otherwise. The modified multilayer perceptron uses a special activation function in the first layer. Instead of using a linear signal function, the authors propose

$$\begin{aligned} {\text {logistic}}\left( w_{0}+\sum _{i=1}^{p}w_{i}x_{i}+\sum _{i=1}^{p}\sum _{j=i+1}^{p}v_{i}^{T}v_{j}x_{i}x_{j}\right) \end{aligned}$$
(16)

where logistic is the logistic function,

$$\begin{aligned} {\text {logistic}}\left( s\right) =\left( 1+e^{-s}\right) ^{-1} , \end{aligned}$$
(17)

and \(V\in {\mathbb {R}}^{p\times k}\) are latent variables. This model is based on factorization machines (Rendle 2010), which are very commonly employed prediction models for recommender systems. The underlying idea is to learn a latent representation for each data set to model similarities between data sets.

Table 1 The list of meta-features used in our experiments for all methods

5 Gaussian processes

As mentioned earlier in this paper, our main focus is on learning Gaussian product ensembles over different parts of the meta-data. However, we first want to remind the reader of the definition and learning of Gaussian processes, and to discuss their advantages and disadvantages.

Given a training data set consisting of inputs (for us the hyperparameters) \(X = (x_1,...,x_n)\) and their associated outputs \(y = (y(x_1),...,y(x_n))\), a Gaussian process assumes that the labels follow a multivariate Gaussian

$$\begin{aligned} y\sim {\mathcal {N}}(\mu (X),k(X,X)) \end{aligned}$$
(18)

where \(\mu (X)\) is a mean function that will be set to zero without any loss of generality (Rasmussen et al. 2005), and k(XX) is a positive semidefinite covariance matrix expressed through a kernel function k that computes the similarity of each pair of instances. A very commonly used kernel function is the squared exponential kernel which is defined as

$$\begin{aligned} k(x,x') = \exp \left( \frac{- ||x - x' ||^2}{2\sigma _l^2} \right) + \delta (x=x')\sigma _{y}^2 , \end{aligned}$$
(19)

where \(\sigma _l\) is the kernel width and \(\sigma _{y}\) a small noise constant that is applied on the diagonal to ensure numerical stability. Finally, \(\delta \) is the indicator function that returns one if its input is true and zero otherwise.

In order to make predictions with a Gaussian process, assume we have training data \((X,y)\) and a new test instance \(x_{\star }\) where the output is unknown. Then, we are interested in the conditional distribution of \(y_{\star }\) given \(x_{\star }\) and the training data. From the definition of Gaussian processes we know that the old and new targets are jointly Gaussian

$$\begin{aligned} \begin{pmatrix} y\\ y_{\star } \end{pmatrix} \sim {\mathcal {N}} \left( 0, \begin{pmatrix} K &{} k_{\star } \\ k_{\star }^\top &{} k_{\star \star }\end{pmatrix} \right) , \end{aligned}$$
(20)

where \(K := k(X,X)\), \(k_{\star } := k(X,x_{\star })\) and \( k_{\star \star } := k(x_{\star },x_{\star })\) for brevity. Then, the conditional distribution of our test labels is a multivariate Gaussian

$$\begin{aligned} p(y_{\star }\, | \, x_{\star }, X,y,\theta ) = {\mathcal {N}}\left( \mu (x_{\star }),\sigma ^2(x_{\star })\right) \end{aligned}$$
(21)

with kernel hyperparameters \(\theta \) and mean and variance (Rasmussen et al. 2005)

$$\begin{aligned} \mathrm {E}\left[ y_{\star }\right]&= \mu (x_{\star }) = k_{\star } K^{-1} y \end{aligned}$$
(22)
$$\begin{aligned} {\text {Var}}\left[ y_{\star }\right]&= \sigma ^2(x_{\star }) = k_{\star \star } - k_{\star }^\top K^{-1} k_{\star } . \end{aligned}$$
(23)

Note that the computational expense lies in inverting the kernel matrix K of the training data which has dimensionality \(n \times n\). However, once we have estimated the inverse, we can easily predict means and variances for every input that we are interested in.

Another nice aspect of Gaussian processes is that they are hyperparameter free, as the kernel hyperparameters \(\theta \) can effectively be learned by maximizing their marginal likelihood (Rasmussen et al. 2005) which is given by

$$\begin{aligned} \log p(y\, | \, X,\theta ) = -\frac{1}{2} y^\top K^{-1} y-\frac{1}{2} \log |K |. \end{aligned}$$
(24)

The above term can be maximized using standard optimization techniques such as gradient ascent. As we will optimize \(\theta \) on the meta-data prior to starting the hyperparameter optimization for the new data set, we drop them in order to make the notation more compact.

From the nature of SMBO, our observation history is incremented by one additional instance with each SMBO trial that we undertake. For the kernel matrix, this means only one additional column and row, consisting of the kernel function evaluated for the new point and all the old points. However, after adding the new observation, we have to invert K again.

To speed up the overall inversion process, we can use the Cholesky decomposition on K, where K is decomposed as a product of triangular matrices:

$$\begin{aligned} K = LL^\top . \end{aligned}$$
(25)

Then, the predicted Gaussian for \(y_{\star }\) resolves to

$$\begin{aligned} p(y_{\star } \, | \, x_{\star },X,y) = {\mathcal {N}}(y_{\star } \, | \, k_{\star } \alpha \, , \, k_{\star \star } - l^\top l) \end{aligned}$$
(26)

with \(\alpha = \text {solve}\left( L^\top , \text {solve}\left( L, y\right) \right) \) and \(l = \text {solve}\left( L, k_{\star }\right) \) for \(\text {solve}\) being the operation of solving a system of equations. As soon as a new instance has to be included into the Gaussian process, we can simply update the matrix L. This is done by setting the new \(L_{\text {new}}\) as

$$\begin{aligned} L_{\text {new}}=\left( \begin{array}{cc} L &{} 0\\ l^\top &{} l_{*} \end{array}\right) \end{aligned}$$
(27)

and setting

$$\begin{aligned} l_{*}=\sqrt{k_{\star \star }-\left\| l\right\| _{2}^{2}+\sigma _{y}^{2}} . \end{aligned}$$
(28)

This way we effectively reduce the computation from \({\mathcal {O}}(n^3)\) to \({\mathcal {O}}(n^2)\) where n is the number of observations in the current observation history, due to L being a lower triangular matrix. However, if we are aiming to include a vast amount of meta-data, learning a Gaussian process will become an issue because the run time is still quadratic. We address this issue in the next section.

6 Scalable Gaussian process transfer surrogate framework

In the previous section we discussed the learning of Gaussian processes, where the most computationally expensive step lies in the inversion of the kernel matrix K, which is of size n if we are facing n many training instances. Given the scenario that we are in possession of large scale meta-data, learning a Gaussian process becomes infeasible, as inverting K can only be done in \({\mathcal {O}}(n^{3})\) computations; however, Gaussian processes are a natural choice as surrogate models for SMBO, as they naturally predict uncertainties and are basically hyperparameter free.

Beyond the computational challenges, learning a Gaussian process on all training instances makes the strong assumption that each training instance and data set are equally important. This issue is usually addressed by adding meta-features which leads to an indirect representation of similarity between data sets and their influence. We want to propose a framework that tackles both issues by making Gaussian processes scalable and making the influence of each data set within the meta-data explicit.

Therefore, in order to still learn Gaussian processes, we propose to subdivide the meta-data into M many individual parts and learn a single Gaussian process independently on each of the parts, including a single, additional Gaussian process for all the new observations that we will see during the SMBO trials. Formally, we divide our meta-data

$$\begin{aligned} X=(X^{(1)},...,X^{(M)})\qquad y=(y^{(1)},...,y^{(M)}), \end{aligned}$$
(29)

in a way where all \(X^{(i)}\) are pairwise disjoint. However, instead of taking an arbitrary subdivision of our meta-data, we simply divide it by the data sets we have already observed. This means, for each data set \(D_i\), we create a subset \(X^{(i)},\,y^{(i)}\) which contains all meta-instances of data set \(D_i\). As a result, we have M Gaussian processes learned, one for each data set, such that for every \(i=1,...M\)

$$\begin{aligned} p_i\left( y_{\star }\,|\, x_{\star }, X^{(i)},y^{(i)}\right) ={\mathcal {N}}\left( \mu _{i}(x_{\star }),\sigma _{i}^{2}(x_{\star })\right) . \end{aligned}$$
(30)

As mentioned earlier, we also learn a Gaussian process for the new observations, which will be updated after every SMBO trial. We will simply use the index \(M+1\) for the target Gaussian process.

Fig. 3
figure 3

The proposed framework for our scalable transfer surrogate based on Gaussian processes. A Gaussian process is learned per data set and they are finally combined in a weighted sum

figure b

We derive our scalable Gaussian process transfer surrogate framework (SGPT) by combining all \(M+1\) Gaussian processes into a weighted, normalized sum as sketched in Fig. 3. We define the following mean and precision

$$\begin{aligned} \mu \left( x_{\star }\right)= & {} \frac{\sum _{i=1}^{M+1}w_{i}\mu _{i}(x_{\star })}{\sum _{i=1}^{M+1}w_{i}} \end{aligned}$$
(31)
$$\begin{aligned} \sigma ^{-2}(x_{\star })= & {} \sum _{i=1}^{M+1}v_{i}\sigma _{i}^{-2}(x_{\star }) . \end{aligned}$$
(32)

The final framework is summarized in Algorithm 2. It consists of two different parts. The first involves the training of the individual processes, and the second one combines the processes for prediction. As mentioned before, training involves dividing the meta-instances in M subsets, one subset for each data set on which we observed evaluations. Thus, every Gaussian process becomes the expert of the respective data set. The prediction uses these experts plus one additional expert that is estimated on the observed performances on the new data set. Based on Eqs. 31 and 32, the mean and uncertainty is estimated. In the following subsections, we will discuss how to derive possible options for choosing w and v which we introduced in Eqs. 31 and 32. Each version is a possible surrogate model \(\varPsi \) that can be used in SMBO (see Algorithm 1).

6.1 Product of experts

In this section we want to formally derive values for the parameters w and v for the scalable Gaussian process transfer surrogate framework (Algorithm 2). Following our previous work (Schilling et al. 2016), when applying the independence assumption, we can write the joint likelihood in Eq. 21 as a product of individual likelihoods

$$\begin{aligned} p(y_{\star }\,|\, x_{\star }, X,y)=\prod _{i=1}^{M+1} p_{i}\left( y_{\star }\,|\, x_{\star }, X^{(i)},y^{(i)}\right) , \end{aligned}$$
(33)

which is also called a product of experts model and has been introduced by Hinton (1999). Additionally, weighting coefficients \(\beta _{i}\) have been proposed to use in the product of experts model to derive the generalized product of experts

$$\begin{aligned} p(y_{\star }\,|\, x_{\star }, X,y)=\prod _{i=1}^{M+1}p_{i}^{\beta _i}\left( y_{\star }\,|\, x_{\star }, X^{(i)},y^{(i)}\right) , \end{aligned}$$
(34)

where the initial formulation is obtained by setting all \(\beta _{i}=1\) (Hinton 1999). Usually, the coefficients \(\beta _i\) in the generalized product of experts are chosen to sum up to one.

Computing the product of all these Gaussian densities, we obtain a Gaussian distribution with the following mean and precision:

$$\begin{aligned} \mu (x_{\star })&=\sigma ^{2}\left( x_{\star }\right) \sum _{i=1}^{M+1}\beta _{i}\sigma _{i}^{-2}(x_{\star })\mu _{i}(x_{\star })\end{aligned}$$
(35)
$$\begin{aligned} \sigma ^{-2}\left( x_{\star }\right)&=\sum _{i=1}^{M+1}\beta _{i}\sigma _{i}^{-2}(x_{\star }) . \end{aligned}$$
(36)

Substituting the precision into the formula for the mean, the predicted mean resolves to

$$\begin{aligned} \mu (x_{\star })=\frac{\sum _{i=1}^{M+1}\beta _{i}\sigma _{i}^{-2}(x_{\star })\mu _{i}(x_{\star })}{\sum _{i=1}^{M+1}\beta _{i}\sigma _{i}^{-2}(x_{\star })}, \end{aligned}$$
(37)

which is a sum of means, weighted by the product of \(\beta _{i}\) and the individual precisions. For our experiments, we set

$$\begin{aligned} \beta _i = \frac{1}{M+1} \quad \forall i=1,\ldots ,M+1 , \end{aligned}$$
(38)

which does not influence the predicted mean as the terms cancel out; however, this effectively increases the uncertainty which the general model of experts usually tends to underestimate (Deisenroth and Ng 2015). To sum up, generalized products of experts are an instance of scalable transfer surrogates when setting

$$\begin{aligned} w_{i}= & {} \beta _{i}\sigma _{i}^{-2}\left( x_{\star }\right) \end{aligned}$$
(39)
$$\begin{aligned} v_{i}= & {} \beta _{i} \end{aligned}$$
(40)

as weight parameters in Algorithm 2.

6.2 Kernel regression

In the previous section, we have derived parameters w and v for Algorithm 2 under the assumption that every data set has equal importance for the task of finding optimal hyperparameter configurations for our new data set \(D_{M+1}\); however, in order to find good hyperparameter configurations on a new data set \(D_{M+1}\), it is intuitive to give more weight to the influence of data sets that have a similar hyperparameter response surface. Hence, setting \(w_{i}\) to larger values for these experts to increase their influence makes a lot of sense. Assuming that we know the similarity \(k\left( \chi _{i},\chi _{j}\right) \) between two data sets \(D_{i}\) and \(D_{j}\), where \(\chi _{i}\) and \(\chi _{j}\) are the data set descriptors of \(D_{i}\) and \(D_{j}\), respectively, we proposed to set the value of \(w_{i}\) to the similarity between the data set \(D_{i}\) and the new data set \(D_{M+1}\) (Wistuba et al. 2016):

$$\begin{aligned} w_{i}=k\left( \chi _{i},\chi _{M+1}\right) . \end{aligned}$$
(41)

The concrete kernel that we apply is the Epanechnikov quadratic kernel (Nadaraya 1964)

$$\begin{aligned} k_{\rho }\left( \chi _{i},\chi _{j}\right) =\gamma \left( \frac{\left\| \chi _{i}-\chi _{j}\right\| _{2}}{\rho }\right) , \end{aligned}$$
(42)

where the \(\gamma \) function is given by

$$\begin{aligned} \gamma \left( t\right) ={\left\{ \begin{array}{ll} \frac{3}{4}\left( 1-t^{2}\right) &{} \text {if }t\le 1\\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(43)

and \(\rho >0\) is the bandwidth. Setting w like this, our scalable Gaussian process transfer surrogate framework is now equivalent to kernel regression with the Nadaraya Watson kernel-weighted average for the mean prediction. Furthermore, we propose to rely on the uncertainty of the surrogate model for the new data set only:

$$\begin{aligned} v_{i}={\left\{ \begin{array}{ll} 1 &{} i=M+1\\ 0 &{} \text {otherwise} . \end{array}\right. } \end{aligned}$$
(44)

We would like to use the true similarity between the new data set and all other data sets, but since this is not available, we will evaluate two different common techniques to approximate it. One is based on meta-features, i.e. simple, statistical or information theoretic properties that are extracted from the data set which describe the data set (Bardenet et al. 2013; Reif et al. 2012; Smith-Miles 2009). We use the meta-features listed in Table 1. For a more detailed explanation of meta-features, we refer to Michie et al. (1994).

Using these meta-features, however, has one drawback: the meta-features are constant, which means that the knowledge of the target data set enters the model only via the target Gaussian process which is updated after every trial. Therefore, we propose an alternative using a pairwise hyperparameter performance comparison (Leite et al. 2012; Wistuba et al. 2015). The idea is to select pairs \(\left( x_{i},x_{j}\right) \) of evaluated hyperparameter configurations on the new data set \(D_{M+1}\) and count how often \(D_{M+1}\) and another data set \(D_{k}\) agree on the ranking of these configurations. After evaluating t many hyperparameter configurations during SMBO on the new data set, we estimate the data set descriptors for each data set \(D_k\) as

$$\begin{aligned} \left( \chi _{k}\right) _{j+\left( i-1\right) t}={\left\{ \begin{array}{ll} \frac{1}{t\left( t-1\right) } &{} \text {if }\mu _{k}\left( x_{i}\right) >\mu _{k}\left( x_{j}\right) \\ 0 &{} \text {otherwise} \end{array}\right. }. \end{aligned}$$
(45)

While the value of \(y\left( \cdot ,D_{M+1}\right) \) is known for these t hyperparameter configurations, this is not necessarily true for the data sets \(D_{1},\ldots ,D_{M}\). Hence, we use the prediction of each individual expert instead.

Computing the Euclidean distance of two meta-feature vectors then yields the number of discordant pairs normalized by dividing by the number of all pairs. This is basically a distance function based on the Kendall rank correlation coefficient (Kendall 1938). In this way, during the SMBO process the coefficients are adapted after each iteration, where the data sets that agree on more hyperparameter pairs with the target data set are weighted higher. This has been shown to improve the performance drastically.

7 Transfer acquisition function framework

The transfer surrogate models of the previous section provide very good results as we will see in the experiments section. Nevertheless, these models face two important problems: first, each data set has different scales of evaluation scores which are reconstructed by each of the experts, and thus, the weighted average will likely not meet the scale of the new data set. One way to overcome this is to normalize the performance meta-data by data set and also create an approximated normalization of the new data set. This is more a work-around than a solution because the problem remains to some degree and the approximated normalization for the new data set is inaccurate in the very beginning.

The second problem stems from the fact that transfer surrogate models assume that the meta-data is equally important throughout the whole optimization process. It seems natural, however, that with more knowledge about the new data set the influence of meta-data could diminish. In fact, we have seen that it is important to reduce the influence over time as soon as all available knowledge is consumed (Wistuba et al. 2016). Thus, we propose to transfer the meta-data within the acquisition function instead of the surrogate model which means that we will use a surrogate model that directly reconstructs the response surface of the new data set, and does not consider any meta-data. We reconstruct this response surface by using a Gaussian process that is trained on all available observations of the response surface for the new data set. We propose a transfer acquisition function framework (TAF) that is very similar to the aforementioned transfer surrogate framework presented in Algorithm 2. We define a novel acquisition function that incorporates meta-data as

$$\begin{aligned} a\left( x\right)= & {} \frac{w_{M+1}\mathrm {E}\left[ I_{M+1}\left( x\right) \right] +\sum _{i=1}^{M}w_{i}I_{i}\left( x\right) }{\sum _{i=1}^{M+1}w_{i}} \end{aligned}$$
(46)

with

$$\begin{aligned} I_{i}&\left( x\right) =&\max \left\{ y_{D_{i}}^{\text {min}}-{\hat{Y}}_{i}\left( x\right) ,0\right\} , \end{aligned}$$
(47)

where \({\hat{Y}}_{i}\left( x\right) \sim {\mathcal {N}}\left( \mu _{i}(x),\sigma _{i}^{2}(x)\right) \) and \(y_{D_{i}}^{\text {min}}\) is the best value achieved so far on \(D_{i}\). Thus, the acquisition function becomes the weighted average of the expected improvement (see Eq. 8) on the new data plus the predicted improvement on all other data sets from previous experiments. While the framework looks very similar to the one proposed in Eqs. 31 and 32, the implications are entirely different. As we previously tried to learn a mapping between data characteristics and algorithm and hyperparameter behaviour from the meta-data, we now try to learn a mapping between the data characteristics and the improvement. Instead of predicting the performance for each hyperparameter configuration and hence minimizing a regression loss by employing meta-data, each hyperparameter configuration is scored by two components: first, the expected improvement on the new data set is taken into account, which leads to a high uncertainty especially for the early trials; secondly, the predicted improvement on the new data set is taken into account, which leads to a high uncertainty especially for the early trials; secondly, the predicted improvement that the hyperparameter configuration has achieved on other data sets is considered, where both components complement each other. In the early phase of the optimization, when the expected improvement provided by the surrogate for the new data set is still very unreliable, the improvement prediction from the meta-data favors hyperparameter configurations that have good average performance. Over time, more information about the new data set is collected and the expected improvement prediction becomes reliable. At this point, most of the meta-data has been consumed, i.e. many hyperparameter configurations in regions that have been good in previous experiments have been tried and thus, the improvement is comparably small. Thus, the meta-data starts to play a minor role in the acquisition function.

The weights for the transfer acquisition function framework can be set analogously to the propositions in Sects. 6.1 and 6.2.

8 Experiments and results

8.1 Meta-data set creation

We have created in total two meta-data sets, both for the purpose of learning classification models.

The first meta-data set contains the predictive hyperparameter performances for running an SVM on 50 different data sets all taken from the UCI repository. We selected the data sets at random and decided to use an SVM as a classifier since it is one of the most popular classification tools. We learn a linear SVM, an SVM with RBF kernel, and lastly an SVM with a polynomial kernel. The hyperparameters are then the choice of kernel, the cost of the slack variables C and—depending on which kernel we choose—the kernel width \(\gamma \) for RBF and the degree d for the polynomial kernel. We chose C from \(C\in \left\{ 2^{-5},\ldots ,2^{6}\right\} \), \(\gamma \) was searched in \(\gamma \in \left\{ 10^{-4},10^{-3},10^{-2},0.05,0.1,0.5,1,2,5,10,20,50,10^{2},10^{3}\right\} \) and the polynomial degree was optimized over \(d\in \left\{ 2,\ldots ,10\right\} \). This are 288 different configurations per data set in total. If a data set was already split, we merged all splits and created a new 80% training and 20% validation split. For running the SVM we used the implementation by Tsochantaridis et al. (2004). The creation of this meta-data set took 160 CPU hours.

The other meta-data set is extended so that it contains both hyperparameter performance on different data sets and performance of a set of different algorithms. In order to accomplish this, we used WEKA (Holmes et al. 1994) to run 19 different classifiers on 59 data sets for a total of 21,871 hyperparameter configurations to evaluate per data set. An overview of the employed classifiers can be seen in Table 2. In total, this sums up to roughly 1.3 million experiments. The overall computation of this meta-data set took about 900 CPU hours.

The meta-target for both meta-data sets is the classification error. We cannot assume that all approaches would work for other error metrics such as the logarithmic loss. We did not conduct experiences for other loss metrics but we do not expect different results since the problem remains the same: finding the global minimum of the function \(y\).

In a leave-one-data-set-out cross-validation, we choose one data set as the target data set. No meta-instance of this data set is available at the beginning of the search and needs to be acquired by SMBO. To show that the optimization strategies can deal with yet unobserved hyperparameter configurations, only a subset of all meta-instances is used for training purposes. The evaluation on meta-test is done using all meta-instances of the target data set. Thus, all methods also need to evaluate hyperparameter configurations they have never seen in their meta-data.

Table 2 Overview of all classifiers used within the WEKA meta-data set

8.2 Competing optimization strategies

We compare our proposed optimization strategies to a large set of state-of-the-art optimization strategies which will be described in detail.

Random Search This is a relatively simple baseline that chooses hyperparameter configurations at random. Nevertheless, Bergstra and Bengio (2012) have shown that this strategy can outperform grid search, especially for algorithms with hyperparameters that have a low effective dimensionality.

Independent Gaussian Process (I-GP) The use of Gaussian processes as a surrogate model goes back to the paper of Jones et al. (1998) from 1998. It was first proposed to be applied to the problem of hyperparameter optimization for machine learning by Snoek et al. (2012) under the name Spearmint. In our experiment we used a squared-exponential kernel with automatic relevance detection (SE-ARD) and also for all other optimization methods that are based on a Gaussian process. This surrogate model does not use any knowledge from previous experiments.

Independent Random Forest (I-RF) This surrogate is very similar to I-GP but uses a random forest instead of a Gaussian process. It was proposed by Hutter et al. (2011) under the name SMAC and is applied in auto-sklearn (Feurer et al. 2015) and Auto-WEKA (Thornton et al. 2013).

Initialization for I-GP and I-RF (I-GP (init) and I-RF (init)) Because I-GP and I-RF do not consider any meta-data, we evaluate both surrogate models also with a meta-initialization (Wistuba et al. 2015).

Surrogate Collaborative Tuning (SCoT) SCoT (Bardenet et al. 2013) is the first transfer surrogate model proposed. Furthermore, it also tries to rank the hyperparameter configurations instead of reconstructing the hyperparameter surface. It uses \(\text {SVM}^{\text {RANK}}\) to predict a ranking. Since the ranker does not provide any uncertainties, a Gaussian process is fitted on the output of the ranker. The authors originally proposed to use the RBF-kernel for \(\text {SVM}^{\text {RANK}}\) but due to computational complexity we follow the lead of Yogatama and Mann (2014) and use the linear kernel instead.

Gaussian Process with Multi-Kernel Learning (MKL-GP) Yogatama and Mann (2014) proposed another transfer surrogate model. This transfer surrogate model learns a Gaussian process with a specific kernel combination on all instances. The kernel is a linear combination of the SE-ARD kernel and a kernel modelling the similarity between data sets based on a set of meta-features. To tackle the problem of different scales of hyperparameter response surfaces for different data sets, they propose to normalize the target.

Factorized Multilayer Perceptron (FMLP) FMLP (Schilling et al. 2015) is another transfer surrogate model that uses a specific neural network to learn the similarity between data sets implicitly in a latent representation.

Scalable Gaussian Process Transfer Surrogate (SGPT) This is the framework we propose in this work. We distinguish different instances of it depending on how we choose the weights. SGPT-PoE is the version that chooses the weights according to Sect. 6.1 and hence is based on product of experts. Then we included also the kernel regression method introduced in Sect. 6.2 with the meta-feature data set descriptors (SGPT-M) and with the pairwise hyperparameter performance descriptors (SGPT-R).

Transfer Acquisition Function (TAF) In Sect. 7 we proposed an acquisition function that makes use of meta-data. We combine it with the surrogate model as used by I-GP and distinguish different versions depending on which weights are chosen as for SGPT.

The kernel parameters used in the Gaussian process are learned by maximizing the marginal likelihood on the meta-training set (Eq. 24). All hyperparameters of the tuning strategies are optimized in a leave-one-data-set-out cross-validation on the meta-training set.

The reported results were estimated using a leave-one-data-set-out cross-validation and are the average of ten repetitions. For strategies with random initialization (Random, I-GP, I-RF), we report the average of over thousand repetitions due to higher variance. For those strategies that use meta-features, we use those meta-features that are described in Table 1.

8.3 Evaluation metrics

We compare all optimization strategies with respect to three different evaluation measures.

8.3.1 Average rank

At each time step t all optimization strategies are ranked for each data set \(D_{i}\) according to their best score achieved, the better the score, the smaller the rank. In case of ties, the average rank is used. Finally, the ranks are averaged over all data sets yielding the average rank.

8.3.2 Average distance to the global minimum

For an optimization strategy the average distance to the global minimum (ADTM) at time step t is defined as follows. The distance of the best score achieved for each data set D to the best score on this data set is computed. The final score is the average over all data sets \(D\in {\mathcal {D}}\). To account for different scales on different data sets, the hyperparameter performance is scaled to \(\left[ 0,1\right] \). Formally,

$$\begin{aligned} \text {ADTM}=\frac{1}{\left| {\mathcal {D}}\right| }\sum _{D\in {\mathcal {D}}}\min _{x\in \mathcal {X}_{t}}\frac{y_{D}\left( x\right) -y_{D}^{\text {min}}}{y_{D}^{\text {max}}-y_{D}^{\text {min}}} \end{aligned}$$
(48)

where \(y_{D}^{\text {min}}\) and \(y_{D}^{\text {max}}\) are the best and worst performance on the precomputed grid, respectively. \(\mathcal {X}_{t}\) is the set of all hyperparameter configurations tried until time step t. For this evaluation metric it holds that lower values are better and zero is the potential optimum.

8.3.3 Fraction of unsolved data sets

This metric estimates the proportion of data sets in which the optimum according to the precomputed grid has not been found. The evaluation metric ranges between 1 if no optimum has been found to 0 if every optimum is found.

Fig. 4
figure 4

SGPT (and hence also TAF) is clearly outperforming the state-of-the-art that is training a single Gaussian process on the full meta-data with respect to scalability. FMLP, which is based on a neural network, has a linear training time and provides faster results for larger data sets

8.4 Scalability experiment

As discussed in Sect. 5, Gaussian processes are computationally expensive. Training time is cubic in the number of training instances and still quadratic when updating it. Our proposed surrogate model SGPT makes use of Gaussian processes in a scalable way. Given d data sets, where on each of these n observations of hyperparameters performance have been made. A typical way of using meta-data for SMBO (Bardenet et al. 2013; Yogatama and Mann 2014) is to train Gaussian process on all instances which has an asymptotic training time of \({\mathcal {O}}\left( d^{3}n^{3}\right) \). We propose to learn for each data set an independent Gaussian process which reduces the training time to \({\mathcal {O}}\left( dn^{3}\right) \) which is no longer cubic in the number of data sets. Still, the complexity of both methods is cubic in the number of instances per data set. In an empirical evaluation we show that our method is nevertheless feasible while the state-of-the-art exceeds an acceptable run time.

We have created an artificial meta-data set with \(d=50\) data sets and 5 hyperparameters. The number of instances per data set n varies from 10 to 190. We estimated the run time for a Gaussian process on the full data and SGPT for different n. The results are visualized in Fig. 4. At a point where the Full GP needs almost 7 hours of training, SGPT needs only about 2 minutes. One can even consider to further improve the scalability by learning multiple Gaussian processes per data set. To achieve this, the subsets \(X^{(i)},\,y^{(i)}\) defined in Eq. 29 have to be divided further. One could for example learn an individual Gaussian process for each of the three SVM kernels and then apply the method of Sect. 6.1.

As discussed earlier, the cubic run time make Gaussian processes unattractive for large meta-data sets. Hence, our main goal was to achieve run times for Gaussian processes that are competitive to other models such as neural networks as used by FMLP. Figure 4 shows that our approach needs time very similar to FMLP, for fewer instances, it is even faster.

8.5 Predictive performance in SMBO

We were able to provide theoretical and empirical evidence that our method is scaling better than the current state-of-the-art methods. Now our aim is to provide empirical evidence that our proposed methods are also competitive for the task of hyperparameter optimization as well as combined algorithm selection and hyperparameter optimization in terms of prediction.

First, we compare our proposed variations of SGPT and explain the results in Sect. 8.5.1. Then, in Sect. 8.5.2, we compare SGPT to six state-of-the-art methods. Finally, we analyze our proposed acquisition function and empirically compare it to its SGPT counterpart in Sect. 8.5.3. In Sect. 8.5.5, we compare the best methods in a final comparison. We use Sect. 8.5.5 to conduct a significance analysis of our results and conclude the experimental section with a comparison considering the training time in Sect. 8.5.6.

Fig. 5
figure 5

SGPT-R is outperforming the other two approaches due to the decaying influence of the meta-data for the task of hyperparameter optimization

Fig. 6
figure 6

The adaptive weights also allows SGPT-R to outperform its variants for the task of combined algorithm selection and hyperparameter optimization

8.5.1 Evaluating the scalable gaussian process transfer surrogate framework

In Sect. 6, we derived or proposed three different ways of using meta-data in SGPT, depending on how we set the weights. Before comparing to the state-of-the-art methods, our aim is to focus on these three variations. We evaluated all methods for the task of hyperparameter optimization as well as for the task of combined algorithm selection and hyperparameter optimization. The results are presented in Figs. 5 and 6. SGPT-R is outperforming the other two alternatives. Our explanation for this is the use of adaptive weights. Firstly, they enable the method to quickly identify data sets that behave similarly and additionally, the influence of meta-data decays over time. Hence, as soon as enough knowledge is gathered from the meta-data, the system is able to rely stronger on the predictions of the surrogate for the new data set. This insight was one of our key motivations for proposing TAF. In the following, we will not use SGPT-M and SGPT-PoE in further comparisons to avoid overcrowded figures.

Fig. 7
figure 7

Our proposed approach SGPT-R is outperforming all competitor methods with respect to all three metrics. For all metrics, the lower the better

Fig. 8
figure 8

SGPT-R is outperforming the competitor methods for the task of combined algorithm selection and hyperparameter optimization only in the first iterations. Then, FMLP provides the best results. On this larger meta-data set we were not able to compare to the methods based on a GP trained on the whole meta-data (i.e. SCoT and MKL-GP)

8.5.2 Comparison to other competitor methods

We saw in the last section that SGPT-R is the best variation of our SGPT framework. We compare it on the task of hyperparameter optimization as well as the task of combined algorithm selection and hyperparameter optimization. As before, we use all three evaluation measures in the comparisons. The results for the task of hyperparameter optimization are presented in Fig. 7. SGPT-R outperforms all competitor methods with respect to all three evaluation metrics.

Fig. 9
figure 9

TAF and SGPT deliver similar competitive results on this meta-data set

Fig. 10
figure 10

TAF provides a clear improvement over SGPT thanks to its adaptive use of meta-data and better way of dealing with different data set scales

The results for the task of combined algorithm selection and hyperparameter optimization are presented in Fig. 8. For this task, we do not compare our methods to SCoT and MKL-GP because they are based on a Gaussian process which is trained on the full meta-data set. Since the meta-data set is too large, we were not able to conduct these experiments. Instead, we compare our methods to a Gaussian process and a random forest that use a meta-initialization.

SGPT-R provides very good results for the first trials but then it is not able to further progress and find better hyperparameter configurations. The likely reason is that the WEKA meta-data set is a more challenging optimization problem than the SVM meta-data set. Hence, more trials are needed in comparison to the SVM meta-data, which consequently increases the number of discordant pairs such that the distance of the new data set to the others is increasing too quickly. Then, following the definition of Eqs. 42 and 45, meta-data will not be considered any more and our method performs like a Gaussian process without meta-data. Thus, the strongest competitor FMLP is able to outperform it after some time. Also the other methods are getting close to the performance for SGPT.

Fig. 11
figure 11

SGPT-R and TAF-R provide similar performances by outperforming the other competitors

Fig. 12
figure 12

On this data set TAF-R is able show that it is more robust than SGPT-R

8.5.3 Evaluating the transfer acquisition framework (TAF)

Our motivation for introducing TAF was to get rid of the problem of differently scaled data sets and the question how to adaptively employ meta-data. It is a direct extension of SGPT that uses meta-data in the acquisition function instead of by the surrogate model. To provide empirical evidence that we overcame the problems faced by many transfer surrogate models, we compare TAF first to the different versions of SGPT. As a reminder, the postfix “-PoE” is used for the variant that chooses the weights according to Sect. 6.1 and is based on product of experts. The postfixes “-M” and “-R” indicate the variants based on kernel regression with the meta-feature data set descriptors pairwise hyperparameter performance descriptors, respectively (Sect. 6.2). We have conducted our experiments on our two meta-data sets and obtained the results summarized in Figs. 9 and 10. TAF-R obtained best results. We see a strong improvement of TAF-M and TAF-PoE over SGPT-M and SGPT-PoE, respectively. As mentioned before, especially SGPT-M uses static weights that do not account for the progress of the optimization process. Hence, the impact of the meta-data remains constant which is unfavorable. The weights of TAF-M are the same and thus also remain constant but the improvement on the meta-data shrinks over time as better \(x\) have been found. Hence, we observe an adaptive use of meta-data that leads to the large improvement over SGPT-PoE and SGPT-M. Since SGPT-R has already a mechanism that decays the influence of the meta-data, SGPT-R and the TAF approaches provide similar good results on the SVM meta-data set in Fig. 9 which likely results from the better way of tackling the problem of different scales.

The improvement of TAF-R over SGPT-R becomes significant on the WEKA meta-data set as presented in Fig. 10. We saw in the previous experiments that SGPT-R is able to have a good start on this problem but then somehow gets stuck and has problems to further improve the solution. TAF-R is able to overcome this issue. Having a similar good start, it is able to continue its search and finds the optimum for more than 50% of the data sets within 300 trials.

8.5.4 Overall comparison

We conclude our experiments by comparing the best methods from the TAF and SGPT frameworks with the strongest state-of-the-art competitors. Figure 11 presents the results for finding optimal hyperparameter configurations for a kernel SVM. Both, TAF-R and SGPT-R are outperforming the competitor methods with respect to all three metrics and are approximately comparable. For the WEKA data set the story is different. In Fig. 12 it becomes clear that SGPT-R has a good start, but then fails to further improve the solution. FMLP is able to outperform it after some time and also other simpler competitor methods are getting comparable performances. Thanks to the better way of adaptive meta-data usage and handling of different data set scales, TAF-R is once again the best method. While it is a little bit worse than SGPT-R at the beginning, TAF-R quickly outperforms SGPT-R. Additionally, TAF-R is always stronger than the runner-up method FMLP. Thus, we consider our motivation confirmed in proposing TAF. Even though it looks very similar to SGPT, TAF-R is more robust due to the aforementioned reasons. We are able to prove theoretically and show empirically that our approach is faster than the Gaussian process that has been learned on the full meta-data. In our experiment we can show that the run time is approximately as fast as the fastest transfer surrogate models. Hence, our proposed approach is not only effective, it is also efficient.

8.5.5 Significance analysis

Friedman Test. To analyze the significance of our results, we apply a Friedman test (1937, 1940) with the corresponding post-hoc tests. This was proposed by Demšar (2006) who analyzed multiple statistical tests and as a conclusion proposed this method for comparisons of multiple methods over multiple data sets. A very important aspect is that Demšar proposed this test for classifiers and not for optimization methods, as there are vital differences between these problems. Good methods might find a good solution, or even the best in a shorter time period than other worse methods; however, if you provide these methods with more time, they inevitably will catch up. It is clear that with enough trials there will be no significant difference between a random strategy and any other strategy. Since we are not aware of any better solution to analyze the statistical significance of our results, we follow Demšar’s recommendation for all methods nonetheless.

Fig. 13
figure 13

Comparison of all optimizers against each other with the two-tailed Nemenyi test on the SVM meta-data set. Groups that are not significantly different (at \(p=0.05\)) are connected. The left plot shows the results after the first trial, the right plot after the 30th trial. After 30 trials there is no significant difference between our proposed methods TAF and SGPT and the potentially best method

Fig. 14
figure 14

Comparison of all optimizers against each other with the two-tailed Nemenyi test on the WEKA meta-data set. Groups that are not significantly different (at \(p=0.05\)) are connected. The left plot shows the results after 30 trials, the right plot after 200 trials

For our analysis we added one further method that we call Best. This is an oracle method that chooses for any data set the best performing hyperparameter configuration and algorithm with the first trial.

We estimated the critical value for both meta-data sets and rejected the null hypothesis that the measured average ranks are not significantly different from the mean rank for \(\alpha =0.05\). Since no method singled out, we use the Nemenyi test (1962) for pairwise comparisons. We computed the critical values for \(p=0.05\). We conducted the significance tests for four different scenarios: after the first and 30th trial on the SVM meta-data set and after the 30th and 200th trial on the WEKA meta-data set. The results are visualized in Figs. 13 and 14. Groups of methods that are not significantly different are connected.

Fig. 15
figure 15

Comparison of TAF-R against all other optimizers using the Bayesian hierarchical test on the SVM meta-data set. Probabilities above 95% are significant. With increasing number of trials the probability for equality of both methods increases due to a limited number of considered configurations

On the SVM meta-data set the critical difference is 1.7 for \(p=0.05\). The post-hoc tests detect significant differences between TAF-R and SGPT-R to all methods but FMLP after the first trial and significant differences to I-GP, I-RF and Random after 30 trials. As discussed earlier, the optimal value is fixed such that more trials mean that more methods achieve good solutions. Nevertheless, even after 30 trials any method that is using meta-data still outperforms any other method that is not. Finally, after 30 trials there is no significant difference between our proposed methods and the potentially best method Best.

Fig. 16
figure 16

Comparison of TAF-R against all other optimizers using the Bayesian hierarchical test on the WEKA meta-data set. Probabilities above 95% are significant. With increasing number of trials the probability for equality of both methods increases due to a limited number of considered configurations

On the WEKA meta-data set the critical difference is 1.48 for \(p=0.05\). The post-hoc tests detect that I-RF is significantly worse than any other method. After 30 trials, SGPT-R provides significantly better results than I-GP, I-RF and I-RF with initialization. The experimental data is not sufficient to reach any conclusion regarding FMLP, TAF-R and I-GP with initialization. After 200 trials, no statistically significant statement can be made about all methods but I-RF and Best.

Bayesian Hierarchical Test In addition to the Friedman test, we conduct a second significance test, namely the recently proposed Bayesian hierarchical test (Corani et al. 2016). As proposed by Corani et al., we assume that two classifiers are practically equivalent if their mean difference of accuracy lies within the interval \(\left( -\, 0.01,0.01\right) \). The test produces three probabilities, the first two being whether one or the other classifier is significantly better than the other. Additionally it provides a probability whether both classifiers are practically equivalent as their difference of accuracies lies within the region of practical equivalence (rope). Figures 15 and 16 present the posterior probabilities of the hierarchical model for each trial. For a significance level of \(\alpha = 0.05\), the results are significant when the posterior probability is above 95%. However, the advantage of the Bayesian hierarchical test is that posterior probabilities can be meaningfully interpreted even when they do not exceed the 95% threshold (Corani et al. 2016), simply by computing the odds ratios of the different outcomes. We see that TAF-R provides, sometimes significant, better results than Random, I-GP, I-RF, SCoT and MKL-GP on both meta-data sets. In comparison to FMLP and SGPT-R the results are not significant. However, the test indicates that these methods are either equal or TAF-R is better. There is little indication that TAF-R performs worse, as the respective probability only increases when compared to SGPT-R on the WEKA meta-data set in some of the early trials.

Fig. 17
figure 17

Comparison of TAF-R to its competitor method considering the run time. TAF-R still performs best. In contrast to the previous experiments, FMLP is worse and RF better

8.5.6 Performance with respect to run time

In the previous experiments we assumed that the training time is always the same no matter which configuration is chosen. The main reason for this is that none of the methods considers the run time needed for evaluating a configuration when choosing which configuration to test next. All of them try to find the global minimum in as few trials as possible. However, in practice it is of course very important to consider the run time as well. Since for the Weka meta-data set not only the hyperparameters are changing but also the model, we conducted a run time experiment for this particular meta-data set. The results are reported in Fig. 17. In our experiments we used various data sets which differ in the number of predictors and instances. To ensure that each data set has equal influence on the average results, we normalized the run time. The run time for each data set is normalized by dividing it by the average training time needed for a model to be trained on this data set. Hence, the normalized run time in Fig. 17 at 1 shows the performance of all methods after investing the amount of time that is needed to train a single model on average. We report ADTM and fraction of unsolved data sets for each method starting at the time when at least one configuration for all data sets has been evaluated. We report the average rank starting at the time when every method has evaluated at least one configuration for all data sets.

Our proposed method TAF-R still yields the best results. The biggest differences to the previous results are the changes of FMLP and I-RF. I-RF improved a lot. It evaluates far more configurations per time unit than any other method. This leads to worse results in the previous evaluation protocol but pays off under this protocol. For FMLP it is exactly the other way round. It mainly focuses on time-consuming configurations. While it needs only few configurations to find good models, it needs more time than most competitor methods.

9 Conclusions

In this work, we proposed a new transfer surrogate model framework which is able to scale to large meta-data sets. Such considerations will become a necessity in the near future as with more experiments conducted every day, the meta-data that can be employed also grows. To ensure scalability, our transfer surrogate model is built on Gaussian processes which have been learned individually on the observed performances on a single dataset only, to then be combined into a joint surrogate model on the basis of product of experts as well as kernel regression models. Additionally, as Gaussian processes are basically hyperparameter free, we have created a strong but scalable surrogate model, that also does not require a second stage hyperparameter optimization, as opposed to other surrogates in the related work. Overall, we derived different instances of our framework and evaluated them with respect to optimization performance and scalability on two meta-data sets containing both hyperparameter optimization as well as model choice. We show empirically that our method is able to outperform existing methods with respect to both measures by choosing well-performing hyperparameter configurations while maintaining a small computational overhead.

Nevertheless, the transfer surrogate model still suffers from the problem caused by different performance scales on different data sets, which introduces a bias when selecting the next hyperparameter configuration to test. As a result of this issue, we proposed to decay the influence of meta-data as the hyperparameter optimization progresses, we have found empirically that this is strategy proves advantageous. Unfortunately, this decay is relying on heuristics, which was our motivation to use the meta-data within the acquisition function instead of the surrogate model to achieve a more principled effect. Similarly to the transfer surrogate framework, we derived a transfer acquisition framework that keeps the assets of our previously proposed surrogate model but overcomes its disadvantages and directly uses the improvement of the meta-data. In a conclusive evaluation, we were able to confirm our intuition empirically on two meta-data sets, as the transfer acquisition framework shows even higher performance than the transfer surrogate model on the same meta-data sets. Consequently, we recommend using the transfer acquisition framework for hyperparameter optimization, as it is fast and powerful in delivering well-performing hyperparameter configurations.

Possible future work involves the consideration of learning curves. As mentioned in the related work section, there exist many approaches which use the convergence behavior of the learning algorithm to predict the final performance. This prediction itself will help improving our method. Additionally, learning curve forecasting methods will benefit from observations on other data sets. Furthermore, we want to investigate the case where one is not solely interested in reducing the loss of a predictor but also other cost such as training, prediction or hardware constraints (Abdulrahman et al. 2018) which is relevant for edge devices. We saw in our experiments that trying to minimize the classification error only regardless of the training time needed can lead to non-optimal results. Therefore, we are considering to minimize a loss function which is a combination of classification error and training time as suggested by Abdulrahman et al. (2018). Finally, we want to conduct experiments to see whether our method is able to be migrated to deep learning methods.