Scalable Gaussian processbased transfer surrogates for hyperparameter optimization
 1.1k Downloads
 1 Citations
Abstract
Algorithm selection as well as hyperparameter optimization are tedious task that have to be dealt with when applying machine learning to realworld problems. Sequential modelbased optimization (SMBO), based on socalled “surrogate models”, has been employed to allow for faster and more direct hyperparameter optimization. A surrogate model is a machine learning regression model which is trained on the metalevel instances in order to predict the performance of an algorithm on a specific data set given the hyperparameter settings and data set descriptors. Gaussian processes, for example, make good surrogate models as they provide probability distributions over labels. Recent work on SMBO also includes metadata, i.e. observed hyperparameter performances on other data sets, into the process of hyperparameter optimization. This can, for example, be accomplished by learning transfer surrogate models on all available instances of metaknowledge; however, the increasing amount of metainformation can make Gaussian processes infeasible, as they require the inversion of a large covariance matrix which grows with the number of instances. Consequently, instead of learning a joint surrogate model on all of the metadata, we propose to learn individual surrogate models on the observations of each data set and then combine all surrogates to a joint one using ensembling techniques. The final surrogate is a weighted sum of all data set specific surrogates plus an additional surrogate that is solely learned on the target observations. Within our framework, any surrogate model can be used and explore Gaussian processes in this scenario. We present two different strategies for finding the weights used in the ensemble: the first is based on a probabilistic product of experts approach, and the second is based on kernel regression. Additionally, we extend the framework to directly estimate the acquisition function in the same setting, using a novel technique which we name the “transfer acquisition function”. In an empirical evaluation including comparisons to the current stateoftheart on two publicly available metadata sets, we are able to demonstrate that our proposed approach does not only scale to large metadata, but also finds the stronger prediction models.
Keywords
Hyperparameter optimization Gaussian processes Sequential modelbased optimization Metalearning1 Introduction
Hyperparameter optimization and algorithm selection are ubiquitous tasks in machine learning contexts that usually have to be conducted for every individual research task and realworld application. Choosing the correct model and hyperparameter configuration usually improves very poor predictions to stateoftheart performance.
Hyperparameter optimization tries to find the hyperparameter configuration that minimizes a certain black box function \(y(x)\), which is commonly a crossvalidation loss of a model learned on some training data using the hyperparameter configuration x. Despite its omnipresence, hyperparameter optimization is usually a difficult task, as the optimization cannot be carried out by minimizing a loss function with nice mathematical properties such as differentiability or convexity. Consider for example the number of hidden layers and hidden neurons for a simple feedforward neural network. When learning the neural network, both of these hyperparameters have to be set, as the final prediction performance depends heavily on the correct setting of the model complexity.If the model is too complex (i.e. many layers and neurons), the model will very likely overfit the training data or get stuck in a local minimum. However, if the model complexity is not high enough, it might underfit the training data, and miss information vital to the optimization procedure. Thus, the correct setting of these hyperparameters is vital for any serious application of machine learning; however, as mentioned above, the difficulty with this task is that we have no loss function which we can optimize to learn the specific best choices for these hyperparameters.
The majority of efforts to solve this problem are based on the sequential modelbased optimization (SMBO) framework, which has its roots in the area of blackbox optimization. SMBO is an iterative approach which trains a surrogate model \(\varPsi \) on the observed metalevel instances of \(y\). Then, it can be used in order to predict the performance of an algorithm on a specific data set given the hyperparameter settings and data set descriptors. We use this method to find promising hyperparameter configurations, evaluate \(y\) for these configurations, and finally retrain \(\varPsi \). The overall process is repeated T many times, and in the end, we take the best hyperparameter configuration found so far. In comparison to exhaustive search methods, SMBO tries to adaptively steer the optimization into promising regions in the hyperparameter space.
 1.
A metalearning system must include a learning subsystem which adapts with experience.
 2.Experience is gained by exploiting metaknowledge extracted
 (a)
in a previous learning episode on a single data set, and/or
 (b)
from different domains or problems.
 (a)
Throughout this paper, the term metadata refers to observations of the performances of different sets of hyperparameter configurations evaluated on a various, different data sets. The inclusion of such metadata in SMBObased hyperparameter optimization can be accomplished in different ways, with the most commonly used approach being to pretrain the surrogate model on these observations. Such surrogate models, which are pretrained on metadata, are called “transfer surrogate models” because they are capable of using the metadata to infer from previously seen data sets to new ones.
Another approach learns a particular initialization, i.e. a set of hyperparameter configurations, which is most likely to work well on the data. This approach has been shown to produce better results; this seems plausible due to the nature of the problem.
Usually, researchers gain more experience in choosing wellperforming hyperparameter configurations for their models by running these models on a variety of data sets and hyperparameters. Consequently, if a new data set arrives, the hyperparameter optimization guided by an expert will comprise this knowledge into choosing which initial configurations to test. This intuition makes it clear that incorporating hyperparameter performance on other data sets into the surrogate model used within SMBO aids in speeding up and steering the hyperparameter optimization towards regions where we can suspect to find good hyperparameter configurations.
A number of publications show that using metadata is beneficial, including but not limited to Bardenet et al. (2013), Yogatama and Mann (2014), Swersky et al. (2013), Schilling et al. (2015), Wistuba et al. (2015, 2016), Feurer et al. (2015).
In this paper, we integrate two pieces of work, Schilling et al. (2016), Wistuba et al. (2016), which both learn individual Gaussian processes on subsets of the metadata. Each Gaussian process is learned on all the observed performances of a single data set, i.e. the hyperparameter configuration and its corresponding performance on this specific data set. Finally, all processes are then combined into a single surrogate model. In this way, we can achieve scalability to large amounts of metadata because the training effort of the surrogate model is no longer cubic in the number of data sets used to create the metadata. We compare both papers and extend the ideas therein by learning a transfer acquisition function using an ensemble of Gaussian processes. The resulting approach shows better empirical performance than the stateofthe art for hyperparameter optimization. We present this result via a set of thoroughly conducted experiments which maintain the scalability properties of a simple product of experts model. We then also show that the acquisition function is an elegant way of dealing with different performance scales for different data sets and a decaying use of metadata.

The unification of our previous work (Schilling et al. 2016; Wistuba et al. 2016) as the scalable Gaussian process transfer surrogate framework.

Identification of typical problems faced when using metadata in surrogate models and thus,

Proposal of using metalearning in the acquisition function instead of the surrogate model to overcome these issues by using the transfer acquisition function framework.

Extensive empirical evaluations for comparing all approaches including previous and new methods with additional discussion.
2 Related work
The algorithm selection problem is an important problem in many domains and was first introduced in the 1970s (Rice 1976). Application domains include hard combinatorial problems such as SAT (Xu et al. 2008) and TSP (Kanda et al. 2012), software design (Cavazos and O’Boyle 2006), numerical optimization (Kamel et al. 1993), optimization (Nareyek 2004) and many more. In our work, we limit ourselves to the domain of machine learning although there are generalizations that subsume hyperparameter optimization in the broad category of algorithm configuration (Eggensperger et al. 2018). Thus, we investigate both the problem of finding the right algorithm as well as finding suitable hyperparameters for this algorithm.
Existing work can be grouped with respect to different properties. One way to group previous work is by methodological approach, where we have on the one hand approaches that search the hyperparameter space exhaustively and on the other hand methods that use blackbox optimization techniques such as sequential modelbased optimization (Jones et al. 1998). Other methods under this grouping make use of search algorithms from artificial intelligence such as genetic algorithms.
One can also distinguish between approaches based on those which use metadata and those that do not. Metalearning permits us to transfer past experiences with particular algorithms and hyperparameter configurations from one data set to another. Plenty of work has been done in that area and can be found in some recently published books and surveys: Brazdil et al. (2009), Vilalta and Drissi (2002), Lemke et al. (2015).
In the next section we discuss further related work as classified by methodology.
2.1 Exhaustive search methods
The most widely used method to optimize hyperparameters in machine learning is the “grid search”. For a grid search, we choose a finite subset of hyperparameter configurations and evaluate them all in a brute force manner, ideally within a parallel computing environment. In some cases, grid search is manually steered by choosing a coarse grid at first to find regions where hyperparameter performance is generally good, with such regions being investigated more closely using a finegrained grid. This mixture of grid search and manual search techniques appears in many publications: for instance Hinton (2010), Larochelle et al. (2007). The downsides of grid search are rather obvious: if the dimension of the hyperparameter space is large and no prior knowledge about hyperparameter performance is given, grid search requires many evaluations to deliver good results, often at the expense of many useless computations.
Bergstra and Bengio (2012) introduce Random Search, which essentially replaces the fixed set of points by sampling points from some probability distribution. This has mainly two advantages, at first, one may enter prior beliefs over the hyperparameter space by defining the probability distributions to draw from. Secondly, random search works better in scenarios of low effective dimensionality, which is the case if hyperparameter performance almost stays constant in one dimension of the hyperparameter space and changes drastically in another dimension.
2.2 Model specific methods
Many methods for hyperparameter optimization exist which have been designed for a specific algorithm. These methods are usually based on genetic algorithms (Friedrichs and Igel 2005a; de Souza et al. 2006), although some are deterministic (Keerthi et al. 2007). Beyond this, there are many other methods for specific scenarios, including but not limited to methods for general regression and timeseries models (McQuarrie and Tsai 1998), for regression when the sample size is small (Chapelle et al. 2002), for Bayesian topical trend analysis (Masada et al. 2009) and for loglinear models (Foo et al. 2007). Moreover, Schneider et al. deal with hyperparameter learning in probabilistic prototypebased models (Schneider et al. 2010), Seeger employs hyperparameter learning for large scale hierarchical kernel methods (Seeger 2006), and Kapoor et al. are concerned with optimizing hyperparameters for graphbased semisupervised classification models (Kapoor et al. 2005).
The major limitation of all of these methods is that they are specifically tailored to one particular model and only work well in certain scenarios. This is a drawback that more widely applicable SMBObased methods can alleviate.
2.3 Sequential modelbased optimization
In order to overcome the issues of exhaustive search methods or model specific methods, black box optimization has been used in the context of sequential modelbased optimization (SMBO) (Jones et al. 1998). SMBO learns a surrogate model on the observed hyperparameter performance, which is then queried to provide predictions for unobserved hyperparameter configurations. The predicted performance and the uncertainty of the surrogate model is then used within the expected improvement acquisition function to choose which of the many unobserved hyperparameter configurations to test next. The main strand of research along these lines has been committed to finding surrogate models, for example a Gaussian process (Rasmussen et al. 2005) which provided the socalled Spearmint method (Snoek et al. 2012). Other surrogate models, such as the random forests proposed in SMAC (Hutter et al. 2011), have also been investigated. While this earlier work focuses on optimizing hyperparameters only, AutoWEKA (Thornton et al. 2013) has shown that algorithmselection can be considered in a similar fashion to hyperparameters, and so the existing work is capable of choosing both algorithms and hyperparameters in combination.
Additionally, research on including metadata, i.e. observations of hyperparameter performance on other data sets, has been gaining a lot of attention. Bardenet et al. (2013) use \(\text {SVM}^{\text {RANK}}\) as a surrogate and thus consider the hyperparameter selection as a ranking task rather than a regression task. In this way, they are able to overcome the issues of different data sets having different performance levels. To estimate uncertainties, they train a Gaussian process on the output of the \(\text {SVM}^{\text {RANK}}\), in order to compute expected improvement. Moreover, Gaussian processes with a metakernel have been proposed (Swersky et al. 2013; Yogatama and Mann 2014). Finally, neural networks have been used as surrogate models in combination with a factorization machine (Rendle 2010) in the input layer (Schilling et al. 2015).
2.4 Learning curve predictions
The idea of using metadata in SMBO is to find better performing prediction models within a smaller fraction of time. Another idea applicable to models that are learned in an iterative fashion is to predict the learning curve, i.e. the performance of the resulting model, after a number of epochs. For example, Domhan et al. (2015) predict the performance of the hyperparameter configuration based on the partially observed learning curve after a few iterations. If the final performance is likely to be worse than the current best configuration, then the process is stopped and the configuration discarded, and the optimization continues with another, different configuration. Swersky et al. (2014) propose a similar approach which never discards a configuration, but instead learns the models for various hyperparameter configurations at the same time and switch from one learning process to another if it turns out to be more promising.
There are other hyperparameter approaches not related to SMBO following the same idea. There are, for example, some populationbased approaches, such as Successive Halving (Jamieson and Talwalkar 2016) and Hyperband (Li et al. 2016), which choose a set of hyperparameter configurations at random and incrementally train the learning algorithms in parallel. From time to time, the weakest configurations are discarded. The Racing Algorithm by Maron and Moore (1997) follows a similar principle but is focused on lazy learners where the expensive part is testing rather than training.
2.5 Metainitializations
There are several strategies to find a set of initial configurations for hyperparameter optimization methods. Reif et al. (2012) propose to initialize a hyperparameter search based on genetic algorithms with the best hyperparameters on other data sets, where the similarity of data sets is defined through metafeatures. Feurer et al. (2014) propose the same idea for SMBO which was later extended (Feurer et al. 2015; Wistuba et al. 2015). The drawback of these approaches is that they do not consider whether the initial hyperparameter configurations are very close to each other and therefore may waste computation time by choosing too similar hyperparameters initially. Thus, one of our previous works proposes to learn a set of initial hyperparameter configurations by optimizing a metaloss that maximizes the overall improvement on the metadata (Wistuba et al. 2015).
2.6 Metafeatures
Metafeatures are descriptive characteristics of a data set and thus an essential component of all traditional metalearning methods that are learning across problems. In this work, we use pairwise comparisons of the performance of two hyperparameter configurations on one data set compared to another. This is a very special instance of landmarkers (Pfahringer et al. 2000). Landmark features are created by applying very fast machine learning algorithms (e.g. decision stumps, linear regression) to the data, with the performance is added as a metafeature. In contrast, our approach uses only the performance of algorithms and hyperparameter configurations which we have evaluated during our optimization process, and thus, no additional time was spent for estimating these landmarkers. This idea has been already employed by some others (Leite et al. 2012; Sun and Pfahringer 2013; Wistuba et al. 2015). In contrast to their work, we propose a way of using these metafeatures also in cases with continuous hyperparameters, since for continous hyperparameters, it is very unlikely that we have seen the same hyperparameter configurations for all data sets. The approach of pairwise comparisons proposed by the literature works only if we either only want to find the best algorithm and ignore the hyperparameters (Sun and Pfahringer 2013) or discretize the hyperparameters (Leite et al. 2012; Wistuba et al. 2015). We overcome this problem by predicting the performance of a hyperparameter configuration if it is not part of our metadata set.
2.7 Other approaches
Furthermore, there also exist strategies to optimize hyperparameters that are based on optimization techniques from artificial intelligence such as tabu search (Cawley 2001), particle swarm optimization (Guo et al. 2008) and evolutionary algorithms (Friedrichs and Igel 2005b). Since none of these strategies use information from previous experiments, metadata can be added analogously to the SMBO counterpart using in initialization (Gomes et al. 2012; Reif et al. 2012). Another interesting recent proposition is the use of bandit optimization techniques for automatic machine learning (Hoffman et al. 2014).
3 Problem definition
In this section we will formally define the problem of hyperparameter optimization and introduce the notation that will be used in the remainder of the paper. We will follow the notation that was introduced by Bergstra and Bengio (2012), but extend it to account for a more general problem of hyperparameter optimization by also including model choice and other tasks.
Let \({\mathcal {D}}\) denote the space of all data sets and let \(\mathcal {M}\) denote the space of all models. Thus, \({\mathcal {D}}\) consists of all possible data sets, where instances might have a vector representation, but can also be images, timeseries, or similar representations.
4 Sequential modelbased optimization
Many different surrogate models have been proposed in the recent years, for instance Gaussian processes (Bardenet et al. 2013; Snoek et al. 2012; Swersky et al. 2013; Yogatama and Mann 2014) in different variations. Additionally, random forests (Hutter et al. 2011) and neural networks (Schilling et al. 2015) have been employed. What all these surrogate models have in common is that they are relatively easy and fast to evaluate, at least in comparison to evaluating \(y\), while still being able to learn complex functions. Using linear models to estimate a response surface as in Fig. 1 apparently leads to poor results. Additionally, a growing observation history enables the surrogate model to better approximate the true response surface of \(y\). One key ingredient of surrogate models in the SMBO framework is that, besides predicting the validation performance more accurately, they also have to give an estimation about their uncertainty, i.e. predict a probability distribution instead of a single value. In our work, the surrogate model \(\varPsi \) will predict for each configuration \(x\) a posterior mean \(\mu \left( \varPsi \left( x\right) \right) \) and a standard deviation \(\sigma \left( \varPsi \left( x\right) \right) \).
While GPLCB is the best acquisition function to explain how exploration and exploitation can be achieved, we use the expected improvement in our experiments, which achieves a similar effect, but works slightly differently. At the end of this section, we show an example of how a balanced tradeoff between exploration and exploitation is achieved by expected improvement.
4.1 SMBO using metainformation
The majority of work has focused on coming up with surrogate models that can effectively integrate the metaknowledge, for example by being trained on the observation history prior to starting SMBO on a new data set. In order to do so, we have to add so called metafeatures, that describe the characteristics of a data set, as otherwise the surrogate model would be unable to differentiate between instances if the same hyperparameter configuration has been used. Compared to the work in traditional metalearning, only a few (three or four) metafeatures have been used for surrogate models (Bardenet et al. 2013; Yogatama and Mann 2014; Schilling et al. 2015). We extended the number of metafeatures in our previous work (Schilling et al. 2016; Wistuba et al. 2016) and will use the same metafeatures in this work for all methods. We have listed all those metafeatures in Table 1. To simplify the notation, we will from now on assume that by writing D as a variable of \(y\), the metafeatures of D are automatically included.
Several surrogate models are explicitly designed for handling metadata (Bardenet et al. 2013; Yogatama and Mann 2014; Schilling et al. 2015). Given the importance of these approaches for this work, we explain how they work and how metadata is used in detail in the upcoming subsections.
4.1.1 Surrogate collaborative tuning (SCoT)
Bardenet et al. (2013) were the first to propose a surrogate model in conjunction with metadata, showing how to learn a single surrogate model over observations from many data sets. Since the same algorithm applied to different data sets leads to loss values that can differ significantly in scale, they recommend tackling this problem using a ranking model instead of a regression model. Finally, they propose to use \(\text {SVM}^{\text {RANK}}\) with an RBF kernel to learn a ranking of hyperparameter configurations per data set. The ranking itself does not provide uncertainty estimations which are needed for the acquisition function, and thus, Bardenet et al. finally fit a Gaussian process to the ranking in order to provide this.
4.1.2 Gaussian process with multikernel learning (MKLGP)
4.1.3 Factorized multilayer perceptron (FMLP)
The list of metafeatures used in our experiments for all methods
Metafeatures  

Number of classes  Class probability max 
Number of instances  Class probability mean 
Log number of instances  Class probability standard deviation 
Number of features  Kurtosis min 
Log number of features  Kurtosis max 
Data set dimensionality  Kurtosis mean 
Log data set dimensionality  Kurtosis standard deviation 
Inverse data set dimensionality  Skewness min 
Log inverse data set dimensionality  Skewness max 
Class cross entropy  Skewness mean 
Class probability min  Skewness standard deviation 
5 Gaussian processes
As mentioned earlier in this paper, our main focus is on learning Gaussian product ensembles over different parts of the metadata. However, we first want to remind the reader of the definition and learning of Gaussian processes, and to discuss their advantages and disadvantages.
From the nature of SMBO, our observation history is incremented by one additional instance with each SMBO trial that we undertake. For the kernel matrix, this means only one additional column and row, consisting of the kernel function evaluated for the new point and all the old points. However, after adding the new observation, we have to invert K again.
6 Scalable Gaussian process transfer surrogate framework
In the previous section we discussed the learning of Gaussian processes, where the most computationally expensive step lies in the inversion of the kernel matrix K, which is of size n if we are facing n many training instances. Given the scenario that we are in possession of large scale metadata, learning a Gaussian process becomes infeasible, as inverting K can only be done in \({\mathcal {O}}(n^{3})\) computations; however, Gaussian processes are a natural choice as surrogate models for SMBO, as they naturally predict uncertainties and are basically hyperparameter free.
Beyond the computational challenges, learning a Gaussian process on all training instances makes the strong assumption that each training instance and data set are equally important. This issue is usually addressed by adding metafeatures which leads to an indirect representation of similarity between data sets and their influence. We want to propose a framework that tackles both issues by making Gaussian processes scalable and making the influence of each data set within the metadata explicit.
6.1 Product of experts
6.2 Kernel regression
Computing the Euclidean distance of two metafeature vectors then yields the number of discordant pairs normalized by dividing by the number of all pairs. This is basically a distance function based on the Kendall rank correlation coefficient (Kendall 1938). In this way, during the SMBO process the coefficients are adapted after each iteration, where the data sets that agree on more hyperparameter pairs with the target data set are weighted higher. This has been shown to improve the performance drastically.
7 Transfer acquisition function framework
The transfer surrogate models of the previous section provide very good results as we will see in the experiments section. Nevertheless, these models face two important problems: first, each data set has different scales of evaluation scores which are reconstructed by each of the experts, and thus, the weighted average will likely not meet the scale of the new data set. One way to overcome this is to normalize the performance metadata by data set and also create an approximated normalization of the new data set. This is more a workaround than a solution because the problem remains to some degree and the approximated normalization for the new data set is inaccurate in the very beginning.
The weights for the transfer acquisition function framework can be set analogously to the propositions in Sects. 6.1 and 6.2.
8 Experiments and results
8.1 Metadata set creation
We have created in total two metadata sets, both for the purpose of learning classification models.
The first metadata set contains the predictive hyperparameter performances for running an SVM on 50 different data sets all taken from the UCI repository. We selected the data sets at random and decided to use an SVM as a classifier since it is one of the most popular classification tools. We learn a linear SVM, an SVM with RBF kernel, and lastly an SVM with a polynomial kernel. The hyperparameters are then the choice of kernel, the cost of the slack variables C and—depending on which kernel we choose—the kernel width \(\gamma \) for RBF and the degree d for the polynomial kernel. We chose C from \(C\in \left\{ 2^{5},\ldots ,2^{6}\right\} \), \(\gamma \) was searched in \(\gamma \in \left\{ 10^{4},10^{3},10^{2},0.05,0.1,0.5,1,2,5,10,20,50,10^{2},10^{3}\right\} \) and the polynomial degree was optimized over \(d\in \left\{ 2,\ldots ,10\right\} \). This are 288 different configurations per data set in total. If a data set was already split, we merged all splits and created a new 80% training and 20% validation split. For running the SVM we used the implementation by Tsochantaridis et al. (2004). The creation of this metadata set took 160 CPU hours.
The other metadata set is extended so that it contains both hyperparameter performance on different data sets and performance of a set of different algorithms. In order to accomplish this, we used WEKA (Holmes et al. 1994) to run 19 different classifiers on 59 data sets for a total of 21,871 hyperparameter configurations to evaluate per data set. An overview of the employed classifiers can be seen in Table 2. In total, this sums up to roughly 1.3 million experiments. The overall computation of this metadata set took about 900 CPU hours.
The metatarget for both metadata sets is the classification error. We cannot assume that all approaches would work for other error metrics such as the logarithmic loss. We did not conduct experiences for other loss metrics but we do not expect different results since the problem remains the same: finding the global minimum of the function \(y\).
Overview of all classifiers used within the WEKA metadata set
REP tree  Random tree  Random forest 
LMT  J48  Decision stump 
ZERO_R  PART  ONE_R 
RIPPER  Decision table  KSTAR 
IBK  SMO  Simple logistic 
MLP  Logistic regression  Naive bayes 
Bayes net 
8.2 Competing optimization strategies
We compare our proposed optimization strategies to a large set of stateoftheart optimization strategies which will be described in detail.
Random Search This is a relatively simple baseline that chooses hyperparameter configurations at random. Nevertheless, Bergstra and Bengio (2012) have shown that this strategy can outperform grid search, especially for algorithms with hyperparameters that have a low effective dimensionality.
Independent Gaussian Process (IGP) The use of Gaussian processes as a surrogate model goes back to the paper of Jones et al. (1998) from 1998. It was first proposed to be applied to the problem of hyperparameter optimization for machine learning by Snoek et al. (2012) under the name Spearmint. In our experiment we used a squaredexponential kernel with automatic relevance detection (SEARD) and also for all other optimization methods that are based on a Gaussian process. This surrogate model does not use any knowledge from previous experiments.
Independent Random Forest (IRF) This surrogate is very similar to IGP but uses a random forest instead of a Gaussian process. It was proposed by Hutter et al. (2011) under the name SMAC and is applied in autosklearn (Feurer et al. 2015) and AutoWEKA (Thornton et al. 2013).
Initialization for IGP and IRF (IGP (init) and IRF (init)) Because IGP and IRF do not consider any metadata, we evaluate both surrogate models also with a metainitialization (Wistuba et al. 2015).
Surrogate Collaborative Tuning (SCoT) SCoT (Bardenet et al. 2013) is the first transfer surrogate model proposed. Furthermore, it also tries to rank the hyperparameter configurations instead of reconstructing the hyperparameter surface. It uses \(\text {SVM}^{\text {RANK}}\) to predict a ranking. Since the ranker does not provide any uncertainties, a Gaussian process is fitted on the output of the ranker. The authors originally proposed to use the RBFkernel for \(\text {SVM}^{\text {RANK}}\) but due to computational complexity we follow the lead of Yogatama and Mann (2014) and use the linear kernel instead.
Gaussian Process with MultiKernel Learning (MKLGP) Yogatama and Mann (2014) proposed another transfer surrogate model. This transfer surrogate model learns a Gaussian process with a specific kernel combination on all instances. The kernel is a linear combination of the SEARD kernel and a kernel modelling the similarity between data sets based on a set of metafeatures. To tackle the problem of different scales of hyperparameter response surfaces for different data sets, they propose to normalize the target.
Factorized Multilayer Perceptron (FMLP) FMLP (Schilling et al. 2015) is another transfer surrogate model that uses a specific neural network to learn the similarity between data sets implicitly in a latent representation.
Scalable Gaussian Process Transfer Surrogate (SGPT) This is the framework we propose in this work. We distinguish different instances of it depending on how we choose the weights. SGPTPoE is the version that chooses the weights according to Sect. 6.1 and hence is based on product of experts. Then we included also the kernel regression method introduced in Sect. 6.2 with the metafeature data set descriptors (SGPTM) and with the pairwise hyperparameter performance descriptors (SGPTR).
Transfer Acquisition Function (TAF) In Sect. 7 we proposed an acquisition function that makes use of metadata. We combine it with the surrogate model as used by IGP and distinguish different versions depending on which weights are chosen as for SGPT.
The kernel parameters used in the Gaussian process are learned by maximizing the marginal likelihood on the metatraining set (Eq. 24). All hyperparameters of the tuning strategies are optimized in a leaveonedatasetout crossvalidation on the metatraining set.
The reported results were estimated using a leaveonedatasetout crossvalidation and are the average of ten repetitions. For strategies with random initialization (Random, IGP, IRF), we report the average of over thousand repetitions due to higher variance. For those strategies that use metafeatures, we use those metafeatures that are described in Table 1.
8.3 Evaluation metrics
We compare all optimization strategies with respect to three different evaluation measures.
8.3.1 Average rank
At each time step t all optimization strategies are ranked for each data set \(D_{i}\) according to their best score achieved, the better the score, the smaller the rank. In case of ties, the average rank is used. Finally, the ranks are averaged over all data sets yielding the average rank.
8.3.2 Average distance to the global minimum
8.3.3 Fraction of unsolved data sets
8.4 Scalability experiment
As discussed in Sect. 5, Gaussian processes are computationally expensive. Training time is cubic in the number of training instances and still quadratic when updating it. Our proposed surrogate model SGPT makes use of Gaussian processes in a scalable way. Given d data sets, where on each of these n observations of hyperparameters performance have been made. A typical way of using metadata for SMBO (Bardenet et al. 2013; Yogatama and Mann 2014) is to train Gaussian process on all instances which has an asymptotic training time of \({\mathcal {O}}\left( d^{3}n^{3}\right) \). We propose to learn for each data set an independent Gaussian process which reduces the training time to \({\mathcal {O}}\left( dn^{3}\right) \) which is no longer cubic in the number of data sets. Still, the complexity of both methods is cubic in the number of instances per data set. In an empirical evaluation we show that our method is nevertheless feasible while the stateoftheart exceeds an acceptable run time.
We have created an artificial metadata set with \(d=50\) data sets and 5 hyperparameters. The number of instances per data set n varies from 10 to 190. We estimated the run time for a Gaussian process on the full data and SGPT for different n. The results are visualized in Fig. 4. At a point where the Full GP needs almost 7 hours of training, SGPT needs only about 2 minutes. One can even consider to further improve the scalability by learning multiple Gaussian processes per data set. To achieve this, the subsets \(X^{(i)},\,y^{(i)}\) defined in Eq. 29 have to be divided further. One could for example learn an individual Gaussian process for each of the three SVM kernels and then apply the method of Sect. 6.1.
As discussed earlier, the cubic run time make Gaussian processes unattractive for large metadata sets. Hence, our main goal was to achieve run times for Gaussian processes that are competitive to other models such as neural networks as used by FMLP. Figure 4 shows that our approach needs time very similar to FMLP, for fewer instances, it is even faster.
8.5 Predictive performance in SMBO
We were able to provide theoretical and empirical evidence that our method is scaling better than the current stateoftheart methods. Now our aim is to provide empirical evidence that our proposed methods are also competitive for the task of hyperparameter optimization as well as combined algorithm selection and hyperparameter optimization in terms of prediction.
8.5.1 Evaluating the scalable gaussian process transfer surrogate framework
8.5.2 Comparison to other competitor methods
The results for the task of combined algorithm selection and hyperparameter optimization are presented in Fig. 8. For this task, we do not compare our methods to SCoT and MKLGP because they are based on a Gaussian process which is trained on the full metadata set. Since the metadata set is too large, we were not able to conduct these experiments. Instead, we compare our methods to a Gaussian process and a random forest that use a metainitialization.
8.5.3 Evaluating the transfer acquisition framework (TAF)
Our motivation for introducing TAF was to get rid of the problem of differently scaled data sets and the question how to adaptively employ metadata. It is a direct extension of SGPT that uses metadata in the acquisition function instead of by the surrogate model. To provide empirical evidence that we overcame the problems faced by many transfer surrogate models, we compare TAF first to the different versions of SGPT. As a reminder, the postfix “PoE” is used for the variant that chooses the weights according to Sect. 6.1 and is based on product of experts. The postfixes “M” and “R” indicate the variants based on kernel regression with the metafeature data set descriptors pairwise hyperparameter performance descriptors, respectively (Sect. 6.2). We have conducted our experiments on our two metadata sets and obtained the results summarized in Figs. 9 and 10. TAFR obtained best results. We see a strong improvement of TAFM and TAFPoE over SGPTM and SGPTPoE, respectively. As mentioned before, especially SGPTM uses static weights that do not account for the progress of the optimization process. Hence, the impact of the metadata remains constant which is unfavorable. The weights of TAFM are the same and thus also remain constant but the improvement on the metadata shrinks over time as better \(x\) have been found. Hence, we observe an adaptive use of metadata that leads to the large improvement over SGPTPoE and SGPTM. Since SGPTR has already a mechanism that decays the influence of the metadata, SGPTR and the TAF approaches provide similar good results on the SVM metadata set in Fig. 9 which likely results from the better way of tackling the problem of different scales.
The improvement of TAFR over SGPTR becomes significant on the WEKA metadata set as presented in Fig. 10. We saw in the previous experiments that SGPTR is able to have a good start on this problem but then somehow gets stuck and has problems to further improve the solution. TAFR is able to overcome this issue. Having a similar good start, it is able to continue its search and finds the optimum for more than 50% of the data sets within 300 trials.
8.5.4 Overall comparison
We conclude our experiments by comparing the best methods from the TAF and SGPT frameworks with the strongest stateoftheart competitors. Figure 11 presents the results for finding optimal hyperparameter configurations for a kernel SVM. Both, TAFR and SGPTR are outperforming the competitor methods with respect to all three metrics and are approximately comparable. For the WEKA data set the story is different. In Fig. 12 it becomes clear that SGPTR has a good start, but then fails to further improve the solution. FMLP is able to outperform it after some time and also other simpler competitor methods are getting comparable performances. Thanks to the better way of adaptive metadata usage and handling of different data set scales, TAFR is once again the best method. While it is a little bit worse than SGPTR at the beginning, TAFR quickly outperforms SGPTR. Additionally, TAFR is always stronger than the runnerup method FMLP. Thus, we consider our motivation confirmed in proposing TAF. Even though it looks very similar to SGPT, TAFR is more robust due to the aforementioned reasons. We are able to prove theoretically and show empirically that our approach is faster than the Gaussian process that has been learned on the full metadata. In our experiment we can show that the run time is approximately as fast as the fastest transfer surrogate models. Hence, our proposed approach is not only effective, it is also efficient.
8.5.5 Significance analysis
For our analysis we added one further method that we call Best. This is an oracle method that chooses for any data set the best performing hyperparameter configuration and algorithm with the first trial.
On the WEKA metadata set the critical difference is 1.48 for \(p=0.05\). The posthoc tests detect that IRF is significantly worse than any other method. After 30 trials, SGPTR provides significantly better results than IGP, IRF and IRF with initialization. The experimental data is not sufficient to reach any conclusion regarding FMLP, TAFR and IGP with initialization. After 200 trials, no statistically significant statement can be made about all methods but IRF and Best.
8.5.6 Performance with respect to run time
In the previous experiments we assumed that the training time is always the same no matter which configuration is chosen. The main reason for this is that none of the methods considers the run time needed for evaluating a configuration when choosing which configuration to test next. All of them try to find the global minimum in as few trials as possible. However, in practice it is of course very important to consider the run time as well. Since for the Weka metadata set not only the hyperparameters are changing but also the model, we conducted a run time experiment for this particular metadata set. The results are reported in Fig. 17. In our experiments we used various data sets which differ in the number of predictors and instances. To ensure that each data set has equal influence on the average results, we normalized the run time. The run time for each data set is normalized by dividing it by the average training time needed for a model to be trained on this data set. Hence, the normalized run time in Fig. 17 at 1 shows the performance of all methods after investing the amount of time that is needed to train a single model on average. We report ADTM and fraction of unsolved data sets for each method starting at the time when at least one configuration for all data sets has been evaluated. We report the average rank starting at the time when every method has evaluated at least one configuration for all data sets.
Our proposed method TAFR still yields the best results. The biggest differences to the previous results are the changes of FMLP and IRF. IRF improved a lot. It evaluates far more configurations per time unit than any other method. This leads to worse results in the previous evaluation protocol but pays off under this protocol. For FMLP it is exactly the other way round. It mainly focuses on timeconsuming configurations. While it needs only few configurations to find good models, it needs more time than most competitor methods.
9 Conclusions
In this work, we proposed a new transfer surrogate model framework which is able to scale to large metadata sets. Such considerations will become a necessity in the near future as with more experiments conducted every day, the metadata that can be employed also grows. To ensure scalability, our transfer surrogate model is built on Gaussian processes which have been learned individually on the observed performances on a single dataset only, to then be combined into a joint surrogate model on the basis of product of experts as well as kernel regression models. Additionally, as Gaussian processes are basically hyperparameter free, we have created a strong but scalable surrogate model, that also does not require a second stage hyperparameter optimization, as opposed to other surrogates in the related work. Overall, we derived different instances of our framework and evaluated them with respect to optimization performance and scalability on two metadata sets containing both hyperparameter optimization as well as model choice. We show empirically that our method is able to outperform existing methods with respect to both measures by choosing wellperforming hyperparameter configurations while maintaining a small computational overhead.
Nevertheless, the transfer surrogate model still suffers from the problem caused by different performance scales on different data sets, which introduces a bias when selecting the next hyperparameter configuration to test. As a result of this issue, we proposed to decay the influence of metadata as the hyperparameter optimization progresses, we have found empirically that this is strategy proves advantageous. Unfortunately, this decay is relying on heuristics, which was our motivation to use the metadata within the acquisition function instead of the surrogate model to achieve a more principled effect. Similarly to the transfer surrogate framework, we derived a transfer acquisition framework that keeps the assets of our previously proposed surrogate model but overcomes its disadvantages and directly uses the improvement of the metadata. In a conclusive evaluation, we were able to confirm our intuition empirically on two metadata sets, as the transfer acquisition framework shows even higher performance than the transfer surrogate model on the same metadata sets. Consequently, we recommend using the transfer acquisition framework for hyperparameter optimization, as it is fast and powerful in delivering wellperforming hyperparameter configurations.
Possible future work involves the consideration of learning curves. As mentioned in the related work section, there exist many approaches which use the convergence behavior of the learning algorithm to predict the final performance. This prediction itself will help improving our method. Additionally, learning curve forecasting methods will benefit from observations on other data sets. Furthermore, we want to investigate the case where one is not solely interested in reducing the loss of a predictor but also other cost such as training, prediction or hardware constraints (Abdulrahman et al. 2018) which is relevant for edge devices. We saw in our experiments that trying to minimize the classification error only regardless of the training time needed can lead to nonoptimal results. Therefore, we are considering to minimize a loss function which is a combination of classification error and training time as suggested by Abdulrahman et al. (2018). Finally, we want to conduct experiments to see whether our method is able to be migrated to deep learning methods.
Notes
Acknowledgements
We would like to thank Prof. Pavel Brazdil and the anonymous reviewers for their valuable feedback that helped us improving this work. We are grateful for Bradley Baker’s help in perfecting our paper. We acknowledge the cofunding of our work by the German Research Foundation (DFG) under Grant SCHM 2583/61.
References
 Abdulrahman, S. M., Brazdil, P., van Rijn, J. N., & Vanschoren, J. (2018). Speeding up algorithm selection using average ranking and active testing by introducing runtime. In P. Brazdil & C.GiraudCarrier (Eds.), Special issue on metalearning and algorithm selection. Machine Learning Journal, 107, 1Google Scholar
 Bardenet, R., Brendel, M., Kégl, B., & Sebag, M. (2013) Collaborative hyperparameter tuning. In Proceedings of the 30th international conference on machine learning (pp. 199–207). ICML 2013, Atlanta, GA, USA, 16–21 June 2013.Google Scholar
 Bergstra, J., & Bengio, Y. (2012). Random search for hyperparameter optimization. The Journal of Machine Learning Research, 13, 281–305.MathSciNetzbMATHGoogle Scholar
 Brazdil, P., GiraudCarrier, C. G., Soares, C., & Vilalta, R. (2009). Metalearning—Applications to data mining. Cognitive technologies. Springer. https://doi.org/10.1007/9783540732631.
 Cavazos, J., & O’Boyle, M. F. P. (2006). Methodspecific dynamic compilation using logistic regression. In Proceedings of the 21th annual ACM SIGPLAN conference on objectoriented programming, systems, languages, and applications (pp. 229–240). OOPSLA 2006, October 22–26, 2006, Portland, Oregon, USA.Google Scholar
 Cawley, G. C. (2001). Model selection for support vector machines via adaptive stepsize Tabu search. In Proceedings of the international conference on artificial neural networks and genetic algorithms.Google Scholar
 Chapelle, O., Vapnik, V., & Bengio, Y. (2002). Model selection for small sample regression. Machine Learning, 48(1–3), 9–23.CrossRefzbMATHGoogle Scholar
 Corani, G., Benavoli, A., Demsar, J., Mangili, F., & Zaffalon, M. (2016). Statistical comparison of classifiers through bayesian hierarchical modelling. CoRR abs/1609.08905. http://arxiv.org/abs/1609.08905.
 Czogiel, I., Luebke, K., & Weihs, C. (2006). Response surface methodology for optimizing hyper parameters. Tech. rep. https://eldorado.tudortmund.de/bitstream/2003/22205/1/tr0906.pdf.
 de Souza, B. F., de Carvalho, A., Calvo, R., & Ishii, R. P. (2006). Multiclass SVM model selection using particle swarm optimization. In Sixth international conference on hybrid intelligent systems, 2006 (pp. 31–31). HIS’06 , IEEE.Google Scholar
 Deisenroth, M. P., & Ng, J. W. (2015). Distributed gaussian processes. International Conference on Machine Learning (ICML), 2, 5.Google Scholar
 Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30. URL http://www.jmlr.org/papers/v7/demsar06a.html.
 Domhan, T., Springenberg, J. T., & Hutter, F. (2015). Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Proceedings of the twentyfourth international joint conference on artificial intelligence (pp. 3460–3468). IJCAI 2015, Buenos Aires, Argentina, July 25–31, 2015.Google Scholar
 Eggensperger, K., Lindauer, M., Hoos, H. H., Hutter, F., & LeytonBrown, K. (2018). Efficient benchmarking of algorithm configuration procedures via modelbased surrogates. In P. Brazdil & C. GiraudCarrier (Eds.), Special issue on metalearning and algorithm selection. Machine Learning Journal, 107, 1.Google Scholar
 Feurer, M., Klein, A., Eggensperger, K., Springenberg, J. T., Blum, M., & Hutter, F. (2015). Efficient and robust automated machine learning. In Advances in neural information processing systems, Vol. 28: Annual conference on neural information processing systems 2015, December 7–12, 2015, Montreal, Quebec, Canada (pp. 2962–2970). http://papers.nips.cc/paper/5872efficientandrobustautomatedmachinelearning.
 Feurer, M., Springenberg, J. T., & Hutter, F. (2014). Using metalearning to initialize bayesian optimization of hyperparameters. In ECAI workshop on metalearning and algorithm selection (MetaSel) (pp. 3–10).Google Scholar
 Feurer, M., Springenberg, J. T., & Hutter, F. (2015). Initializing bayesian hyperparameter optimization via metalearning. In Proceedings of the twentyninth AAAI conference on artificial intelligence, January 25–30, 2015, Austin, Texas, USA (pp. 1128–1135).Google Scholar
 Foo, C.s., Do, C. B., & Ng, A. (2007). Efficient multiple hyperparameter learning for loglinear models. In Advances in neural information processing systems (pp. 377–384).Google Scholar
 Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200), 675–701. https://doi.org/10.1080/01621459.1937.10503522.CrossRefzbMATHGoogle Scholar
 Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11(1), 86–92.MathSciNetCrossRefzbMATHGoogle Scholar
 Friedrichs, F., & Igel, C. (2005). Evolutionary tuning of multiple SVM parameters. Neurocomputing, 64, 107–117.CrossRefGoogle Scholar
 Friedrichs, F., & Igel, C. (2005). Evolutionary tuning of multiple svm parameters. Neurocomputing, 64, 107–117. https://doi.org/10.1016/j.neucom.2004.11.022.CrossRefGoogle Scholar
 Gomes, T. A. F., Prudêncio, R. B. C., Soares, C., Rossi, A. L. D., & Carvalho, A. C. P. L. F. (2012). Combining metalearning and search techniques to select parameters for support vector machines. Neurocomputing, 75(1), 3–13. https://doi.org/10.1016/j.neucom.2011.07.005.CrossRefGoogle Scholar
 Guo, X. C., Yang, J. H., Wu, C. G., Wang, C. Y., & Liang, Y. C. (2008). A novel lssvms hyperparameter selection based on particle swarm optimization. Neurocomputing, 71(16–18), 3211–3215. https://doi.org/10.1016/j.neucom.2008.04.027.CrossRefGoogle Scholar
 Hinton, G. E. (1999). Products of experts. In Artificial neural networks, 1999. ICANN 99. Ninth international conference on (Conf. Publ. No. 470) (Vol. 1, pp. 1–6). IET.Google Scholar
 Hinton, G. (2010). A practical guide to training restricted Boltzmann machines. Momentum, 9(1), 926.Google Scholar
 Hoffman, M. D., Shahriari, B., & de Freitas, N. (2014). On correlation and budget constraints in modelbased bandit optimization with application to automatic machine learning. In Proceedings of the seventeenth international conference on artificial intelligence and statistics (pp. 365–374). AISTATS 2014, Reykjavik, Iceland, April 22–25, 2014.Google Scholar
 Holmes, G., Donkin, A., & Witten, I. H. (1994). Weka: A machine learning workbench. In Intelligent information systems, 1994. Proceedings of the 1994 second Australian and New Zealand conference on (pp. 357–361). IEEE.Google Scholar
 Hutter, F., Hoos, H. H., & LeytonBrown, K. (2011). Sequential modelbased optimization for general algorithm configuration. In Proceedings of the 5th international conference on learning and intelligent optimization, LION’05 (pp. 507–523). Berlin, Heidelberg: Springer.Google Scholar
 Jamieson, K. G., & Talwalkar, A. (2016). Nonstochastic best arm identification and hyperparameter optimization. In Proceedings of the 19th international conference on artificial intelligence and statistics (pp. 240–248). AISTATS 2016, Cadiz, Spain, May 9–11, 2016. http://jmlr.org/proceedings/papers/v51/jamieson16.html
 Jones, D. R., Schonlau, M., & Welch, W. J. (1998). Efficient global optimization of expensive blackbox functions. Journal of Global Optimization, 13(4), 455–492. https://doi.org/10.1023/A:1008306431147.MathSciNetCrossRefzbMATHGoogle Scholar
 Kamel, M. S., Enright, W. H., & Ma, K. S. (1993). ODEXPERT: An expert system to select numerical solvers for initial value ODE systems. ACM Transactions on Mathematical Software, 19(1), 44–62.CrossRefzbMATHGoogle Scholar
 Kanda, J., Soares, C., Hruschka, E. R., & de Carvalho, A. C. P. L. F. (2012). A metalearning approach to select metaheuristics for the traveling salesman problem using mlpbased label ranking. In Neural information processing—19th international conference (pp. 488–495). ICONIP 2012, Doha, Qatar, November 12–15, 2012, Proceedings, Part III.Google Scholar
 Kapoor, A., Ahn, H., Qi, Y., & Picard, R. W. (2005). Hyperparameter and kernel learning for graph based semisupervised classification. In Advances in Neural Information Processing Systems (pp. 627–634).Google Scholar
 Keerthi, S., Sindhwani, V., & Chapelle, O. (2007). An efficient method for gradientbased adaptation of hyperparameters in SVM models. Twentyfirst annual conference on neural information processing systems. Vancouver, CanadaGoogle Scholar
 Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81–93. https://doi.org/10.2307/2332226.CrossRefzbMATHGoogle Scholar
 Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on machine learning (pp. 473–480). ACM.Google Scholar
 Leite, R., Brazdil, P., & Vanschoren, J. (2012). Selecting classification algorithms with active testing. In Machine learning and data mining in pattern recognition—8th international conference (pp. 117–131). MLDM 2012, Berlin, Germany, July 13–20, 2012. Proceedings.Google Scholar
 Lemke, C., Budka, M., & Gabrys, B. (2015). Metalearning: A survey of trends and technologies. Artificial Intelligence Review, 44(1), 117–130. https://doi.org/10.1007/s104620139406y.CrossRefGoogle Scholar
 Li, L., Jamieson, K. G., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2016). Efficient hyperparameter optimization and infinitely many armed bandits. CoRR abs/1603.06560. http://arxiv.org/abs/1603.06560.
 Maron, O., & Moore, A. W. (1997). The racing algorithm: Model selection for lazy learners. Artificial Intelligence Review, 11(1–5), 193–225. https://doi.org/10.1023/A:1006556606079.CrossRefGoogle Scholar
 Masada, T., Fukagawa, D., Takasu, A., Hamada, T., Shibata, Y., & Oguri, K. (2009). Dynamic hyperparameter optimization for bayesian topical trend analysis. In Proceedings of the 18th ACM conference on information and knowledge management (pp. 1831–1834). ACM.Google Scholar
 McQuarrie, A. D., & Tsai, C. L. (1998). Regression and time series model selection. Singapore: World Scientific.CrossRefzbMATHGoogle Scholar
 Michie, D., Spiegelhalter, D. J., Taylor, C. C., & Campbell, J. (Eds.). (1994). Machine learning, neural and statistical classification. Upper Saddle River, NJ: Ellis Horwood.zbMATHGoogle Scholar
 Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications, 9(1), 141–142. https://doi.org/10.1137/1109020.CrossRefzbMATHGoogle Scholar
 Nareyek, A. (2004). Choosing search heuristics by nonstationary reinforcement learning (pp. 523–544). Boston, MA: Springer.Google Scholar
 Nemenyi, P. (1962). Distributionfree multiple comparisons. In Biometrics, 18, 263. International Biometric Soc 1441 I ST, NW, Suite 700, Washington, DC 200052210.Google Scholar
 Pfahringer, B., Bensusan, H., & GiraudCarrier, C. (2000). Metalearning by landmarking various learning algorithms. In Proceedings of the seventeenth international conference on machine learning (pp. 743–750). Morgan Kaufmann.Google Scholar
 Rasmussen, C. E., & Williams, C. K. I. (2005). Gaussian processes for machine learning (Adaptive computation and machine learning). Cambridge, MA: The MIT Press.Google Scholar
 Reif, M., Shafait, F., & Dengel, A. (2012). Metalearning for evolutionary parameter optimization of classifiers. Machine Learning, 87(3), 357–380. https://doi.org/10.1007/s1099401252867.MathSciNetCrossRefGoogle Scholar
 Rendle, S. (2010). Factorization machines. In Data mining (ICDM), 2010 IEEE 10th international conference on (pp. 995–1000). IEEE.Google Scholar
 Rice, J. R. (1976). The algorithm selection problem. Advances in Computers, 15, 65–118. https://doi.org/10.1016/S00652458(08)605203.CrossRefGoogle Scholar
 Schilling, N., Wistuba, M., & SchmidtThieme, L. (2016). Scalable hyperparameter optimization with products of gaussian process experts. In Joint European conference on machine learning and knowledge discovery in databases (pp. 33–48). Springer.Google Scholar
 Schilling, N., Wistuba, M., Drumond, L., & SchmidtThieme, L. (2015). Hyperparameter optimization with factorized multilayer perceptrons. In Machine learning and knowledge discovery in databases—European conference. ECML PKDD 2015, Porto, Portugal, September 7–11, 2015. Proceedings, Part II.Google Scholar
 Schneider, P., Biehl, M., & Hammer, B. (2010). Hyperparameter learning in probabilistic prototypebased models. Neurocomputing, 73(7), 1117–1124.CrossRefGoogle Scholar
 Seeger, M. (2006). Crossvalidation optimization for large scale hierarchical classification kernel methods. In Advances in neural information processing systems (pp. 1233–1240).Google Scholar
 SmithMiles, K. A. (2009). Crossdisciplinary perspectives on metalearning for algorithm selection. ACM Computing Surveys, 41(1), 6:1–6:25. https://doi.org/10.1145/1456650.1456656.Google Scholar
 Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems 25: 26th Annual conference on neural information processing systems 2012 (pp. 2960–2968). Proceedings of a meeting held December 36, 2012, Lake Tahoe, Nevada, USA.Google Scholar
 Srinivas, N., Krause, A., Kakade, S., & Seeger, M. W. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th international conference on machine learning (ICML10) (pp. 1015–1022), June 21–24, 2010, Haifa, Israel.Google Scholar
 Sun, Q., & Pfahringer, B. (2013). Pairwise metarules for better metalearningbased algorithm ranking. Machine Learning, 93(1), 141–161.MathSciNetCrossRefzbMATHGoogle Scholar
 Swersky, K., Snoek, J., & Adams, R. P. (2013). Multitask bayesian optimization. In Advances in neural information processing systems 26: 27th annual conference on neural information processing systems 2013 (pp. 2004–2012). Proceedings of a meeting held December 58, 2013, Lake Tahoe, Nevada, USA.Google Scholar
 Swersky, K., Snoek, J., & Adams, R. P. (2014). Freezethaw bayesian optimization. Computing Research Repository. arXiv:1406.3896.
 Thornton, C., Hutter, F., Hoos, H. H., & LeytonBrown, K. (2013). Autoweka: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’13 (pp. 847–855). ACM, New York, NY, USA. https://doi.org/10.1145/2487575.2487629.
 Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In Proceedings of the twentyfirst international conference on Machine learning (p. 104). ACM.Google Scholar
 Vilalta, R., & Drissi, Y. (2002). A perspective view and survey of metalearning. Artificial Intelligence Review, 18(2), 77–95. https://doi.org/10.1023/A:1019956318069.CrossRefGoogle Scholar
 Wistuba, M., Schilling, N., & SchmidtThieme, L. (2015). Learning data set similarities for hyperparameter optimization initializations. In Proceedings of the 2015 international workshop on metalearning and algorithm selection (pp. 15–26), Porto, Portugal, September 7th, 2015.Google Scholar
 Wistuba, M., Schilling, N., & SchmidtThieme, L. (2015). Learning hyperparameter optimization initializations. In International conference on data science and advanced analytics, DSAA 2015, Paris, France, October 19–21, 2015.Google Scholar
 Wistuba, M., Schilling, N., & SchmidtThieme, L. (2015). Sequential modelfree hyperparameter tuning. In 2015 IEEE international conference on data mining (pp. 1033–1038). ICDM 2015, Atlantic City, NJ, USA, November 14–17, 2015. https://doi.org/10.1109/ICDM.2015.20
 Wistuba, M., Schilling, N., & SchmidtThieme, L. (2016). Twostage transfer surrogate model for automatic hyperparameter optimization. In Joint European conference on machine learning and knowledge discovery in databases (pp. 199–214). Springer.Google Scholar
 Xu, L., Hutter, F., Hoos, H. H., & LeytonBrown, K. (2008). SATzilla: Portfoliobased Algorithm Selection for SAT. Journal of Artificial Intelligence Research (JAIR), 32, 565–606.zbMATHGoogle Scholar
 Yogatama, D., & Mann, G. (2014). Efficient transfer learning method for automatic hyperparameter tuning. In International conference on artificial intelligence and statistics (AISTATS 2014).Google Scholar