Automated Machine Learning pp 97111  Cite as
HyperoptSklearn
Abstract
Hyperoptsklearn is a software project that provides automated algorithm configuration of the Scikitlearn machine learning library. Following AutoWeka, we take the view that the choice of classifier and even the choice of preprocessing module can be taken together to represent a single large hyperparameter optimization problem. We use Hyperopt to define a search space that encompasses many standard components (e.g. SVM, RF, KNN, PCA, TFIDF) and common patterns of composing them together. We demonstrate, using search algorithms in Hyperopt and standard benchmarking data sets (MNIST, 20Newsgroups, Convex Shapes), that searching this space is practical and effective. In particular, we improve on bestknown scores for the model space for both MNIST and Convex Shapes at the time of release.
5.1 Introduction
Relative to deep networks, algorithms such as Support Vector Machines (SVMs) and Random Forests (RFs) have a smallenough number of hyperparameters that manual tuning and grid or random search provides satisfactory results. Taking a step back though, there is often no particular reason to use either an SVM or an RF when they are both computationally viable. A modelagnostic practitioner may simply prefer to go with the one that provides greater accuracy. In this light, the choice of classifier can be seen as hyperparameter alongside the Cvalue in the SVM and the maxtreedepth of the RF. Indeed the choice and configuration of preprocessing components may likewise be seen as part of the model selection/hyperparameter optimization problem.
The AutoWeka project [19] was the first to show that an entire library of machine learning approaches (Weka [8]) can be searched within the scope of a single run of hyperparameter tuning. However, Weka is a GPLlicensed Java library, and was not written with scalability in mind, so we feel there is a need for alternatives to AutoWeka. Scikitlearn [16] is another library of machine learning algorithms. It is written in Python (with many modules in C for greater speed), and is BSDlicensed. Scikitlearn is widely used in the scientific Python community and supports many machine learning application areas.
This chapter introduces HyperoptSklearn: a project that brings the benefits of automated algorithm configuration to users of Python and scikitlearn. HyperoptSklearn uses Hyperopt [3] to describe a search space over possible configurations of scikitlearn components, including preprocessing, classification, and regression modules. One of the main design features of this project is to provide an interface that is familiar to users of scikitlearn. With very little changes, hyperparameter search can be applied to an existing code base. This chapter begins with a background of Hyperopt and the configuration space it uses within scikitlearn, followed by example usage and experimental results with this software.
This chapter is an extended version of our 2014 paper introducing hyperoptsklearn, presented at the 2014 ICML Workshop on AutoML [10].
5.2 Background: Hyperopt for Optimization

A search domain,

An objective function,

An optimization algorithm.
The search domain is specified via random variables, whose distributions should be chosen so that the most promising combinations have high prior probability. The search domain can include Python operators and functions that combine random variables into more convenient data structures for the objective function. Any conditional structure is defined within this domain. The objective function maps a joint sampling of these random variables to a scalarvalued score that the optimization algorithm will try to minimize.
An example search domain using Hyperopt is depicted below.
Here there are four parameters, one for selecting which case is active, and one for each of the three cases. The first case contains a positive valued parameter that is sensitive to log scaling. The second case contains a bounded real valued parameter. The third case contains a categorical parameter with three options.
Having chosen a search domain, an objective function, and an optimization algorithm, Hyperopt’s fmin function carries out the optimization, and stores results of the search to a database (e.g. either a simple Python list or a MongoDB instance). The fmin call carries out the simple analysis of finding the bestperforming configuration, and returns that to the caller. The fmin call can use multiple workers when using the MongoDB backend, to implement parallel model selection on a compute cluster.
5.3 ScikitLearn Model Selection as a Search Problem
Model selection is the process of estimating which machine learning model performs best from among a possibly infinite set of options. As an optimization problem, the search domain is the set of valid assignments to the configuration parameters (hyperparameters) of the machine learning model. The objective function is typically the measure of success (e.g. accuracy, F1Score, etc) on heldout examples. Often the negative degree of success (loss) is used to set up the task as a minimization problem, and crossvalidation is applied to produce a more robust final score. Practitioners usually address this optimization by hand, by grid search, or by random search. In this chapter we discuss solving it with the Hyperopt optimization library. The basic approach is to set up a search space with random variable hyperparameters, use scikitlearn to implement the objective function that performs model training and model validation, and use Hyperopt to optimize the hyperparameters.
Scikitlearn includes many algorithms for learning from data (classification or regression), as well as many algorithms for preprocessing data into the vectors expected by these learning algorithms. Classifiers include for example, KNearestNeighbors, Support Vector Machines, and Random Forest algorithms. Preprocessing algorithms include transformations such as componentwise Zscaling (Normalizer) and Principle Components Analysis (PCA). A full classification algorithm typically includes a series of preprocessing steps followed by a classifier. For this reason, scikitlearn provides a pipeline data structure to represent and use a sequence of preprocessing steps and a classifier as if they were just one component (typically with an API similar to the classifier). Although hyperoptsklearn does not formally use scikitlearn’s pipeline object, it provides related functionality. Hyperoptsklearn provides a parameterization of a search space over pipelines, that is, of sequences of preprocessing steps and classifiers or regressors.
Although the total number of hyperparameters in the full configuration space is large, the number of active hyperparameters describing any one model is much smaller: a model consisting of PCA and a RandomForest for example, would have only 12 active hyperparameters (1 for the choice of preprocessing, 2 internal to PCA, 1 for the choice of classifier and 8 internal to the RF). Hyperopt description language allows us to differentiate between conditional hyperparameters (which must always be assigned) and nonconditional hyperparameters (which may remain unassigned when they would be unused). We make use of this mechanism extensively so that Hyperopt’s search algorithms do not waste time learning by trial and error that e.g. RF hyperparameters have no effect on SVM performance. Even internally within classifiers, there are instances of conditional parameters: KNN has conditional parameters depending on the distance metric, and LinearSVC has 3 binary parameters (loss, penalty, and dual) that admit only 4 valid joint assignments. Hyperoptsklearn also includes a blacklist of (preprocessing, classifier) pairs that do not work together, e.g. PCA and MinMaxScaler were incompatible with MultinomialNB, TFIDF could only be used for text data, and the treebased classifiers were not compatible with the sparse features produced by the TFIDF preprocessor. Allowing for a 10way discretization of realvalued hyperparameters, and taking these conditional hyperparameters into account, a grid search of our search space would still require an infeasible number of evalutions (on the order of 10^{12}).
Finally, the search space becomes an optimization problem when we also define a scalarvalued search objective. By default, Hyperoptsklearn uses scikitlearn’s score method on validation data to define the search criterion. For classifiers, this is the socalled “ZeroOne Loss”: the number of correct label predictions among data that has been withheld from the data set used for training (and also from the data used for testing after the model selection search process).
5.4 Example Usage
Following Scikitlearn’s convention, hyperoptsklearn provides an Estimator class with a fit method and a predict method. The fit method of this class performs hyperparameter optimization, and after it has completed, the predict method applies the best model to given test data. Each evaluation during optimization performs training on a large fraction of the training set, estimates test set accuracy on a validation set, and returns that validation set score to the optimizer. At the end of search, the best configuration is retrained on the whole data set to produce the classifier that handles subsequent predict calls.
One of the important goals of hyperoptsklearn is that it is easy to learn and to use. To facilitate this, the syntax for fitting a classifier to data and making predictions is very similar to scikitlearn. Here is the simplest example of using this software.
The HyperoptEstimator object contains the information of what space to search as well as how to search it. It can be configured to use a variety of hyperparameter search algorithms and also supports using a combination of algorithms. Any algorithm that supports the same interface as the algorithms in hyperopt can be used here. This is also where you, the user, can specify the maximum number of function evaluations you would like to be run as well as a timeout (in seconds) for each run.
Each search algorithm can bring its own bias to the search space, and it may not be clear that one particular strategy is the best in all cases. Sometimes it can be helpful to use a mixture of search algorithms.
Searching effectively over the entire space of classifiers available in scikitlearn can use a lot of time and computational resources. Sometimes you might have a particular subspace of models that they are more interested in. With hyperoptsklearn it is possible to specify a more narrow search space to allow it to be explored in greater depth.
Combinations of different spaces can also be used.
The support vector machine provided by scikitlearn has a number of different kernels that can be used (linear, rbf, poly, sigmoid). Changing the kernel can have a large effect on the performance of the model, and each kernel has its own unique hyperparameters. To account for this, hyperoptsklearn treats each kernel choice as a unique model in the search space. If you already know which kernel works best for your data, or you are just interested in exploring models with a particular kernel, you may specify it directly rather than going through the svc.
It is also possible to specify which kernels you are interested in by passing a list to the svc.
In a similar manner to classifiers, the space of preprocessing modules can be fine tuned. Multiple successive stages of preprocessing can be specified through an ordered list. An empty list means that no preprocessing will be done on the data.
Combinations of different spaces can be used here as well.
Some types of preprocessing will only work on specific types of data. For example, the TfidfVectorizer that scikitlearn provides is designed to work with text data and would not be appropriate for other types of data. To address this, hyperoptsklearn comes with a few predefined spaces of classifiers and preprocessing tailored to specific data types.
So far in all of these examples, every hyperparameter available to the model is being searched over. It is also possible for you to specify the values of specific hyperparameters, and those parameters will remain constant during the search. This could be useful, for example, if you knew you wanted to use whitened PCA data and a degree3 polynomial kernel SVM.
It is also possible to specify ranges of individual parameters. This is done using the standard hyperopt syntax. These will override the defaults defined within hyperoptsklearn.
All of the components available to the user can be found in the components.py file. A complete working example of using hyperoptsklearn to find a model for the 20 newsgroups data set is shown below.
5.5 Experiments
We conducted experiments on three data sets to establish that hyperoptsklearn can find accurate models on a range of data sets in a reasonable amount of time. Results were collected on three data sets: MNIST, 20Newsgroups, and Convex Shapes. MNIST is a wellknown data set of 70 K 28 × 28 greyscale images of handdrawn digits [12]. 20Newsgroups is a 20way classification data set of 20 K newsgroup messages ([13], we did not remove the headers for our experiments). Convex Shapes is a binary classification task of distinguishing pictures of convex whitecolored regions in small (32 × 32) blackandwhite images [11].
Fig. 5.2 (right) shows that search could find different, good models. This figure was constructed by running hyperoptsklearn with different initial conditions (number of evaluations, choice of optimization algorithm, and random number seed) and keeping track of what final model was chosen after each run. Although support vector machines were always among the best, the parameters of best SVMs looked very different across data sets. For example, on the image data sets (MNIST and Convex) the SVMs chosen never had a sigmoid or linear kernel, while on 20 newsgroups the linear and sigmoid kernel were often best.
5.6 Discussion and Future Work
Hyperoptsklearn scores relative to selections from literature on the three data sets used in our experiments. On MNIST, hyperoptsklearn is one of the bestscoring methods that does not use imagespecific domain knowledge (these scores and others may be found at http://yann.lecun.com/exdb/mnist/). On 20 Newsgroups, hyperoptsklearn is competitive with similar approaches from the literature (scores taken from [7]). In the 20 Newsgroups data set, the score reported for hyperoptsklearn is the weightedaverage F1 score provided by sklearn. The other approaches shown here use the macroaverage F1 score. On Convex Shapes, hyperoptsklearn outperforms previous automated algorithm configuration approaches [6] and manual tuning [11]
MNIST  20 Newsgroups  Convex shapes  

Approach  Accuracy  Approach  FScore  Approach  Accuracy 
Committee of convnets  99.8%  CFC  0.928  hyperoptsklearn  88.7% 
hyperoptsklearn  98.7%  hyperoptsklearn  0.856  hpdbnet  84.6% 
libSVM grid search  98.6%  SVMTorch  0.848  dbn3  81.4% 
Boosted trees  98.5%  LibSVM  0.843 
The model with the best performance on the MNIST Digits data set uses deep artificial neural networks. Small receptive fields of convolutional winnertakeall neurons build up the large network. Each neural column becomes an expert on inputs preprocessed in different ways, and the average prediction of 35 deep neural columns to come up with a single final prediction [4]. This model is much more advanced than those available in scikitlearn. The previously best known model in the scikitlearn search space is a radialbasis SVM on centered data that scores 98.6%, and hyperoptsklearn matches that performance [15].
The CFC model that performed quite well on the 20 newsgroups document classification data set is a ClassFeatureCentroid classifier. Centroid approaches are typically inferior to an SVM, due to the centroids found during training being far from the optimal location. The CFC method reported here uses a centroid built from the interclass term index and the innerclass term index. It uses a novel combination of these indices along with a denormalized cosine measure to calculate the similarity score between the centroid and a text vector [7]. This style of model is not currently implemented in hyperoptsklearn, and our experiments suggest that existing hyperoptsklearn components cannot be assembled to match its level of performance. Perhaps when it is implemented, Hyperopt may find a set of parameters that provides even greater classification accuracy.
On the Convex Shapes data set, our Hyperoptsklearn experiments revealed a more accurate model than was previously believed to exist in any search space, let alone a search space of such standard components. This result underscores the difficulty and importance of hyperparameter search.
Hyperoptsklearn provides many opportunities for future work: more classifiers and preprocessing modules could be included in the search space, and there are more ways to combine even the existing components. Other types of data require different preprocessing, and other prediction problems exist beyond classification. In expanding the search space, care must be taken to ensure that the benefits of new models outweigh the greater difficulty of searching a larger space. There are some parameters that scikitlearn exposes that are more implementation details than actual hyperparameters that affect the fit (such as algorithm and leaf_size in the KNN model). Care should be taken to identify these parameters in each model and they may need to be treated differently during exploration.
It is possible for a user to add their own classifier to the search space as long as it fits the scikitlearn interface. This currently requires some understanding of how hyperoptsklearn’s code is structured and it would be nice to improve the support for this so minimal effort is required by the user. It is also possible for the user to specify alternate scoring methods besides the default accuracy or Fmeasure, as there can be cases where these are not best suited to the particular problem.
Computational wall time spent on search is of great practical importance, and hyperoptsklearn currently spends a significant amount of time evaluating points that are unpromising. Techniques for recognizing bad performers early could speed up search enormously [5, 18].
5.7 Conclusions
This chapter has introduced Hyperoptsklearn, a Python package for automated algorithm configuration of standard machine learning algorithms provided by ScikitLearn. Hyperoptsklearn provides a unified interface to a large subset of the machine learning algorithms available in scikitlearn and with the help of Hyperopt’s optimization functions it is able to both rival and surpass human experts in algorithm configuration. We hope that it provides practitioners with a useful tool for the development of machine learning systems, and automated machine learning researchers with benchmarks for future work in algorithm configuration.
Notes
Acknowledgements
This research was supported by the NSERC Banting Fellowship program, the NSERC Engage Program and by DWave Systems. Thanks also to Hristijan Bogoevski for early drafts of a hyperopttoscikitlearn bridge.
Bibliography
 1.J. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl. Algorithms for hyperparameter optimization, NIPS, 24:2546–2554, 2011.Google Scholar
 2.J. Bergstra, D. Yamins, and D. D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures, In Proc. ICML, 2013a.Google Scholar
 3.J. Bergstra, D. Yamins, and D. D. Cox. Hyperopt: A Python library for optimizing the hyperparameters of machine learning algorithms, SciPy’13, 2013b.Google Scholar
 4.D. Ciresan, U. Meier, and J. Schmidhuber. Multicolumn Deep Neural Networks for Image Classification, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3642–3649. 2012.Google Scholar
 5.T. Domhan, T. Springenberg, F. Hutter. Extrapolating Learning Curves of Deep Neural Networks, ICML AutoML Workshop, 2014.Google Scholar
 6.K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, and K. LeytonBrown. Towards an empirical foundation for assessing bayesian optimization of hyperparameters, NIPS workshop on Bayesian Optimization in Theory and Practice, 2013.Google Scholar
 7.H. Guan, J. Zhou, and M. Guo. A classfeaturecentroid classifier for text categorization, Proceedings of the 18th international conference on World wide web, 201–210. ACM, 2009.Google Scholar
 8.M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update, ACM SIGKDD explorations newsletter, 11(1):10–18, 2009.CrossRefGoogle Scholar
 9.F. Hutter, H. Hoos, and K. LeytonBrown. Sequential modelbased optimization for general algorithm configuration, LION5, 2011. Extended version as UBC Tech report TR201010.Google Scholar
 10.B. Komer, J. Bergstra, and C. Eliasmith. Hyperoptsklearn: automatic hyperparameter configuration for scikitlearn, ICML AutoML Workshop, 2014.Google Scholar
 11.H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation, ICML, 473–480, 2007.Google Scholar
 12.Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition, Proceedings of the IEEE, 86(11):2278–2324, November 1998.CrossRefGoogle Scholar
 13.T. Mitchell. 20 newsgroups data set, http://qwone.com/jason/20Newsgroups/, 1996.
 14.J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods for seeking the extremum, L.C.W. Dixon and G.P. Szego, editors, Towards Global Optimization, volume 2, pages 117–129. North Holland, New York, 1978.Google Scholar
 15.The MNIST Database of handwritten digits: http://yann.lecun.com/exdb/mnist/
 16.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine Learning in Python, Journal of Machine Learning Research, 12:2825–2830, 2011.MathSciNetzbMATHGoogle Scholar
 17.J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms, Neural Information Processing Systems, 2012.Google Scholar
 18.K. Swersky, J. Snoek, R.P. Adams. FreezeThaw Bayesian Optimization, arXiv:1406.3896, 2014.Google Scholar
 19.C. Thornton, F. Hutter, H. H. Hoos, and K. LeytonBrown. AutoWEKA: Automated selection and hyperparameter optimization of classification algorithms, KDD 847–855, 2013.Google Scholar
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.