QIM: Quantifying Hyperparameter Importance for Deep Learning
 1 Citations
 928 Downloads
Abstract
Recently, Deep Learning (DL) has become super hot because it achieves breakthroughs in many areas such as image processing and face identification. The performance of DL models critically depend on hyperparameter settings. However, existing approaches that quantify the importance of these hyperparameters are timeconsuming.
In this paper, we propose a fast approach to quantify the importance of the DL hyperparameters, called QIM. It leverages PlackettBurman design to collect as few as possible data but can still correctly quantify the hyperparameter importance. We conducted experiments on the popular deep learning framework – Caffe – with different datasets to evaluate QIM. The results show that QIM can rank the importance of the DL hyperparameters correctly with very low cost.
Keywords
Deep learning Plackettburman design Hyperparameter1 Introduction
Deep learning (DL) is a subfield of machine learning (ML) that focuses on extracting features from data through multiple layers of abstraction. While DL algorithms usually behave very differently with variant models such as deep belief networks [8], convolutional networks [13], and stacked denoising autoencoders [17], all of which have up to hundreds of hyperparameters which significantly affect the performance of DL algorithms.
There has been a recent surge of interest in more sophisticated hyperparameter optimization methods [1, 3, 9, 15]. For example, [3] has applied Bayesian optimization techniques for designing convolutional vision architectures by learning a probabilistic model over the hyperparameter search space. However, all these approaches have not provide scientists with answers to questions like the following: how important is each of the hyperparameters, and how do their values affect performance? The answer to such questions is the key to scientific discoveries. However, not much work has been done on quantifying the relative importance of the hyperparameters that does matter.

We propose a PB design based approach to quantify the importance of hyperparameters of DL algorithms, called QIM.

We leverage Cafe to implement two versions of the DL algorithm to evaluate QIM. The results show that QIM is able to correctly assess the importance of the hyperparameters of DL but is \(3\times \) faster than other approaches.
This paper is organized as follows. Section 2 describes the background of the DL and the PB design approach. Section 3 introduces our QIM. Section 4 describes the experimental setup for evaluating QIM. Section 5 presents the results and analysis. Section 6 describes the related work and Sect. 7 concludes the paper.
2 Background
2.1 Deep Learning (DL)
Generally, DL is a type of machine learning (ML) but is much more powerful than traditional ML. The great power of DL are obtained by learning to represent the world as a nested hierarchy of concepts, with each concept defined in relation to simpler concepts, as shown in Fig. 1. Each DL algorithm generally includes two subalgorithms: forwardpropagation and backpropagation. Most DL algorithms come with many hyperparameters that control many aspects of the learning algorithm behavior. Generally, properly setting the values of the hyperparameters is utter important but it is also difficult. The hyperparameters assessed in this paper include learning rate, momentum, weight decay, gamma, power, and stepsize.
Learning rate is a crucial hyperparameter for stochastic gradient descent(SGD) algorithm [2] which is used in the backpropagation algorithm. Momentum is designed to accelerate the learning process. Weight decay is designed to prevent the ‘overfitting’. In other words, it governs the regularization term of the neural net which is added to the network’s loss. The other hyperparameters, gamma, power, and stepsize are used to adjust the value of learning rate.
2.2 PB Design
The PB design matrix with 8 experiments.
Parameters or factors  

Assembly  x1  x2  x3  x4  x5  x6  x7 
1  +1  +1  +1  −1  +1  −1  −1 
2  −1  +1  +1  +1  −1  +1  −1 
3  −1  −1  +1  +1  +1  −1  +1 
4  +1  −1  −1  +1  +1  +1  −1 
5  −1  +1  −1  −1  +1  +1  +1 
6  +1  −1  +1  −1  −1  +1  +1 
7  +1  +1  −1  +1  −1  −1  +1 
8  −1  −1  −1  −1  −1  −1  −1 
Note that we want to quantify the importance of only 5 parameters but we construct a matrix with rows of 7 parameters; this is required by the PB design approach. However, we can use the quantities of the dummy parameters (\(m_6\) and \(m_7\)) to represent experimental errors.
3 QIM
3.1 Overview
3.2 Identifying the Value Range for Each Hyperparameter
In order to use QIM correctly, we need to know the value range of each hyperparameter for a certain DL algorithm. We propose a way named tentatively search(TS) to decide the value ranges of the hyperparameters. As shown in Fig. 3, we iteratively decrease or increase the value of a parameter by a certain stepsize while keep the values of all other parameters fixed and measure the performance. When we increase the value of the hyperparameter again and again until the gradient between the last two points achieves zero like CD shows, we choose the value of the hyperparameter corresponds to point C as the upper bound of the parameter. Another case is that DL algorithm fails to run successfully when we increase or decrease the value of the parameter further. It indicates that we already find the upper or lower bound of the hyperparameter in the previous try. In summary, we can find the bounds of the hyperparameters in either case.
3.3 QIM
We now turn to describe QIM in detail. We first describe the hyperparameters used in this study. The common hyperparameters used for all four types experiments include base_lr, momentum, weight_decay and gamma. The hyperparameter power is only used for lenetcifar10 and lenetmnist, and stepsize only for autocifar10 and automnist. Since \(4 \times 1<5< 4 \times 2\), we use the \(N=8\) PB design as showed in Table 1 and we have two dummy parameters. To improve the confidence of QIM, we design 16run trails instead of the 8run one proposed by PB design. This is achieved by adding a mirroring line for each line in Table 1. For each type of experiment, we run 16 trails with different hyperparamenters settings corresponding to PB matrix. Then the importance of each hyperparameter is computed by using Eq. (1).
4 Experimental Setup
5 Evaluation
We first report the results with supervised learning and then with unsupervised learning algorithm.
5.1 Supervised Learning
The supervised learning algorithm used in this DL study is lenet. We feed lenet with data set CIFAR10 and MNIST respectively.
Results on CIFAR10 — Case 1. Figure 4 shows the importance obtained by QIM and ANOVA on CIFAR10. The importance rank given by QIM, from the most important to least important, is base_lr, weight_decay, power, gamma and momentum. On the other hand, ANOVA gives the similar rank except the importance of the hyperparameter power. In this experiment, QIM introduces an error of 10.52 %. This indicates that the importance rank obtained by QIM is generally correct and we can use it in practice. Moreover, we find the importance of base_lr is much higher than those of other hyperparameters, which implies that the base_lr dominates the performance of DL with lenet on CIFAR10 (Fig. 6).
Results on MNIST — Case 2. The task is to classify the images into 10 digit classes. Figure 5 compares the results of hyperparameter importance with QIM and ANOVA. As can be seen, both methods rank the weight_decay as the most important parameter and the power as the least important one. QIM treats the base_lr less important than ANOVA does while the two approaches give the similar importance for both momentum and gamma. In this experiment, QIM introduces an error of 5.12 %, which is smaller than the error obtained in the first case. This indicates that the the importance rank of the hyperparameters obtained by QIM is more convincible than the first case.
5.2 Unsupervised Pretraining
As a unsupervised pretraining model, deep autoencoder [17] is trained on CIFAR10 and MNIST respectively.
Results on CIFAR10 — Case 3. QIM gives the top two importance to base_lr and momentum, which are consistent with the results of ANOVA. The error rate of QIM in this experiment is 5.4 %. For the less important hyperparameters such as weight_decay, gamma, power, stepsize assessed by ANOVA, QIM also gives the similar importance rank but with different absolute importance values. Comparing to the case 1,we find that the learning algorithm used in DL significantly affects the importance of its hyperparameters as well.
Results on MNIST — Case 4. QIM and ANOVA rank the same top three important hyperparameters including base_lr, momentum, and weight_decay as shown in Fig. 7. In this experiment, the error of QIM is 14.32 % which seems high. However, QIM assesses the importance of hyperparameters consistently with ANOVA but with less iterations.
5.3 Time Cost
Figure 8 compares the time used by QIM and ANOVA to rank the importance of hyperparameters of DL. As can be seen, the time used by QIM is \(3\times \) less than that used by ANOVA on average. As evaluated above, QIM can correctly rank the hyperparameter importance. This indicates that QIM is indeed a fast and efficient approach for quantifying the importance of hyperparameters.
6 Related Work
There are a lot of studies focusing on optimizing hyperparameters of DL algorithm [1, 3, 4, 9]. In lowdimensional problems with numerical hyperparameters, the best available hyperparameter optimization methods use Bayesian optimization [6] based on Gaussian process models, whereas in highdimensional and discrete spaces, treebased models [4], and in particular random forests [9, 16], are more successful [7]. Such modern hyperparameter optimization methods have achieved considerable recent success. For example, Bayesian optimization found a better instantiation of nine convolutional network hyperparameters than a domain expert, thereby achieving the lowest error reported on the CIFAR10 benchmark at the time [15]. However, these studies do not quantify the importance of the hyperparameters while QIM does.
7 Conclusion
In this work, we propose an efficient PB design based approach to quantify the importance of the hyperparameters of DL algorithms named QIM. With 5–15 % of error, QIM can effectively assesses the importance of each hyperparameter with much smaller number of computation iterations. We empirically validate QIM with two deep models on two data sets. The results show that QIM can rank the importance of hyperparameters of DL algorithms correctly in four cases.
Notes
Acknowledgements
We thank the reviewers for their thoughtful comments and suggestions. This work is supported by national key research and development program under No.2016YFB1000204, the major scientific and technological project of Guangdong province (2014B010115003), Shenzhen Technology Research Project (JSGG20160510154636747), Shenzhen Peacock Project (KQCX20140521115045448), outstanding technical talent program of CAS, and NSFC under grant no U1401258.
References
 1.Bardenet, R., Brendel, M., Kégl, B., Sebag, M.: Collaborative hyperparameter tuning. In: 30th International Conference on Machine Learning (ICML 2013), vol. 28, pp. 199–207. ACM Press (2013)Google Scholar
 2.Bengio, Y., Goodfellow, I.J., Courville, A.: Deep learning. An MIT Press book in preparation (2015). Draft chapters available at http://www.iro.umontreal.ca/~bengioy/dlbook
 3.Bergstra, J., Bengio, Y.: Random search for hyperparameter optimization. J. Mach. Learn. Res. 13(1), 281–305 (2012)MathSciNetzbMATHGoogle Scholar
 4.Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyperparameter optimization. In: Advances in Neural Information Processing Systems, pp. 2546–2554 (2011)Google Scholar
 5.Bourlard, H., Kamp, Y.: Autoassociation by multilayer perceptrons and singular value decomposition. Biol. Cybern. 59(4–5), 291–294 (1988)MathSciNetCrossRefzbMATHGoogle Scholar
 6.Brochu, E., Cora, V.M., De Freitas, N.: A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling, hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010)
 7.Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J., Snoek, J., Hoos, H., LeytonBrown, K.: Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In: NIPS Workshop on Bayesian Optimization in Theory and Practice (2013)Google Scholar
 8.Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
 9.Hutter, F., Hoos, H.H., LeytonBrown, K.: Sequential modelbased optimization for general algorithm configuration. In: Coello, C.A.C. (ed.) LION 2011. LNCS, vol. 6683, pp. 507–523. Springer, Heidelberg (2011). doi: 10.1007/9783642255663_40 CrossRefGoogle Scholar
 10.Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, pp. 675–678. ACM (2014)Google Scholar
 11.Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)Google Scholar
 12.LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural comput. 1(4), 541–551 (1989)CrossRefGoogle Scholar
 13.LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradientbased learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
 14.Plackett, R.L., Burman, J.P.: The design of optimum multifactorial experiments. Biometrika 33(4), 305–325 (1946)MathSciNetCrossRefzbMATHGoogle Scholar
 15.Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, pp. 2951–2959 (2012)Google Scholar
 16.Thornton, C., Hutter, F., Hoos, H.H., LeytonBrown, K., Autoweka: combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 847–855. ACM (2013)Google Scholar
 17.Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)MathSciNetzbMATHGoogle Scholar