QIM: Quantifying Hyperparameter Importance for Deep Learning

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9966)

Abstract

Recently, Deep Learning (DL) has become super hot because it achieves breakthroughs in many areas such as image processing and face identification. The performance of DL models critically depend on hyperparameter settings. However, existing approaches that quantify the importance of these hyperparameters are time-consuming.

In this paper, we propose a fast approach to quantify the importance of the DL hyperparameters, called QIM. It leverages Plackett-Burman design to collect as few as possible data but can still correctly quantify the hyperparameter importance. We conducted experiments on the popular deep learning framework – Caffe – with different datasets to evaluate QIM. The results show that QIM can rank the importance of the DL hyperparameters correctly with very low cost.

Keywords

Deep learning Plackett-burman design Hyperparameter 

Notes

Acknowledgements

We thank the reviewers for their thoughtful comments and suggestions. This work is supported by national key research and development program under No.2016YFB1000204, the major scientific and technological project of Guangdong province (2014B010115003), Shenzhen Technology Research Project (JSGG20160510154636747), Shenzhen Peacock Project (KQCX20140521115045448), outstanding technical talent program of CAS, and NSFC under grant no U1401258.

References

  1. 1.
    Bardenet, R., Brendel, M., Kégl, B., Sebag, M.: Collaborative hyperparameter tuning. In: 30th International Conference on Machine Learning (ICML 2013), vol. 28, pp. 199–207. ACM Press (2013)Google Scholar
  2. 2.
    Bengio, Y., Goodfellow, I.J., Courville, A.: Deep learning. An MIT Press book in preparation (2015). Draft chapters available at http://www.iro.umontreal.ca/~bengioy/dlbook
  3. 3.
    Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(1), 281–305 (2012)MathSciNetMATHGoogle Scholar
  4. 4.
    Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Advances in Neural Information Processing Systems, pp. 2546–2554 (2011)Google Scholar
  5. 5.
    Bourlard, H., Kamp, Y.: Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybern. 59(4–5), 291–294 (1988)MathSciNetCrossRefMATHGoogle Scholar
  6. 6.
    Brochu, E., Cora, V.M., De Freitas, N.: A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling, hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010)
  7. 7.
    Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J., Snoek, J., Hoos, H., Leyton-Brown, K.: Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In: NIPS Workshop on Bayesian Optimization in Theory and Practice (2013)Google Scholar
  8. 8.
    Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: Coello, C.A.C. (ed.) LION 2011. LNCS, vol. 6683, pp. 507–523. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-25566-3_40 CrossRefGoogle Scholar
  10. 10.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, pp. 675–678. ACM (2014)Google Scholar
  11. 11.
    Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)Google Scholar
  12. 12.
    LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural comput. 1(4), 541–551 (1989)CrossRefGoogle Scholar
  13. 13.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  14. 14.
    Plackett, R.L., Burman, J.P.: The design of optimum multifactorial experiments. Biometrika 33(4), 305–325 (1946)MathSciNetCrossRefMATHGoogle Scholar
  15. 15.
    Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, pp. 2951–2959 (2012)Google Scholar
  16. 16.
    Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K., Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 847–855. ACM (2013)Google Scholar
  17. 17.
    Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)MathSciNetMATHGoogle Scholar

Copyright information

© IFIP International Federation for Information Processing 2016

Authors and Affiliations

  1. 1.Beihang UniversityBeijingChina
  2. 2.Shenzhen Institute of Advanced Technology, CASShenzhenChina

Personalised recommendations