Scalable Hyperparameter Optimization with Products of Gaussian Process Experts

  • Nicolas SchillingEmail author
  • Martin Wistuba
  • Lars Schmidt-Thieme
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9851)


In machine learning, hyperparameter optimization is a challenging but necessary task that is usually approached in a computationally expensive manner such as grid-search. Out of this reason, surrogate based black-box optimization techniques such as sequential model-based optimization have been proposed which allow for a faster hyperparameter optimization. Recent research proposes to also integrate hyperparameter performances on past data sets to allow for a faster and more efficient hyperparameter optimization.

In this paper, we use products of Gaussian process experts as surrogate models for hyperparameter optimization. Naturally, Gaussian processes are a decent choice as they offer good prediction accuracy as well as estimations about their uncertainty. Additionally, their hyperparameters can be tuned very effectively. However, in the light of large meta data sets, learning a single Gaussian process is not feasible as it involves inversion of a large kernel matrix. This directly limits their usefulness for hyperparameter optimization if large scale hyperparameter performances on past data sets are given.

By using products of Gaussian process experts the scalability issues can be circumvened, however, this usually comes with the price of having less predictive accuracy. In our experiments, we show empirically that products of experts nevertheless perform very well compared to a variety of published surrogate models. Thus, we propose a surrogate model that performs as well as the current state of the art, is scalable to large scale meta knowledge, does not include hyperparameters itself and finally is even very easy to parallelize. The software related to this paper is available at


Hyperparameter optimization Sequential model-based optimization Product of experts 



The authors gratefully acknowledge the co-funding of their work by the German Research Foundation (DFG) under grant SCHM 2583/6-1.


  1. 1.
    Adankon, M.M., Cheriet, M.: Model selection for the LS-SVM. Appl. Handwriting Recogn. Pattern Recognit. 42(12), 3264–3270 (2009)CrossRefzbMATHGoogle Scholar
  2. 2.
    Bardenet, R., Brendel, M., Kegl, B., Sebag, M.: Collaborative hyperparameter tuning. In: Dasgupta, S., Mcallester, D. (eds.) Proceedings of the 30th International Conference on Machine Learning (ICML-13). vol. 28, pp. 199–207. JMLR Workshop and Conference Proceedings, May 2013Google Scholar
  3. 3.
    Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Brazdil, P., Carrier, C.G., Soares, C., Vilalta, R.: Metalearning: Applications to data mining. Springer, Heidelberg (2008)zbMATHGoogle Scholar
  5. 5.
    Cao, Y., Fleet, D.J.: Generalized product of experts for automatic and principled fusion of gaussian process predictions. arXiv preprint (2014). arXiv:1410.7827
  6. 6.
    Deisenroth, M.P., Ng, J.W.: Distributed gaussian processes. Int. Conf. Mach. Learn. (ICML) 2, 5 (2015)Google Scholar
  7. 7.
    Escalante, H.J., Montes, M., Sucar, L.E.: Particle swarm model selection. J. Mach. Learn. Res. 10, 405–440 (2009)Google Scholar
  8. 8.
    Feurer, M., Springenberg, J.T., Hutter, F.: Initializing bayesian hyperparameter optimization via meta-learning. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)Google Scholar
  9. 9.
    Guo, X.C., Yang, J.H., Wu, C.G., Wang, C.Y., Liang, Y.C.: A novel ls-svms hyper-parameter selection based on particle swarm optimization. Neurocomput 71(16–18), 3211–3215 (2008)CrossRefGoogle Scholar
  10. 10.
    Hensman, J., Fusi, N., Lawrence, N.D.: Gaussian processes for big data. arXiv preprint (2013). arXiv:1309.6835
  11. 11.
    Hinton, G.E.: Products of experts. In: Ninth International Conference on (Conf. Publ. No. 470) Artificial Neural Networks, ICANN 99. vol. 1, pp. 1–6. IET (1999)Google Scholar
  12. 12.
    Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: Dhaenens, C., Jourdan, L., Marmion, M.-E. (eds.) LION 2015. LNCS, vol. 8994, pp. 507–523. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-25566-3_40 CrossRefGoogle Scholar
  14. 14.
    Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Global Optim. 13(4), 455–492 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Koch, P., Bischl, B., Flasch, O., Bartz-Beielstein, T., Weihs, C., Konen, W.: Tuning and evolution of support vector kernels. Evol. Intell. 5(3), 153–170 (2012)CrossRefGoogle Scholar
  16. 16.
    Quinonero-Candela, J., Rasmussen, C.E.: A unifying view of sparse approximate gaussian process regression. J. Mach. Learn. Res. 6, 1939–1959 (2005)MathSciNetzbMATHGoogle Scholar
  17. 17.
    Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press (2005)Google Scholar
  18. 18.
    Schilling, N., Wistuba, M., Drumond, L., Schmidt-Thieme, L.: Hyperparameter optimization with factorized multilayer perceptrons. In: Machine Learning and Knowledge Discovery in Databases, pp. 87–103. Springer, Heidelberg (2015)Google Scholar
  19. 19.
    Shen, Y., Ng, A., Seeger, M.: Fast gaussian process regression using kd-trees. In: Proceedings of the 19th Annual Conference on Neural Information Processing Systems. No. EPFL-CONF-161316 (2006)Google Scholar
  20. 20.
    Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 25, pp. 2951–2959. Curran Associates, Inc. (2012)Google Scholar
  21. 21.
    Swersky, K., Snoek, J., Adams, R.P.: Multi-task bayesian optimization. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 26, pp. 2004–2012. Curran Associates, Inc. (2013)Google Scholar
  22. 22.
    Tresp, V.: A bayesian committee machine. Neural Comput. 12(11), 2719–2741 (2000)CrossRefGoogle Scholar
  23. 23.
    Williams, C., Seeger, M.: Using the nyström method to speed up kernel machines. In: Proceedings of the 14th Annual Conference on Neural Information Processing Systems, pp. 682–688. No. EPFL-CONF-161322 (2001)Google Scholar
  24. 24.
    Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Learning hyperparameter optimization initializations. In: IEEE International Conference on Data Science and Advanced Analytics (DSAA), 36678 2015, pp. 1–10. IEEE (2015)Google Scholar
  25. 25.
    Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Sequential model-free hyperparameter tuning. In: 2015 IEEE International Conference on Data Mining (ICDM), pp. 1033–1038. IEEE (2015)Google Scholar
  26. 26.
    Yogatama, D., Mann, G.: Efficient transfer learning method for automatic hyperparameter tuning. In: International Conference on Artificial Intelligence and Statistics (AISTATS 2014) (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Nicolas Schilling
    • 1
    Email author
  • Martin Wistuba
    • 1
  • Lars Schmidt-Thieme
    • 1
  1. 1.Information Systems and Machine Learning LabHildesheimGermany

Personalised recommendations