Machine Learning

, Volume 107, Issue 8–10, pp 1283–1302 | Cite as

High-dimensional penalty selection via minimum description length principle

  • Kohei MiyaguchiEmail author
  • Kenji Yamanishi
Part of the following topical collections:
  1. Special Issue of the ECML PKDD 2018 Journal Track


We tackle the problem of penalty selection for regularization on the basis of the minimum description length (MDL) principle. In particular, we consider that the design space of the penalty function is high-dimensional. In this situation, the luckiness-normalized-maximum-likelihood (LNML)-minimization approach is favorable, because LNML quantifies the goodness of regularized models with any forms of penalty functions in view of the MDL principle, and guides us to a good penalty function through the high-dimensional space. However, the minimization of LNML entails two major challenges: (1) the computation of the normalizing factor of LNML and (2) its minimization in high-dimensional spaces. In this paper, we present a novel regularization selection method (MDL-RS), in which a tight upper bound of LNML (uLNML) is minimized with local convergence guarantee. Our main contribution is the derivation of uLNML, which is a uniform-gap upper bound of LNML in an analytic expression. This solves the above challenges in an approximate manner because it allows us to accurately approximate LNML and then efficiently minimize it. The experimental results show that MDL-RS improves the generalization performance of regularized estimates specifically when the model has redundant parameters.


Minimum description length principle Luckiness normalized maximum likelihood Regularized empirical risk minimization Penalty selection Concave–convex procedure 



Funding was provided by Core Research for Evolutional Science and Technology (Grant No. JPMJCR1304).


  1. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723.MathSciNetCrossRefzbMATHGoogle Scholar
  2. Barron, A. R., & Cover, T. M. (1991). Minimum complexity density estimation. IEEE Transactions on Information Theory, 37(4), 1034–1054.MathSciNetCrossRefzbMATHGoogle Scholar
  3. Chatterjee, S., & Barron, A. (2014). Information theoretic validity of penalized likelihood. In 2014 IEEE international symposium on information theory (ISIT) (pp. 3027–3031). IEEE.Google Scholar
  4. Chen, J., & Chen, Z. (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), 759–771.MathSciNetCrossRefzbMATHGoogle Scholar
  5. Dhillon, P. S., Foster, D., & Ungar, L. H. (2011). Minimum description length penalization for group and multi-task sparse learning. Journal of Machine Learning Research, 12(Feb), 525–564.MathSciNetzbMATHGoogle Scholar
  6. Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432–441.CrossRefzbMATHGoogle Scholar
  7. Grünwald, P. D. (2007). The minimum description length principle. Cambridge: MIT Press.Google Scholar
  8. Grünwald, P. D., & Mehta, N. A. (2017). A tight excess risk bound via a unified PAC–Bayesian–Rademacher–Shtarkov–MDL complexity. arXiv:1710.07732.
  9. Hirai, S., & Yamanishi, K. (2011). Efficient computation of normalized maximum likelihood coding for gaussian mixtures with its applications to optimal clustering. In 2011 IEEE international symposium on information theory proceedings (ISIT) (pp. 1031–1035). IEEE.Google Scholar
  10. Larsen, J., Hansen, L. K., Svarer, C., & Ohlsson, M. (1996). Design and regularization of neural networks: The optimal use of a validation set. In Proceedings of the 1996 IEEE signal processing society workshop on neural networks for signal processing [1996] VI (pp. 62–71). IEEE.Google Scholar
  11. McAllester, D. A. (1999). PAC–Bayesian model averaging. In Proceedings of the twelfth annual conference on computational learning theory (pp. 164–170). ACM.Google Scholar
  12. Miyaguchi, K., Matsushima, & S., Yamanishi, K. (2017). Sparse graphical modeling via stochastic complexity. In Proceedings of the 2017 SIAM international conference on data mining (pp. 723–731). SIAM.Google Scholar
  13. Mockus, J., Eddy, W., & Reklaitis, G. (2013). Bayesian Heuristic approach to discrete and global optimization: Algorithms, visualization, software, and applications (Vol. 17). Berlin: Springer.zbMATHGoogle Scholar
  14. Rafiei, M. H., & Adeli, H. (2015). A novel machine learning model for estimation of sale prices of real estate units. Journal of Construction Engineering and Management, 142(2), 04015,066.CrossRefGoogle Scholar
  15. Rasmussen, C. E., & Williams, C. K. (2006). Gaussian processes for machine learning (Vol. 1). Cambridge: MIT Press.zbMATHGoogle Scholar
  16. Rish, I., & Grabarnik, G. (2014). Sparse modeling: Theory, algorithms, and applications. Boca Raton: CRC Press.CrossRefzbMATHGoogle Scholar
  17. Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5), 465–471.CrossRefzbMATHGoogle Scholar
  18. Rissanen, J. J. (1996). Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42(1), 40–47.MathSciNetCrossRefzbMATHGoogle Scholar
  19. Roos, T., Myllymaki, P., & Rissanen, J. (2009). Mdl denoising revisited. IEEE Transactions on Signal Processing, 57(9), 3347–3360.MathSciNetCrossRefzbMATHGoogle Scholar
  20. Schwarz, G., et al. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.MathSciNetCrossRefzbMATHGoogle Scholar
  21. Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar
  22. Shawe-Taylor, J., & Williamson, R. C. (1997). A PAC analysis of a Bayesian estimator. In Proceedings of the tenth annual conference on computational learning theory (pp. 2–9). ACM.Google Scholar
  23. Shtar’kov, Y. M. (1987). Universal sequential coding of single messages. Problemy Peredachi Informatsii, 23(3), 3–17.MathSciNetzbMATHGoogle Scholar
  24. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological), 58, 267–288.MathSciNetzbMATHGoogle Scholar
  25. Tipping, M. E. (2001). Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1(Jun), 211–244.MathSciNetzbMATHGoogle Scholar
  26. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.CrossRefzbMATHGoogle Scholar
  27. Yamanishi, K. (1992). A learning criterion for stochastic rules. Machine Learning, 9(2–3), 165–203.zbMATHGoogle Scholar
  28. Yuan, M., & Lin, Y. (2005). Efficient empirical bayes variable selection and estimation in linear models. Journal of the American Statistical Association, 100(472), 1215–1225.MathSciNetCrossRefzbMATHGoogle Scholar
  29. Yuille, A. L., & Rangarajan, A. (2003). The concave–convex procedure. Neural computation, 15(4), 915–936.CrossRefzbMATHGoogle Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  1. 1.The University of TokyoBunkyo-kuJapan

Personalised recommendations