High-dimensional penalty selection via minimum description length principle
- 169 Downloads
We tackle the problem of penalty selection for regularization on the basis of the minimum description length (MDL) principle. In particular, we consider that the design space of the penalty function is high-dimensional. In this situation, the luckiness-normalized-maximum-likelihood (LNML)-minimization approach is favorable, because LNML quantifies the goodness of regularized models with any forms of penalty functions in view of the MDL principle, and guides us to a good penalty function through the high-dimensional space. However, the minimization of LNML entails two major challenges: (1) the computation of the normalizing factor of LNML and (2) its minimization in high-dimensional spaces. In this paper, we present a novel regularization selection method (MDL-RS), in which a tight upper bound of LNML (uLNML) is minimized with local convergence guarantee. Our main contribution is the derivation of uLNML, which is a uniform-gap upper bound of LNML in an analytic expression. This solves the above challenges in an approximate manner because it allows us to accurately approximate LNML and then efficiently minimize it. The experimental results show that MDL-RS improves the generalization performance of regularized estimates specifically when the model has redundant parameters.
KeywordsMinimum description length principle Luckiness normalized maximum likelihood Regularized empirical risk minimization Penalty selection Concave–convex procedure
Funding was provided by Core Research for Evolutional Science and Technology (Grant No. JPMJCR1304).
- Chatterjee, S., & Barron, A. (2014). Information theoretic validity of penalized likelihood. In 2014 IEEE international symposium on information theory (ISIT) (pp. 3027–3031). IEEE.Google Scholar
- Grünwald, P. D. (2007). The minimum description length principle. Cambridge: MIT Press.Google Scholar
- Grünwald, P. D., & Mehta, N. A. (2017). A tight excess risk bound via a unified PAC–Bayesian–Rademacher–Shtarkov–MDL complexity. arXiv:1710.07732.
- Hirai, S., & Yamanishi, K. (2011). Efficient computation of normalized maximum likelihood coding for gaussian mixtures with its applications to optimal clustering. In 2011 IEEE international symposium on information theory proceedings (ISIT) (pp. 1031–1035). IEEE.Google Scholar
- Larsen, J., Hansen, L. K., Svarer, C., & Ohlsson, M. (1996). Design and regularization of neural networks: The optimal use of a validation set. In Proceedings of the 1996 IEEE signal processing society workshop on neural networks for signal processing  VI (pp. 62–71). IEEE.Google Scholar
- McAllester, D. A. (1999). PAC–Bayesian model averaging. In Proceedings of the twelfth annual conference on computational learning theory (pp. 164–170). ACM.Google Scholar
- Miyaguchi, K., Matsushima, & S., Yamanishi, K. (2017). Sparse graphical modeling via stochastic complexity. In Proceedings of the 2017 SIAM international conference on data mining (pp. 723–731). SIAM.Google Scholar
- Shawe-Taylor, J., & Williamson, R. C. (1997). A PAC analysis of a Bayesian estimator. In Proceedings of the tenth annual conference on computational learning theory (pp. 2–9). ACM.Google Scholar