The decomposed normalized maximum likelihood code-length criterion for selecting hierarchical latent variable models

  • Kenji YamanishiEmail author
  • Tianyi Wu
  • Shinya Sugawara
  • Makoto Okada


We propose a new model selection criterion based on the minimum description length principle in a name of the decomposed normalized maximum likelihood (DNML) criterion. Our criterion can be applied to a large class of hierarchical latent variable models, such as naïve Bayes models, stochastic block models, latent Dirichlet allocations and Gaussian mixture models, to which many conventional information criteria cannot be straightforwardly applied due to non-identifiability of latent variable models. Our method also has an advantage that it can be exactly evaluated without asymptotic approximation with small time complexity. We theoretically justify DNML in terms of hierarchical minimax regret and estimation optimality. Our experiments using synthetic data and benchmark data demonstrate the validity of our method in terms of computational efficiency and model selection accuracy. We show that our criterion especially dominate other existing criteria when sample size is small and when data are noisy.


Model selection Latent variable model Minimum description length Normalized maximum likelihood coding 



  1. Airoldi EM, Blei DM, Fienberg SE, Xing EP (2008) Mixed membership stochastic blockmodels. J Mach Learn Res 9:1981–2014zbMATHGoogle Scholar
  2. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723MathSciNetCrossRefzbMATHGoogle Scholar
  3. Ana CC (2007) Improving methods for single-label text categorization. PhD thesis, Instituto Superior Tecnico, Universidade Tecnica de LisboaGoogle Scholar
  4. Barron A, Cover T (1991) Minimum complexity density estimation. IEEE Trans. Inf. Theory 37(4):1034–1053MathSciNetCrossRefzbMATHGoogle Scholar
  5. Blei DM, Jordan MI (2003) Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. pp 127–134Google Scholar
  6. Blei DM, Lafferty JD (2009) Topic models. In: Srivastava AN, Sahami M (eds) Text mining: classification, clustering, and applications, vol 10. Taylor & Francis Group, London, pp 71–93CrossRefGoogle Scholar
  7. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022zbMATHGoogle Scholar
  8. Boulle M, Clerot F, Hue C (2016) Revisiting enumerative two-part crude mdl for bernoulli and multinomial distributions. arXiv:1608.05522
  9. Celeux G, Forbes F, Robert CP, Titterington DM (2006) Deviance information criteria for missing data models. Bayesian Anal. 1(4):651–673MathSciNetCrossRefzbMATHGoogle Scholar
  10. Cover T, Thomas M (1991) Elements of information theory. Wiley, HobokenCrossRefzbMATHGoogle Scholar
  11. Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228–5235CrossRefGoogle Scholar
  12. Grünwald P (2007) The minimum description length principle. MIT Press, CambridgeCrossRefGoogle Scholar
  13. Hirai S, Yamanishi K (2013) Efficient computation of normalized maximum likelihood codes for Gaussian mixture models with its applications to clustering. IEEE Trans. Inf. Theory 59(11):7718–7727MathSciNetCrossRefzbMATHGoogle Scholar
  14. Hirai S, Yamanishi K (2017) An upper bound on normalized maximum likelihood codes for gaussian mixture models. CoRR, Vol. arXiv:abs/1709.00925
  15. Ito Y, Oeda S, Yamanishi K (2016) Rank selection for non-negative matrix factorization with normalized maximum likelihood coding. In: Proceedings of the 2016 SIAM international conference on data mining. SIAM, pp 720–728Google Scholar
  16. Kemp C, Tenenbaum JB, Griffiths TL, Yamada T, Ueda N (2006) Learning systems of concepts with an infinite relational model. Proc Assoc Adv Artif Intell 3:381–388Google Scholar
  17. Kontkanen P, Myllymäki P (2007) A linear-time algorithm for computing the multinomial stochastic complexity. Inf Process Lett 103(6):227–233MathSciNetCrossRefzbMATHGoogle Scholar
  18. Kontkanen P, Myllymäki P, Buntine W, Rissanen J, Tirri H (2005) An MDL framework for data clustering. In: Grünwald P, Myung I, Pitt MA (eds) Advances in minimum description length: theory and applications. MIT Press, Cambridge, pp 323–353Google Scholar
  19. McLachlan G, Peel D (2000) Finite mixture models. Wiley, HobokenCrossRefzbMATHGoogle Scholar
  20. Miettinen P, Vreeken J (2014) Mdl4bmf: minimum description length for Boolean matrix factorization. ACM Trans Knowl Discov Data 8(18):18:1–1:31Google Scholar
  21. Miller JW, Harrison MT (2013) A simple example of dirichlet process mixture inconsistency for the number of components. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems 26. Curran Associates, Inc., pp 199–206Google Scholar
  22. Rissanen J (1978) Modeing by shortest description length. Automatica 14(5):465–471CrossRefzbMATHGoogle Scholar
  23. Rissanen J (1996) Fisher information and stochastic complexity. IEEE Trans Inf Theory 42(1):40–47MathSciNetCrossRefzbMATHGoogle Scholar
  24. Rissanen J (1998) Stochastic complexity in statistical inquiry, vol 15. World Scientific, SingaporeCrossRefzbMATHGoogle Scholar
  25. Rissanen J (2012) Optimal estimation of parameters. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
  26. Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: Proceedings of the 20th conference on uncertainty in artificial intelligence. AUAI Press, pp 487–494Google Scholar
  27. Sakai Y, Yamanishi K (2013) An nml-based model selection criterion for general relational data modeling. In: Proceedings of 2013 IEEE international conference on big data. IEEE, pp 421–429Google Scholar
  28. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464MathSciNetCrossRefzbMATHGoogle Scholar
  29. Shtar’kov YM (1987) Universal sequential coding of single messages. Probl Pereda Inf 23(3):3–17MathSciNetzbMATHGoogle Scholar
  30. Silander T, Roos T, Kontkanen P, Myllymäki P (2008) Factorized normalized maximum likelihood criterion for learning bayesian network structures. In: Proceedings of the fourth European workshop on probabilistic graphical models. pp 257–264Google Scholar
  31. Snijders TAB, Nowicki K (1997) Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J Classif 14(1):75–100MathSciNetCrossRefzbMATHGoogle Scholar
  32. Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A (2002) Bayesian measures of model complexity and fit. J R Stat Soc Ser B (Stat Methodol) 64(4):583–639MathSciNetCrossRefzbMATHGoogle Scholar
  33. Taddy M (2012) On estimation and selection for topic models. In: Proceedings of artifical intelligence and statistics. pp 1184–1193Google Scholar
  34. Tatti N, Vreeken J (2012) The long and the shot of it: summarizing event sequences with serial episodes. In: Proceedings of the 18th ACM SGKDD International conference on knowledge discovery and data mining. pp 462–470Google Scholar
  35. Teh YW, Jordan MI, Beal MJ, Blei DM (2012) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581MathSciNetCrossRefzbMATHGoogle Scholar
  36. Van Leeuwen M, Vreeken J, Arno S (2009) Identifying the components. Data Min Knowl Discov 19(2):176–193MathSciNetCrossRefGoogle Scholar
  37. Wallach HM, Mimno DM, McCallum A (2009) Rethinking LDA: why priors matter. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22. Curran Associates, Inc., pp 1973–1981Google Scholar
  38. Wu T, Sugawara S, Yamanishi K (2017) Decomposed normalized maximum likelihood codelength criterion for selecting hierarchical latent variable models. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. pp 1165–1174Google Scholar
  39. Yamanishi K (1992) A learning criterion for stochastic rules. Mach Learn 9(2–3):165–203zbMATHGoogle Scholar

Copyright information

© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.The University of TokyoTokyoJapan
  2. 2.Tokyo University of ScienceTokyoJapan

Personalised recommendations