The decomposed normalized maximum likelihood code-length criterion for selecting hierarchical latent variable models


We propose a new model selection criterion based on the minimum description length principle in a name of the decomposed normalized maximum likelihood (DNML) criterion. Our criterion can be applied to a large class of hierarchical latent variable models, such as naïve Bayes models, stochastic block models, latent Dirichlet allocations and Gaussian mixture models, to which many conventional information criteria cannot be straightforwardly applied due to non-identifiability of latent variable models. Our method also has an advantage that it can be exactly evaluated without asymptotic approximation with small time complexity. We theoretically justify DNML in terms of hierarchical minimax regret and estimation optimality. Our experiments using synthetic data and benchmark data demonstrate the validity of our method in terms of computational efficiency and model selection accuracy. We show that our criterion especially dominate other existing criteria when sample size is small and when data are noisy.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18


  1. 1., accessed Feb. 2, 2017.

  2. 2., accessed Feb. 2, 2017.


  1. Airoldi EM, Blei DM, Fienberg SE, Xing EP (2008) Mixed membership stochastic blockmodels. J Mach Learn Res 9:1981–2014

    MATH  Google Scholar 

  2. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723

    MathSciNet  Article  MATH  Google Scholar 

  3. Ana CC (2007) Improving methods for single-label text categorization. PhD thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa

  4. Barron A, Cover T (1991) Minimum complexity density estimation. IEEE Trans. Inf. Theory 37(4):1034–1053

    MathSciNet  Article  MATH  Google Scholar 

  5. Blei DM, Jordan MI (2003) Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. pp 127–134

  6. Blei DM, Lafferty JD (2009) Topic models. In: Srivastava AN, Sahami M (eds) Text mining: classification, clustering, and applications, vol 10. Taylor & Francis Group, London, pp 71–93

    Google Scholar 

  7. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  8. Boulle M, Clerot F, Hue C (2016) Revisiting enumerative two-part crude mdl for bernoulli and multinomial distributions. arXiv:1608.05522

  9. Celeux G, Forbes F, Robert CP, Titterington DM (2006) Deviance information criteria for missing data models. Bayesian Anal. 1(4):651–673

    MathSciNet  Article  MATH  Google Scholar 

  10. Cover T, Thomas M (1991) Elements of information theory. Wiley, Hoboken

    Book  MATH  Google Scholar 

  11. Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228–5235

    Article  Google Scholar 

  12. Grünwald P (2007) The minimum description length principle. MIT Press, Cambridge

    Book  Google Scholar 

  13. Hirai S, Yamanishi K (2013) Efficient computation of normalized maximum likelihood codes for Gaussian mixture models with its applications to clustering. IEEE Trans. Inf. Theory 59(11):7718–7727

    MathSciNet  Article  MATH  Google Scholar 

  14. Hirai S, Yamanishi K (2017) An upper bound on normalized maximum likelihood codes for gaussian mixture models. CoRR, Vol. arXiv:abs/1709.00925

  15. Ito Y, Oeda S, Yamanishi K (2016) Rank selection for non-negative matrix factorization with normalized maximum likelihood coding. In: Proceedings of the 2016 SIAM international conference on data mining. SIAM, pp 720–728

  16. Kemp C, Tenenbaum JB, Griffiths TL, Yamada T, Ueda N (2006) Learning systems of concepts with an infinite relational model. Proc Assoc Adv Artif Intell 3:381–388

    Google Scholar 

  17. Kontkanen P, Myllymäki P (2007) A linear-time algorithm for computing the multinomial stochastic complexity. Inf Process Lett 103(6):227–233

    MathSciNet  Article  MATH  Google Scholar 

  18. Kontkanen P, Myllymäki P, Buntine W, Rissanen J, Tirri H (2005) An MDL framework for data clustering. In: Grünwald P, Myung I, Pitt MA (eds) Advances in minimum description length: theory and applications. MIT Press, Cambridge, pp 323–353

    Google Scholar 

  19. McLachlan G, Peel D (2000) Finite mixture models. Wiley, Hoboken

    Book  MATH  Google Scholar 

  20. Miettinen P, Vreeken J (2014) Mdl4bmf: minimum description length for Boolean matrix factorization. ACM Trans Knowl Discov Data 8(18):18:1–1:31

    Google Scholar 

  21. Miller JW, Harrison MT (2013) A simple example of dirichlet process mixture inconsistency for the number of components. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems 26. Curran Associates, Inc., pp 199–206

    Google Scholar 

  22. Rissanen J (1978) Modeing by shortest description length. Automatica 14(5):465–471

    Article  MATH  Google Scholar 

  23. Rissanen J (1996) Fisher information and stochastic complexity. IEEE Trans Inf Theory 42(1):40–47

    MathSciNet  Article  MATH  Google Scholar 

  24. Rissanen J (1998) Stochastic complexity in statistical inquiry, vol 15. World Scientific, Singapore

    Book  MATH  Google Scholar 

  25. Rissanen J (2012) Optimal estimation of parameters. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  26. Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: Proceedings of the 20th conference on uncertainty in artificial intelligence. AUAI Press, pp 487–494

  27. Sakai Y, Yamanishi K (2013) An nml-based model selection criterion for general relational data modeling. In: Proceedings of 2013 IEEE international conference on big data. IEEE, pp 421–429

  28. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    MathSciNet  Article  MATH  Google Scholar 

  29. Shtar’kov YM (1987) Universal sequential coding of single messages. Probl Pereda Inf 23(3):3–17

    MathSciNet  MATH  Google Scholar 

  30. Silander T, Roos T, Kontkanen P, Myllymäki P (2008) Factorized normalized maximum likelihood criterion for learning bayesian network structures. In: Proceedings of the fourth European workshop on probabilistic graphical models. pp 257–264

  31. Snijders TAB, Nowicki K (1997) Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J Classif 14(1):75–100

    MathSciNet  Article  MATH  Google Scholar 

  32. Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A (2002) Bayesian measures of model complexity and fit. J R Stat Soc Ser B (Stat Methodol) 64(4):583–639

    MathSciNet  Article  MATH  Google Scholar 

  33. Taddy M (2012) On estimation and selection for topic models. In: Proceedings of artifical intelligence and statistics. pp 1184–1193

  34. Tatti N, Vreeken J (2012) The long and the shot of it: summarizing event sequences with serial episodes. In: Proceedings of the 18th ACM SGKDD International conference on knowledge discovery and data mining. pp 462–470

  35. Teh YW, Jordan MI, Beal MJ, Blei DM (2012) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581

    MathSciNet  Article  MATH  Google Scholar 

  36. Van Leeuwen M, Vreeken J, Arno S (2009) Identifying the components. Data Min Knowl Discov 19(2):176–193

    MathSciNet  Article  Google Scholar 

  37. Wallach HM, Mimno DM, McCallum A (2009) Rethinking LDA: why priors matter. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22. Curran Associates, Inc., pp 1973–1981

    Google Scholar 

  38. Wu T, Sugawara S, Yamanishi K (2017) Decomposed normalized maximum likelihood codelength criterion for selecting hierarchical latent variable models. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. pp 1165–1174

  39. Yamanishi K (1992) A learning criterion for stochastic rules. Mach Learn 9(2–3):165–203

    MATH  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Kenji Yamanishi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper builds upon and extends work published as Wu et al. (2017). This work was supported by JST CREST under Grant JPMJCR1304.

Responsible editor: Johannes Fürnkranz.


A Proof of Theorem 2

According to the theory of minimax regret in Shtar’kov (1987), the minimum in (17) is attained by

$$\begin{aligned} L^{*}({\varvec{x}})= & {} -\log \frac{p_{_{\mathrm{NML}}}({\varvec{x}}|\bar{\varvec{z}}({\varvec{x}}))p(\bar{\varvec{z}}({\varvec{x}});\hat{\theta } _{2}(\bar{\varvec{z}}({\varvec{x}})))}{\sum _{{\varvec{y}}}p_{_{\mathrm{NML}}}({\varvec{y}}|\bar{\varvec{w}}({\varvec{y}}))p(\bar{\varvec{w}}({\varvec{y}});\hat{\theta }_{2}(\bar{\varvec{w}}({\varvec{y}})) )}\\= & {} -\log \frac{ p_{_{\mathrm{NML}}}({\varvec{x}}|\bar{\varvec{z}}({\varvec{x}}))p_{_{\mathrm{NML}}}(\bar{\varvec{z}}({\varvec{x}}))}{\sum _{{\varvec{y}}}p_{_{{\mathrm{NML}}}}({\varvec{y}}|\bar{\varvec{w}}({\varvec{y}}))p_{_{\mathrm{NML}}}(\bar{\varvec{w}}({\varvec{y}}) )}\\= & {} \left\{ -\log p_{_{\mathrm{NML}}}({\varvec{x}}|\bar{\varvec{z}}({\varvec{x}}))-\log p_{_{\mathrm{NML}}}(\bar{\varvec{z}}({\varvec{x}}))\right\} \\&+\log {\sum _{{\varvec{y}}}p_{_{\mathrm{NML}}}({\varvec{y}}|\bar{\varvec{w}}({\varvec{y}}))p_{_{\mathrm{NML}}}(\bar{\varvec{w}}({\varvec{y}}) )} \nonumber \\= & {} L_{_{\mathrm{DNML}}}({\varvec{x}})+\log {\sum _{{\varvec{y}}}p_{_{\mathrm{NML}}}({\varvec{y}}|\bar{\varvec{w}}({\varvec{y}}))p_{_{\mathrm{NML}}}(\bar{\varvec{w}}({\varvec{y}}) )} , \end{aligned}$$

where \(p_{_{\mathrm{NML}}}({\varvec{z}})=p({\varvec{z}};\hat{\theta }_{2}({\varvec{z}}))/C_{\varvec{z}},\) and \(C_{\varvec{z}}=\sum _{{\varvec{z}}}p({\varvec{z}};\hat{\theta }_{2}({\varvec{z}})).\) Then we have

$$\begin{aligned} R^{n}_{X}=\log {\sum _{{\varvec{y}}}p_{_{\mathrm{NML}}}({\varvec{y}}|\bar{\varvec{w}}({\varvec{y}}))p_{_{\mathrm{NML}}}(\bar{\varvec{w}}({\varvec{y}}) )}+\log C_{\varvec{z}}. \end{aligned}$$

Since the DNML code-length is the prefix code-length for which \({\varvec{x}}\) is uniquely decodable, the Kraft’s inequality (see Cover and Thomas 1991) leads

$$\begin{aligned} \sum _{{\varvec{x}}}e^{-L_{_{\mathrm{DNML}}}({\varvec{x}})}\le 1. \end{aligned}$$

This is equivalent with

$$\begin{aligned} \sum _{{\varvec{x}}}p_{_{\mathrm{NML}}}({\varvec{x}}|\bar{\varvec{z}}({\varvec{x}}))p_{_{\mathrm{NML}}}(\bar{\varvec{z}}({\varvec{x}}))\le 1. \end{aligned}$$

Plugging (36) into (35) yields the following upper bound on \(R^{n}_{X}\):

$$\begin{aligned} R^{n}_{X}\le \log C^{n}_{Z}. \end{aligned}$$

This upper bound is attained by the DNML code-length. This competes the proof. \(\square \)

B Proof of Theorem 3

The proof is done basically along Theorem 5.2 in Rissanen (2012, pp. 58–62). The only difference is that our model is restricted to the specific form (20), in which latent variables are included and parameters are separated into \(\theta _{1}\) and \(\theta _{2}\), In our case, the product of two NML distributions rather than a single NML distribution must be taken into consideration.

We begin with Rissanen’s following lemma:

Lemma 1

(Rissanen 2012) Let \(p({\varvec{x}}; \theta )\) be the probability mass function that is continuous with respect to \(\theta \) for any \({\varvec{x}}\). Let K be fixed. Let \(\hat{\theta }({\varvec{x}})\) be the ML estimator of \(\theta \) and \(\bar{\theta }\) be another estimator. Let \(\hat{p}({\varvec{x}})=p({\varvec{x}};\hat{\theta }({\varvec{x}}))/\hat{C}\) where \(\hat{C}=\sum _{{\varvec{x}}}p({\varvec{x}};\hat{\theta }({\varvec{x}}) )\). Let \(\bar{p}({\varvec{x}})=p(x;\bar{\theta }({\varvec{x}}))/\bar{C}\) where \(\bar{C}=\sum _{{\varvec{x}}}p({\varvec{x}};\bar{\theta }({\varvec{x}}) )\). Let \(\varDelta (p_{\theta }||\bar{p})\buildrel {\mathrm{def}} \over =D(p_{\theta }||\bar{p})-D(p_{\theta }||\hat{p})\). Then for \({\varvec{x}}\in \bar{A}=\{{\varvec{x}}: \hat{\theta }({\varvec{x}})\ne \bar{\theta }\}\), we have

$$\begin{aligned} \sup _{\theta =\bar{\theta }({\varvec{x}}):{\varvec{x}}\in \bar{A}} \varDelta (p_{\theta } || \bar{p})\ge 0. \end{aligned}$$

In the first step, we consider the case where K is fixed. Let \(\bar{\theta }_{1},\bar{\theta }_{2}\) be fixed estimators of \(\theta _{1},\theta _{2}\), respectively. Let \(\bar{\theta }=(\bar{\theta }_{1},\bar{\theta }_{2})\). We define \(p_{\theta }\) and \(\bar{p}\) by

$$\begin{aligned} p_{\theta }({\varvec{x}},{\varvec{z}})= & {} p({\varvec{x}}|{\varvec{z}}, \theta _{1})p({\varvec{z}}; \theta _{2}),\\ \bar{p}({\varvec{x}},{\varvec{z}})= & {} \frac{p({\varvec{x}}|{\varvec{z}};\bar{\theta } _{1}({\varvec{x}},{\varvec{z}}))}{\bar{C}_{X|{\varvec{z}}}}\cdot \frac{p({\varvec{z}};\bar{\theta } _{2}({\varvec{z}}))}{\bar{C}_{Z}}, \end{aligned}$$

where \(\bar{C}_{X|{\varvec{z}}}=\sum _{{\varvec{x}}}p({\varvec{x}}|{\varvec{z}},\bar{\theta }_{1}({\varvec{x}},{\varvec{z}}))\) and \(\bar{C}_{Z}=\sum _{{\varvec{z}}}p({\varvec{z}}; \hat{\theta }_{2}({\varvec{z}}))\). Let \(\hat{\theta }=(\hat{\theta }_{1}, \hat{\theta }_{2})\) be the ML estimator of \(\theta \). Then \(\hat{p}({\varvec{x}},{\varvec{z}})\), \(\hat{C}_{X|{\varvec{z}}}\) and \(\hat{C}_{Z}\) are defined similarly.

Let us define \(\varDelta (p_{\theta }||\bar{p})\buildrel {\mathrm{def}} \over =D(p_{\theta }||\bar{p})-D(p_{\theta }||\hat{p})\). Let \(\bar{A}=\{({\varvec{x}},{\varvec{z}}):\hat{\theta }({\varvec{x}},{\varvec{z}})\ne \bar{\theta }({\varvec{x}},{\varvec{z}})\}\). We can decompose \(\varDelta (p_{\theta }||\bar{p})\) as follows:

$$\begin{aligned} \varDelta (p_{\theta }||\bar{p})= & {} E_{Z}[D(p_{\theta _{1}}||\bar{p}_{X|Z})]+D(p_{\theta _{2}}||\bar{p}_{Z})-\left( E_{Z}[D(p_{\theta _{1}}||\hat{p}_{X|Z})]+D(p_{\theta _{2}}||\hat{p}_{Z}) \right) \nonumber \\= & {} E_{Z}[\varDelta (p_{\theta _{1}}||\bar{p}_{X|Z})]+\varDelta (p_{\theta _{2}}||\bar{p}_{Z}), \end{aligned}$$

where \(p_{\theta _{1}}({\varvec{x}}|{\varvec{z}})=p({\varvec{x}}|{\varvec{z}};\theta _{1}), p_{\theta _{2}}({\varvec{z}})=p({\varvec{z}};\theta _{2})\), \(\bar{p}_{X|Z}({\varvec{x}}|{\varvec{z}})=p({\varvec{x}}|{\varvec{z}}; \bar{\theta }_{1}({\varvec{x}},{\varvec{z}}))/\bar{C}_{X|{\varvec{z}}}, \bar{p}_{Z}({\varvec{z}})=p({\varvec{z}};\bar{\theta }_{2}({\varvec{z}}))/\bar{C}_{Z}\) and \(E_{Z}\) denotes the expectation taken with respect to \(p_{\theta _{2}}\).

Applying Lemma 1 to \(E_{Z}[\varDelta (p_{\theta _{1}}||\bar{p}_{X|Z})]\) and \(\varDelta (p_{\theta _{2}}||\bar{p}_{Z})\) in (37), repectively, we are able to prove that

$$\begin{aligned} \sup _{\theta =\bar{\theta }({\varvec{x}},{\varvec{z}}):({\varvec{x}},{\varvec{z}})\in \bar{A}} \varDelta (p_{\theta } || \bar{p})\ge 0. \end{aligned}$$

Therefore, we have

$$\begin{aligned} \min _{\bar{\theta }}\max _{\theta }\varDelta (p_{\theta }||\bar{p})= & {} \min _{\bar{\theta }}\max _{\theta }(D(p_{\theta }||\bar{p})-D(p_{\theta }||\hat{p}))\nonumber \\\ge & {} \inf _{\bar{\theta }}\sup _{\theta =\bar{\theta }({\varvec{x}},{\varvec{z}}): ({\varvec{x}},{\varvec{z}})\in \bar{A}}\varDelta (p_{\theta }||\bar{p})\nonumber \\\ge & {} 0. \end{aligned}$$

\(\bar{\theta }=\hat{\theta }\) makes (38) zero, which attains the minimum.

Next we let the process of estimation of K included. Letting \(\bar{\theta }\) be an estimator of \(\theta \), we consider the form:

$$\begin{aligned} \bar{p}({\varvec{x}},{\varvec{z}}; K)=\bar{p}({\varvec{x}}|{\varvec{z}}; K)\bar{p}({\varvec{z}};K). \end{aligned}$$

Let \(\bar{K}\) be a given estimator of K and define \(\bar{p}\) by

$$\begin{aligned} \bar{p}({\varvec{x}},{\varvec{z}})=\frac{\bar{p}({\varvec{x}},{\varvec{z}}; \bar{K}({\varvec{x}},{\varvec{z}}))}{\sum _{{\varvec{y}},{\varvec{w}}}\bar{p}({\varvec{y}},{\varvec{w}}; \bar{K}({\varvec{y}},{\varvec{w}}))}, \end{aligned}$$

where \(\bar{p}({\varvec{x}},{\varvec{z}})\) forms a probability mass function. Specifically, we employ the ML estimator \(\hat{\theta }\) for \(\theta \) and the DNML estimator \(\hat{K}\) for K: \(\hat{K}({\varvec{x}},{\varvec{z}})=\arg \max _{K}\hat{p}({\varvec{x}},{\varvec{z}};K)\) Then we write the associated distribution as \(\hat{p}({\varvec{x}},{\varvec{z}})\). We also define \(p_{\theta ,K}({\varvec{x}},{\varvec{z}})\) as

$$\begin{aligned} p_{\theta ,K}({\varvec{x}},{\varvec{z}})=p({\varvec{x}}|{\varvec{z}}, \theta _{1}, K)p({\varvec{z}};\theta _{2},K). \end{aligned}$$

Lemma 1 can be applied again to the case where the estimator of K is normalized. Let \(\bar{B}=\{({\varvec{x}},{\varvec{z}}):\hat{K}({\varvec{x}},{\varvec{z}})\ne \bar{K}({\varvec{x}},{\varvec{z}})\}\). Then repeating the argument to evaluate (38), we have

$$\begin{aligned} \sup _{_{ \begin{array}{c}\theta =\bar{\theta }({\varvec{x}},{\varvec{z}}): ({\varvec{x}},{\varvec{z} })\in \bar{A}\\ K=\bar{K}({\varvec{x}},{\varvec{z}}):({\varvec{x}},{\varvec{z}})\in \bar{B}\end{array}}} \varDelta (p_{\theta ,K} || \bar{p})\ge 0. \end{aligned}$$

Therefore, we have


\(\bar{K}=\hat{K}\) makes (39) zero, which achieves the minimum. This completes the proof. \(\square \)

C Proof of Theorem 5

For the code-length for \(-\log p({\varvec{x}}, {\varvec{z}}; \hat{\theta }({\varvec{x}}, {\varvec{z}}) )\), this term can be decomposed into the sum of \( -\log p({\varvec{x}} | {\varvec{z}}; \hat{\theta }({\varvec{x}}, {\varvec{z}})) \) and \( -\log p({\varvec{z}}; \hat{\theta }({\varvec{z}}))\) in hierarchical latent variable models. Thus this part is common both for DNML and NML. We denote this term as \(L_{data}\).

The logarithm of the probability distribution of a finite mixture model can be written as \(\log p({\varvec{x}},{\varvec{z}}) = \sum _k z_k \log \pi _k + z_k \log p(x | z_k = 1)\). Its Fisher information matrix \(I_{X,Z} \) is derived as a block-diagonal matrix whose diagonal components are \(I_{\mathrm{MN}},\pi _1^ { K^1_{X} } I^1_{X},\cdots ,\pi _K ^ {K^K_{X} } I^K_{X}\),

where \(I_{\mathrm{MN}}\) and \(I^k_{ X}\) are the Fisher information matrices for the multinomial distribution and for the kth base distribution.

Using the asymptotic approximation formula (8) for the parametric complexity, we can compute the NML code-length as

For the DNML code-length,

Subtracting \({L}_{_{\mathrm{DNML}}}({\varvec{x}}, {\varvec{z}}) \) from \( {L}_{_{\mathrm{NML}}}({\varvec{x}}, {\varvec{z}}) \), we obtain (27). This completes the proof. \(\square \)

D Derivation of DNML code-length for NB

The likelihood function for the complete variable model for NB is written as


where \(z_{ik}=1\ (z_{k}=i)\) and \(z_{ik}=0\ (z_{k}\ne i)\).

When latent variables \({\varvec{z}}\) are given, the conditional maximum likelihood \(p({\varvec{x}}| {\varvec{z}}; \hat{\varPhi } ({\varvec{x}}, {\varvec{z}})) \) is obtained by maximizing (40) with respect to \(\varPhi \) as follows:


Taking the negative logarithm of (41), we get the first term in (29). The second term represents the logarithm of the parametric complexity of \(p({\varvec{x}}| {\varvec{z}}; \hat{\varPhi }) \) and is given as follows:

$$\begin{aligned} \sum _{{\varvec{x}}} p({\varvec{x}}| {\varvec{z}}; \hat{\varPhi }) =&\sum _{{\varvec{x}}} \prod _k \prod _d \prod _l \left( \frac{ n_{kdl} }{ n_{kd} } \right) ^ { n_{kdl} } \\ =&\prod _k \prod _d \sum _{x_{kd}^n} \prod _l \left( \frac{ n_{kdl} }{ n_{kd}} \right) ^ { n_{kdl} } \\ =&\prod _k \prod _d C_{\mathrm{MN}} (n_k, L_d) . \end{aligned}$$

Since NB is a finite mixture model, the last two terms in (29) are derived from (24). For the time complexity, since \(n_{kd}, n_k\) can be computed via a single pass through data and \( C_{\mathrm{MN}} (n_k, L_d) \) can be computed in linear time in \(n_k\) and \(L_{d}\) (by Theorem 4), the total time complexity is linear in n and K.

E Derivation of DNML code-length for LDA

We begin with deriving \({L}_{ _{\mathrm{NML}}}({\varvec{x}}| {\varvec{z}}; K)\). Let \(\varTheta =\{\theta _d\}\) and \(\varPhi =\{\phi _k\}\). The likelihood function for the complete variable model for LDA is calculated as

$$\begin{aligned} p({\varvec{x}}, {\varvec{z}}; \varTheta , \varPhi , K) = \prod _{d=1}^{D} \prod _{i=1}^{n_{d}} \prod _{k=1}^{K} \left\{ \theta _{dk} \left( \prod _{v=1}^{V} \phi _{kv} ^ {x_{div}} \right) \right\} ^ { z_{dik} }. \end{aligned}$$

When we are given \({\varvec{z}}\), the maximum of the conditional likelihood function \(p({\varvec{x}} | {\varvec{z}}; \hat{\varPhi }({\varvec{x}}, {\varvec{z}}), K)\) is calculated by maximizing (42) with respect to \(\varTheta \) and \(\varPhi \) as follows:

$$\begin{aligned} p({\varvec{x}} | {\varvec{z}}; \hat{\varPhi }({\varvec{x}}, {\varvec{z}}), K) =&\prod _{k=1}^K \prod _{v=1}^V \left( \frac{n_{kv}}{n_k}\right) ^{n_{kv}} . \end{aligned}$$

Normalizing (43) and taking its negative logarithm, we obtain the first two terms in (30). Next, we consider \({L}_{_{\mathrm{NML}}}({\varvec{z}}; K)\). Since each document is a mixture of topics in LDA, \(p({\varvec{z}}; \varTheta ) \) can be decomposed into \(\prod _d p({\varvec{z}}_d; \theta _d)\), where \({\varvec{z}}_d\) allocates data to document d. Under this decomposition, \(p({\varvec{z}}_d; \theta _d)\) for each d comprises a finite mixture model. Then, the NML code-length for \({\varvec{z}}\) is obtained as \(\sum _d {L}_{_\mathrm{NML}}({\varvec{z}}_d; K)\), which yields the last two terms in (30).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yamanishi, K., Wu, T., Sugawara, S. et al. The decomposed normalized maximum likelihood code-length criterion for selecting hierarchical latent variable models. Data Min Knowl Disc 33, 1017–1058 (2019).

Download citation


  • Model selection
  • Latent variable model
  • Minimum description length
  • Normalized maximum likelihood coding