Appendix 1: Consistency of the MICL criterion
This section is devoted to the proof of consistency of our \(\text {MICL}\) criterion with a fixed number of components. The first part deals with non-nested models and requires a bias-entropy compensation assumption. The second part covers the nested models, i.e, when the competing model contains the true model. In what follows, we consider the true model \(\varvec{m}^{(0)} = \big (g^{(0)}, \varvec{\omega }^{(0)}\big )\), its set of relevant variables is \({\varOmega }^{(0)} = \left\{ j : \omega ^{(0)}_j = 1 \right\} \) and the parameter is \(\varvec{\theta }^{(0)}\).
Case of non-nested modelWe need to introduce the entropy notation given by
$$\begin{aligned} \xi \big (\varvec{\theta }; \mathbf {z}, \varvec{m}\big ) = \sum _{i = 1}^n \sum _{k = 1}^g z_{ik} \ln \tau _{ik}\big (\varvec{\theta }\mid \varvec{m}\big ), \end{aligned}$$
where \(\tau _{ik}\big (\varvec{\theta }\mid \varvec{m}\big ) = \dfrac{\tau _k \phi \big (\varvec{x}_i \mid \theta _k, \varvec{m}\big )}{\sum _h^g\tau _h \phi \big (\varvec{x}_i \mid \theta _h, \varvec{m}\big )}.\)
Proposition 1
Assume that \(\varvec{m}^{(1)}\) is a model such that \(\varvec{m}^{(0)}\) is a non-nested model within \(\varvec{m}^{(1)}\). Assume that
$$\begin{aligned}&- \mathbb {E}\left[ \ln \dfrac{\sum _{k = 1}^{g^{(0)}}\tau _k \prod _{j = 1}^d\phi \big (x_{1j} \mid \mu ^{(0)}_{kj}, \sigma ^{(0)2}_{kj}\big )\mathbbm {1}_{G^{(0)}_k}\big (\varvec{x}_1\big )}{p\big ( \varvec{x}_1 \mid \varvec{\theta }^{(0)},\varvec{m}^{(0)}\big )}\right] \nonumber \\&\quad \le \mathbf {KL}\Big [\varvec{m}^{(0)}||\varvec{m}^{(1)}\Big ], \end{aligned}$$
(22)
where \(\mathbf {KL}\Big [\varvec{m}^{(0)}||\varvec{m}^{(1)}\Big ]\) is the Kullback-Leibler divergence of \(p\big (\cdot \mid \varvec{\theta }^{(0)},\varvec{m}^{(0)}\big )\) from \(p\big (\cdot \mid \varvec{\theta }^{(1)},\varvec{m}^{(1)}\big )\) and
$$\begin{aligned} G^{(0)}_k = \left\{ x \in \mathbb R^d : k = \underset{1 \le h \le g^{(0)}}{\text {argmax}}\, \tau _h \prod _{j = 1}^d\phi \big (x_{1j} \mid \mu ^{(0)}_{hj}, \sigma ^{2 (0)}_{hj}\big )\right\} . \end{aligned}$$
When \(n \rightarrow \infty \), we have
$$\begin{aligned} \mathbb {P}\bigg (\text {MICL}\big (\varvec{m}^{(1)}\big ) > \text {MICL}\big (\varvec{m}^{(0)}\big )\bigg ) \longrightarrow 0. \end{aligned}$$
Proof
For any model \(\varvec{m}\), we have the following inequalities,
$$\begin{aligned} \text {ICL}\big (\varvec{m}\big ) \le \text {MICL}\big (\varvec{m}\big ) \le \ln p\big (\mathbf {x}\mid \varvec{m}\big ). \end{aligned}$$
It follows,
$$\begin{aligned}&\mathbb {P}\bigg \{\text {MICL}\big (\varvec{m}^{(1)}\big ) - \text {MICL}\big (\varvec{m}^{(0)}\big )> 0\bigg \}\\&\quad \le \mathbb {P}\bigg \{\ln p\big (\mathbf {x}\mid \varvec{m}^{(1)}\big ) - \text {ICL}\big (\varvec{m}^{(0)}\big ) > 0\bigg \}. \end{aligned}$$
Now set \({\varDelta }\nu = \nu ^{(1)} - \nu ^{(0)}\) where \(\nu ^{(1)}\) and \(\nu ^{(0)}\) are the numbers of free parameters in the models \(\varvec{m}^{(1)}\) and \(\varvec{m}^{(0)}\) respectively. Using Laplace’s approximation, we have
$$\begin{aligned} \text {ICL}\big (\varvec{m}^{(0)}\big )= & {} \ln p\left( \mathbf {x}\mid \widehat{\varvec{\theta }}^{(0)}, \varvec{m}^{(0)}\right) + \xi \left( \widehat{\varvec{\theta }}^{(0)}; \widehat{\mathbf {z}}^{(0)}, \varvec{m}^{(0)}\right) \nonumber \\&- \dfrac{\nu ^{(0)}}{2} \ln n+ \mathcal {O}_p(1), \end{aligned}$$
where \(\widehat{\varvec{\theta }}^{(0)}\) and \(\widehat{\mathbf {z}}^{(0)}\) are respectively the MLE and the partition given by the corresponding MAP rule. In the same way, we have
$$\begin{aligned} \ln p\big (\mathbf {x}\mid \varvec{m}^{(1)}\big ) = \ln p\big (\mathbf {x}\mid \widehat{\varvec{\theta }}^{(1)}, \varvec{m}^{(1)}\big ) - \dfrac{\nu ^{(1)}}{2} \ln n + \mathcal {O}_p(1), \end{aligned}$$
where \(\widehat{\varvec{\theta }}^{(1)}\) is the MLE of \(\varvec{\theta }^{(1)}\). Note that
$$\begin{aligned}&\ln p\big (\mathbf {x}\mid \varvec{m}^{(1)}\big ) - \text {ICL}\big (\varvec{m}^{(0)}\big )\\&= \dfrac{A_n}{2} + n B_n - \dfrac{{\varDelta }\nu }{2} \ln n +\, \mathcal {O}_p(1), \end{aligned}$$
where
$$\begin{aligned} A_n = 2 \ln \dfrac{p\left( \mathbf {x}\mid \widehat{\varvec{\theta }}^{(1)},\varvec{m}^{(1)}\right) }{ p\left( \mathbf {x}\mid \varvec{\theta }^{(1)}, \varvec{m}^{(1)}\right) } - 2 \ln \dfrac{p\left( \mathbf {x}\mid \widehat{\varvec{\theta }}^{(0)}, \varvec{m}^{(0)}\right) }{p\left( \mathbf {x}\mid \varvec{\theta }^{(0)}, \varvec{m}^{(0)}\right) }, \end{aligned}$$
and
$$\begin{aligned} B_n = \dfrac{1}{n} \ln \dfrac{p\left( \mathbf {x}\mid \varvec{\theta }^{(1)},\varvec{m}^{(1)}\right) }{p\left( \mathbf {x}\mid \varvec{\theta }^{(0)},\varvec{m}^{(0)}\right) } - \dfrac{1}{n}\xi \left( \widehat{\varvec{\theta }}^{(0)}; \widehat{\mathbf {z}}^{(0)},\varvec{m}^{(0)}\right) . \end{aligned}$$
When \(n \rightarrow \infty \), we have \(A_n \rightarrow \chi ^2_{{\varDelta }\nu }\) in distribution and \(B_n\) tends to
$$\begin{aligned}&-\mathbf {KL}\Big [\varvec{m}^{(0)}||\varvec{m}^{(1)}\Big ]\\&- \mathbb {E}\left[ \ln \dfrac{\sum _{k = 1}^{g^{(0)}}\tau _k \prod _{j = 1}^d\phi \big (x_{1j} \mid \mu ^{(0)}_{kj}, \sigma ^{(0)2}_{kj}\big )\mathbbm {1}_{G^{(0)}_k}\big (\varvec{x}_1\big )}{p\big ( \varvec{x}_1 \mid \varvec{\theta }^{(0)},\varvec{m}^{(0)}\big )}\right] \end{aligned}$$
in probability. Thus, under the assumption (22), \(\text {MICL}\) is consistent since when \(n \rightarrow \infty \), we have
$$\begin{aligned}&\mathbb {P}\bigg \{\text {MICL}\big (\varvec{m}^{(1)}\big ) - \text {MICL}\big (\varvec{m}^{(0)}\big )> 0 \bigg \}\\&\quad \le \mathbb {P}\bigg [A_n + \mathcal {O}_p(1)> {\varDelta }\nu \ln n\bigg ] + \mathbb {P}\bigg [B_n > 0 \bigg ]\longrightarrow 0. \end{aligned}$$
Case of nested model Recall that \(\text {MICL}\big (\varvec{m}^{(0)}\big ) = \ln p\big (\mathbf {x}, \mathbf {z}^{(0)}\mid \varvec{m}^{(0)}\big )\), where \(\mathbf {z}^{(0)} = \underset{\mathbf {z}}{\text {argmax}} \ln p\big (\mathbf {x}, \mathbf {z}\mid \varvec{m}^{(0)}\big )\). We have
$$\begin{aligned} \mathbf {z}^{(0)} {=} \underset{\mathbf {z}}{\text {argmax}}\Big \{\ln p\left( \mathbf {z}\mid g^{(0)}\right) {+} \underset{j \in {\varOmega }_0}{\sum }\ln p\big (\mathbf {x}_{\bullet j} \mid \omega _j^{(0)}, g^{(0)}, \mathbf {z}\big ) \Big \}, \end{aligned}$$
where \({\varOmega }_0 = \left\{ j : \omega ^{(0)}_j = 1\right\} \). Let \(\varvec{m}^{(1)} = \left( g^{(0)}, {\varOmega }_1\right) \) where \({\varOmega }_1 = {\varOmega }_0 \cup {\varOmega }_{01}\) and \( {\varOmega }_{01} = \left\{ j : \omega _j^{(1)} = 1, \omega _j^{(0)} = 0 \right\} \). Then, in the same way, we have \(\text {MICL}\big (\varvec{m}^{(1)}\big ) = \ln p\big (\mathbf {x}, \mathbf {z}^{(1)}\!\mid \varvec{m}^{(1)}\big )\), where
$$\begin{aligned} \mathbf {z}^{(1)} {=} \underset{\mathbf {z}}{\text {argmax}}\left[ \ln p(\mathbf {z}\mid g^{(0)}) {+} \underset{j \in {\varOmega }_1}{\sum }\ln p\left( \mathbf {x}_{\bullet j} {\mid } \omega _j^{(1)}, g^{(0)}, \mathbf {z}\right) \right] . \end{aligned}$$
Let \(j \in {\varOmega }_{01}\), Laplace’s approximation gives us,
$$\begin{aligned} \ln p\left( \mathbf {x}_{\bullet j} \mid \omega _j^{(1)}, g^{(0)}, \mathbf {z}\right)= & {} \sum _{i =1}^n \sum _{k = 1}^g z_{ik}\ln \phi \left( x_{ij} \mid \tilde{\mu }^{(1)}_{kj}, \tilde{\sigma }^{(1)2}_{kj}\right) \nonumber \\&- g^{(0)} \ln n + \mathcal {O}_p(1), \end{aligned}$$
where
$$\begin{aligned} \left( \tilde{\mu }^{(1)}_{kj}, \tilde{\sigma }^{(1)2}_{kj}\right) = \underset{\mu ^{(1)}_{kj}, \sigma ^{(1)2}_{kj}}{\text {argmax}} \sum _{i =1}^n z_{ik}\ln \phi \left( x_{ij} \mid \mu ^{(1)}_{kj}, \sigma ^{(1)2}_{kj}\right) . \end{aligned}$$
Proposition 2
Assume that \(\varvec{m}^{(1)}\) is a model such that \(g^{(1)} = g^{(0)}\) and \({\varOmega }_1 = {\varOmega }_0 \cup {\varOmega }_{01}\) where \({\varOmega }_{01} \ne \emptyset \), i.e, the model \(\varvec{m}^{(0)}\) is nested within the model \(\varvec{m}^{(1)}\) with the same number of components. When \(n \rightarrow \infty \),
$$\begin{aligned} \mathbb {P}\bigg (\text {MICL}\big (\varvec{m}^{(1)}\big ) > \text {MICL}\big (\varvec{m}^{(0)}\big )\bigg ) \longrightarrow 0. \end{aligned}$$
Proof
We have
$$\begin{aligned}&\mathbb {P}\bigg \{\text {MICL}\big (\varvec{m}^{(1)}\big )> \text {MICL}\big (\varvec{m}^{(0)}\big )\bigg \}\\&\quad \le \mathbb {P}\left\{ \sum _{j \in {\varOmega }_{01}} \ln \dfrac{p\big (\mathbf {x}_{\bullet j} \mid \omega ^{(1)}_j, g^{(0)}, \mathbf {z}^{(1)}\big )}{ p\big (\mathbf {x}_{\bullet j} \mid \omega ^{(0)}_j, g^{(0)}, \mathbf {z}^{(0)}\big )} > 0 \right\} , \end{aligned}$$
And for each \(j \in {\varOmega }_{01}\), when \(n \rightarrow \infty \)
$$\begin{aligned} 2 \sum _{i =1}^n \sum _{k =1}^{g^{(0)}} z^{(1)}_{ik}\ln \dfrac{\phi \left( x_{ij} \mid \tilde{\mu }^{(1)}_{kj}, \tilde{\sigma }^{(1)2}_{kj}\right) }{\phi \left( x_{ij} \mid \mu ^{(0)}_{1j},\sigma ^{(0)2}_{1j}\right) } \longrightarrow \chi ^2_{2g^{(0)}} \quad \text {in distribution}. \end{aligned}$$
We have
$$\begin{aligned}&\mathbb {P}\bigg (\sum _{j \in {\varOmega }_{01}} \ln \dfrac{p\big (\mathbf {x}_{\bullet j} \mid \omega ^{(1)}_j, g^{(0)}, \mathbf {z}^{(1)}\big )}{ p\big (\mathbf {x}_{\bullet j} \mid \omega ^{(0)}_j, g^{(0)}, \mathbf {z}^{(0)}\big )}> 0 \bigg )\\&\quad = \mathbb {P}\Big (\chi ^2_{2(g^{(0)}-1)} - 2 (g^{(0)}-1) \ln n > 0\Big )\\&\qquad \longrightarrow 0 \quad \text { by Chebyshev's inequality}. \end{aligned}$$
Appendix 2: Details on the partition step
At iteration [r], the partition \(\mathbf {z}^{[r]}\) is defined as a partition which increase the value of the integrated complete-data likelihood for the current model \(\varvec{m}^{[r]}\). This partition is obtained by an iterative method initialized with the partition \(\mathbf {z}^{[r-1]}\). Each iteration consists in sampling uniformly an individual which is affiliated to the class maximizing the integrated complete-data likelihood while the other class memberships are unchanged.
At iteration [r] of the global algorithm, the algorithm used to obtained \(\mathbf {z}^{[r]}\) is inialized at partition \(\mathbf {z}^{(0)}=\mathbf {z}^{[r-1]}\). It performs S iterations where iteration (s) is defined as follows:
Individual sampling
\(i^{(s)} \sim \mathcal {U}\{1,\ldots ,n\}\)
Partition optimization defined the set of partition \(\mathcal {Z}^{(s)}=\{\mathbf {z}: \varvec{z}_i=\varvec{z}_i^{(s-1)},\; \forall i\ne i^{(s)}\}\) and compute the optimized partition \(\mathbf {z}^{(s)}\) defined by
$$\begin{aligned} \mathbf {z}^{(s)} = \text {argmax}_{\mathbf {z}\in \mathcal {Z}^{(s)}} \ln p\left( \mathbf {x},\mathbf {z}|\varvec{m}^{[r]}\right) . \end{aligned}$$