# Entropic risk minimization for nonparametric estimation of mixing distributions

- First Online:

- Received:
- Accepted:

- 1 Citations
- 1.2k Downloads

## Abstract

We discuss a nonparametric estimation method for the mixing distributions in mixture models. The problem is formalized as a minimization of a one-parameter objective functional, which becomes the maximum likelihood estimation or the kernel vector quantization in special cases. Generalizing the theorem for the nonparametric maximum likelihood estimation, we prove the existence and discreteness of the optimal mixing distribution and provide an algorithm to calculate it. It is demonstrated that with an appropriate choice of the parameter, the proposed method is less prone to overfitting than the maximum likelihood method. We further discuss the connection between the unifying estimation framework and the rate-distortion problem.

### Keywords

Mixture models Nonparametric estimation Entropic risk measure Rate-distortion theory## 1 Introduction

In this paper, we define an objective functional with a single parameter \(\beta \), called entropic risk measure (Rudloff et al. 2008) and propose a nonparametric mixture estimation method as a minimization problem of it. With specific choices of \(\beta \), the method reduces to the maximum likelihood estimation (MLE) (Lindsay 1983, 1995) and the kernel vector quantization (KVQ) (Tipping and Schölkopf 2001). We generalize Lindsay’s theorem for the proposed method and prove the discreteness of the optimal mixing distribution for general \(\beta \). Then, we provide an algorithm which is an extension of the procedure in Nowozin and Bakir (2008) to calculate the optimal mixing distribution for the entropic risk measure. Numerical experiments indicate that an appropriate choice of \(\beta \) will reduce the generalization error. We discuss the estimation bias and variance to show that the range of optimal \(\beta \) depends on the sample size. We also discuss the relation between the proposed mixture estimation method and the rate-distortion problem (Berger 1971).

The paper is organized as follows. Section 2 introduces the entropic risk measure as the objective functional for estimating the mixture model. Section 3 proves the discreteness of the optimal mixing distribution with an overview of Lindsay’s proof, and a concrete estimation algorithm for the mixing distribution is shown. Section 4 examines its properties through numerical experiments for the Gaussian mixture model. In Sect. 5, we consider the range of \(\beta \) that will improve the generalization ability and describe the relation to the rate-distortion theory. Section 6 discusses the extension to other objective functionals than the entropic risk measure, and Sect. 7 concludes this paper.

## 2 Mixture model and objective functional

If \(q(\theta )\) is a single point distribution, computing \(q(\theta )\) is a point estimation; and if \(q(\theta )\) is a parametric distribution, the inference of \(q(\theta )\) is known as the empirical Bayesian approach. Instead, we consider the problem nonparametrically.

^{1}

^{2}Note that \(F_{\beta }(q)\) is continuous with respect to \(\beta \in \mathbf {R}\), and convex for \(\beta \ge -1\). We will discuss the convexity of it in Sect. 3.1. The above optimization problem becomes the MLE for \(\beta =0\) and the KVQ for \(\beta \rightarrow \infty \). We will discuss other choices of the convex objective functional in Sect. 6.

## 3 Optimal mixing distribution

### 3.1 Discreteness of the optimal mixing distribution

In this section, we generalize Lindsay’s theorem (Lindsay 1983, 1995) to prove that the optimal mixing distribution \(q(\theta )\) which minimizes \(F_{\beta }\) in Eq. (2) is discrete. Furthermore, this enables us to rely on the decoupled approach in Nowozin and Bakir (2008). We will see this in Sect. 3.2.

If \({\varvec{r}}^{*}\) minimizes \(F_{\beta }({\varvec{r}})\), \(F_{\beta , {\varvec{r}}^{*}}^{\prime }({\varvec{r}}_{1}) \ge 0\) must hold for any \({\varvec{r}}_{1}\). It has been proved for \(\beta =0\) that there exists a unique \({\varvec{r}}\) that minimizes \(F_{\beta }\) at the boundary of the convex hull of the set \(\{{\varvec{p}}_{\theta } = (p(x_{1}|\theta ),\ldots , p(x_{n}|\theta )) | \theta \in \Omega \}\), where \(\Omega \) is the parameter space (Lindsay 1983, 1995). This result can be generalized for the case \(\beta \ge -1\) because of the convexity of \(F_{\beta }\). From Caratheodory’s theorem, this means that the optimal \({\varvec{r}}\) is expressed by a convex combination, \(\sum _{l=1}^{k}\pi _{l}{\varvec{p}}_{\theta _{l}}\), with \(\pi _{l}\ge 0\), \(\sum _{l=1}^{k}\pi _{l}=1\) and \(k\le n\), indicating that the optimal mixing distribution is \(q(\theta )=\sum _{l=1}^{k}\pi _{l}\delta (\theta -\theta _{l})\), \(\theta _{l}\in \Omega \), which is a discrete distribution where the number of the support points is no more than \(n\).

### 3.2 Optimization of mixing distribution

In this section, we derive an estimation algorithm for \(q(\theta )\) following Nowozin and Bakir (2008). This algorithm iterates the subproblem that augments a new point to the support of \(q(\theta )\) and the learning of the finite mixture model. The minimization of \(F_{\beta }\) over finite mixture models is implemented with a simple updating rule by the expectation-maximization (EM) algorithm (Dempster et al. 1977; Barber 2012).

#### 3.2.1 Learning procedure

^{3}and \(F_{\beta }({\varvec{r}})\) can be decreased by adding more weight \(\pi _{l'}\) on \(\theta _{l'}\). Thus, the optimal condition for the mixing distribution \(q(\theta )\) is summarized as

The above algorithm updates \(\{\theta _{l}\}\) as well as \(\{\pi _{l}\}\) in Step 4. This is an extension of Nowozin and Bakir (2008), where only \(\{\pi _{l}\}\) is updated in the algorithm. Algorithm 1 requires a constant \(\epsilon \) and strongly depends on it especially when only \(\{\pi _{l}\}\) is updated. In the numerical experiments in the next section, we set \(\epsilon = 0.01\). From the assertion in Sect. 3.1, this learning procedure is guaranteed to stop before the support size of \(q(\theta )\) exceeds \(n\) if the learning procedure is started with an empty support set and updating both \(\{\theta _{l}\}\) and \(\{\pi _{l}\}\).

#### 3.2.2 EM updates for finite mixtures

^{4}

We can prove that the above update monotonically decreases the objective \(F_{\beta }\) for \(\beta \le 0.\)^{5}

^{6}To directly minimize \(F_{\beta }(q)\) with respect to \(\{{\varvec{\theta }}, {\varvec{\pi }}\}\), we switch to the following updating rules:

### 3.3 Pre-imaging for generation of support points

We discuss the relationship between the proposed algorithm and kernel-based learning algorithms. In this subsection, we focus on the case in which the component \(p(x|\theta )\) is a location family and is represented as \(p(x|\theta )\propto f(x-\theta )\) for some function \(f\), such as the Gaussian density in Eq. (12).

## 4 Numerical experiments

We applied Algorithm 1 to the synthetic data drawn from \(p^{*}(x)\) in Eq. (13) and estimated \(q(\theta )\). We also applied the original version of the algorithm in Nowozin and Bakir (2008), where each \(\theta _{l}\) is not updated in Step 4 of Algorithm 1 but fixed once generated in Step 3, and only \(\{\pi _{l}\}\) are updated. Results for this case will be indicated as “means fixed.”

This posterior probability and the number of components will be used in connection with rate-distortion function in Sect. 5.2. All results were averaged over \(100\) trials for different data sets generated by (13).

### 4.1 Prediction with known Kernel width

First, we assumed that the kernel width \(\gamma \) in Eq. (12) was known and \(p(x|\theta )\) was set to \(p(x|\theta ) = \frac{1}{2\pi } \exp \left( -\frac{||x-\theta ||^{2}}{2}\right) \). The distribution \(p^{*}(x)\) is realized in this case by the mixing distribution \(q(\theta )=\frac{1}{2}\delta (\theta - \theta _{1}^{*}) + \frac{1}{2} \delta (\theta - \theta _{2}^{*})\). An example of the estimated mixture model for \(\beta = -0.2\) and \(\gamma =0.5\) is demonstrated in Fig. 2a.

We see that the average training error is minimized at \(\beta = 0\) as expected, while the minimum of average prediction error is attained around \(\beta = -0.2\). Figure 2d shows the average of the maximum errors of Eq. (16). As expected, it monotonically decreases with respect to \(\beta \), which is consistent with the fact that the estimation approaches the KVQ as \(\beta \rightarrow \infty \).

The number of components \(\hat{k}\) as well as the number of hard clusters increase as \(\beta \) becomes larger. The discussion in Sect. 2 suggests that as \(\beta \) grows, more components are estimated to increase the entropy of the mixture \(r(x;q)\). This regularization reduces the average prediction error when \(\beta \) takes a slightly negative value as we have just observed in Fig. 2c. In Sect. 5.1, we will discuss the effective range of \(\beta \) which reduces the generalization error.

### 4.2 Mismatched Kernel width

Next we assumed \(\gamma \ne 0.5\), that is, the variance of a component has a mismatch.

When \(\gamma <0.5\), the true distribution in Eq. (13) cannot be realized with the model in Eq. (1). This will induce a larger objective \(F_{\beta }\) and a larger training error by the order of \(O(1)\). The top panels of Fig. 4a–c show examples of estimated mixtures and the average prediction error as a function of \(\beta \) for \(\gamma =0.05\), \(\gamma =0.2\) and \(\gamma =0.4\), respectively. We see that the prediction error is much larger for \(\gamma =0.05\) and \(\gamma =0.2\) than for \(\gamma =0.4\).

The results in Fig. 4c (\(\gamma =0.4\)) and Fig. 4d (\(\gamma =0.6\)) are similar to those presented in Sect. 4.1 because \(\gamma \) is close to the true value, \(0.5\). The prediction error increases when the mismatch of \(\gamma \) is large. The above results imply that it is possible to use cross-validation to select \(\beta \) and the kernel width \(\gamma \). But a practical procedure needs to be explored further. The next section is devoted to discussing the selection of these parameters.

## 5 Selection of parameters

This section first discusses the optimal parameter \(\beta \) that minimizes the average generalization error. The relationship between the mixture estimation and the rate-distortion problem is then described.

### 5.1 Effective range of \(\beta \)

^{7}As can be seen from the results for \(\beta =0\) in both figures, the assumption in Eq. (19) is reasonable. We also see that the generalization error is minimized by \(\beta \) with the smaller absolute value as \(n\) increases. This tendency becomes more apparent for \(\gamma =2.0\), that is, when the variance is underestimated.

### 5.2 Connection to the rate-distortion problem

The convex clustering in Lashkari and Golland (2007) corresponds to a special case of our proposal, that is, \(\beta =0\) (MLE), and the support points of \(q(\theta )\) are fixed to the training data set \(\{x_{1},\ldots , x_{n}\}\). For this restricted version of the problem, it is pointed out that the kernel width, \(\gamma \) in Eq. (12), has a relationship to the rate-distortion (RD) function of source \(\hat{p}(x)\) (the empirical distribution) and distortion associated with \(p(x|\theta )\) (for example, the squared distortion for Gaussian) (Lashkari and Golland 2007). In this section, we investigate the relationship of our proposal with the RD theory in a general case where the support points of \(q(\theta )\) are not restricted to the sample points and \(\beta \ne 0\). The relationship of mixture modeling with the RD theory is also partly discussed in Banerjee et al. (2005) for the finite mixture of exponential family distributions under the constraint that the cardinality of the support of \(q(\theta )\) is fixed.

*i.e.*, the mutual information \(I(X;\Theta )\) under the constraint of the average distortion measure. This is formulated by the Lagrange multiplier method as follows and is reformulated as the optimization problem in Eq. (21) (Berger 1971):

Figure 6b overwrites Fig. 6a (without the Shannon lower bound) with the RD curve for the empirical distribution given by a data set (\(n=50\)) generated by \(p^{*}(x)\). We used Algorithm 1 with \(\beta =0\) for estimating \(q(\theta )\) for each \(s\) and interpolated linearly to draw the RD curve. Figure 6c, d show RD curves for \(\beta =-0.2\) and \(\beta =0.5\), respectively, drawn by using Algorithm 1 for each \(s\). Note here that, in the above interpretation of the proposed optimization as an RD problem, the source depends on \(w_{i}\), which depends on \(q(\theta )\) as in Eq. (9) whereas in the original RD problem, the source does not depend on the reconstruction distribution. Hence the above pair of rate and distortion does not necessarily inherit properties of the usual RD function such as convexity, except for \(\beta =0\). In fact, the RD curve for \(\beta =0.5\) loses convexity as in Fig. 6d.

Figure 6b–d also show the average and twice the standard deviation of the linearly interpolated RD curves for \(100\) empirical data sets. We can see that when compared to the MLE (\(\beta =0\)), the RD curves for \(\beta =-0.2\) have small variation around the point \((D, R)=(2, 1)\), and those for \(\beta =0.5\) are, on average, close to the RD curve for the true Gaussian mixture in the small distortion region, that is, for a small variance of the Gaussian component. These imply the observation in Sects. 4.1 and 4.2 that \(\beta \ne 0\) can reduce the generalization error of the MLE. For \(\beta >0\), the learning algorithm developed in Sect. 3 can be considered as an algorithm for computing Renyi’s analog of the rate-distortion function previously appearing in Arikan and Merhav (1998) in the context of guessing.

## 6 Further topics

- 1.
MLE: \(-\sum _{i=1}^{n}\log r_{i}\)

- 2.
KVQ: \(-\rho \)

- 3.
Margin-minus-variance: \(-\rho + \frac{C}{n}\sum _{i=1}^{n}\left( r_{i}-\rho \right) ^{2}\)

- 4.
Mean-minus-variance: \(-\frac{1}{n}\sum _{i=1}^{n} r_{i} +\frac{C}{n}\sum _{i=1}^{n} \left( r_{i}-\frac{1}{n}\sum _{j=1}^{n}r_{j}\right) ^{2}\)

The beta divergence in Murata et al. (2004) and Eguchi and Kato (2010) is a generalization of the Kullback-Leibler divergence, which consists of a cross entropy term as above \(d_{\gamma }\), and is identical with the power divergence in Basu et al. (1998).

The expression (8) of \(F_{\beta }\) can be viewed as a weighted version of the log-likelihood function. When \(\beta < 0\), Eq. (9) provides a downweighting for outlying observations. This downweighting is equivalent to what is referred to in Basu et al. (1998) as a relative-to-the-model downweighting. This implies that the robustness of the estimation, the main feature of minimization of these divergences, carries over to \(F_{\beta }\) minimization for \(\beta < 0\). We observed that this can alleviate overfitting in Sect. 4.1 where the generalization error is minimized with a slightly negative value of \(\beta \). It is an interesting direction to explore a class of robustness-inducing objective functionals of \(q(\theta )\).

## 7 Conclusion

In this article, a nonparametric estimation method of mixing distributions is discussed. We have proposed an objective functional for the learning of mixing distributions of mixture models which unifies the MLE and the KVQ with the parameter \(\beta \). By extending Lindsay’s result, we proved that the optimal mixing distribution is a discrete distribution with distinct support points no greater in number than the sample size, and we provided a simple algorithm to calculate it. It has been demonstrated through numerical experiments and analyzed theoretically that the estimated distribution is less prone to overfitting for some range of \(\beta \). We have further discussed the nature of the objective functional in relation to the RD theory. Finally, we have shown certain open problems. We believe these results open a new direction for further research.

## Footnotes

- 1.
The original KVQ assumes the kernel function which is connected to the probability model as \(p(x|\theta )\propto K(x,\theta )\), and the possible support points of \(q(\theta )\) are fixed to the sample set \(\{x_{1},\ldots ,x_{n}\}\). That is \(\hat{q}(\theta )=\sum _{i=1}^{n}q_{i}\delta (\theta - x_{i})\), \(q_{i}\ge 0\), \(\sum _{i=1}^{n}q_{i}=1\), where \(\delta (\cdot )\) is Dirac’s delta function.

- 2.
The entropic risk measure was originally defined only for \(\beta >0\).

- 3.
This is proved simply from the fact that \(F_{\beta , {\varvec{r}}^{*}}^{\prime }({\varvec{r}}^{*}) = \sum _{l=1}^{k}\pi _{l}F_{\beta , {\varvec{r}}^{*}}^{\prime }({\varvec{p}}_{\theta _{l}}) = 0\).

- 4.
\(F_{\beta }({\varvec{\theta }},{\varvec{\pi }},{\varvec{w}})\) is related to the fact that the conjugate function of the log-sum-exp function is the entropy function, \(-\sum _{i=1}^{n}w_{i}\log w_{i}\) (Boyd and Vandenberghe 2004).

- 5.
The algorithm was defined for \(-1\le \beta <0\), but it can be easily checked that the algorithm works for \(\beta =0\).

- 6.For \(\beta >0\), it holds thatand the maximum is attained by \({\varvec{w}}\) satisfying Eq. (9).$$\begin{aligned} \max _{{\varvec{w}}\in \Delta } F_{\beta }({\varvec{\theta }},{\varvec{\pi }},{\varvec{w}}) = F_{\beta }({\varvec{r}}), \end{aligned}$$
- 7.
We did not show but a similar trend was observed for the case where \(\{\theta _{l}\}\) are fixed (see Sect. 4).

## Notes

### Acknowledgments

The authors are grateful for helpful comments and suggestions by anonymous reviewers. This work was supported by JSPS KAKENHI Grant Numbers 23700175, 24560490, 25120008, and 25120014.

### References

- Amari, S., Fujita, N., & Shinomoto, S. (1992). Four types of learning curves.
*Neural Computation*,*4*(4), 605–618.CrossRefGoogle Scholar - Arikan, E., & Merhav, N. (1998). Guessing subject to distortion.
*IEEE Transactions on Information Theory*,*44*(3), 1041–1056.MathSciNetCrossRefMATHGoogle Scholar - Banerjee, A., Merugu, S., Dhillon, I. S., & Ghosh, J. (2005). Clustering with Bregman divergences.
*Journal of Machine Learning Research*,*6*, 1705–1749.MathSciNetMATHGoogle Scholar - Barber, D. (2012).
*Bayesian reasoning and machine learning*. Cambridge: Cambridge University Press.MATHGoogle Scholar - Barron, A. R., Roos, T., Watanabe, K. (2014). Bayesian properties of normalized maximum likelihood and its fast computation. In
*Proceedings of the 2014 IEEE International Symposium on Information Theory*(pp. 1667–1671).Google Scholar - Basu, A., Harris, I. R., Hjort, N. L., & Jones, M. C. (1998). Robust and efficient estimation by minimising a density power divergence.
*Biometrika*,*85*(3), 549–559.MathSciNetCrossRefMATHGoogle Scholar - Berger, T. (1971).
*Rate distortion theory: A mathematical basis for data compression*. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar - Boyd, S., & Vandenberghe, L. (2004).
*Convex optimization*. Cambridge: Cambridge University Press.CrossRefMATHGoogle Scholar - Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm.
*Journal of the Royal Statistical Society*,*39–B*, 1–38.MathSciNetGoogle Scholar - Eguchi, S., & Kato, S. (2010). Entropy and divergence associated with power function and the statistical application.
*Entropy*,*12*, 262–274.MathSciNetCrossRefMATHGoogle Scholar - Eguchi, S., Komori, O., & Kato, S. (2011). Projective power entropy and maximum Tsallis entropy distributions.
*Entropy*,*13*, 1746–1764.MathSciNetCrossRefMATHGoogle Scholar - Fujisawa, H., & Eguchi, S. (2008). Robust parameter estimation with a small bias against heavy contamination.
*Journal of Multivariate Analysis*,*99*(9), 2053–2081.MathSciNetCrossRefMATHGoogle Scholar - Hartigan, J. A. (1985). A failure of likelihood asymptotics for normal mixtures. In
*Proceedings of the Berkeley Conference in Honor of J. Neyman and J. Kiefer*(Vol. 2, pp. 807–810).Google Scholar - Lashkari, D., Golland, P. (2007). Convex clustering with exemplar-based models. In Advances in neural information processing systems 19.Google Scholar
- Lindsay, B. G. (1983). The geometry of mixture likelihoods: A general theory.
*The Annals of Statistics*,*11*(1), 86–94.MathSciNetCrossRefMATHGoogle Scholar - Lindsay, B. G. (1995).
*Mixture models: Theory geometry and applications*. Hayward, CA: Institute of Mathematical Statistics.MATHGoogle Scholar - Murata, N., Takenouchi, T., Kanamori, T., & Eguchi, S. (2004). Information geometry of U-boost and Bregman divergence.
*Neural Computation*,*16*(7), 1437–1481.CrossRefMATHGoogle Scholar - Nowozin, S., Bakir, G. (2008). A decoupled approach to exemplar-based unsupervised learning. In
*Proceedings of the 24th International Conference on Machine Learning*(ICML).Google Scholar - Renyi, A. (1961). On measures of entropy and information. In
*Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability*(Vol. 1, pp. 547–561). University of California Press, Berkeley.Google Scholar - Rose, K. (1994). A mapping approach to rate-distortion computation and analysis.
*IEEE Transactions on Information Theory*,*40*(6), 1939–1952.CrossRefMATHGoogle Scholar - Rudloff, B., Sass, J., & Wunderlich, R. (2008). Entropic risk constraints for utility maximization. In C. Tammer & F. Heyde (Eds.),
*Festschrift in celebration of Prof. Dr. Wilfried Grecksch’s 60th Birthday*(pp. 149–180). Aachen: Shaker.Google Scholar - Schölkopf, B., Mika, S., Burges, C. J. C., Knirsch, P., Müller, K. R., Ratsch, G., et al. (1999). Input space versus feature space in kernel-based methods.
*IEEE Transactions on Neural Networks*,*10*, 1000–1017.CrossRefGoogle Scholar - Tipping, M. & Schölkopf, B. (2001). A kernel approach for vector quantization with guaranteed distortion bounds. In
*Proceedings of the International Conference on Artificial Intelligence and Statistics*(AISTATS).Google Scholar - Tsallis, C. (2009).
*Introduction to nonextensive statistical mechanics*. New York: Springer.MATHGoogle Scholar - Watanabe, S. (2005). Algebraic geometry of singular learning machines and symmetry of generalization and training errors.
*Neurocomputing*,*67*, 198–213.CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.