Skip to main content
Log in

MLiT: mixtures of Gaussians under linear transformations

  • Short Papers
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

The curse of dimensionality hinders the effectiveness of density estimation in high dimensional spaces. Many techniques have been proposed in the past to discover embedded, locally linear manifolds of lower dimensionality, including the mixture of principal component analyzers, the mixture of probabilistic principal component analyzers and the mixture of factor analyzers. In this paper, we propose a novel mixture model for reducing dimensionality based on a linear transformation which is not restricted to be orthogonal nor aligned along the principal directions. For experimental validation, we have used the proposed model for classification of five “hard” data sets and compared its accuracy with that of other popular classifiers. The performance of the proposed method has outperformed that of the mixture of probabilistic principal component analyzers on four out of the five compared data sets with improvements ranging from 0.5 to 3.2%. Moreover, on all data sets, the accuracy achieved by the proposed method outperformed that of the Gaussian mixture model with improvements ranging from 0.2 to 3.4%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Bellman R (ed) (1961) Adaptive control processes—a guided tour. Princeton University Press, Princeton, New Jersey

  2. Tipping ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc Ser B (Stat Methodol) 61(3):611–622

    Article  MathSciNet  MATH  Google Scholar 

  3. Roweis S (1997) EM algorithms for PCA and SPCA. In: Advances in neural information processing systems, vol 10. The MIT Press, Colorado, pp 626–632

  4. Bartholomew DJ (ed) (1987) Latent variable models and factor analysis. Charles Griffin, London

  5. Basilevsky A (ed) (1994) Statistical factor analysis and related methods. Wiley, New York

  6. Schölkopf B, Smola A, Müller K-R (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10:1299–1319

    Article  Google Scholar 

  7. Chin T-J, Suter D (2007) Incremental kernel principal component analysis. IEEE Trans Image Process 16(6):1662–1674

    Article  MathSciNet  MATH  Google Scholar 

  8. Hinton GE, Dayan P, Revow M (1997) Modeling the manifolds of images of handwritten digits. IEEE Trans Neural Netw 8(1):65–74

    Article  Google Scholar 

  9. Tipping ME, Bishop CM (1999) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482

    Article  Google Scholar 

  10. Ghahramani Z, Hinton GE (1997) The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, University of Toronto (the original paper for the mixture of factor analyzers)

  11. Ridder DD, Franc V (2003) Robust subspace mixture models using t-distribution. In: 14th British Machine Vision Conference (BMVC), London, UK, pp 319–328

  12. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  13. Bishop CM (ed) (2006) Pattern recognition and machine learning. Springer

  14. Kittler JV (1998) Combining classifiers: a theoretical framework. Pattern Anal Appl 1(1):18–27

    Article  MathSciNet  Google Scholar 

  15. Neal R, Hinton G (1999) A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan MI (ed) Learning in graphical models. MIT Press, Cambridge, MA, pp 355–368

  16. Figueiredo MAF, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396 (avoid singularity by applying deterministic annealing)

    Google Scholar 

  17. Bolton RJ, Krzanowski WJ (1999) A characterization of principal components for projection pursuit. Am Stat 53(2):108–109

    Article  Google Scholar 

  18. Asuncion A, Newman DJ (2007) UCI machine learning repository

  19. Jain AK, Duin RPW, Jianchang M (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37

    Article  Google Scholar 

  20. Breiman L, Spector P (1992) Submodel selection and evaluation in regression: the x-random case. Int Stat Rev 60(3):291–319

    Article  Google Scholar 

  21. Bilmes JA (1998) A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian Mixture and Hidden Markov Model

  22. Schoenberg R (1997) Constrained maximum likelihood. Comput Econ 10:251–266

    Article  MATH  Google Scholar 

  23. Golub GH, van Loan CF (1996) Matrix computations. Johns Hopkins University Press, 3rd edn

Download references

Acknowledgments

The authors wish to thank the Australian Research Council and iOmniscient Pty Ltd that have partially supported this work under the Linkage Project funding scheme, grant LP0668325.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmed Fawzi Otoom.

Appendices

Appendix 1

1.1 The expected complete-data log likelihood for MLiT

In this appendix, we show the derivation of the two main terms in (9),

$$ Q(\theta,\theta^g)=\sum_{Y} \left[\ln(p(Y,Z|\theta))p(Z|Y,\theta^g)\right] $$
(25)

namely, the complete data log-likelihood, \(\ln(p(Y,Z|\theta)\), and the posterior for the latent variables, p(Z|Y, θg). Whereas we report this derivation herewith for completeness, it could be easily derived from the equivalent derivation for conventional Gaussian mixtures.

We begin with the derivation of an expression for \(\ln(p(Y,Z|\theta)\). We first apply Bayes theorem

$$ \ln(p(Y,Z|\theta)=\ln(p(Y|Z,\theta)p(Z|\theta)). $$
(26)

Given the following assumptions: (a) the mutual independence of the y i observations; (b) the dependence of y i only on its own latent variable, z i and; (c) the mutual independence of the z i , we have

$$ \begin{aligned} \ln(p(Y|Z,\theta)p(Z|\theta))&=\ln\left(\prod_{i=1}^N (p(y_i|Z,\theta))p(Z|\theta)\right)\\ &=\ln\left(\prod_{i=1}^N (p(y_i|z_i=l,\theta))p(Z|\theta)\right)\\ &=\ln\left(\prod_{i=1}^N (p(y_i|z_i=l,\theta))\prod_{i=1}^N p(z_i=l|\theta)\right)\\ &=\ln\left(\prod_{i=1}^N[(p(y_i|z_i=l,\theta))p(z_i=l|\theta)]\right)\\ &=\sum_{i=1}^N\ln\left[(p(y_i|z_i=l,\theta))p(z_i=l|\theta)\right] \end{aligned} $$
(27)

The argument of the logarithm can be written as

$$ p(y_i|z_i=l,\theta)p(z_i=l|\theta)={{\mathcal{N}}}\left(\Omega_{l} y|\mu_{l},\Sigma_{l}\right)\alpha_{l}. $$
(28)

Therefore, we have the final equivalence

$$ \ln(p(Y,Z|\theta))=\sum_{i=1}^N\ln\left[\alpha_{l}{{\mathcal{N}}}(\Omega_l y_i|\mu_{l},\Sigma_{l})\right]. $$
(29)

The next term needed is posterior p(Z|Y, θg). Under the assumptions above, the probability of any entire assignment, Z, conditioned on the observations is

$$ p(Z|Y,\theta^g)=\prod_{i=1}^N p(z_i=l|y_i,\theta^g). $$
(30)

Let us then apply Bayes theorem, again, to p(z i  = l|y i , θg)

$$ p(z_i=l|y_i,\theta^g)={\frac{p(z_i=l|\theta^g)p(y_i|z_i=l,\theta^g)} {p(y_i|\theta^g)}} $$
(31)

In the above, we have three terms in the right member. In the numerator, there are the terms that we have computed for (28). The denominator is the marginal probability of y over all the components. Therefore, the above equation becomes

$$ p(z_i=l|y_i,\theta^g)={\frac{\alpha_{l}^g {{\mathcal{N}}}(\Omega_l y_i|\mu_{l}^g,\Sigma_{l}^g)} {\sum_{k=1}^M\alpha_l^g{{\mathcal{N}}}(\Omega_l y_i|\mu_{l}^g,\Sigma_{l}^g)}} $$
(32)

justifying our formula for the responsibilities. We can now replace both terms (29), (30) in Q(θ, θg) to obtain

$$ Q(\theta,\theta^g)=\sum_{Y}\left[\left(\sum_{i=1}^N\ln[\alpha_{l} {\mathcal{N}}(\Omega_l y_i|\mu_{l},\Sigma_{l})]\right)\left(\prod_{i=1}^N p(z_i=l|y_i,\theta^g)\right)\right] $$
(33)

The following simple steps leading to (10) can be repeated from [21].

Appendix 2

2.1 A constrained expectation–maximization algorithm for MLiT

In this appendix, we present the regularized solution based on this constrained optimization [MLiT (C)]. Again, we update one column vector, w j , at a time while keeping the others still. As a constraint on \(\Omega_l,g(\Omega_l)=0\), we choose an entrywise L1-norm constraint, \(g(\Omega_l)=\|\Omega_l\|_{L1}-s=0\); constant s is the chosen value for the norm, or scale:

$$ g(\Omega_l)=\sum_{i=1}^D\sum_{k=1}^P\left|w_{ik}\right|-s=0 $$
(34)

where \(dim(\Omega_l)=D\times P\). However, as we update only one w j column at a time, we need to impose this constraint in a column-wise manner. Therefore, we turn (34) into the stronger constraint g(w j )

$$ g(w_j)=\sum_{i=1}^D \left|w_{ij}\right|-(c = s/P)=0 $$
(35)

A solution for (17) under constraint (35) may be obtained by using a constrained optimization solver such as CML [22]. However, we prefer to provide an inline solution for reasons of speed and independency. Constraint (35) can be represented in an expanded notation as

$$ g(w_j)=\pm w_{1j}\,{\pm} w_{2j}{\pm}\cdots {\pm} w_{Dj}-c=0 $$
(36)

under condition

$$ \left|w_{1j}\right|,\left|w_{2j}\right|,\ldots,\left|w_{Dj}\right|\leq c. $$
(37)

Equation (36) shows the 2D different combinations of signs that the constraint can take, leading to 2D different linear constraints that do not use absolute values. Whereas the number of such constraints is significant, it is tolerable for typical values of D. In contrast, alternative approaches for computing equality-constrained solutions impose constraints on the rows of \(\Omega\) [23]. The number of constraints so required is in the order of 2P, exponential in the high dimension, P, and therefore unmanageable. This is the ultimate justification for our choice of a column-wise solution for the maximization of Q(θ, θg) in \(\Omega_l\). Our constrained approach first solves (17) under each of the (36) linear constraints. If solutions are found, they are then tested post hoc for satisfaction of (37).

For exemplification, let us consider one of the (36) constraints

$$ g(w_j)=w_{1j}+w_{2j}+\cdots +w_{Dj}-c=0 $$
(38)

Then, consider the Lagrangian equation

$$ \begin{aligned} h(w_j)&=f(w_j)+\lambda g(w_j) \\ &=f(w_j)+\lambda\left(\sum_{i=1}^D w_{ij}-c\right) \end{aligned} $$
(39)

where f(w j ) is Q(θ, θg) (10) [or (16) likewise], λ is the Lagrange multiplier, and g(w j ) is the constraint. Equation (39) can be written as

$$ \begin{aligned} h(w_j)&=\sum_{i=1}^N \left((-{\frac{1} {2}}\ln(|\Sigma_l|)\right. \\ &\quad-{\frac{1}{2}}(w_{1}y_{i1}+\cdots+w_{j}y_{ij}+\cdots +w_{p}y_{ip}-\mu_l)^T\\ &\quad\times\Sigma_l^{-1}(w_{1}y_{i1}+\cdots +w_{j}y_{ij}+\cdots+w_{p}y_{ip}-\mu_l))\\ &\quad\left.\times\, p(l|y_{i},\theta^g)\vphantom{\frac{1}{2}}\right)+\lambda \left(\sum_{i=1}^D w_{ij}-c\right) \end{aligned} $$
(40)

where the external sum is ignored since all its terms, but one, are null after the differentiation.

The derivative of (40) in w j is

$$ \begin{aligned} {\frac{\partial(h(w_j))}{\partial w_{j}}}&= \sum_{i=1}^{N}\left(\Sigma_{l}^{-1g}(w_1^g y_{i1} +\cdots+ w_{j}y_{ij}+\cdots+{w}_{p}^{g}y_{ip}\right. \\ &\quad\left.-\,{\mu}_{l}^{g})y_{ij} p(l|y_i,{\theta}^{g})\right)+\lambda\overline{1}\\ &=0 \end{aligned} $$
(41)

where \(\overline{1}\) stands for a D × 1 vector of all ones. By pre-multiplying (41) by \(\Sigma_l^g\), we obtain:

$$ \begin{aligned} {\frac{\partial(h(w_j))}{\partial w_{j}}}&= \sum_{i=1}^N \big((w_{1}^{g} y_{i1}+\cdots+ w_{j}y_{ij}+\cdots+{w}_{p}^{g} y_{ip}-{\mu}_{l}^{g}) \\ &\quad\times {y}_{ij} p(l | {y}_{i},{\theta}^{g})+ \lambda {\Sigma}_{l}^{g}\overline{1}\big) \\ &=0 \end{aligned} $$
(42)

By setting \(r=\Sigma_{i=1}^N y_{ij}^2p(l|y_i,\theta^g)\), and collectively naming b all terms in \(\left[w_{k}^g\right]_{k=1\ldots P,k\neq j}\), we obtain

$$ \begin{aligned} {\frac{\partial(h(w_j))}{\partial w_{j}}}&= rw_j+b+\lambda\Sigma_l^g\overline{1}\\ &=0 \end{aligned} $$
(43)

The solution for w j can then be written as

$$ w_j={\frac{-b-\lambda\Sigma_l^g\overline{1}}{r}} $$
(44)

Let us now call s nm the D × D elements of \(\Sigma_l^g\) in row–column notation and make (44) explicit in its D rows

$$ \left\{\begin{array}{l} w_{1j}={\frac{-b_{1}-\lambda(s_{11}+\cdots+s_{1D})}{r}} \\ \ldots \\ w_{Dj}={\frac{-b_{D}-\lambda(s_{D1}+\cdots+s_{DD})}{r}} \end{array}\right. $$

We are eventually in a position to impose the g(w j ) constraint by adding up all left and right members of (45). On the left, we obtain c, a known value. On the right, a linear function of λ. Therefore, λ can be solved for immediately as

$$ \lambda={\frac{-(cr+b_{1}+\cdots+b_{D})}{(s_{11}+\cdots+s_{DD})}} $$
(45)

With λ thus computed, its value is replaced in (44) to obtain the desired, constrained solution for w j . For each of the other remaining constraints, λ is also calculated and the corresponding constrained solutions of w j are obtained. We note that, in order for the EM algorithm to continue iterating, at least one constraint solution for w j satisfying (36) and (37) is needed. We also note that there is no specific relevance for the choice of s: changing scale would lead to scaled densities for all components and classes and equivalent classification outcomes; possible differences can be imputed to numerical resolution. In the experiments, we have tried different values of s in a logarithmic scale and chosen that corresponding to the highest classification accuracy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Otoom, A.F., Gunes, H., Perez Concha, O. et al. MLiT: mixtures of Gaussians under linear transformations. Pattern Anal Applic 14, 193–205 (2011). https://doi.org/10.1007/s10044-011-0205-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-011-0205-2

Keywords

Navigation