MLiT: mixtures of Gaussians under linear transformations

Otoom, Ahmed Fawzi; Gunes, Hatice; Perez Concha, Oscar; Piccardi, Massimo

doi:10.1007/s10044-011-0205-2

MLiT: mixtures of Gaussians under linear transformations

Short Papers
Published: 26 March 2011

Volume 14, pages 193–205, (2011)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Ahmed Fawzi Otoom¹,
Hatice Gunes^1,2,
Oscar Perez Concha¹ &
…
Massimo Piccardi¹

186 Accesses
Explore all metrics

Abstract

The curse of dimensionality hinders the effectiveness of density estimation in high dimensional spaces. Many techniques have been proposed in the past to discover embedded, locally linear manifolds of lower dimensionality, including the mixture of principal component analyzers, the mixture of probabilistic principal component analyzers and the mixture of factor analyzers. In this paper, we propose a novel mixture model for reducing dimensionality based on a linear transformation which is not restricted to be orthogonal nor aligned along the principal directions. For experimental validation, we have used the proposed model for classification of five “hard” data sets and compared its accuracy with that of other popular classifiers. The performance of the proposed method has outperformed that of the mixture of probabilistic principal component analyzers on four out of the five compared data sets with improvements ranging from 0.5 to 3.2%. Moreover, on all data sets, the accuracy achieved by the proposed method outperformed that of the Gaussian mixture model with improvements ranging from 0.2 to 3.4%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Gaussian mixture model with an extended ultrametric covariance structure

Article 25 February 2022

Carlo Cavicchia, Maurizio Vichi & Giorgia Zaccaria

Principal Components Analysis for a Gaussian Mixture

Unsupervised Dimensionality Reduction for Gaussian Mixture Model

References

Bellman R (ed) (1961) Adaptive control processes—a guided tour. Princeton University Press, Princeton, New Jersey
Tipping ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc Ser B (Stat Methodol) 61(3):611–622
Article MathSciNet MATH Google Scholar
Roweis S (1997) EM algorithms for PCA and SPCA. In: Advances in neural information processing systems, vol 10. The MIT Press, Colorado, pp 626–632
Bartholomew DJ (ed) (1987) Latent variable models and factor analysis. Charles Griffin, London
Basilevsky A (ed) (1994) Statistical factor analysis and related methods. Wiley, New York
Schölkopf B, Smola A, Müller K-R (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10:1299–1319
Article Google Scholar
Chin T-J, Suter D (2007) Incremental kernel principal component analysis. IEEE Trans Image Process 16(6):1662–1674
Article MathSciNet MATH Google Scholar
Hinton GE, Dayan P, Revow M (1997) Modeling the manifolds of images of handwritten digits. IEEE Trans Neural Netw 8(1):65–74
Article Google Scholar
Tipping ME, Bishop CM (1999) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482
Article Google Scholar
Ghahramani Z, Hinton GE (1997) The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, University of Toronto (the original paper for the mixture of factor analyzers)
Ridder DD, Franc V (2003) Robust subspace mixture models using t-distribution. In: 14th British Machine Vision Conference (BMVC), London, UK, pp 319–328
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39(1):1–38
MathSciNet MATH Google Scholar
Bishop CM (ed) (2006) Pattern recognition and machine learning. Springer
Kittler JV (1998) Combining classifiers: a theoretical framework. Pattern Anal Appl 1(1):18–27
Article MathSciNet Google Scholar
Neal R, Hinton G (1999) A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan MI (ed) Learning in graphical models. MIT Press, Cambridge, MA, pp 355–368
Figueiredo MAF, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396 (avoid singularity by applying deterministic annealing)
Google Scholar
Bolton RJ, Krzanowski WJ (1999) A characterization of principal components for projection pursuit. Am Stat 53(2):108–109
Article Google Scholar
Asuncion A, Newman DJ (2007) UCI machine learning repository
Jain AK, Duin RPW, Jianchang M (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37
Article Google Scholar
Breiman L, Spector P (1992) Submodel selection and evaluation in regression: the x-random case. Int Stat Rev 60(3):291–319
Article Google Scholar
Bilmes JA (1998) A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian Mixture and Hidden Markov Model
Schoenberg R (1997) Constrained maximum likelihood. Comput Econ 10:251–266
Article MATH Google Scholar
Golub GH, van Loan CF (1996) Matrix computations. Johns Hopkins University Press, 3rd edn

Download references

Acknowledgments

The authors wish to thank the Australian Research Council and iOmniscient Pty Ltd that have partially supported this work under the Linkage Project funding scheme, grant LP0668325.

Author information

Authors and Affiliations

School of Computing and Communications, Faculty of Engineering and IT, University of Technology, Sydney (UTS), Sydney, Australia
Ahmed Fawzi Otoom, Hatice Gunes, Oscar Perez Concha & Massimo Piccardi
Department of Computing, Imperial College, London, UK
Hatice Gunes

Authors

Ahmed Fawzi Otoom
View author publications
You can also search for this author in PubMed Google Scholar
Hatice Gunes
View author publications
You can also search for this author in PubMed Google Scholar
Oscar Perez Concha
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Piccardi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ahmed Fawzi Otoom.

Appendices

Appendix 1 1.1 The expected complete-data log likelihood for MLiT

In this appendix, we show the derivation of the two main terms in (9),

$$ Q(\theta,\theta^g)=\sum_{Y} \left[\ln(p(Y,Z|\theta))p(Z|Y,\theta^g)\right] $$

(25)

namely, the complete data log-likelihood, $\ln(p(Y,Z|\theta)$, and the posterior for the latent variables, p(Z|Y, θ^g). Whereas we report this derivation herewith for completeness, it could be easily derived from the equivalent derivation for conventional Gaussian mixtures.

We begin with the derivation of an expression for $\ln(p(Y,Z|\theta)$. We first apply Bayes theorem

$$ \ln(p(Y,Z|\theta)=\ln(p(Y|Z,\theta)p(Z|\theta)). $$

(26)

Given the following assumptions: (a) the mutual independence of the y _i observations; (b) the dependence of y _i only on its own latent variable, z _i and; (c) the mutual independence of the z _i, we have

$$ \begin{aligned} \ln(p(Y|Z,\theta)p(Z|\theta))&=\ln\left(\prod_{i=1}^N (p(y_i|Z,\theta))p(Z|\theta)\right)\\ &=\ln\left(\prod_{i=1}^N (p(y_i|z_i=l,\theta))p(Z|\theta)\right)\\ &=\ln\left(\prod_{i=1}^N (p(y_i|z_i=l,\theta))\prod_{i=1}^N p(z_i=l|\theta)\right)\\ &=\ln\left(\prod_{i=1}^N[(p(y_i|z_i=l,\theta))p(z_i=l|\theta)]\right)\\ &=\sum_{i=1}^N\ln\left[(p(y_i|z_i=l,\theta))p(z_i=l|\theta)\right] \end{aligned} $$

(27)

The argument of the logarithm can be written as

$$ p(y_i|z_i=l,\theta)p(z_i=l|\theta)={{\mathcal{N}}}\left(\Omega_{l} y|\mu_{l},\Sigma_{l}\right)\alpha_{l}. $$

(28)

Therefore, we have the final equivalence

$$ \ln(p(Y,Z|\theta))=\sum_{i=1}^N\ln\left[\alpha_{l}{{\mathcal{N}}}(\Omega_l y_i|\mu_{l},\Sigma_{l})\right]. $$

(29)

The next term needed is posterior p(Z|Y, θ^g). Under the assumptions above, the probability of any entire assignment, Z, conditioned on the observations is

$$ p(Z|Y,\theta^g)=\prod_{i=1}^N p(z_i=l|y_i,\theta^g). $$

(30)

Let us then apply Bayes theorem, again, to p(z _i = l|y _i, θ^g)

$$ p(z_i=l|y_i,\theta^g)={\frac{p(z_i=l|\theta^g)p(y_i|z_i=l,\theta^g)} {p(y_i|\theta^g)}} $$

(31)

In the above, we have three terms in the right member. In the numerator, there are the terms that we have computed for (28). The denominator is the marginal probability of y over all the components. Therefore, the above equation becomes

$$ p(z_i=l|y_i,\theta^g)={\frac{\alpha_{l}^g {{\mathcal{N}}}(\Omega_l y_i|\mu_{l}^g,\Sigma_{l}^g)} {\sum_{k=1}^M\alpha_l^g{{\mathcal{N}}}(\Omega_l y_i|\mu_{l}^g,\Sigma_{l}^g)}} $$

(32)

justifying our formula for the responsibilities. We can now replace both terms (29), (30) in Q(θ, θ^g) to obtain

$$ Q(\theta,\theta^g)=\sum_{Y}\left[\left(\sum_{i=1}^N\ln[\alpha_{l} {\mathcal{N}}(\Omega_l y_i|\mu_{l},\Sigma_{l})]\right)\left(\prod_{i=1}^N p(z_i=l|y_i,\theta^g)\right)\right] $$

(33)

The following simple steps leading to (10) can be repeated from [21].

Appendix 2 2.1 A constrained expectation–maximization algorithm for MLiT

In this appendix, we present the regularized solution based on this constrained optimization [MLiT (C)]. Again, we update one column vector, w _j, at a time while keeping the others still. As a constraint on $\Omega_l,g(\Omega_l)=0$, we choose an entrywise L1-norm constraint, $g(\Omega_l)=\|\Omega_l\|_{L1}-s=0$; constant s is the chosen value for the norm, or scale:

$$ g(\Omega_l)=\sum_{i=1}^D\sum_{k=1}^P\left|w_{ik}\right|-s=0 $$

(34)

where $dim(\Omega_l)=D\times P$. However, as we update only one w _j column at a time, we need to impose this constraint in a column-wise manner. Therefore, we turn (34) into the stronger constraint g(w _j)

$$ g(w_j)=\sum_{i=1}^D \left|w_{ij}\right|-(c = s/P)=0 $$

(35)

A solution for (17) under constraint (35) may be obtained by using a constrained optimization solver such as CML [22]. However, we prefer to provide an inline solution for reasons of speed and independency. Constraint (35) can be represented in an expanded notation as

$$ g(w_j)=\pm w_{1j}\,{\pm} w_{2j}{\pm}\cdots {\pm} w_{Dj}-c=0 $$

(36)

under condition

$$ \left|w_{1j}\right|,\left|w_{2j}\right|,\ldots,\left|w_{Dj}\right|\leq c. $$

(37)

Equation (36) shows the 2^D different combinations of signs that the constraint can take, leading to 2^D different linear constraints that do not use absolute values. Whereas the number of such constraints is significant, it is tolerable for typical values of D. In contrast, alternative approaches for computing equality-constrained solutions impose constraints on the rows of $\Omega$ [23]. The number of constraints so required is in the order of 2^P, exponential in the high dimension, P, and therefore unmanageable. This is the ultimate justification for our choice of a column-wise solution for the maximization of Q(θ, θ^g) in $\Omega_l$. Our constrained approach first solves (17) under each of the (36) linear constraints. If solutions are found, they are then tested post hoc for satisfaction of (37).

For exemplification, let us consider one of the (36) constraints

$$ g(w_j)=w_{1j}+w_{2j}+\cdots +w_{Dj}-c=0 $$

(38)

Then, consider the Lagrangian equation

$$ \begin{aligned} h(w_j)&=f(w_j)+\lambda g(w_j) \\ &=f(w_j)+\lambda\left(\sum_{i=1}^D w_{ij}-c\right) \end{aligned} $$

(39)

where f(w _j) is Q(θ, θ^g) (10) [or (16) likewise], λ is the Lagrange multiplier, and g(w _j) is the constraint. Equation (39) can be written as

$$ \begin{aligned} h(w_j)&=\sum_{i=1}^N \left((-{\frac{1} {2}}\ln(|\Sigma_l|)\right. \\ &\quad-{\frac{1}{2}}(w_{1}y_{i1}+\cdots+w_{j}y_{ij}+\cdots +w_{p}y_{ip}-\mu_l)^T\\ &\quad\times\Sigma_l^{-1}(w_{1}y_{i1}+\cdots +w_{j}y_{ij}+\cdots+w_{p}y_{ip}-\mu_l))\\ &\quad\left.\times\, p(l|y_{i},\theta^g)\vphantom{\frac{1}{2}}\right)+\lambda \left(\sum_{i=1}^D w_{ij}-c\right) \end{aligned} $$

(40)

where the external sum is ignored since all its terms, but one, are null after the differentiation.

The derivative of (40) in w _j is

$$ \begin{aligned} {\frac{\partial(h(w_j))}{\partial w_{j}}}&= \sum_{i=1}^{N}\left(\Sigma_{l}^{-1g}(w_1^g y_{i1} +\cdots+ w_{j}y_{ij}+\cdots+{w}_{p}^{g}y_{ip}\right. \\ &\quad\left.-\,{\mu}_{l}^{g})y_{ij} p(l|y_i,{\theta}^{g})\right)+\lambda\overline{1}\\ &=0 \end{aligned} $$

(41)

where $\overline{1}$ stands for a D × 1 vector of all ones. By pre-multiplying (41) by $\Sigma_l^g$, we obtain:

$$ \begin{aligned} {\frac{\partial(h(w_j))}{\partial w_{j}}}&= \sum_{i=1}^N \big((w_{1}^{g} y_{i1}+\cdots+ w_{j}y_{ij}+\cdots+{w}_{p}^{g} y_{ip}-{\mu}_{l}^{g}) \\ &\quad\times {y}_{ij} p(l | {y}_{i},{\theta}^{g})+ \lambda {\Sigma}_{l}^{g}\overline{1}\big) \\ &=0 \end{aligned} $$

(42)

By setting $r=\Sigma_{i=1}^N y_{ij}^2p(l|y_i,\theta^g)$, and collectively naming b all terms in $\left[w_{k}^g\right]_{k=1\ldots P,k\neq j}$, we obtain

$$ \begin{aligned} {\frac{\partial(h(w_j))}{\partial w_{j}}}&= rw_j+b+\lambda\Sigma_l^g\overline{1}\\ &=0 \end{aligned} $$

(43)

The solution for w _j can then be written as

$$ w_j={\frac{-b-\lambda\Sigma_l^g\overline{1}}{r}} $$

(44)

Let us now call s _nm the D × D elements of $\Sigma_l^g$ in row–column notation and make (44) explicit in its D rows

$$ \left\{\begin{array}{l} w_{1j}={\frac{-b_{1}-\lambda(s_{11}+\cdots+s_{1D})}{r}} \\ \ldots \\ w_{Dj}={\frac{-b_{D}-\lambda(s_{D1}+\cdots+s_{DD})}{r}} \end{array}\right. $$

We are eventually in a position to impose the g(w _j) constraint by adding up all left and right members of (45). On the left, we obtain c, a known value. On the right, a linear function of λ. Therefore, λ can be solved for immediately as

$$ \lambda={\frac{-(cr+b_{1}+\cdots+b_{D})}{(s_{11}+\cdots+s_{DD})}} $$

(45)

With λ thus computed, its value is replaced in (44) to obtain the desired, constrained solution for w _j. For each of the other remaining constraints, λ is also calculated and the corresponding constrained solutions of w _j are obtained. We note that, in order for the EM algorithm to continue iterating, at least one constraint solution for w _j satisfying (36) and (37) is needed. We also note that there is no specific relevance for the choice of s: changing scale would lead to scaled densities for all components and classes and equivalent classification outcomes; possible differences can be imputed to numerical resolution. In the experiments, we have tried different values of s in a logarithmic scale and chosen that corresponding to the highest classification accuracy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Otoom, A.F., Gunes, H., Perez Concha, O. et al. MLiT: mixtures of Gaussians under linear transformations. Pattern Anal Applic 14, 193–205 (2011). https://doi.org/10.1007/s10044-011-0205-2

Download citation

Received: 23 December 2009
Accepted: 26 February 2011
Published: 26 March 2011
Issue Date: May 2011
DOI: https://doi.org/10.1007/s10044-011-0205-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

MLiT: mixtures of Gaussians under linear transformations

Abstract

Access this article

Similar content being viewed by others

Gaussian mixture model with an extended ultrametric covariance structure

Principal Components Analysis for a Gaussian Mixture

Unsupervised Dimensionality Reduction for Gaussian Mixture Model

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1

1.1 The expected complete-data log likelihood for MLiT

Appendix 2

2.1 A constrained expectation–maximization algorithm for MLiT

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MLiT: mixtures of Gaussians under linear transformations

Abstract

Access this article

Similar content being viewed by others

Gaussian mixture model with an extended ultrametric covariance structure

Principal Components Analysis for a Gaussian Mixture

Unsupervised Dimensionality Reduction for Gaussian Mixture Model

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1

1.1 The expected complete-data log likelihood for MLiT

Appendix 2

2.1 A constrained expectation–maximization algorithm for MLiT

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation