Abstract
The curse of dimensionality hinders the effectiveness of density estimation in high dimensional spaces. Many techniques have been proposed in the past to discover embedded, locally linear manifolds of lower dimensionality, including the mixture of principal component analyzers, the mixture of probabilistic principal component analyzers and the mixture of factor analyzers. In this paper, we propose a novel mixture model for reducing dimensionality based on a linear transformation which is not restricted to be orthogonal nor aligned along the principal directions. For experimental validation, we have used the proposed model for classification of five “hard” data sets and compared its accuracy with that of other popular classifiers. The performance of the proposed method has outperformed that of the mixture of probabilistic principal component analyzers on four out of the five compared data sets with improvements ranging from 0.5 to 3.2%. Moreover, on all data sets, the accuracy achieved by the proposed method outperformed that of the Gaussian mixture model with improvements ranging from 0.2 to 3.4%.
Similar content being viewed by others
References
Bellman R (ed) (1961) Adaptive control processes—a guided tour. Princeton University Press, Princeton, New Jersey
Tipping ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc Ser B (Stat Methodol) 61(3):611–622
Roweis S (1997) EM algorithms for PCA and SPCA. In: Advances in neural information processing systems, vol 10. The MIT Press, Colorado, pp 626–632
Bartholomew DJ (ed) (1987) Latent variable models and factor analysis. Charles Griffin, London
Basilevsky A (ed) (1994) Statistical factor analysis and related methods. Wiley, New York
Schölkopf B, Smola A, Müller K-R (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10:1299–1319
Chin T-J, Suter D (2007) Incremental kernel principal component analysis. IEEE Trans Image Process 16(6):1662–1674
Hinton GE, Dayan P, Revow M (1997) Modeling the manifolds of images of handwritten digits. IEEE Trans Neural Netw 8(1):65–74
Tipping ME, Bishop CM (1999) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482
Ghahramani Z, Hinton GE (1997) The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, University of Toronto (the original paper for the mixture of factor analyzers)
Ridder DD, Franc V (2003) Robust subspace mixture models using t-distribution. In: 14th British Machine Vision Conference (BMVC), London, UK, pp 319–328
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39(1):1–38
Bishop CM (ed) (2006) Pattern recognition and machine learning. Springer
Kittler JV (1998) Combining classifiers: a theoretical framework. Pattern Anal Appl 1(1):18–27
Neal R, Hinton G (1999) A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan MI (ed) Learning in graphical models. MIT Press, Cambridge, MA, pp 355–368
Figueiredo MAF, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396 (avoid singularity by applying deterministic annealing)
Bolton RJ, Krzanowski WJ (1999) A characterization of principal components for projection pursuit. Am Stat 53(2):108–109
Asuncion A, Newman DJ (2007) UCI machine learning repository
Jain AK, Duin RPW, Jianchang M (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37
Breiman L, Spector P (1992) Submodel selection and evaluation in regression: the x-random case. Int Stat Rev 60(3):291–319
Bilmes JA (1998) A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian Mixture and Hidden Markov Model
Schoenberg R (1997) Constrained maximum likelihood. Comput Econ 10:251–266
Golub GH, van Loan CF (1996) Matrix computations. Johns Hopkins University Press, 3rd edn
Acknowledgments
The authors wish to thank the Australian Research Council and iOmniscient Pty Ltd that have partially supported this work under the Linkage Project funding scheme, grant LP0668325.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1
1.1 The expected complete-data log likelihood for MLiT
In this appendix, we show the derivation of the two main terms in (9),
namely, the complete data log-likelihood, \(\ln(p(Y,Z|\theta)\), and the posterior for the latent variables, p(Z|Y, θg). Whereas we report this derivation herewith for completeness, it could be easily derived from the equivalent derivation for conventional Gaussian mixtures.
We begin with the derivation of an expression for \(\ln(p(Y,Z|\theta)\). We first apply Bayes theorem
Given the following assumptions: (a) the mutual independence of the y i observations; (b) the dependence of y i only on its own latent variable, z i and; (c) the mutual independence of the z i , we have
The argument of the logarithm can be written as
Therefore, we have the final equivalence
The next term needed is posterior p(Z|Y, θg). Under the assumptions above, the probability of any entire assignment, Z, conditioned on the observations is
Let us then apply Bayes theorem, again, to p(z i = l|y i , θg)
In the above, we have three terms in the right member. In the numerator, there are the terms that we have computed for (28). The denominator is the marginal probability of y over all the components. Therefore, the above equation becomes
justifying our formula for the responsibilities. We can now replace both terms (29), (30) in Q(θ, θg) to obtain
The following simple steps leading to (10) can be repeated from [21].
Appendix 2
2.1 A constrained expectation–maximization algorithm for MLiT
In this appendix, we present the regularized solution based on this constrained optimization [MLiT (C)]. Again, we update one column vector, w j , at a time while keeping the others still. As a constraint on \(\Omega_l,g(\Omega_l)=0\), we choose an entrywise L1-norm constraint, \(g(\Omega_l)=\|\Omega_l\|_{L1}-s=0\); constant s is the chosen value for the norm, or scale:
where \(dim(\Omega_l)=D\times P\). However, as we update only one w j column at a time, we need to impose this constraint in a column-wise manner. Therefore, we turn (34) into the stronger constraint g(w j )
A solution for (17) under constraint (35) may be obtained by using a constrained optimization solver such as CML [22]. However, we prefer to provide an inline solution for reasons of speed and independency. Constraint (35) can be represented in an expanded notation as
under condition
Equation (36) shows the 2D different combinations of signs that the constraint can take, leading to 2D different linear constraints that do not use absolute values. Whereas the number of such constraints is significant, it is tolerable for typical values of D. In contrast, alternative approaches for computing equality-constrained solutions impose constraints on the rows of \(\Omega\) [23]. The number of constraints so required is in the order of 2P, exponential in the high dimension, P, and therefore unmanageable. This is the ultimate justification for our choice of a column-wise solution for the maximization of Q(θ, θg) in \(\Omega_l\). Our constrained approach first solves (17) under each of the (36) linear constraints. If solutions are found, they are then tested post hoc for satisfaction of (37).
For exemplification, let us consider one of the (36) constraints
Then, consider the Lagrangian equation
where f(w j ) is Q(θ, θg) (10) [or (16) likewise], λ is the Lagrange multiplier, and g(w j ) is the constraint. Equation (39) can be written as
where the external sum is ignored since all its terms, but one, are null after the differentiation.
The derivative of (40) in w j is
where \(\overline{1}\) stands for a D × 1 vector of all ones. By pre-multiplying (41) by \(\Sigma_l^g\), we obtain:
By setting \(r=\Sigma_{i=1}^N y_{ij}^2p(l|y_i,\theta^g)\), and collectively naming b all terms in \(\left[w_{k}^g\right]_{k=1\ldots P,k\neq j}\), we obtain
The solution for w j can then be written as
Let us now call s nm the D × D elements of \(\Sigma_l^g\) in row–column notation and make (44) explicit in its D rows
We are eventually in a position to impose the g(w j ) constraint by adding up all left and right members of (45). On the left, we obtain c, a known value. On the right, a linear function of λ. Therefore, λ can be solved for immediately as
With λ thus computed, its value is replaced in (44) to obtain the desired, constrained solution for w j . For each of the other remaining constraints, λ is also calculated and the corresponding constrained solutions of w j are obtained. We note that, in order for the EM algorithm to continue iterating, at least one constraint solution for w j satisfying (36) and (37) is needed. We also note that there is no specific relevance for the choice of s: changing scale would lead to scaled densities for all components and classes and equivalent classification outcomes; possible differences can be imputed to numerical resolution. In the experiments, we have tried different values of s in a logarithmic scale and chosen that corresponding to the highest classification accuracy.
Rights and permissions
About this article
Cite this article
Otoom, A.F., Gunes, H., Perez Concha, O. et al. MLiT: mixtures of Gaussians under linear transformations. Pattern Anal Applic 14, 193–205 (2011). https://doi.org/10.1007/s10044-011-0205-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-011-0205-2