Abstract
This work presents a family of parsimonious Gaussian process models which allow to build, from a finite sample, a model-based classifier in an infinite dimensional space. The proposed parsimonious models are obtained by constraining the eigen-decomposition of the Gaussian processes modeling each class. This allows in particular to use non-linear mapping functions which project the observations into infinite dimensional spaces. It is also demonstrated that the building of the classifier can be directly done from the observation space through a kernel function. The proposed classification method is thus able to classify data of various types such as categorical data, functional data or networks. Furthermore, it is possible to classify mixed data by combining different kernels. The methodology is as well extended to the unsupervised classification case and an EM algorithm is derived for the inference. Experimental results on various data sets demonstrate the effectiveness of the proposed method. A Matlab toolbox implementing the proposed classification methods is provided as supplementary material.
Similar content being viewed by others
References
Akaike, Hirotugu: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)
Andrews, J.L., McNicholas, P.D.: Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Stat. Comput. 22(5), 1021–1029 (2012)
Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22(7), 719–725 (2001)
Bouguila, N., Ziou, D., Vaillancourt, J.: Novel mixtures based on the Dirichlet distribution: application to data and image classification. In: Machine Learning and Data Mining in Pattern Recognition, pp. 172–181. Springer, Berlin (2003)
Bouveyron, C., Brunet, C.: Simultaneous model-based clustering and visualization in the Fisher discriminative subspace. Stat. Comput. 22(1), 301–324 (2012)
Bouveyron, C., Brunet-Saumard, C.: Model-based clustering of high-dimensional data: a review. Comput. Stat. Data Anal. 71, 52–78 (2013)
Bouveyron, C., Girard, S.: Robust supervised classification with mixture models: learning from data with uncertain labels. Pattern Recognit. 42(11), 2649–2658 (2009)
Bouveyron, C., Jacques, J.: Model-based clustering of time series in group-specific functional subspaces. Adv. Data Anal. Classif. 5(4), 281–300 (2011)
Bouveyron, C., Girard, S., Schmid, C.: High-dimensional discriminant analysis. Commun. Stat. 36, 2607–2623 (2007a)
Bouveyron, C., Girard, S., Schmid, C.: High-dimensional data clustering. Comput. Stat. Data Anal. 52, 502–519 (2007b)
Canu, S., Grandvalet, Y., Guigue, V., Rakotomamonjy, A.: SVM and kernel methods matlab toolbox. In: Perception Systemes et Information. INSA de Rouen, Rouen (2005)
Caponnetto, A., Micchelli, C.A., Pontil, M., Ying, Y.: Universal multi-task kernels. J. Mach. Learn. Res. 68, 16151646 (2008)
Cattell, R.: The scree test for the number of factors. Multivar. Behav. Res. 1(2), 245–276 (1966)
Celeux, G., Govaert, G.: Clustering criteria for discrete data and latent class models. J. Classif. 8(2), 157–176 (1991)
Chapelle, O., Schölkopf, B., Zien, A. (eds.): Semi-Supervised Learning. MIT Press, Cambridge. http://www.kyb.tuebingen.mpg.de/ssl-book (2006)
Couto, J.: Kernel k-means for categorical data. In: Advances in Intelligent Data Analysis VI, vol. 3646 of Lecture Notes in Computer Science, pp. 739–739. Springer, Berlin (2005)
Cuturi, M., Vert, J.P.: The context-tree kernel for strings. Neural Netw. 18(8), 1111–1123 (2005)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39(1), 1–38 (1977)
Dundar, M.M., Landgrebe, D.A.: Toward an optimal supervised classifier for the analysis of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 42(1), 271–277 (2004)
Evgeniou, T., Micchelli, C.A., Pontil, M.: Learning multiple tasks with kernel methods. J. Mach. Learn. Res. 6, 615637 (2005)
Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugenics 7, 179–188 (1936)
Forbes, F., Wraith, D.: A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweight: application to robust clustering. Stat. Comput. (to appear) (2014)
Franczak, B.C., Browne, R.P., McNicholas, P.D.: Mixtures of shifted asymmetric Laplace distributions. IEEE Trans. Pattern Anal. Mach. Intell. 36(6), 1149–1157 (2014)
Girolami, M.: Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. 13(3), 780–784 (2002)
Gönen, M., Alpaydin, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, 2211–2268 (2011)
Hofmann, T., Schölkopf, B., Smola, A.: Kernel methods in machine learning. Ann. Stat. 36(3), 1171–1220 (2008)
Kadri, H., Rakotomamonjy, A., Bach, F., Preux, P.: Multiple Operator-Valued Kernel Learning. In: Neural Information Processing Systems (NIPS), pp. 1172–1080 (2012)
Kuss, M., Rasmussen, C.: Assessing approximate inference for binary Gaussian process classification. J. Mach. Learn. Res. 6, 1679–1704 (2005)
Lee, S., McLachlan, G.J.: Finite mixtures of multivariate skew t-distributions: some recent and new results. Stat. Comput. 24(2), 181–202 (2013)
Lehoucq, R., Sorensen, D.: Deflation techniques for an implicitly restarted arnoldi iteration. SIAM J. Matrix Anal. Appl. 17(4), 789–821 (1996)
Lin, T.I.: Robust mixture modeling using multivariate skew t distribution. Stat. Comput. 20, 343–356 (2010)
Lin, T.I., Lee, J.C., Hsieh, W.J.: Robust mixture modeling using the skew t distribution. Stat. Comput. 17, 81–92 (2007)
Mahé, P., Vert, J.P.: Graph kernels based on tree patterns for molecules. Mach. Learn. 75(1), 3–35 (2009)
McLachlan, G.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York (1992)
McLachlan, G., Peel, D., Bean, R.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41, 379–388 (2003)
McNicholas, P., Murphy, B.: Parsimonious Gaussian mixture models. Stat. Comput. 18(3), 285–296 (2008)
Mika, S., Ratsch, G., Weston, J., Schölkopf, B., Müllers, K.R.: Fisher discriminant analysis with kernels. In: Neural Networks for Signal Processing (NIPS), pp. 41–48 (1999)
Minka, T.: Expectation propagation for approximate bayesian inference. In: Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pp. 362–369. Morgan Kaufmann, San Francisco (2001)
Montanari, A., Viroli, C.: Heteroscedastic factor mixture analysis. Stat. Model. 10(4), 441–460 (2010)
Murphy, T.B., Dean, N., Raftery, A.E.: Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications. Ann. Appl. Stat. 4(1), 219–223 (2010)
Murua, A., Wicker, N.: Kernel-based Mixture Models for Classification. Technical Report, University of Montréal (2014)
Pekalska, E., Haasdonk, B.: Kernel discriminant analysis for positive definite and indefinite kernels. IEEE Trans. Pattern Anal. Mach. Intell. 31(6), 1017–1032 (2009)
Ramsay, J.O., Silverman, B.W.: Functional Data Analysis. Springer Series in Statistics, 2nd edn. Springer, New York (2005)
Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning Matlab Toolbox. MIT, Cambridge (2006a)
Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. MIT, Cambridge (2006b)
Scholkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT, Cambridge (2001)
Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10(5), 1299–1319 (1998)
Schölkopf, B., Tsuda, K., Vert, J.-P. (eds.): Kernel Methods in Computational Biology. MIT, Cambridge (2004)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)
Shorack, G.R., Wellner, J.A.: Empirical Processes with Applications to Statistics. Wiley, New York (1986)
Smola, A., Kondor, R.: Kernels and regularization on graphs. In: Proceedings of Conference on Learning Theory and Kernel Machines, pp. 144–158 (2003)
Wang, J., Lee, J., Zhang, C.: Kernel trick embedded Gaussian mixture model. In: Proceedings of the 14th International Conference on Algorithmic Learning Theory, pp. 159–174 (2003)
Xu, Z., Huang, K., Zhu, J., King, I., Lyu, M.R.: A novel kernel-based maximum a posteriori classification method. Neural Netw. 22, 977–987 (2009)
Acknowledgments
The authors would like to greatly thank the associate editor and the referee for their helpful remarks and comments on the manuscript.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix: Proofs
Appendix: Proofs
Proof of Proposition 1
Recalling that \(d_{\max }=\max (d_{1},\ldots , d_{k})\), the classification function can be rewritten as:
where \(\gamma =(r-d_{\max })\log (\lambda )\) is a constant term which does not depend on the index \(i\) of the class. In view of the assumptions, \(D_{i}(\varphi (x))\) can be also rewritten as:
Introducing the norm \(||.||_{L_{2}}\) associated with the scalar product \(\langle .,.\rangle _{L_{2}}\) and in view of Proposition 1 of (Shorack and Wellner (1986), p. 208), we finally obtain:
which is the desired result. \(\square \)
Proof of Proposition 2
The proof involves three steps.
-
(i)
Computation of the projection \(\langle \varphi (x)-\hat{\mu }_{i},\hat{q}_{ij}\rangle _{L_{2}}\) : Since \((\hat{\lambda }_{ij},\hat{q}_{ij})\) is solution of the Fredholm-type equation, it follows that, for all \(t\in J\),
$$\begin{aligned} \hat{\lambda }_{ij}\hat{q}_{ij}(t)&= \int _{J}\hat{{\varSigma }}_{i}(s,t)\hat{q}_{ij}(s)ds\nonumber \\&= \frac{1}{n_{i}}\sum _{x_{\ell }\in C_{i}}\langle \varphi (x_{\ell })-\hat{\mu }_{i},\hat{q}_{ij}\rangle _{L_{2}}(\varphi (x_{\ell })(t)\nonumber \\&-\hat{\mu }_{i}(t)). \end{aligned}$$(8)This implies that \(\hat{q}_{ij}\) lies in the linear subspace spanned by the \((\varphi (x_{\ell })-\hat{\mu }_{i}),\,x_{\ell }\in C_{i}\). As a consequence, the rank of the operator \(\hat{{\varSigma }}_{i}\) is finite and is at most \(r_{i}=\min (n_{i},r)\). It therefore exists \(\beta _{ij\ell }\in {\mathbb {R}}\) such that:
$$\begin{aligned} \hat{q}_{ij}=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }(\varphi (x_{\ell })-\hat{\mu }_{i}) \end{aligned}$$(9)leading to:
$$\begin{aligned} \langle \varphi (x)-\hat{\mu }_{i},\hat{q}_{ij}\rangle _{L_{2}}=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }\rho _{i}(x,x_{\ell }), \end{aligned}$$(10)for all \(j=1,\ldots ,r_{i}\). The estimated classification function has therefore the following form:
$$\begin{aligned} \hat{D}_{i}(\varphi (x))&=\frac{1}{n_{i}}\sum _{j=1}^{d_{i}}\frac{1}{\hat{\lambda }_{ij}}\left( \frac{1}{\hat{\lambda }_{ij}}-\frac{1}{\hat{\lambda }}\right) \left( \sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }\rho _{i}(x,x_{\ell })\right) ^{2}\\&\qquad +\frac{1}{\hat{\lambda }}\rho _{i}(x,x)\\&\qquad +\sum _{j=1}^{d_{i}}\log (\hat{\lambda }_{ij})+(d_{\max }-d_{i})\log (\hat{\lambda })-2\log (\hat{\pi }_{i}), \end{aligned}$$for all \(i=1,\ldots ,k\).
-
(ii)
Computation of the \(\beta _{ij\ell }\) and \(\hat{\lambda }_{ij}\): Replacing (9) in the Fredholm-type equation (8) it follows that
$$\begin{aligned}&\frac{1}{n_{i}}\sum _{x_{\ell },x_{\ell '}\in C_{i}}\beta _{ij\ell '}(\varphi (x_{\ell })-\hat{\mu }_{i})\rho _{i}(x_{\ell },x_{\ell '})\\&\qquad \qquad =\hat{\lambda }_{ij}\sum _{x_{\ell '}\in C_{i}}\beta _{ij\ell '}(\varphi (x_{\ell '})-\hat{\mu }_{i}). \end{aligned}$$Finally, projecting this equation on \(\varphi (x_{m})-\hat{\mu }_{i}\) for \(x_{m}\in C_{i}\) yields
$$\begin{aligned}&\frac{1}{n_{i}}\sum _{x_{\ell },x_{\ell '}\in C_{i}}\beta _{ij\ell '}\rho _{i}(x_{\ell },x_{m})\rho _{i}(x_{\ell },x_{\ell '})\\&\qquad \qquad =\hat{\lambda }_{ij}\sum _{x_{\ell '}\in C_{i}}\beta _{ij\ell '}\rho _{i}(x_{m},x_{\ell '}). \end{aligned}$$Recalling that \(M_{i}\) is the matrix \(n_{i}\times n_{i}\) defined by \((M_{i})_{\ell ,\ell '}=\rho _{i}(x_{\ell },x_{\ell '})/n_{i}\) and introducing \(\beta _{ij}\) the vector of \({\mathbb {R}}^{n_{i}}\) defined by \((\beta _{ij})_{\ell }=\beta _{ij\ell }\), the above equation can be rewritten as \(M_{i}^{2}\beta _{ij}=\hat{\lambda }_{ij}M_{i}\beta _{ij}\) or, after simplification \(M_{i}\beta _{ij}=\hat{\lambda }_{ij}\beta _{ij}.\) As a consequence, \(\hat{\lambda }_{ij}\) is the \(j\)th largest eigenvalue of \(M_{i}\) and \(\beta _{ij}\) is the associated eigenvector for all \(1\le j\le d_{i}\). Let us note that the constraint \(\Vert \hat{q}_{ij}\Vert =1\) can be rewritten as \(\beta _{ij}^{t}\beta _{ij}=1\).
-
(iii)
Computation of \(\hat{\lambda }\): Remarking that trace \((\hat{{\varSigma }}_{i})\!=\!\text{ trace }(M_{i}) +{{\sum _{j=r_{i}+1}^{r}\hat{\lambda }_{ij}}}\), it follows:
$$\begin{aligned} \hat{\lambda }=\frac{1}{\sum _{i=1}^{k}\hat{\pi _{i}}(r_{i}-d_{i})}\sum _{i=1}^{k}\hat{\pi }_{i}\left( {\mathrm {trace}}(M_{i})-\sum _{j=1}^{d_{i}}\hat{\lambda }_{ij}\right) , \end{aligned}$$and the proposition is proved. \(\square \)
Proof of Proposition 3
It is sufficient to prove that \(\hat{q}_{ij}\) and \(\hat{\lambda }_{ij}\) are respectively the \(j\)th normed eigenvector and eigenvalue of \(\hat{{\varSigma }}_{i}\). First,
and remarking that \(\beta _{ij}\) is eigenvector of \(M_{i}\), it follows:
Second, straightforward algebra shows that
and the result is proved. \(\square \)
Proof of Proposition 4
For all \(\ell =1,\ldots ,L\), the \(\ell \)th coordinate of the mapping function \(\varphi (x)\) is defined as the \(\ell \)th coordinate of the function \(x\) expressed in the truncated basis \(\{b_{1},\ldots ,b_{L}\}\). More specifically,
for all \(t\in [0,1]\) and thus, for all \(j=1,\ldots ,L\), we have
As a consequence, \(\varphi (x)=B^{-1}\gamma (x)\) and \(K(x,y)=\gamma (x)^{t}B^{-1}\gamma (y)\). Introducing
it follows that \(\rho _{i}(x,y)=(\gamma (x)-\bar{\gamma }_{i})^{t}B^{-1}(\gamma (y)-\bar{\gamma }_{i})\). Let us first show that \(\hat{q}_{ij}\) is eigenvector of \(B^{-1}\hat{{\varSigma }}_{i}\). Recalling that
we have
Remarking that \(\beta _{ij}\) is eigenvector of \(M_{i}\), it follows:
Let us finally compute the norm of \(\hat{q}_{ij}\):
and the result is proved. \(\square \)
Proof of Proposition 5
Let us first set the estimate of \(\mu _{i}\)at iteration \(s\) to its empirical counterpart conditionally on the current value of the model parameter \(\theta ^{(s-1)}\):
where \(n_{i}^{(s-1)}=\sum _{j=1}^{n}t_{ij}^{(s-1)}\). Replacing \(\hat{\mu }_{i}\) by \(\hat{\mu }_{i}^{(s)}\) in the proof of Proposition 2 , we get:
According to Bayes’ rule and Eq. (1), the computation of \(t_{ij}^{(s)}={\mathbb {E}}(Z_{j}=i|x_{j},\theta ^{(s-1)})\) conditionally on the current value of the model parameter \(\theta ^{(s-1)}\) can be finally written as:
for \(j=1,\ldots ,n\) and \(i=1,\ldots ,k\). The result follows. \(\square \)
Rights and permissions
About this article
Cite this article
Bouveyron, C., Fauvel, M. & Girard, S. Kernel discriminant analysis and clustering with parsimonious Gaussian process models. Stat Comput 25, 1143–1162 (2015). https://doi.org/10.1007/s11222-014-9505-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-014-9505-x