Kernel discriminant analysis and clustering with parsimonious Gaussian process models

Bouveyron, C.; Fauvel, M.; Girard, S.

doi:10.1007/s11222-014-9505-x

Kernel discriminant analysis and clustering with parsimonious Gaussian process models

Published: 04 September 2014

Volume 25, pages 1143–1162, (2015)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

C. Bouveyron¹,
M. Fauvel² &
S. Girard³

641 Accesses
11 Citations
3 Altmetric
Explore all metrics

Abstract

This work presents a family of parsimonious Gaussian process models which allow to build, from a finite sample, a model-based classifier in an infinite dimensional space. The proposed parsimonious models are obtained by constraining the eigen-decomposition of the Gaussian processes modeling each class. This allows in particular to use non-linear mapping functions which project the observations into infinite dimensional spaces. It is also demonstrated that the building of the classifier can be directly done from the observation space through a kernel function. The proposed classification method is thus able to classify data of various types such as categorical data, functional data or networks. Furthermore, it is possible to classify mixed data by combining different kernels. The methodology is as well extended to the unsupervised classification case and an EM algorithm is derived for the inference. Experimental results on various data sets demonstrate the effectiveness of the proposed method. A Matlab toolbox implementing the proposed classification methods is provided as supplementary material.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

Data clustering: application and trends

Article 27 November 2022

Notes

http://archive.ics.uci.edu/ml/.

References

Akaike, Hirotugu: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)
Article MATH MathSciNet Google Scholar
Andrews, J.L., McNicholas, P.D.: Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Stat. Comput. 22(5), 1021–1029 (2012)
Article MATH MathSciNet Google Scholar
Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22(7), 719–725 (2001)
Article Google Scholar
Bouguila, N., Ziou, D., Vaillancourt, J.: Novel mixtures based on the Dirichlet distribution: application to data and image classification. In: Machine Learning and Data Mining in Pattern Recognition, pp. 172–181. Springer, Berlin (2003)
Bouveyron, C., Brunet, C.: Simultaneous model-based clustering and visualization in the Fisher discriminative subspace. Stat. Comput. 22(1), 301–324 (2012)
Article MathSciNet Google Scholar
Bouveyron, C., Brunet-Saumard, C.: Model-based clustering of high-dimensional data: a review. Comput. Stat. Data Anal. 71, 52–78 (2013)
Article MathSciNet Google Scholar
Bouveyron, C., Girard, S.: Robust supervised classification with mixture models: learning from data with uncertain labels. Pattern Recognit. 42(11), 2649–2658 (2009)
Article MATH Google Scholar
Bouveyron, C., Jacques, J.: Model-based clustering of time series in group-specific functional subspaces. Adv. Data Anal. Classif. 5(4), 281–300 (2011)
Article MATH MathSciNet Google Scholar
Bouveyron, C., Girard, S., Schmid, C.: High-dimensional discriminant analysis. Commun. Stat. 36, 2607–2623 (2007a)
Article MATH MathSciNet Google Scholar
Bouveyron, C., Girard, S., Schmid, C.: High-dimensional data clustering. Comput. Stat. Data Anal. 52, 502–519 (2007b)
Article MATH MathSciNet Google Scholar
Canu, S., Grandvalet, Y., Guigue, V., Rakotomamonjy, A.: SVM and kernel methods matlab toolbox. In: Perception Systemes et Information. INSA de Rouen, Rouen (2005)
Caponnetto, A., Micchelli, C.A., Pontil, M., Ying, Y.: Universal multi-task kernels. J. Mach. Learn. Res. 68, 16151646 (2008)
MathSciNet Google Scholar
Cattell, R.: The scree test for the number of factors. Multivar. Behav. Res. 1(2), 245–276 (1966)
Article Google Scholar
Celeux, G., Govaert, G.: Clustering criteria for discrete data and latent class models. J. Classif. 8(2), 157–176 (1991)
Article MATH Google Scholar
Chapelle, O., Schölkopf, B., Zien, A. (eds.): Semi-Supervised Learning. MIT Press, Cambridge. http://www.kyb.tuebingen.mpg.de/ssl-book (2006)
Couto, J.: Kernel k-means for categorical data. In: Advances in Intelligent Data Analysis VI, vol. 3646 of Lecture Notes in Computer Science, pp. 739–739. Springer, Berlin (2005)
Cuturi, M., Vert, J.P.: The context-tree kernel for strings. Neural Netw. 18(8), 1111–1123 (2005)
Article Google Scholar
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39(1), 1–38 (1977)
MATH MathSciNet Google Scholar
Dundar, M.M., Landgrebe, D.A.: Toward an optimal supervised classifier for the analysis of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 42(1), 271–277 (2004)
Article Google Scholar
Evgeniou, T., Micchelli, C.A., Pontil, M.: Learning multiple tasks with kernel methods. J. Mach. Learn. Res. 6, 615637 (2005)
MathSciNet Google Scholar
Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugenics 7, 179–188 (1936)
Article Google Scholar
Forbes, F., Wraith, D.: A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweight: application to robust clustering. Stat. Comput. (to appear) (2014)
Franczak, B.C., Browne, R.P., McNicholas, P.D.: Mixtures of shifted asymmetric Laplace distributions. IEEE Trans. Pattern Anal. Mach. Intell. 36(6), 1149–1157 (2014)
Article Google Scholar
Girolami, M.: Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. 13(3), 780–784 (2002)
Article Google Scholar
Gönen, M., Alpaydin, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, 2211–2268 (2011)
MATH MathSciNet Google Scholar
Hofmann, T., Schölkopf, B., Smola, A.: Kernel methods in machine learning. Ann. Stat. 36(3), 1171–1220 (2008)
Article MATH Google Scholar
Kadri, H., Rakotomamonjy, A., Bach, F., Preux, P.: Multiple Operator-Valued Kernel Learning. In: Neural Information Processing Systems (NIPS), pp. 1172–1080 (2012)
Kuss, M., Rasmussen, C.: Assessing approximate inference for binary Gaussian process classification. J. Mach. Learn. Res. 6, 1679–1704 (2005)
MATH MathSciNet Google Scholar
Lee, S., McLachlan, G.J.: Finite mixtures of multivariate skew t-distributions: some recent and new results. Stat. Comput. 24(2), 181–202 (2013)
Article MathSciNet Google Scholar
Lehoucq, R., Sorensen, D.: Deflation techniques for an implicitly restarted arnoldi iteration. SIAM J. Matrix Anal. Appl. 17(4), 789–821 (1996)
Article MATH MathSciNet Google Scholar
Lin, T.I.: Robust mixture modeling using multivariate skew t distribution. Stat. Comput. 20, 343–356 (2010)
Article MathSciNet Google Scholar
Lin, T.I., Lee, J.C., Hsieh, W.J.: Robust mixture modeling using the skew t distribution. Stat. Comput. 17, 81–92 (2007)
Article MathSciNet Google Scholar
Mahé, P., Vert, J.P.: Graph kernels based on tree patterns for molecules. Mach. Learn. 75(1), 3–35 (2009)
Article Google Scholar
McLachlan, G.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York (1992)
Book Google Scholar
McLachlan, G., Peel, D., Bean, R.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41, 379–388 (2003)
Article MATH MathSciNet Google Scholar
McNicholas, P., Murphy, B.: Parsimonious Gaussian mixture models. Stat. Comput. 18(3), 285–296 (2008)
Article MathSciNet Google Scholar
Mika, S., Ratsch, G., Weston, J., Schölkopf, B., Müllers, K.R.: Fisher discriminant analysis with kernels. In: Neural Networks for Signal Processing (NIPS), pp. 41–48 (1999)
Minka, T.: Expectation propagation for approximate bayesian inference. In: Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pp. 362–369. Morgan Kaufmann, San Francisco (2001)
Montanari, A., Viroli, C.: Heteroscedastic factor mixture analysis. Stat. Model. 10(4), 441–460 (2010)
Article MathSciNet Google Scholar
Murphy, T.B., Dean, N., Raftery, A.E.: Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications. Ann. Appl. Stat. 4(1), 219–223 (2010)
Article MathSciNet Google Scholar
Murua, A., Wicker, N.: Kernel-based Mixture Models for Classification. Technical Report, University of Montréal (2014)
Pekalska, E., Haasdonk, B.: Kernel discriminant analysis for positive definite and indefinite kernels. IEEE Trans. Pattern Anal. Mach. Intell. 31(6), 1017–1032 (2009)
Article Google Scholar
Ramsay, J.O., Silverman, B.W.: Functional Data Analysis. Springer Series in Statistics, 2nd edn. Springer, New York (2005)
Google Scholar
Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning Matlab Toolbox. MIT, Cambridge (2006a)
Google Scholar
Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. MIT, Cambridge (2006b)
MATH Google Scholar
Scholkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT, Cambridge (2001)
Google Scholar
Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10(5), 1299–1319 (1998)
Article Google Scholar
Schölkopf, B., Tsuda, K., Vert, J.-P. (eds.): Kernel Methods in Computational Biology. MIT, Cambridge (2004)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Article MATH Google Scholar
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)
Book Google Scholar
Shorack, G.R., Wellner, J.A.: Empirical Processes with Applications to Statistics. Wiley, New York (1986)
MATH Google Scholar
Smola, A., Kondor, R.: Kernels and regularization on graphs. In: Proceedings of Conference on Learning Theory and Kernel Machines, pp. 144–158 (2003)
Wang, J., Lee, J., Zhang, C.: Kernel trick embedded Gaussian mixture model. In: Proceedings of the 14th International Conference on Algorithmic Learning Theory, pp. 159–174 (2003)
Xu, Z., Huang, K., Zhu, J., King, I., Lyu, M.R.: A novel kernel-based maximum a posteriori classification method. Neural Netw. 22, 977–987 (2009)
Article Google Scholar

Download references

Acknowledgments

The authors would like to greatly thank the associate editor and the referee for their helpful remarks and comments on the manuscript.

Author information

Authors and Affiliations

Laboratoire MAP5, UMR 8145, Université Paris Descartes & Sorbonne Paris Cité, Paris, France
C. Bouveyron
Laboratoire DYNAFOR, UMR 1201, INRA & Université de Toulouse, Toulouse, France
M. Fauvel
Equipe MISTIS, INRIA Grenoble Rhône-Alpes & LJK, Grenoble Cedex, France
S. Girard

Authors

C. Bouveyron
View author publications
You can also search for this author in PubMed Google Scholar
M. Fauvel
View author publications
You can also search for this author in PubMed Google Scholar
S. Girard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to C. Bouveyron.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (tar 80 KB)

Appendix: Proofs

Proof of Proposition 1

Recalling that $d_{\max }=\max (d_{1},\ldots , d_{k})$, the classification function can be rewritten as:

$$\begin{aligned} D_{i}(\varphi (x))&= \sum _{j=1}^{r}\frac{1}{\lambda _{ij}}\langle \varphi (x)-\mu _{i},q_{ij}\rangle _{L_{2}}^{2}\\&+\sum _{j=1}^{d_{i}}\log (\lambda _{ij})\!+\!\sum _{j=d_{i}+1}^{d_{\max }}\log (\lambda )\!-\!2\log (\pi _{i})+\gamma , \end{aligned}$$

where $\gamma =(r-d_{\max })\log (\lambda )$ is a constant term which does not depend on the index $i$ of the class. In view of the assumptions, $D_{i}(\varphi (x))$ can be also rewritten as:

$$\begin{aligned} D_{i}(\varphi (x))&= \sum _{j=1}^{d_{i}}\frac{1}{\lambda _{ij}}\langle \varphi (x)-\mu _{i},q_{ij}\rangle _{L_{2}}^{2}\\&+\frac{1}{\lambda }\sum _{j=d_{i}+1}^{r}\langle \varphi (x)-\mu _{i},q_{ij}\rangle _{L_{2}}^{2}\\&+ \sum _{j=1}^{d_{i}}\log (\lambda _{ij})+(d_{\max }-d_{i})\log (\lambda )\\&-2\log (\pi _{i})+\gamma . \end{aligned}$$

Introducing the norm $||.||_{L_{2}}$ associated with the scalar product $\langle .,.\rangle _{L_{2}}$ and in view of Proposition 1 of (Shorack and Wellner (1986), p. 208), we finally obtain:

$$\begin{aligned} D_{i}(\varphi (x))&= \sum _{j=1}^{d_{i}}\left( \frac{1}{\lambda _{ij}}-\frac{1}{\lambda }\right) \langle \varphi (x)-\mu _{i},q_{ij}\rangle _{L_{2}}^{2}\\&+\frac{1}{\lambda }||\varphi (x)-\mu _{i}||_{L_{2}}^{2}\\&+ \sum _{j=1}^{d_{i}}\log (\lambda _{ij})+(d_{\max }-d_{i})\log (\lambda )\\&-2\log (\pi _{i})+\gamma , \end{aligned}$$

which is the desired result. $\square $

Proof of Proposition 2

The proof involves three steps.

(i)
Computation of the projection $\langle \varphi (x)-\hat{\mu }_{i},\hat{q}_{ij}\rangle _{L_{2}}$ : Since $(\hat{\lambda }_{ij},\hat{q}_{ij})$ is solution of the Fredholm-type equation, it follows that, for all $t\in J$,
$$\begin{aligned} \hat{\lambda }_{ij}\hat{q}_{ij}(t)&= \int _{J}\hat{{\varSigma }}_{i}(s,t)\hat{q}_{ij}(s)ds\nonumber \\&= \frac{1}{n_{i}}\sum _{x_{\ell }\in C_{i}}\langle \varphi (x_{\ell })-\hat{\mu }_{i},\hat{q}_{ij}\rangle _{L_{2}}(\varphi (x_{\ell })(t)\nonumber \\&-\hat{\mu }_{i}(t)). \end{aligned}$$
(8)
This implies that $\hat{q}_{ij}$ lies in the linear subspace spanned by the $(\varphi (x_{\ell })-\hat{\mu }_{i}),\,x_{\ell }\in C_{i}$. As a consequence, the rank of the operator $\hat{{\varSigma }}_{i}$ is finite and is at most $r_{i}=\min (n_{i},r)$. It therefore exists $\beta _{ij\ell }\in {\mathbb {R}}$ such that:
$$\begin{aligned} \hat{q}_{ij}=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }(\varphi (x_{\ell })-\hat{\mu }_{i}) \end{aligned}$$
(9)
leading to:
$$\begin{aligned} \langle \varphi (x)-\hat{\mu }_{i},\hat{q}_{ij}\rangle _{L_{2}}=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }\rho _{i}(x,x_{\ell }), \end{aligned}$$
(10)
for all $j=1,\ldots ,r_{i}$. The estimated classification function has therefore the following form:
$$\begin{aligned} \hat{D}_{i}(\varphi (x))&=\frac{1}{n_{i}}\sum _{j=1}^{d_{i}}\frac{1}{\hat{\lambda }_{ij}}\left( \frac{1}{\hat{\lambda }_{ij}}-\frac{1}{\hat{\lambda }}\right) \left( \sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }\rho _{i}(x,x_{\ell })\right) ^{2}\\&\qquad +\frac{1}{\hat{\lambda }}\rho _{i}(x,x)\\&\qquad +\sum _{j=1}^{d_{i}}\log (\hat{\lambda }_{ij})+(d_{\max }-d_{i})\log (\hat{\lambda })-2\log (\hat{\pi }_{i}), \end{aligned}$$
for all $i=1,\ldots ,k$.
(ii)
Computation of the $\beta _{ij\ell }$ and $\hat{\lambda }_{ij}$: Replacing (9) in the Fredholm-type equation (8) it follows that
$$\begin{aligned}&\frac{1}{n_{i}}\sum _{x_{\ell },x_{\ell '}\in C_{i}}\beta _{ij\ell '}(\varphi (x_{\ell })-\hat{\mu }_{i})\rho _{i}(x_{\ell },x_{\ell '})\\&\qquad \qquad =\hat{\lambda }_{ij}\sum _{x_{\ell '}\in C_{i}}\beta _{ij\ell '}(\varphi (x_{\ell '})-\hat{\mu }_{i}). \end{aligned}$$
Finally, projecting this equation on $\varphi (x_{m})-\hat{\mu }_{i}$ for $x_{m}\in C_{i}$ yields
$$\begin{aligned}&\frac{1}{n_{i}}\sum _{x_{\ell },x_{\ell '}\in C_{i}}\beta _{ij\ell '}\rho _{i}(x_{\ell },x_{m})\rho _{i}(x_{\ell },x_{\ell '})\\&\qquad \qquad =\hat{\lambda }_{ij}\sum _{x_{\ell '}\in C_{i}}\beta _{ij\ell '}\rho _{i}(x_{m},x_{\ell '}). \end{aligned}$$
Recalling that $M_{i}$ is the matrix $n_{i}\times n_{i}$ defined by $(M_{i})_{\ell ,\ell '}=\rho _{i}(x_{\ell },x_{\ell '})/n_{i}$ and introducing $\beta _{ij}$ the vector of ${\mathbb {R}}^{n_{i}}$ defined by $(\beta _{ij})_{\ell }=\beta _{ij\ell }$, the above equation can be rewritten as $M_{i}^{2}\beta _{ij}=\hat{\lambda }_{ij}M_{i}\beta _{ij}$ or, after simplification $M_{i}\beta _{ij}=\hat{\lambda }_{ij}\beta _{ij}.$ As a consequence, $\hat{\lambda }_{ij}$ is the $j$th largest eigenvalue of $M_{i}$ and $\beta _{ij}$ is the associated eigenvector for all $1\le j\le d_{i}$. Let us note that the constraint $\Vert \hat{q}_{ij}\Vert =1$ can be rewritten as $\beta _{ij}^{t}\beta _{ij}=1$.
(iii)
Computation of $\hat{\lambda }$: Remarking that trace $(\hat{{\varSigma }}_{i})\!=\!\text{ trace }(M_{i}) +{{\sum _{j=r_{i}+1}^{r}\hat{\lambda }_{ij}}}$, it follows:
$$\begin{aligned} \hat{\lambda }=\frac{1}{\sum _{i=1}^{k}\hat{\pi _{i}}(r_{i}-d_{i})}\sum _{i=1}^{k}\hat{\pi }_{i}\left( {\mathrm {trace}}(M_{i})-\sum _{j=1}^{d_{i}}\hat{\lambda }_{ij}\right) , \end{aligned}$$
and the proposition is proved. $\square $

Proof of Proposition 3

It is sufficient to prove that $\hat{q}_{ij}$ and $\hat{\lambda }_{ij}$ are respectively the $j$th normed eigenvector and eigenvalue of $\hat{{\varSigma }}_{i}$. First,

$$\begin{aligned} \hat{{\varSigma }}_{i}\hat{q}_{ij}&=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\frac{1}{n_{i}}\sum _{x_{\ell '}\in C_{i}}(x_{\ell '}-\bar{\mu }_{i})(x_{\ell '}\\&\quad -\bar{\mu }_{i})^{t}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }(x_{\ell }-\bar{\mu }_{i})\\&=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\frac{1}{n_{i}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}\beta _{ij\ell }(x_{\ell '}-\bar{\mu }_{i})(x_{\ell '}\\&\quad \quad -\bar{\mu }_{i})^{t}(x_{\ell }-\bar{\mu }_{i})\\&=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\frac{1}{n_{i}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}\beta _{ij\ell }(x_{\ell '}-\bar{\mu }_{i})\rho _{i}(x_{\ell },x_{\ell '})\\&=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}(M_{i})_{\ell ,\ell '}\beta _{ij\ell }(x_{\ell '}-\bar{\mu }_{i})\\&=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\sum _{x_{\ell '}\in C_{i}}(M_{i}\beta _{ij})_{\ell '}(x_{\ell '}-\bar{\mu }_{i}), \end{aligned}$$

and remarking that $\beta _{ij}$ is eigenvector of $M_{i}$, it follows:

$$\begin{aligned} \hat{{\varSigma }}_{i}\hat{q}_{ij}=\hat{\lambda }_{ij}\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\sum _{x_{\ell '}\in C_{i}}\beta _{ij\ell '}(x_{\ell '}-\bar{\mu }_{i})=\hat{\lambda }_{ij}\hat{q}_{ij}. \end{aligned}$$

Second, straightforward algebra shows that

$$\begin{aligned} ||\hat{q}_{ij}||^{2}&=\frac{1}{n_{i}\hat{\lambda }_{ij}}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }(x_{\ell })-\bar{\mu }_{i})^{t}\sum _{x_{\ell '}\in C_{i}}\beta _{ij\ell '}(x_{\ell '}-\bar{\mu }_{i})\\&=\frac{1}{n_{i}\hat{\lambda }_{ij}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}\beta _{ij\ell }\beta _{ij\ell '}(x_{\ell }-\bar{\mu }_{i})^{t}(x_{\ell '}-\bar{\mu }_{i})\\&=\frac{1}{\hat{\lambda }_{ij}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}(M_{i})_{\ell ,\ell '}\beta _{ij\ell }\beta _{ij\ell '}\\&=\frac{1}{\hat{\lambda }_{ij}}\hat{q}_{ij}^{t}M_{i}\hat{q}_{ij}=1, \end{aligned}$$

and the result is proved. $\square $

Proof of Proposition 4

For all $\ell =1,\ldots ,L$, the $\ell $th coordinate of the mapping function $\varphi (x)$ is defined as the $\ell $th coordinate of the function $x$ expressed in the truncated basis $\{b_{1},\ldots ,b_{L}\}$. More specifically,

$$\begin{aligned} x(t)=\sum _{\ell =1}^{L}\varphi _{\ell }(x)b_{\ell }(t), \end{aligned}$$

for all $t\in [0,1]$ and thus, for all $j=1,\ldots ,L$, we have

$$\begin{aligned} \gamma _{j}(x)&= \int _{0}^{1}x(t)b_{j}(t)dt \\&= \sum _{\ell =1}^{L}\varphi _{\ell }(x)\int _{0}^{1}b_{j}(t)b_{\ell }(t)dt=\sum _{\ell =1}^{L}B_{j\ell }\varphi _{\ell }(x). \end{aligned}$$

As a consequence, $\varphi (x)=B^{-1}\gamma (x)$ and $K(x,y)=\gamma (x)^{t}B^{-1}\gamma (y)$. Introducing

$$\begin{aligned} \bar{\gamma }_{i}=\frac{1}{n_{i}}\sum _{x_{j}\in C_{i}}\gamma (x_{j}), \end{aligned}$$

it follows that $\rho _{i}(x,y)=(\gamma (x)-\bar{\gamma }_{i})^{t}B^{-1}(\gamma (y)-\bar{\gamma }_{i})$. Let us first show that $\hat{q}_{ij}$ is eigenvector of $B^{-1}\hat{{\varSigma }}_{i}$. Recalling that

$$\begin{aligned} \hat{q}_{ij}=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }(\gamma (x_{\ell })-\bar{\gamma }_{i}), \end{aligned}$$

we have

$$\begin{aligned} B^{-1}\hat{{\varSigma }}_{i}\hat{q}_{ij}&= \frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\frac{1}{n_{i}}\sum _{x_{\ell '}\in C_{i}}(\gamma (x_{\ell '})-\bar{\gamma }_{i})(\gamma (x_{\ell '})\\&\quad -\bar{\gamma }_{i})^{t}B^{-1}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }(\gamma (x_{\ell })-\bar{\gamma }_{i})\\&= \frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\frac{1}{n_{i}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}\beta _{ij\ell }(\gamma (x_{\ell '})\\&-\bar{\gamma }_{i})(\gamma (x_{\ell '})-\bar{\gamma }_{i})^{t}B^{-1}(\gamma (x_{\ell })-\bar{\gamma }_{i})\\&= \frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\frac{1}{n_{i}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}\beta _{ij\ell }(\gamma (x_{\ell '})\\&-\bar{\gamma }_{i})\rho _{i}(x_{\ell },x_{\ell '})\\&= \frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\sum _{x_{\ell '},x_{\ell }\in C_{i}}(M_{i})_{\ell ,\ell '}\beta _{ij\ell }(\gamma (x_{\ell '})-\bar{\gamma }_{i})\\&= \frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\sum _{x_{\ell '}\in C_{i}}(M_{i}\beta _{ij})_{\ell '}(\gamma (x_{\ell '})-\bar{\gamma }_{i}). \end{aligned}$$

Remarking that $\beta _{ij}$ is eigenvector of $M_{i}$, it follows:

$$\begin{aligned} B^{-1}\hat{{\varSigma }}_{i}\hat{q}_{ij}&= \hat{\lambda }_{ij}\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\sum _{x_{\ell '}\in C_{i}}\beta _{ij\ell '}(\gamma (x_{\ell '})-\bar{\gamma }_{i})\\&= \hat{\lambda }_{ij}\hat{q}_{ij}. \end{aligned}$$

Let us finally compute the norm of $\hat{q}_{ij}$:

$$\begin{aligned} ||\hat{q}_{ij}||^{2}&=\frac{1}{n_{i}\hat{\lambda }_{ij}}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }(\gamma (x_{\ell })\\&\quad -\bar{\gamma }_{i})^{t}B^{-1}\sum _{x_{\ell '}\in C_{i}}\beta _{ij\ell '}(\gamma (x_{\ell '})-\bar{\gamma }_{i})\\&=\frac{1}{n_{i}\hat{\lambda }_{ij}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}\beta _{ij\ell }\beta _{ij\ell '}(\gamma (x_{\ell })\\&\quad -\bar{\gamma }_{i})^{t}B^{-1}(\gamma (x_{\ell '})-\bar{\gamma }_{i})\\&=\frac{1}{\hat{\lambda }_{ij}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}(M_{i})_{\ell ,\ell '}\beta _{ij\ell }\beta _{ij\ell '}\\&=\frac{1}{\hat{\lambda }_{ij}}\hat{q}_{ij}^{t}M_{i}\hat{q}_{ij}=1, \end{aligned}$$

and the result is proved. $\square $

Proof of Proposition 5

Let us first set the estimate of $\mu _{i}$at iteration $s$ to its empirical counterpart conditionally on the current value of the model parameter $\theta ^{(s-1)}$:

$$\begin{aligned} \hat{\mu }_{i}^{(s)}(t)=\frac{1}{n_{i}^{(s)}}\sum _{j=1}^{n}t_{ij}^{(s-1)}\varphi (x_{j})(t), \end{aligned}$$

where $n_{i}^{(s-1)}=\sum _{j=1}^{n}t_{ij}^{(s-1)}$. Replacing $\hat{\mu }_{i}$ by $\hat{\mu }_{i}^{(s)}$ in the proof of Proposition 2 , we get:

$$\begin{aligned} \hat{q}_{ij}=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }\sqrt{t_{\ell i}}(\varphi (x_{\ell })-\hat{\mu }_{i}). \end{aligned}$$

(11)

According to Bayes’ rule and Eq. (1), the computation of $t_{ij}^{(s)}={\mathbb {E}}(Z_{j}=i|x_{j},\theta ^{(s-1)})$ conditionally on the current value of the model parameter $\theta ^{(s-1)}$ can be finally written as:

$$\begin{aligned} t_{ij}^{(s)}&= {\mathbb {E}}(Z_{j}=i|x_{j},\theta ^{(s-1)})=P(Z_{j}=i|x_{j},\theta ^{(s-1)})\\&= {{\frac{\exp \left( -\frac{1}{2}D_{i}^{(s-1)}\left( \varphi (x_{j})\right) \right) }{\sum _{\ell =1}^{k}\exp \left( -\frac{1}{2}D_{\ell }^{(s-1)}\left( \varphi (x_{j})\right) \right) }},} \end{aligned}$$

for $j=1,\ldots ,n$ and $i=1,\ldots ,k$. The result follows. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bouveyron, C., Fauvel, M. & Girard, S. Kernel discriminant analysis and clustering with parsimonious Gaussian process models. Stat Comput 25, 1143–1162 (2015). https://doi.org/10.1007/s11222-014-9505-x

Download citation

Received: 18 June 2013
Accepted: 11 August 2014
Published: 04 September 2014
Issue Date: November 2015
DOI: https://doi.org/10.1007/s11222-014-9505-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Kernel discriminant analysis and clustering with parsimonious Gaussian process models

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Feature selection techniques for machine learning: a survey of more than two decades of research

Data clustering: application and trends

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (tar 80 KB)

Appendix: Proofs

Proof of Proposition 1

Proof of Proposition 2

Proof of Proposition 3

Proof of Proposition 4

Proof of Proposition 5

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Kernel discriminant analysis and clustering with parsimonious Gaussian process models

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Feature selection techniques for machine learning: a survey of more than two decades of research

Data clustering: application and trends

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (tar 80 KB)

Appendix: Proofs

Appendix: Proofs

Proof of Proposition 1

Proof of Proposition 2

Proof of Proposition 3

Proof of Proposition 4

Proof of Proposition 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation