Skip to main content
Log in

Kernel discriminant analysis and clustering with parsimonious Gaussian process models

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

This work presents a family of parsimonious Gaussian process models which allow to build, from a finite sample, a model-based classifier in an infinite dimensional space. The proposed parsimonious models are obtained by constraining the eigen-decomposition of the Gaussian processes modeling each class. This allows in particular to use non-linear mapping functions which project the observations into infinite dimensional spaces. It is also demonstrated that the building of the classifier can be directly done from the observation space through a kernel function. The proposed classification method is thus able to classify data of various types such as categorical data, functional data or networks. Furthermore, it is possible to classify mixed data by combining different kernels. The methodology is as well extended to the unsupervised classification case and an EM algorithm is derived for the inference. Experimental results on various data sets demonstrate the effectiveness of the proposed method. A Matlab toolbox implementing the proposed classification methods is provided as supplementary material.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. http://archive.ics.uci.edu/ml/.

References

  • Akaike, Hirotugu: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)

    Article  MATH  MathSciNet  Google Scholar 

  • Andrews, J.L., McNicholas, P.D.: Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Stat. Comput. 22(5), 1021–1029 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  • Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22(7), 719–725 (2001)

    Article  Google Scholar 

  • Bouguila, N., Ziou, D., Vaillancourt, J.: Novel mixtures based on the Dirichlet distribution: application to data and image classification. In: Machine Learning and Data Mining in Pattern Recognition, pp. 172–181. Springer, Berlin (2003)

  • Bouveyron, C., Brunet, C.: Simultaneous model-based clustering and visualization in the Fisher discriminative subspace. Stat. Comput. 22(1), 301–324 (2012)

    Article  MathSciNet  Google Scholar 

  • Bouveyron, C., Brunet-Saumard, C.: Model-based clustering of high-dimensional data: a review. Comput. Stat. Data Anal. 71, 52–78 (2013)

    Article  MathSciNet  Google Scholar 

  • Bouveyron, C., Girard, S.: Robust supervised classification with mixture models: learning from data with uncertain labels. Pattern Recognit. 42(11), 2649–2658 (2009)

    Article  MATH  Google Scholar 

  • Bouveyron, C., Jacques, J.: Model-based clustering of time series in group-specific functional subspaces. Adv. Data Anal. Classif. 5(4), 281–300 (2011)

    Article  MATH  MathSciNet  Google Scholar 

  • Bouveyron, C., Girard, S., Schmid, C.: High-dimensional discriminant analysis. Commun. Stat. 36, 2607–2623 (2007a)

    Article  MATH  MathSciNet  Google Scholar 

  • Bouveyron, C., Girard, S., Schmid, C.: High-dimensional data clustering. Comput. Stat. Data Anal. 52, 502–519 (2007b)

    Article  MATH  MathSciNet  Google Scholar 

  • Canu, S., Grandvalet, Y., Guigue, V., Rakotomamonjy, A.: SVM and kernel methods matlab toolbox. In: Perception Systemes et Information. INSA de Rouen, Rouen (2005)

  • Caponnetto, A., Micchelli, C.A., Pontil, M., Ying, Y.: Universal multi-task kernels. J. Mach. Learn. Res. 68, 16151646 (2008)

    MathSciNet  Google Scholar 

  • Cattell, R.: The scree test for the number of factors. Multivar. Behav. Res. 1(2), 245–276 (1966)

    Article  Google Scholar 

  • Celeux, G., Govaert, G.: Clustering criteria for discrete data and latent class models. J. Classif. 8(2), 157–176 (1991)

    Article  MATH  Google Scholar 

  • Chapelle, O., Schölkopf, B., Zien, A. (eds.): Semi-Supervised Learning. MIT Press, Cambridge. http://www.kyb.tuebingen.mpg.de/ssl-book (2006)

  • Couto, J.: Kernel k-means for categorical data. In: Advances in Intelligent Data Analysis VI, vol. 3646 of Lecture Notes in Computer Science, pp. 739–739. Springer, Berlin (2005)

  • Cuturi, M., Vert, J.P.: The context-tree kernel for strings. Neural Netw. 18(8), 1111–1123 (2005)

    Article  Google Scholar 

  • Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39(1), 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  • Dundar, M.M., Landgrebe, D.A.: Toward an optimal supervised classifier for the analysis of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 42(1), 271–277 (2004)

    Article  Google Scholar 

  • Evgeniou, T., Micchelli, C.A., Pontil, M.: Learning multiple tasks with kernel methods. J. Mach. Learn. Res. 6, 615637 (2005)

    MathSciNet  Google Scholar 

  • Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugenics 7, 179–188 (1936)

    Article  Google Scholar 

  • Forbes, F., Wraith, D.: A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweight: application to robust clustering. Stat. Comput. (to appear) (2014)

  • Franczak, B.C., Browne, R.P., McNicholas, P.D.: Mixtures of shifted asymmetric Laplace distributions. IEEE Trans. Pattern Anal. Mach. Intell. 36(6), 1149–1157 (2014)

    Article  Google Scholar 

  • Girolami, M.: Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. 13(3), 780–784 (2002)

    Article  Google Scholar 

  • Gönen, M., Alpaydin, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, 2211–2268 (2011)

    MATH  MathSciNet  Google Scholar 

  • Hofmann, T., Schölkopf, B., Smola, A.: Kernel methods in machine learning. Ann. Stat. 36(3), 1171–1220 (2008)

    Article  MATH  Google Scholar 

  • Kadri, H., Rakotomamonjy, A., Bach, F., Preux, P.: Multiple Operator-Valued Kernel Learning. In: Neural Information Processing Systems (NIPS), pp. 1172–1080 (2012)

  • Kuss, M., Rasmussen, C.: Assessing approximate inference for binary Gaussian process classification. J. Mach. Learn. Res. 6, 1679–1704 (2005)

    MATH  MathSciNet  Google Scholar 

  • Lee, S., McLachlan, G.J.: Finite mixtures of multivariate skew t-distributions: some recent and new results. Stat. Comput. 24(2), 181–202 (2013)

    Article  MathSciNet  Google Scholar 

  • Lehoucq, R., Sorensen, D.: Deflation techniques for an implicitly restarted arnoldi iteration. SIAM J. Matrix Anal. Appl. 17(4), 789–821 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  • Lin, T.I.: Robust mixture modeling using multivariate skew t distribution. Stat. Comput. 20, 343–356 (2010)

    Article  MathSciNet  Google Scholar 

  • Lin, T.I., Lee, J.C., Hsieh, W.J.: Robust mixture modeling using the skew t distribution. Stat. Comput. 17, 81–92 (2007)

    Article  MathSciNet  Google Scholar 

  • Mahé, P., Vert, J.P.: Graph kernels based on tree patterns for molecules. Mach. Learn. 75(1), 3–35 (2009)

    Article  Google Scholar 

  • McLachlan, G.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York (1992)

    Book  Google Scholar 

  • McLachlan, G., Peel, D., Bean, R.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41, 379–388 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  • McNicholas, P., Murphy, B.: Parsimonious Gaussian mixture models. Stat. Comput. 18(3), 285–296 (2008)

    Article  MathSciNet  Google Scholar 

  • Mika, S., Ratsch, G., Weston, J., Schölkopf, B., Müllers, K.R.: Fisher discriminant analysis with kernels. In: Neural Networks for Signal Processing (NIPS), pp. 41–48 (1999)

  • Minka, T.: Expectation propagation for approximate bayesian inference. In: Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pp. 362–369. Morgan Kaufmann, San Francisco (2001)

  • Montanari, A., Viroli, C.: Heteroscedastic factor mixture analysis. Stat. Model. 10(4), 441–460 (2010)

    Article  MathSciNet  Google Scholar 

  • Murphy, T.B., Dean, N., Raftery, A.E.: Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications. Ann. Appl. Stat. 4(1), 219–223 (2010)

    Article  MathSciNet  Google Scholar 

  • Murua, A., Wicker, N.: Kernel-based Mixture Models for Classification. Technical Report, University of Montréal (2014)

  • Pekalska, E., Haasdonk, B.: Kernel discriminant analysis for positive definite and indefinite kernels. IEEE Trans. Pattern Anal. Mach. Intell. 31(6), 1017–1032 (2009)

    Article  Google Scholar 

  • Ramsay, J.O., Silverman, B.W.: Functional Data Analysis. Springer Series in Statistics, 2nd edn. Springer, New York (2005)

    Google Scholar 

  • Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning Matlab Toolbox. MIT, Cambridge (2006a)

    Google Scholar 

  • Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. MIT, Cambridge (2006b)

    MATH  Google Scholar 

  • Scholkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT, Cambridge (2001)

    Google Scholar 

  • Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10(5), 1299–1319 (1998)

    Article  Google Scholar 

  • Schölkopf, B., Tsuda, K., Vert, J.-P. (eds.): Kernel Methods in Computational Biology. MIT, Cambridge (2004)

  • Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)

    Article  MATH  Google Scholar 

  • Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)

    Book  Google Scholar 

  • Shorack, G.R., Wellner, J.A.: Empirical Processes with Applications to Statistics. Wiley, New York (1986)

    MATH  Google Scholar 

  • Smola, A., Kondor, R.: Kernels and regularization on graphs. In: Proceedings of Conference on Learning Theory and Kernel Machines, pp. 144–158 (2003)

  • Wang, J., Lee, J., Zhang, C.: Kernel trick embedded Gaussian mixture model. In: Proceedings of the 14th International Conference on Algorithmic Learning Theory, pp. 159–174 (2003)

  • Xu, Z., Huang, K., Zhu, J., King, I., Lyu, M.R.: A novel kernel-based maximum a posteriori classification method. Neural Netw. 22, 977–987 (2009)

    Article  Google Scholar 

Download references

Acknowledgments

The authors would like to greatly thank the associate editor and the referee for their helpful remarks and comments on the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to C. Bouveyron.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (tar 80 KB)

Appendix: Proofs

Appendix: Proofs

Proof of Proposition 1

Recalling that \(d_{\max }=\max (d_{1},\ldots , d_{k})\), the classification function can be rewritten as:

$$\begin{aligned} D_{i}(\varphi (x))&= \sum _{j=1}^{r}\frac{1}{\lambda _{ij}}\langle \varphi (x)-\mu _{i},q_{ij}\rangle _{L_{2}}^{2}\\&+\sum _{j=1}^{d_{i}}\log (\lambda _{ij})\!+\!\sum _{j=d_{i}+1}^{d_{\max }}\log (\lambda )\!-\!2\log (\pi _{i})+\gamma , \end{aligned}$$

where \(\gamma =(r-d_{\max })\log (\lambda )\) is a constant term which does not depend on the index \(i\) of the class. In view of the assumptions, \(D_{i}(\varphi (x))\) can be also rewritten as:

$$\begin{aligned} D_{i}(\varphi (x))&= \sum _{j=1}^{d_{i}}\frac{1}{\lambda _{ij}}\langle \varphi (x)-\mu _{i},q_{ij}\rangle _{L_{2}}^{2}\\&+\frac{1}{\lambda }\sum _{j=d_{i}+1}^{r}\langle \varphi (x)-\mu _{i},q_{ij}\rangle _{L_{2}}^{2}\\&+ \sum _{j=1}^{d_{i}}\log (\lambda _{ij})+(d_{\max }-d_{i})\log (\lambda )\\&-2\log (\pi _{i})+\gamma . \end{aligned}$$

Introducing the norm \(||.||_{L_{2}}\) associated with the scalar product \(\langle .,.\rangle _{L_{2}}\) and in view of Proposition 1 of (Shorack and Wellner (1986), p. 208), we finally obtain:

$$\begin{aligned} D_{i}(\varphi (x))&= \sum _{j=1}^{d_{i}}\left( \frac{1}{\lambda _{ij}}-\frac{1}{\lambda }\right) \langle \varphi (x)-\mu _{i},q_{ij}\rangle _{L_{2}}^{2}\\&+\frac{1}{\lambda }||\varphi (x)-\mu _{i}||_{L_{2}}^{2}\\&+ \sum _{j=1}^{d_{i}}\log (\lambda _{ij})+(d_{\max }-d_{i})\log (\lambda )\\&-2\log (\pi _{i})+\gamma , \end{aligned}$$

which is the desired result. \(\square \)

Proof of Proposition 2

The proof involves three steps.

  1. (i)

    Computation of the projection \(\langle \varphi (x)-\hat{\mu }_{i},\hat{q}_{ij}\rangle _{L_{2}}\) : Since \((\hat{\lambda }_{ij},\hat{q}_{ij})\) is solution of the Fredholm-type equation, it follows that, for all \(t\in J\),

    $$\begin{aligned} \hat{\lambda }_{ij}\hat{q}_{ij}(t)&= \int _{J}\hat{{\varSigma }}_{i}(s,t)\hat{q}_{ij}(s)ds\nonumber \\&= \frac{1}{n_{i}}\sum _{x_{\ell }\in C_{i}}\langle \varphi (x_{\ell })-\hat{\mu }_{i},\hat{q}_{ij}\rangle _{L_{2}}(\varphi (x_{\ell })(t)\nonumber \\&-\hat{\mu }_{i}(t)). \end{aligned}$$
    (8)

    This implies that \(\hat{q}_{ij}\) lies in the linear subspace spanned by the \((\varphi (x_{\ell })-\hat{\mu }_{i}),\,x_{\ell }\in C_{i}\). As a consequence, the rank of the operator \(\hat{{\varSigma }}_{i}\) is finite and is at most \(r_{i}=\min (n_{i},r)\). It therefore exists \(\beta _{ij\ell }\in {\mathbb {R}}\) such that:

    $$\begin{aligned} \hat{q}_{ij}=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }(\varphi (x_{\ell })-\hat{\mu }_{i}) \end{aligned}$$
    (9)

    leading to:

    $$\begin{aligned} \langle \varphi (x)-\hat{\mu }_{i},\hat{q}_{ij}\rangle _{L_{2}}=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }\rho _{i}(x,x_{\ell }), \end{aligned}$$
    (10)

    for all \(j=1,\ldots ,r_{i}\). The estimated classification function has therefore the following form:

    $$\begin{aligned} \hat{D}_{i}(\varphi (x))&=\frac{1}{n_{i}}\sum _{j=1}^{d_{i}}\frac{1}{\hat{\lambda }_{ij}}\left( \frac{1}{\hat{\lambda }_{ij}}-\frac{1}{\hat{\lambda }}\right) \left( \sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }\rho _{i}(x,x_{\ell })\right) ^{2}\\&\qquad +\frac{1}{\hat{\lambda }}\rho _{i}(x,x)\\&\qquad +\sum _{j=1}^{d_{i}}\log (\hat{\lambda }_{ij})+(d_{\max }-d_{i})\log (\hat{\lambda })-2\log (\hat{\pi }_{i}), \end{aligned}$$

    for all \(i=1,\ldots ,k\).

  2. (ii)

    Computation of the \(\beta _{ij\ell }\) and \(\hat{\lambda }_{ij}\): Replacing (9) in the Fredholm-type equation (8) it follows that

    $$\begin{aligned}&\frac{1}{n_{i}}\sum _{x_{\ell },x_{\ell '}\in C_{i}}\beta _{ij\ell '}(\varphi (x_{\ell })-\hat{\mu }_{i})\rho _{i}(x_{\ell },x_{\ell '})\\&\qquad \qquad =\hat{\lambda }_{ij}\sum _{x_{\ell '}\in C_{i}}\beta _{ij\ell '}(\varphi (x_{\ell '})-\hat{\mu }_{i}). \end{aligned}$$

    Finally, projecting this equation on \(\varphi (x_{m})-\hat{\mu }_{i}\) for \(x_{m}\in C_{i}\) yields

    $$\begin{aligned}&\frac{1}{n_{i}}\sum _{x_{\ell },x_{\ell '}\in C_{i}}\beta _{ij\ell '}\rho _{i}(x_{\ell },x_{m})\rho _{i}(x_{\ell },x_{\ell '})\\&\qquad \qquad =\hat{\lambda }_{ij}\sum _{x_{\ell '}\in C_{i}}\beta _{ij\ell '}\rho _{i}(x_{m},x_{\ell '}). \end{aligned}$$

    Recalling that \(M_{i}\) is the matrix \(n_{i}\times n_{i}\) defined by \((M_{i})_{\ell ,\ell '}=\rho _{i}(x_{\ell },x_{\ell '})/n_{i}\) and introducing \(\beta _{ij}\) the vector of \({\mathbb {R}}^{n_{i}}\) defined by \((\beta _{ij})_{\ell }=\beta _{ij\ell }\), the above equation can be rewritten as \(M_{i}^{2}\beta _{ij}=\hat{\lambda }_{ij}M_{i}\beta _{ij}\) or, after simplification \(M_{i}\beta _{ij}=\hat{\lambda }_{ij}\beta _{ij}.\) As a consequence, \(\hat{\lambda }_{ij}\) is the \(j\)th largest eigenvalue of \(M_{i}\) and \(\beta _{ij}\) is the associated eigenvector for all \(1\le j\le d_{i}\). Let us note that the constraint \(\Vert \hat{q}_{ij}\Vert =1\) can be rewritten as \(\beta _{ij}^{t}\beta _{ij}=1\).

  3. (iii)

    Computation of \(\hat{\lambda }\): Remarking that trace \((\hat{{\varSigma }}_{i})\!=\!\text{ trace }(M_{i}) +{{\sum _{j=r_{i}+1}^{r}\hat{\lambda }_{ij}}}\), it follows:

    $$\begin{aligned} \hat{\lambda }=\frac{1}{\sum _{i=1}^{k}\hat{\pi _{i}}(r_{i}-d_{i})}\sum _{i=1}^{k}\hat{\pi }_{i}\left( {\mathrm {trace}}(M_{i})-\sum _{j=1}^{d_{i}}\hat{\lambda }_{ij}\right) , \end{aligned}$$

    and the proposition is proved. \(\square \)

Proof of Proposition 3

It is sufficient to prove that \(\hat{q}_{ij}\) and \(\hat{\lambda }_{ij}\) are respectively the \(j\)th normed eigenvector and eigenvalue of \(\hat{{\varSigma }}_{i}\). First,

$$\begin{aligned} \hat{{\varSigma }}_{i}\hat{q}_{ij}&=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\frac{1}{n_{i}}\sum _{x_{\ell '}\in C_{i}}(x_{\ell '}-\bar{\mu }_{i})(x_{\ell '}\\&\quad -\bar{\mu }_{i})^{t}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }(x_{\ell }-\bar{\mu }_{i})\\&=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\frac{1}{n_{i}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}\beta _{ij\ell }(x_{\ell '}-\bar{\mu }_{i})(x_{\ell '}\\&\quad \quad -\bar{\mu }_{i})^{t}(x_{\ell }-\bar{\mu }_{i})\\&=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\frac{1}{n_{i}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}\beta _{ij\ell }(x_{\ell '}-\bar{\mu }_{i})\rho _{i}(x_{\ell },x_{\ell '})\\&=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}(M_{i})_{\ell ,\ell '}\beta _{ij\ell }(x_{\ell '}-\bar{\mu }_{i})\\&=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\sum _{x_{\ell '}\in C_{i}}(M_{i}\beta _{ij})_{\ell '}(x_{\ell '}-\bar{\mu }_{i}), \end{aligned}$$

and remarking that \(\beta _{ij}\) is eigenvector of \(M_{i}\), it follows:

$$\begin{aligned} \hat{{\varSigma }}_{i}\hat{q}_{ij}=\hat{\lambda }_{ij}\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\sum _{x_{\ell '}\in C_{i}}\beta _{ij\ell '}(x_{\ell '}-\bar{\mu }_{i})=\hat{\lambda }_{ij}\hat{q}_{ij}. \end{aligned}$$

Second, straightforward algebra shows that

$$\begin{aligned} ||\hat{q}_{ij}||^{2}&=\frac{1}{n_{i}\hat{\lambda }_{ij}}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }(x_{\ell })-\bar{\mu }_{i})^{t}\sum _{x_{\ell '}\in C_{i}}\beta _{ij\ell '}(x_{\ell '}-\bar{\mu }_{i})\\&=\frac{1}{n_{i}\hat{\lambda }_{ij}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}\beta _{ij\ell }\beta _{ij\ell '}(x_{\ell }-\bar{\mu }_{i})^{t}(x_{\ell '}-\bar{\mu }_{i})\\&=\frac{1}{\hat{\lambda }_{ij}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}(M_{i})_{\ell ,\ell '}\beta _{ij\ell }\beta _{ij\ell '}\\&=\frac{1}{\hat{\lambda }_{ij}}\hat{q}_{ij}^{t}M_{i}\hat{q}_{ij}=1, \end{aligned}$$

and the result is proved. \(\square \)

Proof of Proposition 4

For all \(\ell =1,\ldots ,L\), the \(\ell \)th coordinate of the mapping function \(\varphi (x)\) is defined as the \(\ell \)th coordinate of the function \(x\) expressed in the truncated basis \(\{b_{1},\ldots ,b_{L}\}\). More specifically,

$$\begin{aligned} x(t)=\sum _{\ell =1}^{L}\varphi _{\ell }(x)b_{\ell }(t), \end{aligned}$$

for all \(t\in [0,1]\) and thus, for all \(j=1,\ldots ,L\), we have

$$\begin{aligned} \gamma _{j}(x)&= \int _{0}^{1}x(t)b_{j}(t)dt \\&= \sum _{\ell =1}^{L}\varphi _{\ell }(x)\int _{0}^{1}b_{j}(t)b_{\ell }(t)dt=\sum _{\ell =1}^{L}B_{j\ell }\varphi _{\ell }(x). \end{aligned}$$

As a consequence, \(\varphi (x)=B^{-1}\gamma (x)\) and \(K(x,y)=\gamma (x)^{t}B^{-1}\gamma (y)\). Introducing

$$\begin{aligned} \bar{\gamma }_{i}=\frac{1}{n_{i}}\sum _{x_{j}\in C_{i}}\gamma (x_{j}), \end{aligned}$$

it follows that \(\rho _{i}(x,y)=(\gamma (x)-\bar{\gamma }_{i})^{t}B^{-1}(\gamma (y)-\bar{\gamma }_{i})\). Let us first show that \(\hat{q}_{ij}\) is eigenvector of \(B^{-1}\hat{{\varSigma }}_{i}\). Recalling that

$$\begin{aligned} \hat{q}_{ij}=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }(\gamma (x_{\ell })-\bar{\gamma }_{i}), \end{aligned}$$

we have

$$\begin{aligned} B^{-1}\hat{{\varSigma }}_{i}\hat{q}_{ij}&= \frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\frac{1}{n_{i}}\sum _{x_{\ell '}\in C_{i}}(\gamma (x_{\ell '})-\bar{\gamma }_{i})(\gamma (x_{\ell '})\\&\quad -\bar{\gamma }_{i})^{t}B^{-1}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }(\gamma (x_{\ell })-\bar{\gamma }_{i})\\&= \frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\frac{1}{n_{i}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}\beta _{ij\ell }(\gamma (x_{\ell '})\\&-\bar{\gamma }_{i})(\gamma (x_{\ell '})-\bar{\gamma }_{i})^{t}B^{-1}(\gamma (x_{\ell })-\bar{\gamma }_{i})\\&= \frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\frac{1}{n_{i}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}\beta _{ij\ell }(\gamma (x_{\ell '})\\&-\bar{\gamma }_{i})\rho _{i}(x_{\ell },x_{\ell '})\\&= \frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\sum _{x_{\ell '},x_{\ell }\in C_{i}}(M_{i})_{\ell ,\ell '}\beta _{ij\ell }(\gamma (x_{\ell '})-\bar{\gamma }_{i})\\&= \frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\sum _{x_{\ell '}\in C_{i}}(M_{i}\beta _{ij})_{\ell '}(\gamma (x_{\ell '})-\bar{\gamma }_{i}). \end{aligned}$$

Remarking that \(\beta _{ij}\) is eigenvector of \(M_{i}\), it follows:

$$\begin{aligned} B^{-1}\hat{{\varSigma }}_{i}\hat{q}_{ij}&= \hat{\lambda }_{ij}\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}B^{-1}\sum _{x_{\ell '}\in C_{i}}\beta _{ij\ell '}(\gamma (x_{\ell '})-\bar{\gamma }_{i})\\&= \hat{\lambda }_{ij}\hat{q}_{ij}. \end{aligned}$$

Let us finally compute the norm of \(\hat{q}_{ij}\):

$$\begin{aligned} ||\hat{q}_{ij}||^{2}&=\frac{1}{n_{i}\hat{\lambda }_{ij}}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }(\gamma (x_{\ell })\\&\quad -\bar{\gamma }_{i})^{t}B^{-1}\sum _{x_{\ell '}\in C_{i}}\beta _{ij\ell '}(\gamma (x_{\ell '})-\bar{\gamma }_{i})\\&=\frac{1}{n_{i}\hat{\lambda }_{ij}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}\beta _{ij\ell }\beta _{ij\ell '}(\gamma (x_{\ell })\\&\quad -\bar{\gamma }_{i})^{t}B^{-1}(\gamma (x_{\ell '})-\bar{\gamma }_{i})\\&=\frac{1}{\hat{\lambda }_{ij}}\sum _{x_{\ell '},x_{\ell }\in C_{i}}(M_{i})_{\ell ,\ell '}\beta _{ij\ell }\beta _{ij\ell '}\\&=\frac{1}{\hat{\lambda }_{ij}}\hat{q}_{ij}^{t}M_{i}\hat{q}_{ij}=1, \end{aligned}$$

and the result is proved. \(\square \)

Proof of Proposition 5

Let us first set the estimate of \(\mu _{i}\)at iteration \(s\) to its empirical counterpart conditionally on the current value of the model parameter \(\theta ^{(s-1)}\):

$$\begin{aligned} \hat{\mu }_{i}^{(s)}(t)=\frac{1}{n_{i}^{(s)}}\sum _{j=1}^{n}t_{ij}^{(s-1)}\varphi (x_{j})(t), \end{aligned}$$

where \(n_{i}^{(s-1)}=\sum _{j=1}^{n}t_{ij}^{(s-1)}\). Replacing \(\hat{\mu }_{i}\) by \(\hat{\mu }_{i}^{(s)}\) in the proof of Proposition 2 , we get:

$$\begin{aligned} \hat{q}_{ij}=\frac{1}{\sqrt{n_{i}\hat{\lambda }_{ij}}}\sum _{x_{\ell }\in C_{i}}\beta _{ij\ell }\sqrt{t_{\ell i}}(\varphi (x_{\ell })-\hat{\mu }_{i}). \end{aligned}$$
(11)

According to Bayes’ rule and Eq. (1), the computation of \(t_{ij}^{(s)}={\mathbb {E}}(Z_{j}=i|x_{j},\theta ^{(s-1)})\) conditionally on the current value of the model parameter \(\theta ^{(s-1)}\) can be finally written as:

$$\begin{aligned} t_{ij}^{(s)}&= {\mathbb {E}}(Z_{j}=i|x_{j},\theta ^{(s-1)})=P(Z_{j}=i|x_{j},\theta ^{(s-1)})\\&= {{\frac{\exp \left( -\frac{1}{2}D_{i}^{(s-1)}\left( \varphi (x_{j})\right) \right) }{\sum _{\ell =1}^{k}\exp \left( -\frac{1}{2}D_{\ell }^{(s-1)}\left( \varphi (x_{j})\right) \right) }},} \end{aligned}$$

for \(j=1,\ldots ,n\) and \(i=1,\ldots ,k\). The result follows. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bouveyron, C., Fauvel, M. & Girard, S. Kernel discriminant analysis and clustering with parsimonious Gaussian process models. Stat Comput 25, 1143–1162 (2015). https://doi.org/10.1007/s11222-014-9505-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-014-9505-x

Keywords

Navigation