Machine Learning

, Volume 66, Issue 2–3, pp 259–294 | Cite as

Statistical properties of kernel principal component analysis

  • Gilles Blanchard
  • Olivier Bousquet
  • Laurent Zwald
Article

Abstract

The main goal of this paper is to prove inequalities on the reconstruction error for kernel principal component analysis. With respect to previous work on this topic, our contribution is twofold: (1) we give bounds that explicitly take into account the empirical centering step in this algorithm, and (2) we show that a “localized” approach allows to obtain more accurate bounds. In particular, we show faster rates of convergence towards the minimum reconstruction error; more precisely, we prove that the convergence rate can typically be faster than n−1/2. We also obtain a new relative bound on the error.

A secondary goal, for which we present similar contributions, is to obtain convergence bounds for the partial sums of the biggest or smallest eigenvalues of the kernel Gram matrix towards eigenvalues of the corresponding kernel operator. These quantities are naturally linked to the KPCA procedure; furthermore these results can have applications to the study of various other kernel algorithms.

The results are presented in a functional analytic framework, which is suited to deal rigorously with reproducing kernel Hilbert spaces of infinite dimension.

Keywords

Kernel principal components analysis Fast convergence rates Kernel spectrum estimation Covariance operator Kernel integral operator 

References

  1. Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Annals of Mathematical Statistics, 34, 122–148.Google Scholar
  2. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404.MATHCrossRefMathSciNetGoogle Scholar
  3. Bach, F., & Jordan, M. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48.CrossRefMathSciNetGoogle Scholar
  4. Bartlett, P., Bousquet, O., & Mendelson, S. (2005). Local Rademacher complexities. Annals of Statistics, 33(4), 1497–1537.MATHCrossRefMathSciNetGoogle Scholar
  5. Bartlett, P., Jordan, M., & McAuliffe, J. (2003). Convexity, classification, and risk bounds. Technical report, Department of Statistics, U.C. Berkeley, To appear in J.A.S.A. Google Scholar
  6. Baxendale, P. (1976). Gaussian measures on function spaces. American Journal of Mathematics, 98, 891–952.MATHCrossRefMathSciNetGoogle Scholar
  7. Besse, P. (1979). Etude descriptive d’un processus; approximation, interpolation. PhD thesis, Université de Toulouse.Google Scholar
  8. Besse, P. (1991). Approximation spline de l’analyse en composantes principales d’une variable aléatoire hilbertienne. Annals of Faculty of Science Toulouse (Mathematics), 12(5), 329–349.MATHMathSciNetGoogle Scholar
  9. Bousquet, O. (2002). Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. PhD thesis, Ecole Polytechnique.Google Scholar
  10. Braun, M. (2005). Spectral properties of the kernel matrix and their relation to kernel methods in machine learning. PhD thesis, Friedrich-Wilhelms-Universität Bonn, Available at http://hss.ulb.uni-bonn.de/diss_online/math_nat_fak/2005/braun_mikio.Google Scholar
  11. Dauxois, J., & Pousse, A. (1976). Les analyses factorielles en calcul des probabilités et en statistique: essai d’étude synthétique. PhD thesis, Université de Toulouse.Google Scholar
  12. de la Peña, V. H., & Giné, E. (1999) Decoupling: From dependence to independence. Springer.Google Scholar
  13. Dunford, N., & Schwartz, J. T. (1963). Linear operators part II: Spectral theory, self adjoint operators in Hilbert space. Number VII in Pure and Applied Mathematics. New York: John Wiley & Sons.Google Scholar
  14. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13–30.MATHCrossRefMathSciNetGoogle Scholar
  15. Koltchinskii, V. (2004). Local rademacher complexities and oracle inequalities in risk minimization. Technical report, Department of mathematics and statistics, University of New Mexico.Google Scholar
  16. Koltchinskii, V., & Giné, E. (2000). Random matrix approximation of spectra of integral operators. Bernoulli, 6(1), 113–167.Google Scholar
  17. Massart, P. (2000). Some applications of concentration inequalities to statistics. Annales de la Faculté des Sciences de Toulouse, IX, 245–303.Google Scholar
  18. Maurer, A. (2004) Concentration of Hilbert-Schmidt operators and applications to feature learning. Manuscript.Google Scholar
  19. McDiarmid, C. (1989). On the method of bounded differences. Surveys in combinatorics (pp. 148–188). Cambridge University Press.Google Scholar
  20. Mendelson, S., & Pajor, A. (2005). Ellipsoid approximation with random vectors. In P. Auer, & R. Meir, (Eds.), Proceedings of the 18th annual conference on learning theory (COLT 05) of lecture notes in computer science, vol. 3559 (pp. 429–433). Springer.Google Scholar
  21. Ramsay, J. O., & Dalzell, C. J. (1991). Some tools for functional data analysis. Journal of the Royal Statistical Society, Series B, 53(3), 539–572.MATHMathSciNetGoogle Scholar
  22. B., Smola, A. J., & Müller, K.-R. (1999) Kernel principal component analysis. In B. Schölkopf, C. J. C. Burges, & A. J. Smola, (Eds.), Advances in kernel methods–-Support vector learning (pp. 327–352). Cambridge, MA: MIT Press. Short version appeared in Neural Computation, 10, 1299–1319, 1998.Google Scholar
  23. Shawe-Taylor, J., Williams, C., Cristianini, N., & Kandola, J. (2002). Eigenspectrum of the Gram matrix and its relationship to the operator eigenspectrum. In Algorithmic Learning Theory: 13th International Conference, ALT 2002 of lecture notes in computer science, vol. 2533 (pp. 23–40). Springer-Verlag.Google Scholar
  24. Shawe-Taylor, J., Williams, C., Cristianini, N., & Kandola, J. (2005). On the eigenspectrum of the Gram matrix and the generalisation error of kernel PCA. IEEE Transactions on Information Theory 51, (7), 2510–2522.CrossRefMathSciNetGoogle Scholar
  25. Williams, C. K. I., & Seeger, M. (2000). The effect of the input density distribution on kernel-based classifiers. In P. Langley, editor, Proceedings of the 17th international conference on machine learning (pp. 1159–1166), San Francisco, California: Morgan Kaufmann.Google Scholar
  26. Williamson, R. C., Smola, A. J., & Schölkopf, B. (2001). Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. IEEE Transactions on Information Theory, 47(6), 2516–2532.MATHCrossRefGoogle Scholar

Copyright information

© Springer Science + Business Media, LLC 2007

Authors and Affiliations

  • Gilles Blanchard
    • 1
  • Olivier Bousquet
    • 2
  • Laurent Zwald
    • 3
  1. 1.Fraunhofer FIRST (IDA)BerlinGermany
  2. 2.PertinenceFrance
  3. 3.Département de MathématiquesUniversité Paris-SudFrance

Personalised recommendations