Statistical properties of kernel principal component analysis
- 816 Downloads
The main goal of this paper is to prove inequalities on the reconstruction error for kernel principal component analysis. With respect to previous work on this topic, our contribution is twofold: (1) we give bounds that explicitly take into account the empirical centering step in this algorithm, and (2) we show that a “localized” approach allows to obtain more accurate bounds. In particular, we show faster rates of convergence towards the minimum reconstruction error; more precisely, we prove that the convergence rate can typically be faster than n−1/2. We also obtain a new relative bound on the error.
A secondary goal, for which we present similar contributions, is to obtain convergence bounds for the partial sums of the biggest or smallest eigenvalues of the kernel Gram matrix towards eigenvalues of the corresponding kernel operator. These quantities are naturally linked to the KPCA procedure; furthermore these results can have applications to the study of various other kernel algorithms.
The results are presented in a functional analytic framework, which is suited to deal rigorously with reproducing kernel Hilbert spaces of infinite dimension.
KeywordsKernel principal components analysis Fast convergence rates Kernel spectrum estimation Covariance operator Kernel integral operator
- Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Annals of Mathematical Statistics, 34, 122–148.Google Scholar
- Bartlett, P., Jordan, M., & McAuliffe, J. (2003). Convexity, classification, and risk bounds. Technical report, Department of Statistics, U.C. Berkeley, To appear in J.A.S.A. Google Scholar
- Besse, P. (1979). Etude descriptive d’un processus; approximation, interpolation. PhD thesis, Université de Toulouse.Google Scholar
- Bousquet, O. (2002). Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. PhD thesis, Ecole Polytechnique.Google Scholar
- Braun, M. (2005). Spectral properties of the kernel matrix and their relation to kernel methods in machine learning. PhD thesis, Friedrich-Wilhelms-Universität Bonn, Available at http://hss.ulb.uni-bonn.de/diss_online/math_nat_fak/2005/braun_mikio.Google Scholar
- Dauxois, J., & Pousse, A. (1976). Les analyses factorielles en calcul des probabilités et en statistique: essai d’étude synthétique. PhD thesis, Université de Toulouse.Google Scholar
- de la Peña, V. H., & Giné, E. (1999) Decoupling: From dependence to independence. Springer.Google Scholar
- Dunford, N., & Schwartz, J. T. (1963). Linear operators part II: Spectral theory, self adjoint operators in Hilbert space. Number VII in Pure and Applied Mathematics. New York: John Wiley & Sons.Google Scholar
- Koltchinskii, V. (2004). Local rademacher complexities and oracle inequalities in risk minimization. Technical report, Department of mathematics and statistics, University of New Mexico.Google Scholar
- Koltchinskii, V., & Giné, E. (2000). Random matrix approximation of spectra of integral operators. Bernoulli, 6(1), 113–167.Google Scholar
- Massart, P. (2000). Some applications of concentration inequalities to statistics. Annales de la Faculté des Sciences de Toulouse, IX, 245–303.Google Scholar
- Maurer, A. (2004) Concentration of Hilbert-Schmidt operators and applications to feature learning. Manuscript.Google Scholar
- McDiarmid, C. (1989). On the method of bounded differences. Surveys in combinatorics (pp. 148–188). Cambridge University Press.Google Scholar
- Mendelson, S., & Pajor, A. (2005). Ellipsoid approximation with random vectors. In P. Auer, & R. Meir, (Eds.), Proceedings of the 18th annual conference on learning theory (COLT 05) of lecture notes in computer science, vol. 3559 (pp. 429–433). Springer.Google Scholar
- B., Smola, A. J., & Müller, K.-R. (1999) Kernel principal component analysis. In B. Schölkopf, C. J. C. Burges, & A. J. Smola, (Eds.), Advances in kernel methods–-Support vector learning (pp. 327–352). Cambridge, MA: MIT Press. Short version appeared in Neural Computation, 10, 1299–1319, 1998.Google Scholar
- Shawe-Taylor, J., Williams, C., Cristianini, N., & Kandola, J. (2002). Eigenspectrum of the Gram matrix and its relationship to the operator eigenspectrum. In Algorithmic Learning Theory: 13th International Conference, ALT 2002 of lecture notes in computer science, vol. 2533 (pp. 23–40). Springer-Verlag.Google Scholar
- Williams, C. K. I., & Seeger, M. (2000). The effect of the input density distribution on kernel-based classifiers. In P. Langley, editor, Proceedings of the 17th international conference on machine learning (pp. 1159–1166), San Francisco, California: Morgan Kaufmann.Google Scholar