Abstract
In the previous chapters, we studied the problem of fitting a low-dimensional linear or affine subspace to a collection of points. In practical applications, however, a linear or affine subspace may not be able to capture nonlinear structures in the data. For instance, consider the set of all images of a face obtained by rotating it about its main axis of symmetry. While all such images live in a high-dimensional space whose dimension is the number of pixels, there is only one degree of freedom in the data, namely the angle of rotation. In fact, the space of all such images is a one-dimensional circle embedded in a high-dimensional space, whose structure is not well captured by a one-dimensional line. More generally, a collection of face images observed from different viewpoints is not well approximated by a single linear or affine subspace, as illustrated in the following example.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In principle, we should use the notation \(\hat{\Sigma }_{\phi (\boldsymbol{x})}\) to indicate that it is the estimate of the actual covariance matrix. But for simplicity, we will drop the hat in the sequel and simply use \(\Sigma _{\phi (\boldsymbol{x})}\). The same goes for the eigenvectors and the principal components.
- 2.
The remaining M − N eigenvectors of \(\Phi \Phi ^{\top }\) are associated with the eigenvalue zero.
- 3.
In PCA, we center the data by subtracting its mean. Here, we first subtract the mean of the embedded data and then compute the kernel, whence the name centered kernel.
- 4.
In PCA, if X is the data matrix, then XJ is the centered (mean-subtracted) data matrix.
- 5.
“Almost every” means except for a set of measure zero.
- 6.
- 7.
Notice that \(A = JX^{\top }XJ\), where \(J = I - \frac{1} {N}\boldsymbol{1}\boldsymbol{1}^{\top }\) is the centering matrix.
- 8.
By scaled low-dimensional representation we mean replacing \(\boldsymbol{y}_{j}\) by \(d_{jj}\boldsymbol{y}_{j}\).
- 9.
- 10.
Notice that the above objective is very much related to the MAP-EM algorithm for a mixture of isotropic Gaussians discussed in Appendix B.3.2.
- 11.
AT&T Laboratories, Cambridge, http://www.cl.cam.ac.uk/Research/DTG/attarchive/facedatabase.html.
- 12.
A graph is connected when there is a path between every pair of vertices.
- 13.
This constraint is needed to prevent the trivial solution \(U =\boldsymbol{ 0}\). Alternatively, we could enforce \(U^{\top }U = \mbox{ diag}(\vert \mathcal{G}_{1}\vert,\vert \mathcal{G}_{2}\vert,\ldots,\vert \mathcal{G}_{n}\vert )\). However, this is impossible, because we do not know \(\vert \mathcal{G}_{i}\vert\).
References
Belkin, M., & Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for embedding and clustering. In Proceedings of Neural Information Processing Systems (NIPS) (pp. 585–591).
Bottou, L., & Bengio, J. (1995). Convergence properties of the k-means algorithms. In Neural Information Processing and Systems.
Burges, C. (2005). Geometric methods for feature extraction and dimensional reduction - a guided tour. In The data mining and knowledge discovery handbook (pp. 59–92). Boston: Kluwer Academic.
Burges, C. J. C. (2010). Dimension reduction: A guided tour. Foundations and Trends in Machine Learning, 2(4), 275–365.
Chung, F. (1997). Spectral graph theory. Washington: Conference Board of the Mathematical Sciences.
Cox, T. F., & Cox, M. A. A. (1994). Multidimensional scaling. London: Chapman and Hall.
Davis, C., & Cahan, W. (1970). The rotation of eigenvectors by a pertubation. SIAM Journal on Numerical Analysis, 7(1), 1–46.
Davison, M. (1983). Multidimensional Scaling. New York: Wiley.
Donoho, D., & Grimes, C. (2003). Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. National Academy of Sciences, 100(10), 5591–5596.
Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications (abstract). Biometrics, 21, 768–769.
Gower, J. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53, 325–338.
Hastie, T. (1984). Principal curves and surfaces. Technical Report, Stanford University.
Hastie, T., & Stuetzle, W. (1989). Principal curves. Journal of the American Statistical Association, 84(406), 502–516.
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417–441.
Jancey, R. (1966). Multidimensional group analysis. Australian Journal of Botany, 14, 127–130.
Kruskal, J. (1964). Nonmetric multidimensional scaling: A numerical method. Psychometrika.
Lee, J. A., & Verleysen, M. (2007). Nonlinear Dimensionality Reduction (1st ed.). New York: Springer.
Lloyd, S. (1957). Least squares quantization in PCM. Technical Report. Bell Laboratories. Published in 1982 in IEEE Transactions on Information Theory, 28, 128–137.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (pp. 281–297).
Mercer, J. (1909). Functions of positive and negative types and their connection with the theory of integral equations. Philosophical Transactions, Royal Society London, A, 209(1909), 415–446.
Ng, A., Weiss, Y., & Jordan, M. (2001). On spectral clustering: Analysis and an algorithm. In Proceedings of Neural Information Processing Systems (NIPS) (pp. 849–856).
Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.
Roweis, S., & Saul, L. (2003). Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4, 119–155.
Schölkopf, B., & Smola, A. (2002). Learning with kernels. Cambridge: MIT Press.
Schölkopf, B., Smola, A., & Muller, K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319.
Selim, S., & Ismail, M. A. (1984). K-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transaction on Pattern Analysis and Machine Intelligence, 6(1), 81–87.
Sha, F., & Saul, L. (2005). Analysis and extension of spectral methods for nonlinear dimensionality reduction. In Proceedings of International Conference on Machine Learning (pp. 784–791).
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.
Shi, T., Belkin, M., & Yin, B. (2008). Data spectroscopy: Eigenspace of convolution operators and clustering. arXiv:0807.3719v1.
Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.
Torgerson, W. (1958). Theory and Methods of Scaling. New York: Wiley.
von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416.
Weinberger, K. Q., & Saul, L. (2004). Unsupervised learning of image manifolds by semidefinite programming. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 988–955).
Williams, C. (2002). On a connection between kernel PCA and metric multidimensional scaling. Machine Learning, 46, 11–19.
Zhang, Z., & Zha, H. (2005). Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM Journal on Scientific Computing, 26(1), 313–338.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag New York
About this chapter
Cite this chapter
Vidal, R., Ma, Y., Sastry, S.S. (2016). Nonlinear and Nonparametric Extensions. In: Generalized Principal Component Analysis. Interdisciplinary Applied Mathematics, vol 40. Springer, New York, NY. https://doi.org/10.1007/978-0-387-87811-9_4
Download citation
DOI: https://doi.org/10.1007/978-0-387-87811-9_4
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-87810-2
Online ISBN: 978-0-387-87811-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)