Machine Learning

, Volume 56, Issue 1–3, pp 9–33 | Cite as

Clustering Large Graphs via the Singular Value Decomposition

  • P. Drineas
  • A. Frieze
  • R. Kannan
  • S. Vempala
  • V. Vinay
Article

Abstract

We consider the problem of partitioning a set of m points in the n-dimensional Euclidean space into k clusters (usually m and n are variable, while k is fixed), so as to minimize the sum of squared distances between each point and its cluster center. This formulation is usually the objective of the k-means clustering algorithm (Kanungo et al. (2000)). We prove that this problem in NP-hard even for k = 2, and we consider a continuous relaxation of this discrete problem: find the k-dimensional subspace V that minimizes the sum of squared distances to V of the m points. This relaxation can be solved by computing the Singular Value Decomposition (SVD) of the m × n matrix A that represents the m points; this solution can be used to get a 2-approximation algorithm for the original problem. We then argue that in fact the relaxation provides a generalized clustering which is useful in its own right.

Finally, we show that the SVD of a random submatrix—chosen according to a suitable probability distribution—of a given matrix provides an approximation to the SVD of the whole matrix, thus yielding a very fast randomized algorithm. We expect this algorithm to be the main contribution of this paper, since it can be applied to problems of very large size which typically arise in modern applications.

Singular Value Decomposition randomized algorithms k-means clustering 

References

  1. Achlioptas, D., & McSherry, F. (2001). Fast computation of low rank approximations. In Proceedings of the 33rd Annual Symposium on Theory of Computing (pp. 337–346).Google Scholar
  2. Achlioptas, D., & McSherry, F. (2003). Fast computation of low rank matrix approximations. Manuscript.Google Scholar
  3. Andrews, H. C., & Patterson, C. L. (1976a). Singular value decomposition image coding. IEEE Trans. on Communications, 4, 425–432.Google Scholar
  4. Andrews, H. C., & Patterson, C. L. (1976b). Singular value decompositions and digital image processing. IEEE Trans. ASSP 26–53.Google Scholar
  5. Azar, Y., Fiat, A., Karlin, A., McSherry, F., & Saia, J. (2001). Spectral analysis of data. In Proc. of the 33rd ACM Symposium on Theory of Computing (pp. 619–626).Google Scholar
  6. Berry, M. J., & Linoff, G. (1997). Data mining techniques. John-Wiley.Google Scholar
  7. Bar-Yossef, Z. (2002). The complexity of massive dataset computations. Ph.D. thesis, University of California, Berkeley.Google Scholar
  8. Bar-Yossef, Z. (2003). Sampling lower bounds via information theory. In Proceedings of the 35th Annual Symposium on Theory of Computing (pp. 335–344).Google Scholar
  9. Charikar, M., & Guha, S. (1999). Improved combinatorial algorithms for the facility location and k-median problems. In Proceedings of the 40th IEEE Symposium on Foundations of Computer Science (pp. 378–388).Google Scholar
  10. Charikar, M., Guha, S., Shmoys, D., & Tardos, E. (2002). A constant factor approximation algorithm for the k-median problem. Journal of Computer and System Sciences, 65:1, 129–149.Google Scholar
  11. Drineas, P., & Kannan, R. (2001). Fast Monte-Carlo algorithms for approximate matrix multiplication. In Proceedings of the 42nd Annual Symposium on Foundations of Computer Science (pp. 452–459).Google Scholar
  12. Frieze, A., Kannan, R., & Vempala, S. (1998). Fast Monte-Carlo algorithms for finding low rank approximations. In Proceedings of the 39th Annual Symposium on Foundations of Computer Science (pp. 370–378).Google Scholar
  13. Furedi, Z., & Komlos, J. (1981). The eigenvalues of random symmetric matrices. Combinatorica, 1, 233–241.Google Scholar
  14. Gersho, A., & Gray, R. M. (1991). Vector quantization and signal compression. Kluwer Academic.Google Scholar
  15. Gibson, D., Kleinberg, J., & Raghavan, P. (1998). Clustering categorical data: An approach based on dynamical systems. Very Large Data Bases (VLDB) 311–322.Google Scholar
  16. Goldreich, O., Goldwasser, S., & Ron, D. (1998). Property testing and its connection to learning and approximation. Journal of the ACM, 5:4, 653–750.Google Scholar
  17. Golub, G., & Van Loan, C. (1989). Matrix computations. Johns Hopkins University Press.Google Scholar
  18. Huang, T., & Narendra, P. (1974). Image restoration by singular value decomposition. Applied Optics, 14:9, 2213–2216.Google Scholar
  19. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall.Google Scholar
  20. Jain, K., & Vazirani, V. (1999). Primal-dual approximation algorithms for metric facility location and k-median problems. In Proceedings of the 40th IEEE Symposium on Foundations of Computer Science (pp. 2–13).Google Scholar
  21. Jambu, M., & Lebeaux, M.-O. (1983). Cluster analysis and data analysis. North Holland.Google Scholar
  22. Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2000). The analysis of a simple k-means clustering algorithm. In Symposium on Computational Geometry (pp. 100–109).Google Scholar
  23. Kleinberg, J. (1998). Authoritative sources in a hyperlinked environment. In Proceedings of 9th Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 668–677).Google Scholar
  24. McDiarmid, C. J. H. (1989). On the method of bounded differences. In Surveys in Combinatorics: Invited Papers at the 12th British Combinatorial Conference (pp. 148–188).Google Scholar
  25. Ostrovsky, R., & Rabani, Y. (2002). Polynomial time approximation schemes for geometric k-clustering. Journal of the ACM, 49:2, 139–156.Google Scholar
  26. Papadimitriou, C. H., Raghavan, P., Tamaki, H., & Vempala, S. (2000). Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 61:2, 217–235.Google Scholar
  27. Parlett, B. (1997). The symmetric eigenvalue problem. Classics in Applied Mathematics, SIAM.Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • P. Drineas
    • 1
  • A. Frieze
    • 2
  • R. Kannan
    • 3
  • S. Vempala
    • 4
  • V. Vinay
    • 5
  1. 1.Computer Science DepartmentRensselaer Polytechnic InstituteTroy
  2. 2.Department of Mathematical SciencesCarnegie Mellon UniversityPittsburgh
  3. 3.Computer Science DepartmentYale UniversityNew Haven
  4. 4.Department of MathematicsM.I.T.Cambridge
  5. 5.Indian Institute of ScienceBangaloreIndia

Personalised recommendations