Knowledge and Information Systems

, Volume 32, Issue 2, pp 351–382 | Cite as

Improving clustering by learning a bi-stochastic data similarity matrix

  • Fei Wang
  • Ping LiEmail author
  • Arnd Christian König
  • Muting Wan
Regular paper


An idealized clustering algorithm seeks to learn a cluster-adjacency matrix such that, if two data points belong to the same cluster, the corresponding entry would be 1; otherwise, the entry would be 0. This integer (1/0) constraint makes it difficult to find the optimal solution. We propose a relaxation on the cluster-adjacency matrix, by deriving a bi-stochastic matrix from a data similarity (e.g., kernel) matrix according to the Bregman divergence. Our general method is named the Bregmanian Bi-Stochastication (BBS) algorithm. We focus on two popular choices of the Bregman divergence: the Euclidean distance and the Kullback–Leibler (KL) divergence. Interestingly, the BBS algorithm using the KL divergence is equivalent to the Sinkhorn–Knopp (SK) algorithm for deriving bi-stochastic matrices. We show that the BBS algorithm using the Euclidean distance is closely related to the relaxed k-means clustering and can often produce noticeably superior clustering results to the SK algorithm (and other algorithms such as Normalized Cut), through extensive experiments on public data sets.


Clustering Bi-stochastic matrix Bregman divergence 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Banerjee A, Dhillon I, Ghosh J, Merugu S (2004) A generalized maximum entropy approach to bregman co-clustering and matrix approximation. In: ACM SIGKDD conference on knowledge discovery and data mining. pp 509–514Google Scholar
  2. 2.
    Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with bregman divergences. J Mach Learn Res 6: 1705–1749MathSciNetzbMATHGoogle Scholar
  3. 3.
    Bertsekas DP (1999) Nonlinear programming, 2nd edn. Athena Scientific, BelmontzbMATHGoogle Scholar
  4. 4.
    Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, CambridgezbMATHGoogle Scholar
  5. 5.
    Chan PK, Schlag DF, Zien JY (1994) Spectral k-way ratio-cut partitioning and clustering. IEEE Trans Comput Aided Des 13: 1088–1096CrossRefGoogle Scholar
  6. 6.
    Cui J, Liu H, He J, Li P, Du X, Wang P (2011) Tagclus: a random walk-based method for tag clustering. Knowl Inf Sys 27(2): 193–225CrossRefzbMATHGoogle Scholar
  7. 7.
    Darroch JN, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43(5): 1470–1480MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Deming WE, Stephan FF (1940) On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann Math Stat 11(4): 427–444MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Dhillon IS, Guan Y, Kulis B (2004) A unified view of kernel k-means, spectral clustering and graph cuts. Technical report, Department of Computer Science, University of Texas at Austin. TR-04-25Google Scholar
  10. 10.
    Dhillon IS, Tropp JA (2008) Matrix nearness problems with bregman divergences. SIAM J Matrix Anal Appl 29: 1120–1146MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Ding C, He X, Zha H, Gu M, Simon HD (2001) A min-max cut algorithm for graph partitioning and data clustering. In: Proceedings of the 1st international conference on data mining. pp 107–114Google Scholar
  12. 12.
    Duchi J, Shalev-Shwartz S, Singer Y, Chandra T (2008) Efficient projections onto the L1-ball for learning in high dimensions. In: Proceedings of the 25th international conference on machine learning. pp 272–279Google Scholar
  13. 13.
    Escalante R, Raydan M (1998) Dykstra’s algorithm for a constrained least-squares matrix problem. Numer Linear Algebra Appl 3(6): 459–471MathSciNetCrossRefGoogle Scholar
  14. 14.
    Hager WW (1989) Updating the inverse of a matrix. SIAM Rev 31(2): 221–239MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Horn A (1954) Doubly stochastic matrices and the diagonal of a rotation matrix. Am J Math 76: 620–630MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood CliffszbMATHGoogle Scholar
  17. 17.
    Lanckriet G, Cristianini N, Bartlett P, Ghaoui LE (2004) Learning the kernel matrix with semidefinite programming. J Mach Learn Res 5: 27–72zbMATHGoogle Scholar
  18. 18.
    Li P, Church KW, Hastie TJ (2008) One sketch for all: theory and applications of conditional random sampling. In: NIPS. Vancouver, BC, CanadaGoogle Scholar
  19. 19.
    Li P, König AC (2011) Theory and applications b-bit minwise hashing. Commun ACM (to appear)Google Scholar
  20. 20.
    Liu J, Ye J (2009) Efficient Euclidean projections in linear time. In: International conference on machine learning. pp 657–664Google Scholar
  21. 21.
    Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems, vol 14. pp 849–856Google Scholar
  22. 22.
    Nocedal J, Wright SJ (2006) Numerical optimization, 2nd edn. Springer, BerlinzbMATHGoogle Scholar
  23. 23.
    Pfitzner D, Leibbrandt R, Powers D (2009) Characterization and evaluation of similarity measures for pairs of clusterings. Knowl Inf Syst 19(3): 361–394CrossRefGoogle Scholar
  24. 24.
    Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8): 888–905CrossRefGoogle Scholar
  25. 25.
    Sinkhorn R, Knopp P (1967) Concerning nonnegative matrices and doubly stochastic matrices. Pac J Math 21: 343–348MathSciNetzbMATHGoogle Scholar
  26. 26.
    Sonnenburg S, Rätsch G, Schölkopf B, Rätsch G (2006) Large scale multiple kernel learning. J Mach Learn Res 7(Jul): 1531–1565MathSciNetzbMATHGoogle Scholar
  27. 27.
    Soules GW (1991) The rate of convergence of Sinkhorn balancing. Linear Algebra Appl 150: 3–40MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Stephan FF (1942) An iterative method of adjusting sample frequency tables when expected marginal totals are known. Ann Math Stat 13(2): 166–178MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617MathSciNetGoogle Scholar
  30. 30.
    Tang M, Zhou Y, Li J, Wang W, Cui P, Hou Y, Luo Z, Li J, Lei F, Yan B (2011) Exploring the wild birds migration data for the disease spread study of h5n1: a clustering and association approach. Knowl Inf Syst 27(2): 227–251CrossRefGoogle Scholar
  31. 31.
    Wang F, Li P (2010) Compressed non-negative sparse coding. In: ICDM. Sydney, AUGoogle Scholar
  32. 32.
    Wang F, Tan C, König AC, Li P (2011) Efficient document clustering via online nonnegative matrix factorizations. In: SDMGoogle Scholar
  33. 33.
    Wang F, Wang X, Li T (2009) Generalized cluster aggregation. In: Proceedings of the 21st international joint conference on artificial intelligence. pp 1279–1284Google Scholar
  34. 34.
    Yang J, Cheung W, Chen X (2009) Learning element similarity matrix for semi-structured document analysis. Knowl Inf Syst 19(1): 53–78CrossRefGoogle Scholar
  35. 35.
    Zass R, Shashua A (2005) A unifying approach to hard and probabilistic clustering. In: Proceedings of international conference on computer vision. pp 294–301Google Scholar
  36. 36.
    Zha H, He X, Ding C, Gu M, Simon H (2001) Spectral relaxation for k-means clustering. In: NIPS, Vancover, BC, CanadaGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  • Fei Wang
    • 1
  • Ping Li
    • 1
    Email author
  • Arnd Christian König
    • 2
  • Muting Wan
    • 1
  1. 1.Department of Statistical ScienceCornell UniversityIthacaUSA
  2. 2.Microsoft Research, Microsoft CorporationRedmondUSA

Personalised recommendations