Correlation Clustering
 Nikhil Bansal,
 Avrim Blum,
 Shuchi Chawla
 … show all 3 hide
Abstract
We consider the following clustering problem: we have a complete graph on n vertices (items), where each edge (u, v) is labeled either + or − depending on whether u and v have been deemed to be similar or different. The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels. That is, we want a clustering that maximizes the number of + edges within clusters, plus the number of − edges between clusters (equivalently, minimizes the number of disagreements: the number of − edges inside clusters plus the number of + edges between clusters). This formulation is motivated from a document clustering problem in which one has a pairwise similarity function f learned from past data, and the goal is to partition the current set of documents in a way that correlates with f as much as possible; it can also be viewed as a kind of “agnostic learning” problem.
An interesting feature of this clustering formulation is that one does not need to specify the number of clusters k as a separate parameter, as in measures such as kmedian or minsum or minmax clustering. Instead, in our formulation, the optimal number of clusters could be any value between 1 and n, depending on the edge labels. We look at approximation algorithms for both minimizing disagreements and for maximizing agreements. For minimizing disagreements, we give a constant factor approximation. For maximizing agreements we give a PTAS, building on ideas of Goldreich, Goldwasser, and Ron (1998) and de la Veg (1996). We also show how to extend some of these results to graphs with edge labels in [−1, +1], and give some results for the case of random noise.
 Alon, N., Fischer, E., Krivelevich, M., & Szegedy, M. (2000). Efficient testing of large graphs. Combinatorica, 20:4, 451–476.
 Alon, N., & Spencer, J. H. (1992). The probabilistic method. John Wiley and Sons.
 Arora, S., Frieze, A., & Kaplan, H. (2002). A new rounding procedure for the assignment problem with applications to dense graph arrangements. Mathematical Programming, 92:1, 1–36.
 Arora, S., Karger, D., & Karpinski, M. (1999). Polynomial time approximation schemes for dense instances of NPHard problems. JCSS, 58:1, 193–210.
 BenDavid, S., Long, P. M., & Mansour, Y. (2001). Agnostic boosting. In Proceedings of the 2001 Conference on Computational Learning Theory (pp. 507–516).
 Blum, A., & Karger, D. (1997). An Õ(n ^{3/14})coloring algorithm for 3colorable graphs. IPL: Information Processing Letters, 61.
 Charikar, M., & Guha, S. (1999). Improved combinatorial algorithms for the facility location and kmedian problems. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science.
 Charikar, M., Guruswami, V., & Wirth, A. (2003). Clustering with qualitative information. In Proceedings of the 44th Annual Symposium on Foundations of Computer Science (pp. 524–533).
 Cohen, W., & McCallum, A. (2001). Personal communication.
 Cohen, W., & Richman, J. (2001). Learning to match and cluster entity names. In ACM SIGIR'01 Workshop on Mathematical/Formal Methods in IR.
 Cohen, W., & Richman, J. (2002). Learning to match and cluster large highdimensional data sets for data integration. In Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).
 Condon, A., & Karp, R. (1999). Algorithms for graph partitioning on the planted partition model. Random Structures and Algorithms, 18:2, 116–140.
 de la Vega, F. (1996). MAXCUT has a randomized approximation scheme in dense graphs. Random Structures and Algorithms, 8:3, 187–198.
 Demaine, E., & Immorlica, N. (2003). Correlation clustering with partial information. In Proceedings of APPROX.
 Emanuel, D., & Fiat, A. (2003). Correlation clustering—Minimizing disagreements on arbitrary weighted graphs. In Proceedings of ESA.
 Garey, M., & Johnson, D. (2000). Computers and intractability: A guide to the theory of NPcompleteness. W. H. Freeman and Company.
 Goldreich, O., Goldwasser, S., & Ron, D. (1998). Property testing and its connection to learning and approximation. JACM, 45:4, 653–750.
 Hochbaum, D., & Shmoys, D. (1986). A unified approach to approximation algorithms for bottleneck problems. JACM, 33, 533–550.
 Jain, K., & Vazirani, V. (2001). Approximation algorithms for metric facility location and kMedian problems using the primaldual schema and Lagrangian relaxation. JACM, 48:2, 274–296.
 Kearns, M. (1998). Efficient noisetolerant learning from statistical queries. JACM, 45:6, 983–1006.
 Kearns, M. J., Schapire, R. E., & Sellie, L. M. (1994). Toward efficient agnostic learning. Machine Learning, 17:2/3, 115–142.
 McSherry, F. (2001). Spectral partitioning of random graphs. In Proceedings of the 42th Annual Symposium on Foundations of Computer Science (pp. 529–537).
 Parnas, M., & Ron, D. (2002). Testing the diameter of graphs. Random Structures and Algorithms, 20:2, 165–183.
 Schulman, L. (2000). Clustering for edgecost minimization. In Proceedings of the 32nd Annual ACM Symposium on Theory of Computing (pp. 547–555).
 Shamir, R., & Tsur, D. (2002). Improved algorithms for the random cluster graph model. In Proceedings of the Scandinavian Workshop on Algorithmic Theory (pp. 230–239).
 Swamy, C. (2004). Correlation clustering: Maximizing agreements via semidefinite programming. In Proceedings of the Symposium on Discrete Algorithms.
 Title
 Correlation Clustering
 Journal

Machine Learning
Volume 56, Issue 13 , pp 89113
 Cover Date
 20040701
 DOI
 10.1023/B:MACH.0000033116.57574.95
 Print ISSN
 08856125
 Online ISSN
 15730565
 Publisher
 Kluwer Academic PublishersPlenum Publishers
 Additional Links
 Topics
 Keywords

 clustering
 approximation algorithm
 document classification
 Industry Sectors
 Authors

 Nikhil Bansal ^{(1)}
 Avrim Blum ^{(1)}
 Shuchi Chawla ^{(1)}
 Author Affiliations

 1. Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, 15213, USA