We consider the following clustering problem: we have a complete graph on n vertices (items), where each edge (u, v) is labeled either + or − depending on whether u and v have been deemed to be similar or different. The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels. That is, we want a clustering that maximizes the number of + edges within clusters, plus the number of − edges between clusters (equivalently, minimizes the number of disagreements: the number of − edges inside clusters plus the number of + edges between clusters). This formulation is motivated from a document clustering problem in which one has a pairwise similarity function f learned from past data, and the goal is to partition the current set of documents in a way that correlates with f as much as possible; it can also be viewed as a kind of “agnostic learning” problem.
An interesting feature of this clustering formulation is that one does not need to specify the number of clusters k as a separate parameter, as in measures such as k-median or min-sum or min-max clustering. Instead, in our formulation, the optimal number of clusters could be any value between 1 and n, depending on the edge labels. We look at approximation algorithms for both minimizing disagreements and for maximizing agreements. For minimizing disagreements, we give a constant factor approximation. For maximizing agreements we give a PTAS, building on ideas of Goldreich, Goldwasser, and Ron (1998) and de la Veg (1996). We also show how to extend some of these results to graphs with edge labels in [−1, +1], and give some results for the case of random noise.
- Alon, N., Fischer, E., Krivelevich, M., & Szegedy, M. (2000). Efficient testing of large graphs. Combinatorica, 20:4, 451–476.
- Alon, N., & Spencer, J. H. (1992). The probabilistic method. John Wiley and Sons.
- Arora, S., Frieze, A., & Kaplan, H. (2002). A new rounding procedure for the assignment problem with applications to dense graph arrangements. Mathematical Programming, 92:1, 1–36.
- Arora, S., Karger, D., & Karpinski, M. (1999). Polynomial time approximation schemes for dense instances of NP-Hard problems. JCSS, 58:1, 193–210.
- Ben-David, S., Long, P. M., & Mansour, Y. (2001). Agnostic boosting. In Proceedings of the 2001 Conference on Computational Learning Theory (pp. 507–516).
- Blum, A., & Karger, D. (1997). An Õ(n 3/14)-coloring algorithm for 3-colorable graphs. IPL: Information Processing Letters, 61.
- Charikar, M., & Guha, S. (1999). Improved combinatorial algorithms for the facility location and k-median problems. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science.
- Charikar, M., Guruswami, V., & Wirth, A. (2003). Clustering with qualitative information. In Proceedings of the 44th Annual Symposium on Foundations of Computer Science (pp. 524–533).
- Cohen, W., & McCallum, A. (2001). Personal communication.
- Cohen, W., & Richman, J. (2001). Learning to match and cluster entity names. In ACM SIGIR'01 Workshop on Mathematical/Formal Methods in IR.
- Cohen, W., & Richman, J. (2002). Learning to match and cluster large high-dimensional data sets for data integration. In Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).
- Condon, A., & Karp, R. (1999). Algorithms for graph partitioning on the planted partition model. Random Structures and Algorithms, 18:2, 116–140.
- de la Vega, F. (1996). MAX-CUT has a randomized approximation scheme in dense graphs. Random Structures and Algorithms, 8:3, 187–198.
- Demaine, E., & Immorlica, N. (2003). Correlation clustering with partial information. In Proceedings of APPROX.
- Emanuel, D., & Fiat, A. (2003). Correlation clustering—Minimizing disagreements on arbitrary weighted graphs. In Proceedings of ESA.
- Garey, M., & Johnson, D. (2000). Computers and intractability: A guide to the theory of NP-completeness. W. H. Freeman and Company.
- Goldreich, O., Goldwasser, S., & Ron, D. (1998). Property testing and its connection to learning and approximation. JACM, 45:4, 653–750.
- Hochbaum, D., & Shmoys, D. (1986). A unified approach to approximation algorithms for bottleneck problems. JACM, 33, 533–550.
- Jain, K., & Vazirani, V. (2001). Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation. JACM, 48:2, 274–296.
- Kearns, M. (1998). Efficient noise-tolerant learning from statistical queries. JACM, 45:6, 983–1006.
- Kearns, M. J., Schapire, R. E., & Sellie, L. M. (1994). Toward efficient agnostic learning. Machine Learning, 17:2/3, 115–142.
- McSherry, F. (2001). Spectral partitioning of random graphs. In Proceedings of the 42th Annual Symposium on Foundations of Computer Science (pp. 529–537).
- Parnas, M., & Ron, D. (2002). Testing the diameter of graphs. Random Structures and Algorithms, 20:2, 165–183.
- Schulman, L. (2000). Clustering for edge-cost minimization. In Proceedings of the 32nd Annual ACM Symposium on Theory of Computing (pp. 547–555).
- Shamir, R., & Tsur, D. (2002). Improved algorithms for the random cluster graph model. In Proceedings of the Scandinavian Workshop on Algorithmic Theory (pp. 230–239).
- Swamy, C. (2004). Correlation clustering: Maximizing agreements via semidefinite programming. In Proceedings of the Symposium on Discrete Algorithms.
- Correlation Clustering
Volume 56, Issue 1-3 , pp 89-113
- Cover Date
- Print ISSN
- Online ISSN
- Kluwer Academic Publishers-Plenum Publishers
- Additional Links
- approximation algorithm
- document classification
- Industry Sectors