Machine Learning

, Volume 56, Issue 1–3, pp 89–113 | Cite as

Correlation Clustering

  • Nikhil Bansal
  • Avrim Blum
  • Shuchi Chawla

Abstract

We consider the following clustering problem: we have a complete graph on n vertices (items), where each edge (u, v) is labeled either + or − depending on whether u and v have been deemed to be similar or different. The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels. That is, we want a clustering that maximizes the number of + edges within clusters, plus the number of − edges between clusters (equivalently, minimizes the number of disagreements: the number of − edges inside clusters plus the number of + edges between clusters). This formulation is motivated from a document clustering problem in which one has a pairwise similarity function f learned from past data, and the goal is to partition the current set of documents in a way that correlates with f as much as possible; it can also be viewed as a kind of “agnostic learning” problem.

An interesting feature of this clustering formulation is that one does not need to specify the number of clusters k as a separate parameter, as in measures such as k-median or min-sum or min-max clustering. Instead, in our formulation, the optimal number of clusters could be any value between 1 and n, depending on the edge labels. We look at approximation algorithms for both minimizing disagreements and for maximizing agreements. For minimizing disagreements, we give a constant factor approximation. For maximizing agreements we give a PTAS, building on ideas of Goldreich, Goldwasser, and Ron (1998) and de la Veg (1996). We also show how to extend some of these results to graphs with edge labels in [−1, +1], and give some results for the case of random noise.

clustering approximation algorithm document classification 

References

  1. Alon, N., Fischer, E., Krivelevich, M., & Szegedy, M. (2000). Efficient testing of large graphs. Combinatorica, 20:4, 451–476.Google Scholar
  2. Alon, N., & Spencer, J. H. (1992). The probabilistic method. John Wiley and Sons.Google Scholar
  3. Arora, S., Frieze, A., & Kaplan, H. (2002). A new rounding procedure for the assignment problem with applications to dense graph arrangements. Mathematical Programming, 92:1, 1–36.Google Scholar
  4. Arora, S., Karger, D., & Karpinski, M. (1999). Polynomial time approximation schemes for dense instances of NP-Hard problems. JCSS, 58:1, 193–210.Google Scholar
  5. Ben-David, S., Long, P. M., & Mansour, Y. (2001). Agnostic boosting. In Proceedings of the 2001 Conference on Computational Learning Theory (pp. 507–516).Google Scholar
  6. Blum, A., & Karger, D. (1997). An Õ(n 3/14)-coloring algorithm for 3-colorable graphs. IPL: Information Processing Letters, 61.Google Scholar
  7. Charikar, M., & Guha, S. (1999). Improved combinatorial algorithms for the facility location and k-median problems. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science.Google Scholar
  8. Charikar, M., Guruswami, V., & Wirth, A. (2003). Clustering with qualitative information. In Proceedings of the 44th Annual Symposium on Foundations of Computer Science (pp. 524–533).Google Scholar
  9. Cohen, W., & McCallum, A. (2001). Personal communication.Google Scholar
  10. Cohen, W., & Richman, J. (2001). Learning to match and cluster entity names. In ACM SIGIR'01 Workshop on Mathematical/Formal Methods in IR.Google Scholar
  11. Cohen, W., & Richman, J. (2002). Learning to match and cluster large high-dimensional data sets for data integration. In Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).Google Scholar
  12. Condon, A., & Karp, R. (1999). Algorithms for graph partitioning on the planted partition model. Random Structures and Algorithms, 18:2, 116–140.Google Scholar
  13. de la Vega, F. (1996). MAX-CUT has a randomized approximation scheme in dense graphs. Random Structures and Algorithms, 8:3, 187–198.Google Scholar
  14. Demaine, E., & Immorlica, N. (2003). Correlation clustering with partial information. In Proceedings of APPROX.Google Scholar
  15. Emanuel, D., & Fiat, A. (2003). Correlation clustering—Minimizing disagreements on arbitrary weighted graphs. In Proceedings of ESA.Google Scholar
  16. Garey, M., & Johnson, D. (2000). Computers and intractability: A guide to the theory of NP-completeness. W. H. Freeman and Company.Google Scholar
  17. Goldreich, O., Goldwasser, S., & Ron, D. (1998). Property testing and its connection to learning and approximation. JACM, 45:4, 653–750.Google Scholar
  18. Hochbaum, D., & Shmoys, D. (1986). A unified approach to approximation algorithms for bottleneck problems. JACM, 33, 533–550.Google Scholar
  19. Jain, K., & Vazirani, V. (2001). Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation. JACM, 48:2, 274–296.Google Scholar
  20. Kearns, M. (1998). Efficient noise-tolerant learning from statistical queries. JACM, 45:6, 983–1006.Google Scholar
  21. Kearns, M. J., Schapire, R. E., & Sellie, L. M. (1994). Toward efficient agnostic learning. Machine Learning, 17:2/3, 115–142.Google Scholar
  22. McSherry, F. (2001). Spectral partitioning of random graphs. In Proceedings of the 42th Annual Symposium on Foundations of Computer Science (pp. 529–537).Google Scholar
  23. Parnas, M., & Ron, D. (2002). Testing the diameter of graphs. Random Structures and Algorithms, 20:2, 165–183.Google Scholar
  24. Schulman, L. (2000). Clustering for edge-cost minimization. In Proceedings of the 32nd Annual ACM Symposium on Theory of Computing (pp. 547–555).Google Scholar
  25. Shamir, R., & Tsur, D. (2002). Improved algorithms for the random cluster graph model. In Proceedings of the Scandinavian Workshop on Algorithmic Theory (pp. 230–239).Google Scholar
  26. Swamy, C. (2004). Correlation clustering: Maximizing agreements via semidefinite programming. In Proceedings of the Symposium on Discrete Algorithms.Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Nikhil Bansal
    • 1
  • Avrim Blum
    • 1
  • Shuchi Chawla
    • 1
  1. 1.Computer Science DepartmentCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations