## Abstract

We consider the following clustering problem: we have a complete graph on *n* vertices (items), where each edge (*u*, *v*) is labeled either + or − depending on whether *u* and *v* have been deemed to be similar or different. The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels. That is, we want a clustering that maximizes the number of + edges within clusters, plus the number of − edges between clusters (equivalently, minimizes the number of disagreements: the number of − edges inside clusters plus the number of + edges between clusters). This formulation is motivated from a document clustering problem in which one has a pairwise similarity function *f* learned from past data, and the goal is to partition the current set of documents in a way that correlates with *f* as much as possible; it can also be viewed as a kind of “agnostic learning” problem.

An interesting feature of this clustering formulation is that one does not need to specify the number of clusters *k* as a separate parameter, as in measures such as *k*-median or min-sum or min-max clustering. Instead, in our formulation, the optimal number of clusters could be any value between 1 and *n*, depending on the edge labels. We look at approximation algorithms for both minimizing disagreements and for maximizing agreements. For minimizing disagreements, we give a constant factor approximation. For maximizing agreements we give a PTAS, building on ideas of Goldreich, Goldwasser, and Ron (1998) and de la Veg (1996). We also show how to extend some of these results to graphs with edge labels in [−1, +1], and give some results for the case of random noise.

## References

- Alon, N., Fischer, E., Krivelevich, M., & Szegedy, M. (2000). Efficient testing of large graphs.
*Combinatorica, 20:4*, 451–476.Google Scholar - Alon, N., & Spencer, J. H. (1992).
*The probabilistic method*. John Wiley and Sons.Google Scholar - Arora, S., Frieze, A., & Kaplan, H. (2002). A new rounding procedure for the assignment problem with applications to dense graph arrangements.
*Mathematical Programming, 92:1*, 1–36.Google Scholar - Arora, S., Karger, D., & Karpinski, M. (1999). Polynomial time approximation schemes for dense instances of NP-Hard problems.
*JCSS, 58:1*, 193–210.Google Scholar - Ben-David, S., Long, P. M., & Mansour, Y. (2001). Agnostic boosting. In
*Proceedings of the 2001 Conference on Computational Learning Theory*(pp. 507–516).Google Scholar - Blum, A., & Karger, D. (1997). An
*Õ*(*n*^{3/14})-coloring algorithm for 3-colorable graphs.*IPL: Information Processing Letters, 61*.Google Scholar - Charikar, M., & Guha, S. (1999). Improved combinatorial algorithms for the facility location and
*k*-median problems. In*Proceedings of the 40th Annual Symposium on Foundations of Computer Science*.Google Scholar - Charikar, M., Guruswami, V., & Wirth, A. (2003). Clustering with qualitative information. In
*Proceedings of the 44th Annual Symposium on Foundations of Computer Science*(pp. 524–533).Google Scholar - Cohen, W., & McCallum, A. (2001). Personal communication.Google Scholar
- Cohen, W., & Richman, J. (2001). Learning to match and cluster entity names. In
*ACM SIGIR'01 Workshop on Mathematical/Formal Methods in IR*.Google Scholar - Cohen, W., & Richman, J. (2002). Learning to match and cluster large high-dimensional data sets for data integration. In
*Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)*.Google Scholar - Condon, A., & Karp, R. (1999). Algorithms for graph partitioning on the planted partition model.
*Random Structures and Algorithms, 18:2*, 116–140.Google Scholar - de la Vega, F. (1996). MAX-CUT has a randomized approximation scheme in dense graphs.
*Random Structures and Algorithms, 8:3*, 187–198.Google Scholar - Demaine, E., & Immorlica, N. (2003). Correlation clustering with partial information. In
*Proceedings of APPROX*.Google Scholar - Emanuel, D., & Fiat, A. (2003). Correlation clustering—Minimizing disagreements on arbitrary weighted graphs. In
*Proceedings of ESA*.Google Scholar - Garey, M., & Johnson, D. (2000).
*Computers and intractability: A guide to the theory of NP-completeness*. W. H. Freeman and Company.Google Scholar - Goldreich, O., Goldwasser, S., & Ron, D. (1998). Property testing and its connection to learning and approximation.
*JACM, 45:4*, 653–750.Google Scholar - Hochbaum, D., & Shmoys, D. (1986). A unified approach to approximation algorithms for bottleneck problems.
*JACM, 33*, 533–550.Google Scholar - Jain, K., & Vazirani, V. (2001). Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation.
*JACM, 48:2*, 274–296.Google Scholar - Kearns, M. (1998). Efficient noise-tolerant learning from statistical queries.
*JACM, 45:6*, 983–1006.Google Scholar - Kearns, M. J., Schapire, R. E., & Sellie, L. M. (1994). Toward efficient agnostic learning.
*Machine Learning, 17:2/3*, 115–142.Google Scholar - McSherry, F. (2001). Spectral partitioning of random graphs. In
*Proceedings of the 42th Annual Symposium on Foundations of Computer Science*(pp. 529–537).Google Scholar - Parnas, M., & Ron, D. (2002). Testing the diameter of graphs.
*Random Structures and Algorithms, 20:2*, 165–183.Google Scholar - Schulman, L. (2000). Clustering for edge-cost minimization. In
*Proceedings of the 32nd Annual ACM Symposium on Theory of Computing*(pp. 547–555).Google Scholar - Shamir, R., & Tsur, D. (2002). Improved algorithms for the random cluster graph model. In
*Proceedings of the Scandinavian Workshop on Algorithmic Theory*(pp. 230–239).Google Scholar - Swamy, C. (2004). Correlation clustering: Maximizing agreements via semidefinite programming. In
*Proceedings of the Symposium on Discrete Algorithms*.Google Scholar