Abstract
The problem of scalability of correlation clustering (CC) is addressed by reducing the number of variables involved in the SDP formulation. A nonlinear programming formulation is obtained instead of SDP formulation, which reduces the number of variables. The new formulation is solved through limited memory Broyden Fletcher Goldfarb Shanno method. We demonstrate the potential of the nonlinear formulation on large graph datasets having more than ten thousand vertices and nine million edges. The proposed scalable formulation is experimentally shown not to compromise on quality of the obtained clusters. We compare the scalable formulation results with those of the original CC formulation. We compare the scalable formulation results with the original CC formulation, a constrained spectral clustering method which uses edge labels of the graph and differs only in the way clusters are obtained by defining the cut on the given graph and with a variant of constraint spectral clustering known as self-taught spectral clustering.
Similar content being viewed by others
Notes
In maximization problem a \(\lambda\)-approximation algorithm A for the problem with cost function c is an algorithm such that, for every input I, the solution A(I) to satisfy \(c(A(I))\ge \lambda \mathrm{max}_S c(S)\) for some \(\lambda \le 1\).
Approximation value(\(\lambda\)) in the case of a maximization problem is the ratio between the cost of an optimal solution to the cost of the solution produced by the approximation algorithm, in maximization problem \(\lambda \le 1\) and in minimization problem \(\lambda \ge 1.\)
By large dataset we mean graph with large number of nodes as well as edges.
References
Arasu A, Ré C, Suciu D (2009) Large-scale deduplication with constraints using dedupalog. In: International conference on data engineering, pp. 952–963
Bansal N, Blum A, Chawla S (2002) Correlation clustering. Found Comput Sci (FOCS) 2002:238–247
Ben-David S, Long PM, Mansour Y (2001) Agnostic boosting. In: Proceedings of the 14th annual conference on computational learning theory and 5th European conference on computational learning theory, COLT ’01/EuroCOLT ’01. Springer, pp. 507–516
Burer S, Monteiro R (2003) A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Math Program 95(2):329–357
Cesa-Bianchi N, Gentile C, Vitale F, Zappella G (2012) A correlation clustering approach to link classification in signed networks. J Mach Learn. Res. 23:1–34
Charikar M, Guruswami V, Wirth A (2005) Clustering with qualitative information. J Comput Syst Sci 71(3):360–383
Chen Y, Sanghavi S, Xu H (2012) Clustering sparse graphs. Adv Neural Inf Process Syst 25:2204–2212
Chierichetti F, Dalvi N, Kumar R (2014) Correlation clustering in mapreduce. In: KDD ’14 proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining
Cohn D, Caruana R, Mccallum A (2007) Constrained clustering: advances in algorithms, chap. Semi-supervised clustering with user feedback. Data mining and knowledge discovery series, pp. 17–31
Giotis I, Guruswami V (2006) Correlation clustering with a fixed number of clusters. Theory Comput Open Access J 2:249–266
Goemans MX, Williamson DP (1995) Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J Assoc Comput Mach 42(6):1115–1145
Immorlica N, Wirth A (2007) Constrained clustering: advances in algorithms, theory and applications, chap. correlation clustering. In: Wagstaff KL, Davidson I, Basu S (eds) Data mining and knowledge discovery series. Chapman & Hall, pp. 313–327
Kamvar SD, Klein D, Manning CD (2003) Spectral learning. In: Proceedings of the 18th international joint conference on artificial intelligence, IJCAI’03. Morgan Kaufmann Publishers Inc, pp. 561–566
Kearns MJ, Schapire RE, Sellie LM (1992) Toward efficient agnostic learning. In: Proceedings of the fifth annual workshop on Computational learning theory, COLT ’92. ACM, pp. 341–352
Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math Program 45(3):503–528
Lu Z, Leen TK (2007) Constrained clustering: advances in algorithms, chap. pairwise constraints as priors in probabilistic clustering. In: Wagstaff KL, Davidson I, Basu S (eds) Data mining and knowledge discovery series. Chapman & Hall, pp. 59–90
Pan X, Papailiopoulos D, Oymak S, Recht B, Ramchandran K, Jordan MI (2015) Parallel correlation clustering on big graphs. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems 28. Curran Associates, Inc, pp. 82–90
Pensa R, Robardet C, Boulicaut JF (2007) Constrained clustering: advances in algorithms, chap. constraint-driven co-clustering of 0/1 Data. In: Wagstaff KL, Davidson I, Basu S (eds) Data mining and knowledge discovery series. Chapman & Hall, pp. 123–148
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336):846–850
Rebagliati N, Verri A (2011) Spectral clustering with more than k eigenvectors. Neurocomputing http://staffweb.cms.gre.ac.uk/~wc06/partition/
Shanno DF, Phua KH (1980) Minimization of unconstrained multivariate functions. ACM Trans Math Softw 6:618–622
Shental N, Bar-Hillel A, Hertz T, Weinshal D (2004) Computing Gaussian mixture models with EM using equivalence constraints. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems 16. MIT press, Cambridge
Swamy C (2004) Correlation clustering: maximizing agreements via semidefinite programming. In: Proceedings of the fifteenth annual ACM-SIAM symposium on discrete algorithms, pp. 526–527
Tang W, Zhong S (2008) Computational methods of feature selection, chap. pairwise constraints-guided dimensionality reduction. In: Liu H, Motoda H (eds) Data mining and knowledge discovery series. Taylor & Francis Group CRC, pp. 295–312
Vandenberghe L, Boyd S (1996) Semidefinite programming. SIAM Rev 38(1):49–95
Wacquet G, Caillault EP, Hamad D, HéBert PA (2013) Constrained spectral embedding for K-way data clustering. Pattern Recognit Lett 34:1009–1017
Wagstaff K, Cardie C, Rogers S, Schrödl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the eighteenth international conference on machine learning, ICML ’01. Morgan Kaufmann Publishers Inc, pp. 577–584
Wang X, Davidson I (2010) Flexible constrained spectral clustering. In: KDD ’10: proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 563–572
Wang X, Qian B, Davidson I (2012) On constrained spectral clustering and its applications. Data min knowl discov 28(1):1–30
Wang X, Wang J, Qian B, Wang F, Davidson I (2014) Self-taught spectral clustering via constraint augmentation. In: Proceedings of the 2014 SIAM international conference on data mining, Philadelphia, Pennsylvania, USA, April 24–26, pp. 416–424
Wang YX, Xu H (2016) Noisy sparse subspace clustering. J Mach Learn Res 17(1):320–360
Xu C, Tao D, Xu C (2015) Multi-view self-paced learning for clustering. In: Proceedings of the 24th international conference on artificial intelligence, IJCAI’15. AAAI Press, pp. 3974–3980
Xu Q, Desjardins M (2005) Constrained spectral clustering under a local proximity structure assumption. In: Proceedings of the 18th international conference of the Florida artificial intelligence research society (FLAIRS-05). AAAI Press
Zhang XL (2014) Convex discriminative multitask clustering. IEEE Trans Pattern Anal Mach Intell 37:28–40
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Samal, M., Saradhi, V.V. & Nandi, S. Scalability of correlation clustering. Pattern Anal Applic 21, 703–719 (2018). https://doi.org/10.1007/s10044-017-0598-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-017-0598-7