Scalability of correlation clustering

Samal, Mamata; Saradhi, V. Vijaya; Nandi, Sukumar

doi:10.1007/s10044-017-0598-7

Scalability of correlation clustering

Theoretical Advances
Published: 24 February 2017

Volume 21, pages 703–719, (2018)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Mamata Samal¹,
V. Vijaya Saradhi¹ &
Sukumar Nandi¹

253 Accesses
1 Citation
Explore all metrics

Abstract

The problem of scalability of correlation clustering (CC) is addressed by reducing the number of variables involved in the SDP formulation. A nonlinear programming formulation is obtained instead of SDP formulation, which reduces the number of variables. The new formulation is solved through limited memory Broyden Fletcher Goldfarb Shanno method. We demonstrate the potential of the nonlinear formulation on large graph datasets having more than ten thousand vertices and nine million edges. The proposed scalable formulation is experimentally shown not to compromise on quality of the obtained clusters. We compare the scalable formulation results with those of the original CC formulation. We compare the scalable formulation results with the original CC formulation, a constrained spectral clustering method which uses edge labels of the graph and differs only in the way clusters are obtained by defining the cut on the given graph and with a variant of constraint spectral clustering known as self-taught spectral clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

In maximization problem a \(\lambda\)-approximation algorithm A for the problem with cost function c is an algorithm such that, for every input I, the solution A(I) to satisfy \(c(A(I))\ge \lambda \mathrm{max}_S c(S)\) for some \(\lambda \le 1\).
Approximation value(\(\lambda\)) in the case of a maximization problem is the ratio between the cost of an optimal solution to the cost of the solution produced by the approximation algorithm, in maximization problem \(\lambda \le 1\) and in minimization problem \(\lambda \ge 1.\)
By large dataset we mean graph with large number of nodes as well as edges.

References

Arasu A, Ré C, Suciu D (2009) Large-scale deduplication with constraints using dedupalog. In: International conference on data engineering, pp. 952–963
Bansal N, Blum A, Chawla S (2002) Correlation clustering. Found Comput Sci (FOCS) 2002:238–247
MATH Google Scholar
Ben-David S, Long PM, Mansour Y (2001) Agnostic boosting. In: Proceedings of the 14th annual conference on computational learning theory and 5th European conference on computational learning theory, COLT ’01/EuroCOLT ’01. Springer, pp. 507–516
Burer S, Monteiro R (2003) A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Math Program 95(2):329–357
Article MathSciNet MATH Google Scholar
Cesa-Bianchi N, Gentile C, Vitale F, Zappella G (2012) A correlation clustering approach to link classification in signed networks. J Mach Learn. Res. 23:1–34
Google Scholar
Charikar M, Guruswami V, Wirth A (2005) Clustering with qualitative information. J Comput Syst Sci 71(3):360–383
Article MathSciNet MATH Google Scholar
Chen Y, Sanghavi S, Xu H (2012) Clustering sparse graphs. Adv Neural Inf Process Syst 25:2204–2212
Google Scholar
Chierichetti F, Dalvi N, Kumar R (2014) Correlation clustering in mapreduce. In: KDD ’14 proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining
Cohn D, Caruana R, Mccallum A (2007) Constrained clustering: advances in algorithms, chap. Semi-supervised clustering with user feedback. Data mining and knowledge discovery series, pp. 17–31
Giotis I, Guruswami V (2006) Correlation clustering with a fixed number of clusters. Theory Comput Open Access J 2:249–266
MathSciNet MATH Google Scholar
Goemans MX, Williamson DP (1995) Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J Assoc Comput Mach 42(6):1115–1145
Article MathSciNet MATH Google Scholar
Immorlica N, Wirth A (2007) Constrained clustering: advances in algorithms, theory and applications, chap. correlation clustering. In: Wagstaff KL, Davidson I, Basu S (eds) Data mining and knowledge discovery series. Chapman & Hall, pp. 313–327
Kamvar SD, Klein D, Manning CD (2003) Spectral learning. In: Proceedings of the 18th international joint conference on artificial intelligence, IJCAI’03. Morgan Kaufmann Publishers Inc, pp. 561–566
Kearns MJ, Schapire RE, Sellie LM (1992) Toward efficient agnostic learning. In: Proceedings of the fifth annual workshop on Computational learning theory, COLT ’92. ACM, pp. 341–352
Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math Program 45(3):503–528
Article MathSciNet MATH Google Scholar
Lu Z, Leen TK (2007) Constrained clustering: advances in algorithms, chap. pairwise constraints as priors in probabilistic clustering. In: Wagstaff KL, Davidson I, Basu S (eds) Data mining and knowledge discovery series. Chapman & Hall, pp. 59–90
Pan X, Papailiopoulos D, Oymak S, Recht B, Ramchandran K, Jordan MI (2015) Parallel correlation clustering on big graphs. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems 28. Curran Associates, Inc, pp. 82–90
Pensa R, Robardet C, Boulicaut JF (2007) Constrained clustering: advances in algorithms, chap. constraint-driven co-clustering of 0/1 Data. In: Wagstaff KL, Davidson I, Basu S (eds) Data mining and knowledge discovery series. Chapman & Hall, pp. 123–148
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336):846–850
Article Google Scholar
Rebagliati N, Verri A (2011) Spectral clustering with more than k eigenvectors. Neurocomputing http://staffweb.cms.gre.ac.uk/~wc06/partition/
Shanno DF, Phua KH (1980) Minimization of unconstrained multivariate functions. ACM Trans Math Softw 6:618–622
Article MATH Google Scholar
Shental N, Bar-Hillel A, Hertz T, Weinshal D (2004) Computing Gaussian mixture models with EM using equivalence constraints. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems 16. MIT press, Cambridge
Google Scholar
Swamy C (2004) Correlation clustering: maximizing agreements via semidefinite programming. In: Proceedings of the fifteenth annual ACM-SIAM symposium on discrete algorithms, pp. 526–527
Tang W, Zhong S (2008) Computational methods of feature selection, chap. pairwise constraints-guided dimensionality reduction. In: Liu H, Motoda H (eds) Data mining and knowledge discovery series. Taylor & Francis Group CRC, pp. 295–312
Vandenberghe L, Boyd S (1996) Semidefinite programming. SIAM Rev 38(1):49–95
Article MathSciNet MATH Google Scholar
Wacquet G, Caillault EP, Hamad D, HéBert PA (2013) Constrained spectral embedding for K-way data clustering. Pattern Recognit Lett 34:1009–1017
Article Google Scholar
Wagstaff K, Cardie C, Rogers S, Schrödl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the eighteenth international conference on machine learning, ICML ’01. Morgan Kaufmann Publishers Inc, pp. 577–584
Wang X, Davidson I (2010) Flexible constrained spectral clustering. In: KDD ’10: proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 563–572
Wang X, Qian B, Davidson I (2012) On constrained spectral clustering and its applications. Data min knowl discov 28(1):1–30
Article MathSciNet MATH Google Scholar
Wang X, Wang J, Qian B, Wang F, Davidson I (2014) Self-taught spectral clustering via constraint augmentation. In: Proceedings of the 2014 SIAM international conference on data mining, Philadelphia, Pennsylvania, USA, April 24–26, pp. 416–424
Wang YX, Xu H (2016) Noisy sparse subspace clustering. J Mach Learn Res 17(1):320–360
MathSciNet MATH Google Scholar
Xu C, Tao D, Xu C (2015) Multi-view self-paced learning for clustering. In: Proceedings of the 24th international conference on artificial intelligence, IJCAI’15. AAAI Press, pp. 3974–3980
Xu Q, Desjardins M (2005) Constrained spectral clustering under a local proximity structure assumption. In: Proceedings of the 18th international conference of the Florida artificial intelligence research society (FLAIRS-05). AAAI Press
Zhang XL (2014) Convex discriminative multitask clustering. IEEE Trans Pattern Anal Mach Intell 37:28–40
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Engineering, Indian Institute of Technology Guwahati, Assam, 781 039, India
Mamata Samal, V. Vijaya Saradhi & Sukumar Nandi

Authors

Mamata Samal
View author publications
You can also search for this author in PubMed Google Scholar
V. Vijaya Saradhi
View author publications
You can also search for this author in PubMed Google Scholar
Sukumar Nandi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mamata Samal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Samal, M., Saradhi, V.V. & Nandi, S. Scalability of correlation clustering. Pattern Anal Applic 21, 703–719 (2018). https://doi.org/10.1007/s10044-017-0598-7

Download citation

Received: 05 January 2016
Accepted: 18 January 2017
Published: 24 February 2017
Issue Date: August 2018
DOI: https://doi.org/10.1007/s10044-017-0598-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalability of correlation clustering

Abstract

Access this article

Similar content being viewed by others

Distributed Graph Clustering Using Modularity and Map Equation

Community Discovery: Simple and Scalable Approaches

An Efficient Local Search Algorithm for Correlation Clustering on Large Graphs

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scalability of correlation clustering

Abstract

Access this article

Similar content being viewed by others

Distributed Graph Clustering Using Modularity and Map Equation

Community Discovery: Simple and Scalable Approaches

An Efficient Local Search Algorithm for Correlation Clustering on Large Graphs

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation