Scalable Clustering Algorithms with Balancing Constraints

Banerjee, Arindam; Ghosh, Joydeep

doi:10.1007/s10618-006-0040-z

Scalable Clustering Algorithms with Balancing Constraints

Published: 26 May 2006

Volume 13, pages 365–395, (2006)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Arindam Banerjee¹ &
Joydeep Ghosh²

1083 Accesses
64 Citations
6 Altmetric
1 Mention
Explore all metrics

Abstract

Clustering methods for data-mining problems must be extremely scalable. In addition, several data mining applications demand that the clusters obtained be balanced, i.e., of approximately the same size or importance. In this paper, we propose a general framework for scalable, balanced clustering. The data clustering process is broken down into three steps: sampling of a small representative subset of the points, clustering of the sampled data, and populating the initial clusters with the remaining data followed by refinements. First, we show that a simple uniform sampling from the original data is sufficient to get a representative subset with high probability. While the proposed framework allows a large class of algorithms to be used for clustering the sampled set, we focus on some popular parametric algorithms for ease of exposition. We then present algorithms to populate and refine the clusters. The algorithm for populating the clusters is based on a generalization of the stable marriage problem, whereas the refinement algorithm is a constrained iterative relocation scheme. The complexity of the overall method is O(kN log N) for obtaining k balanced clusters from N data points, which compares favorably with other existing techniques for balanced clustering. In addition to providing balancing guarantees, the clustering performance obtained using the proposed framework is comparable to and often better than the corresponding unconstrained solution. Experimental results on several datasets, including high-dimensional (>20,000) ones, are provided to demonstrate the efficacy of the proposed framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Semi Brute-Force Search Approach for (Balanced) Clustering

Article 01 August 2023

A sampling-based exact algorithm for the solution of the minimax diameter clustering problem

Article 26 March 2018

Partition-Based Clustering Using Constraint Optimization

Notes

Note that both KMeans type partitional algorithms and graph-partitioning approaches can be readily generalized to cater to weighted objects (Banerjee et al., 2005b; Atrehl and Ghosh, 2003). Therefore, our framework can be easily extended to apply to situations where balancing is desired based on a derived quantity such as net revenue per cluster, since such situations are dealt with by assigning a corresponding weight to each object.
We used the sprandn function in Matlab to generate the artificial data.

References

Ahalt, S.C., Krishnamurthy, A.K., Chen, P., and Melton, D.E. 1990. Competitive learning algorithms for vector quantization. Neural Networks, 3(3):277–290.
Google Scholar
Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. New York: Addison Wesley.
Banerjee, A., Dhillon, I., Ghosh, J., and Sra, S. 2005a. Clustering on the unit hypersphere using von Mises-Fisher distributions. Journal of Machine Learning Research, 6:1345–1382.
Banerjee, A. and Ghosh, J. 2004. Frequency sensitive competitive learning for balanced clustering on high-dimensional hyperspheres. IEEE Transactions on Neural Networks, 15(3):702–719.
Google Scholar
Banerjee, A., Merugu, S., Dhillon, I., and Ghosh, J. 2005b. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749.
Bennett, K.P., Bradley, P.S., and Demiriz, A. 2000. Constrained k-means clustering. Technical Report, Microsoft Research, TR-2000-65.
Bradley, P.S., Fayyad, U.M., and Reina, C. 1998a. Scaling clustering algorithms to large databases. In Proc. of the 4th International Conference on Knowledge Discovery and Data Mining (KDD), pp. 9–15.
Bradley, P.S., Fayyad, U.M., and Reina, C. 1998b. Scaling EM (Expectation-Maximization) clustering to large databases. Technical report, Microsoft Research.
Cover, T.M. and Thomas, J.A. 1991. Elements of Information Theory. Wiley-Interscience.
Cutting, D.R., Karger, D.R., Pedersen, J.O., and Tukey, J.W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proc. 15th Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 318–329.
Dhillon, I.S. and Modha, D.S. 2001. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175.
Google Scholar
Domingos, P. and Hulton, G. 2001. A general method for scaling up machine learning algorithms and its application to clustering. In Proc. 18th Intl. Conf. Machine Learning, pp. 106–113.
Duda, R.O., Hart, P.E., and Stork, D.G. 2001. Pattern Classification. John Wiley & Sons.
Fasulo, D. 1999. An analysis of recent work on clustering. Technical report. University of Washington, Seattle.
Feller, W. 1967. An Introduction to Probability Theory and Its Applications. John Wiley & Sons.
Ghiasi, S., Srivastava, A., Yang, X., and Sarrafzadeh, M. 2002. Optimal energy aware clustering in sensor networks. Sensors, 2:258–269.
Google Scholar
Ghosh, J. 2003. Scalable clustering methods for data mining. In Handbook of Data Mining, Nong Ye, (eds), Lawrence Erlbaum, pp. 247–277.
Guan, Y., Ghorbani, A., and Belacel, N. 2003. Y-means: A clustering method for intrusion detection. In Proc. CCECE-2003, pp. 1083–1086.
Guha, S., Rastogi, R., and Shim, K. 1998. Cure: An efficient clustering algorithm for large databases. In Proc. ACM SIGMOD Intl. Conf.on Management of Data, ACM. New York, pp. 73–84.
Gupta, G. and Younis, M. 2003. Load-balanced clustering of wireless networks. In Proc. IEEE Int'l Conf. on Communications, Vol. 3, pp. 1848–1852.
Gupta, G.K. and Ghosh, J. 2001. Detecting seasonal and divergent trends and visualization for very high dimensional transactional data. In Proc. 1st SIAM Intl. Conf. on Data Mining.
Gusfield, D. and Irving, R.W. 1989. The Stable Marriage Problem: Structure and Algorithms. MIT Press, Cambridge, MA.
Han, J., Kamber, M., and Tung, A.K.H. 2001. Spatial clustering methods in data mining: A survey. In Geographic Data Mining and Knowledge Discovery, Taylor and Francis.
Jain, A.K., Murty, M.N., and Flynn, P.J. 1999. Data clustering: A review. ACM Computing Surveys, 31(3):264–323.
Google Scholar
Karypis, G. and Kumar, V. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392.
Google Scholar
Kearns, M., Mansour, Y., and Ng, A. 1997. An information-theoretic analysis of hard and soft assignment methods for clustering. In Proc. of the 13th Annual Conference on Uncertainty in Artificial Intelligence (UAI), pp. 282–293.
Lynch, P.J. and Horton, S. 2002. Web Style Guide:Basic Design Principles for Creating Web Sites. Yale Univ. Press.
MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. Math. Statist. Prob., 1:281–297.
Modha, D. and Spangler, S. 2003. Feature weighting in k-means clustering. Machine Learning, 52(3):217–237.
Google Scholar
Motwani, R. and Raghavan, P. 1995. Randmized Algorithms. Cambridge University Press.
Neilson Marketing Research. 1993. Category Management: Positioning Your Organization to Win. McGraw-Hill.
Palmer, C.R. and Faloutsos, C. 1999. Density biased sampling: An improved method for data mining and clustering. Technical report, Carnegie Mellon University.
Papoulis, A. 1991. Probability, Random Variables and Stochastic Processes. McGraw Hill.
Singh, P. 2005. Personal Communication at KDD05.
Strehl, A. and Ghosh, J. 2003. Relationship-based clustering and visualization for high-dimensional data mining. INFORMS Journal on Computing, 15(2):208–230.
Google Scholar
Tung, A.K.H., Ng, R.T., Laksmanan, L.V.S., and Han, J. 2001. Constraint-based clustering in large databases. In Proc. Intl. Conf. on Database Theory (ICDT'01).
Yang, Y. and Padmanabhan, B. 2003. Segmenting customer transactions using a pattern-based clustering approach. In Proceedings of ICDM, pp. 411–419.
Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proc. ACM SIGMOD Intl. Conf. on Management of Data, Montreal, ACM, pp. 103–114.

Download references

Acknowledgment

The research was supported in part by NSF grants IIS 0325116, IIS 0307792, and an IBM PhD fellowship.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Minnesota, Twin Cities, 200 Union Street SE, Minneapolis, MN, 55455, USA
Arindam Banerjee
Department of Electrical and Computer Engineering, College of Engineering, University of Texas at Austin, 1 University Station C0803, Austin, TX, 78712, USA
Joydeep Ghosh

Authors

Arindam Banerjee
View author publications
You can also search for this author in PubMed Google Scholar
Joydeep Ghosh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arindam Banerjee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Banerjee, A., Ghosh, J. Scalable Clustering Algorithms with Balancing Constraints. Data Min Knowl Disc 13, 365–395 (2006). https://doi.org/10.1007/s10618-006-0040-z

Download citation

Received: 03 June 2005
Accepted: 10 January 2006
Published: 26 May 2006
Issue Date: November 2006
DOI: https://doi.org/10.1007/s10618-006-0040-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable Clustering Algorithms with Balancing Constraints

Abstract

Access this article

Similar content being viewed by others

A Semi Brute-Force Search Approach for (Balanced) Clustering

A sampling-based exact algorithm for the solution of the minimax diameter clustering problem

Partition-Based Clustering Using Constraint Optimization

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scalable Clustering Algorithms with Balancing Constraints

Abstract

Access this article

Similar content being viewed by others

A Semi Brute-Force Search Approach for (Balanced) Clustering

A sampling-based exact algorithm for the solution of the minimax diameter clustering problem

Partition-Based Clustering Using Constraint Optimization

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation