Skip to main content
Log in

Scalable Clustering Algorithms with Balancing Constraints

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Clustering methods for data-mining problems must be extremely scalable. In addition, several data mining applications demand that the clusters obtained be balanced, i.e., of approximately the same size or importance. In this paper, we propose a general framework for scalable, balanced clustering. The data clustering process is broken down into three steps: sampling of a small representative subset of the points, clustering of the sampled data, and populating the initial clusters with the remaining data followed by refinements. First, we show that a simple uniform sampling from the original data is sufficient to get a representative subset with high probability. While the proposed framework allows a large class of algorithms to be used for clustering the sampled set, we focus on some popular parametric algorithms for ease of exposition. We then present algorithms to populate and refine the clusters. The algorithm for populating the clusters is based on a generalization of the stable marriage problem, whereas the refinement algorithm is a constrained iterative relocation scheme. The complexity of the overall method is O(kN log N) for obtaining k balanced clusters from N data points, which compares favorably with other existing techniques for balanced clustering. In addition to providing balancing guarantees, the clustering performance obtained using the proposed framework is comparable to and often better than the corresponding unconstrained solution. Experimental results on several datasets, including high-dimensional (>20,000) ones, are provided to demonstrate the efficacy of the proposed framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.

Similar content being viewed by others

Notes

  1. Note that both KMeans type partitional algorithms and graph-partitioning approaches can be readily generalized to cater to weighted objects (Banerjee et al., 2005b; Atrehl and Ghosh, 2003). Therefore, our framework can be easily extended to apply to situations where balancing is desired based on a derived quantity such as net revenue per cluster, since such situations are dealt with by assigning a corresponding weight to each object.

  2. We used the sprandn function in Matlab to generate the artificial data.

References

  • Ahalt, S.C., Krishnamurthy, A.K., Chen, P., and Melton, D.E. 1990. Competitive learning algorithms for vector quantization. Neural Networks, 3(3):277–290.

    Google Scholar 

  • Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. New York: Addison Wesley.

  • Banerjee, A., Dhillon, I., Ghosh, J., and Sra, S. 2005a. Clustering on the unit hypersphere using von Mises-Fisher distributions. Journal of Machine Learning Research, 6:1345–1382.

  • Banerjee, A. and Ghosh, J. 2004. Frequency sensitive competitive learning for balanced clustering on high-dimensional hyperspheres. IEEE Transactions on Neural Networks, 15(3):702–719.

    Google Scholar 

  • Banerjee, A., Merugu, S., Dhillon, I., and Ghosh, J. 2005b. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749.

  • Bennett, K.P., Bradley, P.S., and Demiriz, A. 2000. Constrained k-means clustering. Technical Report, Microsoft Research, TR-2000-65.

  • Bradley, P.S., Fayyad, U.M., and Reina, C. 1998a. Scaling clustering algorithms to large databases. In Proc. of the 4th International Conference on Knowledge Discovery and Data Mining (KDD), pp. 9–15.

  • Bradley, P.S., Fayyad, U.M., and Reina, C. 1998b. Scaling EM (Expectation-Maximization) clustering to large databases. Technical report, Microsoft Research.

  • Cover, T.M. and Thomas, J.A. 1991. Elements of Information Theory. Wiley-Interscience.

  • Cutting, D.R., Karger, D.R., Pedersen, J.O., and Tukey, J.W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proc. 15th Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 318–329.

  • Dhillon, I.S. and Modha, D.S. 2001. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175.

    Google Scholar 

  • Domingos, P. and Hulton, G. 2001. A general method for scaling up machine learning algorithms and its application to clustering. In Proc. 18th Intl. Conf. Machine Learning, pp. 106–113.

  • Duda, R.O., Hart, P.E., and Stork, D.G. 2001. Pattern Classification. John Wiley & Sons.

  • Fasulo, D. 1999. An analysis of recent work on clustering. Technical report. University of Washington, Seattle.

  • Feller, W. 1967. An Introduction to Probability Theory and Its Applications. John Wiley & Sons.

  • Ghiasi, S., Srivastava, A., Yang, X., and Sarrafzadeh, M. 2002. Optimal energy aware clustering in sensor networks. Sensors, 2:258–269.

    Google Scholar 

  • Ghosh, J. 2003. Scalable clustering methods for data mining. In Handbook of Data Mining, Nong Ye, (eds), Lawrence Erlbaum, pp. 247–277.

  • Guan, Y., Ghorbani, A., and Belacel, N. 2003. Y-means: A clustering method for intrusion detection. In Proc. CCECE-2003, pp. 1083–1086.

  • Guha, S., Rastogi, R., and Shim, K. 1998. Cure: An efficient clustering algorithm for large databases. In Proc. ACM SIGMOD Intl. Conf.on Management of Data, ACM. New York, pp. 73–84.

  • Gupta, G. and Younis, M. 2003. Load-balanced clustering of wireless networks. In Proc. IEEE Int'l Conf. on Communications, Vol. 3, pp. 1848–1852.

  • Gupta, G.K. and Ghosh, J. 2001. Detecting seasonal and divergent trends and visualization for very high dimensional transactional data. In Proc. 1st SIAM Intl. Conf. on Data Mining.

  • Gusfield, D. and Irving, R.W. 1989. The Stable Marriage Problem: Structure and Algorithms. MIT Press, Cambridge, MA.

  • Han, J., Kamber, M., and Tung, A.K.H. 2001. Spatial clustering methods in data mining: A survey. In Geographic Data Mining and Knowledge Discovery, Taylor and Francis.

  • Jain, A.K., Murty, M.N., and Flynn, P.J. 1999. Data clustering: A review. ACM Computing Surveys, 31(3):264–323.

    Google Scholar 

  • Karypis, G. and Kumar, V. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392.

    Google Scholar 

  • Kearns, M., Mansour, Y., and Ng, A. 1997. An information-theoretic analysis of hard and soft assignment methods for clustering. In Proc. of the 13th Annual Conference on Uncertainty in Artificial Intelligence (UAI), pp. 282–293.

  • Lynch, P.J. and Horton, S. 2002. Web Style Guide:Basic Design Principles for Creating Web Sites. Yale Univ. Press.

  • MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. Math. Statist. Prob., 1:281–297.

  • Modha, D. and Spangler, S. 2003. Feature weighting in k-means clustering. Machine Learning, 52(3):217–237.

    Google Scholar 

  • Motwani, R. and Raghavan, P. 1995. Randmized Algorithms. Cambridge University Press.

  • Neilson Marketing Research. 1993. Category Management: Positioning Your Organization to Win. McGraw-Hill.

  • Palmer, C.R. and Faloutsos, C. 1999. Density biased sampling: An improved method for data mining and clustering. Technical report, Carnegie Mellon University.

  • Papoulis, A. 1991. Probability, Random Variables and Stochastic Processes. McGraw Hill.

  • Singh, P. 2005. Personal Communication at KDD05.

  • Strehl, A. and Ghosh, J. 2003. Relationship-based clustering and visualization for high-dimensional data mining. INFORMS Journal on Computing, 15(2):208–230.

    Google Scholar 

  • Tung, A.K.H., Ng, R.T., Laksmanan, L.V.S., and Han, J. 2001. Constraint-based clustering in large databases. In Proc. Intl. Conf. on Database Theory (ICDT'01).

  • Yang, Y. and Padmanabhan, B. 2003. Segmenting customer transactions using a pattern-based clustering approach. In Proceedings of ICDM, pp. 411–419.

  • Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proc. ACM SIGMOD Intl. Conf. on Management of Data, Montreal, ACM, pp. 103–114.

Download references

Acknowledgment

The research was supported in part by NSF grants IIS 0325116, IIS 0307792, and an IBM PhD fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arindam Banerjee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Banerjee, A., Ghosh, J. Scalable Clustering Algorithms with Balancing Constraints. Data Min Knowl Disc 13, 365–395 (2006). https://doi.org/10.1007/s10618-006-0040-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-006-0040-z

Keywords

Navigation