Data Mining and Knowledge Discovery

, Volume 13, Issue 3, pp 365–395 | Cite as

Scalable Clustering Algorithms with Balancing Constraints

Article

Abstract

Clustering methods for data-mining problems must be extremely scalable. In addition, several data mining applications demand that the clusters obtained be balanced, i.e., of approximately the same size or importance. In this paper, we propose a general framework for scalable, balanced clustering. The data clustering process is broken down into three steps: sampling of a small representative subset of the points, clustering of the sampled data, and populating the initial clusters with the remaining data followed by refinements. First, we show that a simple uniform sampling from the original data is sufficient to get a representative subset with high probability. While the proposed framework allows a large class of algorithms to be used for clustering the sampled set, we focus on some popular parametric algorithms for ease of exposition. We then present algorithms to populate and refine the clusters. The algorithm for populating the clusters is based on a generalization of the stable marriage problem, whereas the refinement algorithm is a constrained iterative relocation scheme. The complexity of the overall method is O(kN log N) for obtaining k balanced clusters from N data points, which compares favorably with other existing techniques for balanced clustering. In addition to providing balancing guarantees, the clustering performance obtained using the proposed framework is comparable to and often better than the corresponding unconstrained solution. Experimental results on several datasets, including high-dimensional (>20,000) ones, are provided to demonstrate the efficacy of the proposed framework.

Keywords

scalable clustering balanced clustering constrained clustering sampling stable marriage problem text clustering 

References

  1. Ahalt, S.C., Krishnamurthy, A.K., Chen, P., and Melton, D.E. 1990. Competitive learning algorithms for vector quantization. Neural Networks, 3(3):277–290.Google Scholar
  2. Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. New York: Addison Wesley.Google Scholar
  3. Banerjee, A., Dhillon, I., Ghosh, J., and Sra, S. 2005a. Clustering on the unit hypersphere using von Mises-Fisher distributions. Journal of Machine Learning Research, 6:1345–1382.Google Scholar
  4. Banerjee, A. and Ghosh, J. 2004. Frequency sensitive competitive learning for balanced clustering on high-dimensional hyperspheres. IEEE Transactions on Neural Networks, 15(3):702–719.Google Scholar
  5. Banerjee, A., Merugu, S., Dhillon, I., and Ghosh, J. 2005b. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749.Google Scholar
  6. Bennett, K.P., Bradley, P.S., and Demiriz, A. 2000. Constrained k-means clustering. Technical Report, Microsoft Research, TR-2000-65.Google Scholar
  7. Bradley, P.S., Fayyad, U.M., and Reina, C. 1998a. Scaling clustering algorithms to large databases. In Proc. of the 4th International Conference on Knowledge Discovery and Data Mining (KDD), pp. 9–15.Google Scholar
  8. Bradley, P.S., Fayyad, U.M., and Reina, C. 1998b. Scaling EM (Expectation-Maximization) clustering to large databases. Technical report, Microsoft Research.Google Scholar
  9. Cover, T.M. and Thomas, J.A. 1991. Elements of Information Theory. Wiley-Interscience.Google Scholar
  10. Cutting, D.R., Karger, D.R., Pedersen, J.O., and Tukey, J.W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proc. 15th Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 318–329.Google Scholar
  11. Dhillon, I.S. and Modha, D.S. 2001. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175.Google Scholar
  12. Domingos, P. and Hulton, G. 2001. A general method for scaling up machine learning algorithms and its application to clustering. In Proc. 18th Intl. Conf. Machine Learning, pp. 106–113.Google Scholar
  13. Duda, R.O., Hart, P.E., and Stork, D.G. 2001. Pattern Classification. John Wiley & Sons.Google Scholar
  14. Fasulo, D. 1999. An analysis of recent work on clustering. Technical report. University of Washington, Seattle.Google Scholar
  15. Feller, W. 1967. An Introduction to Probability Theory and Its Applications. John Wiley & Sons.Google Scholar
  16. Ghiasi, S., Srivastava, A., Yang, X., and Sarrafzadeh, M. 2002. Optimal energy aware clustering in sensor networks. Sensors, 2:258–269.Google Scholar
  17. Ghosh, J. 2003. Scalable clustering methods for data mining. In Handbook of Data Mining, Nong Ye, (eds), Lawrence Erlbaum, pp. 247–277.Google Scholar
  18. Guan, Y., Ghorbani, A., and Belacel, N. 2003. Y-means: A clustering method for intrusion detection. In Proc. CCECE-2003, pp. 1083–1086.Google Scholar
  19. Guha, S., Rastogi, R., and Shim, K. 1998. Cure: An efficient clustering algorithm for large databases. In Proc. ACM SIGMOD Intl. Conf.on Management of Data, ACM. New York, pp. 73–84.Google Scholar
  20. Gupta, G. and Younis, M. 2003. Load-balanced clustering of wireless networks. In Proc. IEEE Int'l Conf. on Communications, Vol. 3, pp. 1848–1852.Google Scholar
  21. Gupta, G.K. and Ghosh, J. 2001. Detecting seasonal and divergent trends and visualization for very high dimensional transactional data. In Proc. 1st SIAM Intl. Conf. on Data Mining.Google Scholar
  22. Gusfield, D. and Irving, R.W. 1989. The Stable Marriage Problem: Structure and Algorithms. MIT Press, Cambridge, MA.Google Scholar
  23. Han, J., Kamber, M., and Tung, A.K.H. 2001. Spatial clustering methods in data mining: A survey. In Geographic Data Mining and Knowledge Discovery, Taylor and Francis.Google Scholar
  24. Jain, A.K., Murty, M.N., and Flynn, P.J. 1999. Data clustering: A review. ACM Computing Surveys, 31(3):264–323.Google Scholar
  25. Karypis, G. and Kumar, V. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392.Google Scholar
  26. Kearns, M., Mansour, Y., and Ng, A. 1997. An information-theoretic analysis of hard and soft assignment methods for clustering. In Proc. of the 13th Annual Conference on Uncertainty in Artificial Intelligence (UAI), pp. 282–293.Google Scholar
  27. Lynch, P.J. and Horton, S. 2002. Web Style Guide:Basic Design Principles for Creating Web Sites. Yale Univ. Press.Google Scholar
  28. MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. Math. Statist. Prob., 1:281–297.Google Scholar
  29. Modha, D. and Spangler, S. 2003. Feature weighting in k-means clustering. Machine Learning, 52(3):217–237.Google Scholar
  30. Motwani, R. and Raghavan, P. 1995. Randmized Algorithms. Cambridge University Press.Google Scholar
  31. Neilson Marketing Research. 1993. Category Management: Positioning Your Organization to Win. McGraw-Hill.Google Scholar
  32. Palmer, C.R. and Faloutsos, C. 1999. Density biased sampling: An improved method for data mining and clustering. Technical report, Carnegie Mellon University.Google Scholar
  33. Papoulis, A. 1991. Probability, Random Variables and Stochastic Processes. McGraw Hill.Google Scholar
  34. Singh, P. 2005. Personal Communication at KDD05.Google Scholar
  35. Strehl, A. and Ghosh, J. 2003. Relationship-based clustering and visualization for high-dimensional data mining. INFORMS Journal on Computing, 15(2):208–230.Google Scholar
  36. Tung, A.K.H., Ng, R.T., Laksmanan, L.V.S., and Han, J. 2001. Constraint-based clustering in large databases. In Proc. Intl. Conf. on Database Theory (ICDT'01).Google Scholar
  37. Yang, Y. and Padmanabhan, B. 2003. Segmenting customer transactions using a pattern-based clustering approach. In Proceedings of ICDM, pp. 411–419.Google Scholar
  38. Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proc. ACM SIGMOD Intl. Conf. on Management of Data, Montreal, ACM, pp. 103–114.Google Scholar

Copyright information

© Springer Science+Business Media, Inc. 2006

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringUniversity of MinnesotaTwin CitiesMinneapolisUSA
  2. 2.Department of Electrical and Computer EngineeringCollege of Engineering, University of Texas at AustinAustinUSA

Personalised recommendations