A Heuristic for Non-convex Variance-Based Clustering Criteria

  • Rodrigo F. Toso
  • Casimir A. Kulikowski
  • Ilya B. Muchnik
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7276)

Abstract

We address the clustering problem in the context of exploratory data analysis, where data sets are investigated under different and desirably contrasting perspectives. In this scenario where, for flexibility, solutions are evaluated by criterion functions, we introduce and evaluate a generalized and efficient version of the incremental one-by-one clustering algorithm of MacQueen (1967). Unlike the widely adopted two-phase algorithm developed by Lloyd (1957), our approach does not rely on the gradient of the criterion function being optimized, offering the key advantage of being able to deal with non-convex criteria. After an extensive experimental analysis using real-world data sets with a more flexible, non-convex criterion function, we obtained results that are considerably better than those produced with the k-means criterion, making our algorithm an invaluable tool for exploratory clustering applications.

Keywords

Criterion Function Decision Boundary Local Search Procedure Cluster Criterion Adjust Rand Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Asuncion, A., Newman, D.J.: UCI Machine Learning Repository (2009)Google Scholar
  2. 2.
    Bauman, E.V., Dorofeyuk, A.A.: Variational approach to the problem of automatic classification for a class of additive functionals. Automation and Remote Control 8, 133–141 (1978)MathSciNetGoogle Scholar
  3. 3.
    Bock, H.-H.: Origins and extensions of the k-means algorithm in cluster analysis. Electronic Journal for History of Probability and Statistics 4(2) (2008)Google Scholar
  4. 4.
    Bradley, P.S., Fayyad, U.M.: Refining initial points for k-means clustering. In: Proceedings of the 15th International Conference on Machine Learning, pp. 91–99. Morgan Kaufmann Publishers Inc. (1998)Google Scholar
  5. 5.
    Duda, R.O., Hart, P.E., Storck, D.G.: Pattern Classification, 2nd edn. Wiley Interscience (2000)Google Scholar
  6. 6.
    Efros, M., Schulman, L.J.: Deterministic clustering with data nets. Technical Report 04-050, Electronic Colloquium on Computational Complexity (2004)Google Scholar
  7. 7.
    Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985)CrossRefGoogle Scholar
  8. 8.
    Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 881–892 (2002)CrossRefGoogle Scholar
  9. 9.
    Kiseleva, N.E., Muchnik, I.B., Novikov, S.G.: Stratified samples in the problem of representative types. Automation and Remote Control 47, 684–693 (1986)Google Scholar
  10. 10.
    Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means algorithm. Pattern Recognition 36, 451–461 (2003)CrossRefGoogle Scholar
  11. 11.
    Lloyd, S.P.: Least squares quantization in PCM. Technical report, Bell Telephone Labs Memorandum (1957)Google Scholar
  12. 12.
    Lytkin, N.I., Kulikowski, C.A., Muchnik, I.B.: Variance-based criteria for clustering and their application to the analysis of management styles of mutual funds based on time series of daily returns. Technical Report 2008-01, DIMACS (2008)Google Scholar
  13. 13.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)Google Scholar
  14. 14.
    Neyman, J.: On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society 97, 558–625 (1934)CrossRefGoogle Scholar
  15. 15.
    Pelleg, D., Moore, A.: Accelerating exact k-means algorithms with geometric reasoning. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 277–281. ACM (1999)Google Scholar
  16. 16.
    Pelleg, D., Moore, A.: x-means: Extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning, pp. 727–734. Morgan Kaufmann Publishers Inc. (2000)Google Scholar
  17. 17.
    Schulman, L.J.: Clustering for edge-cost minimization. In: Proceedings of the 32nd Annual ACM Symposium on Theory of Computing, pp. 547–555. ACM (2000)Google Scholar
  18. 18.
    Späth, H.: Cluster analysis algorithms for data reduction and classification of objects. E. Horwood (1980)Google Scholar
  19. 19.
    Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1073–1080. ACM (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Rodrigo F. Toso
    • 1
  • Casimir A. Kulikowski
    • 1
  • Ilya B. Muchnik
    • 2
  1. 1.Department of Computer ScienceRutgers UniversityPiscatawayUSA
  2. 2.DIMACSRutgers UniversityPiscatawayUSA

Personalised recommendations