Advertisement

Artificial Intelligence Review

, Volume 41, Issue 1, pp 27–48 | Cite as

Effects of resampling method and adaptation on clustering ensemble efficacy

  • Behrouz Minaei-Bidgoli
  • Hamid Parvin
  • Hamid Alinejad-RoknyEmail author
  • Hosein Alizadeh
  • William F. Punch
Article

Abstract

Clustering ensembles combine multiple partitions of data into a single clustering solution of better quality. Inspired by the success of supervised bagging and boosting algorithms, we propose non-adaptive and adaptive resampling schemes for the integration of multiple independent and dependent clusterings. We investigate the effectiveness of bagging techniques, comparing the efficacy of sampling with and without replacement, in conjunction with several consensus algorithms. In our adaptive approach, individual partitions in the ensemble are sequentially generated by clustering specially selected subsamples of the given dataset. The sampling probability for each data point dynamically depends on the consistency of its previous assignments in the ensemble. New subsamples are then drawn to increasingly focus on the problematic regions of the input feature space. A measure of data point clustering consistency is therefore defined to guide this adaptation. Experimental results show improved stability and accuracy for clustering structures obtained via bootstrapping, subsampling, and adaptive techniques. A meaningful consensus partition for an entire set of data points emerges from multiple clusterings of bootstraps and subsamples. Subsamples of small size can reduce computational cost and measurement complexity for many unsupervised data mining tasks with distributed sources of data. This empirical study also compares the performance of adaptive and non-adaptive clustering ensembles using different consensus functions on a number of datasets. By focusing attention on the data points with the least consistent clustering assignments, whether one can better approximate the inter-cluster boundaries or can at least create diversity in boundaries and this results in improving clustering accuracy and convergence speed as a function of the number of partitions in the ensemble. The comparison of adaptive and non-adaptive approaches is a new avenue for research, and this study helps to pave the way for the useful application of distributed data mining methods.

Keywords

Clustering ensembles Consensus functions Distributed data mining Bootstrap Subsampling Adaptive clustering 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aeberhard S, Coomans D, de Vel O (1992). Comparison of classifiers in high dimensional settings. Technical Report no. 92-02, Department of Computer Science and Department of Mathematics and Statistics, James Cook University of North QueenslandGoogle Scholar
  2. Ayad HG, Kamel MS (2008). Cumulative voting consensus method for partitions with a variable number of clusters. IEEE Trans Pattern Anal Mach Intell 30(1)Google Scholar
  3. Barthelemy J, Leclerc B (1995) The median procedure for partition. In: Cox IJ et al (eds) Partitioning data sets. AMS DIMACS series in discrete mathematics, vol 19, pp 3–34Google Scholar
  4. Ben-Hur A, Elisseeff A, Guyon I (2002). A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing, vol 7, pp 6–17Google Scholar
  5. Breiman L (1996) Bagging predictors. J Mach Learn 24(2): 123–140zbMATHMathSciNetGoogle Scholar
  6. Breiman L (1998) Arcing classifiers. Ann Stat 26(3): 801–849CrossRefzbMATHMathSciNetGoogle Scholar
  7. Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Proceedings of large-scale parallel KDD systems workshop, ACM SIGKDD, in large-scale parallel data mining, lecture notes in artificial intelligence, vol 1759, pp 245–260Google Scholar
  8. Dixon JK (1979) Pattern recognition with partly missing data. IEEE Trans Syst Man Cybern SMC 9:617–621Google Scholar
  9. Duda RO, Hart PE, Stork DG (2001) Pattern classification. 2 (edn). John Wiley & Sons, New YorkzbMATHGoogle Scholar
  10. Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9): 1090–1099CrossRefGoogle Scholar
  11. Efron B (1979) Bootstrap methods: another Look at the Jackknife. Ann Stat 7: 1–26CrossRefzbMATHMathSciNetGoogle Scholar
  12. Fern X, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of 20th international conference on Machine Learning, ICML 2003Google Scholar
  13. Fischer B, Buhmann JM (2002) Data resampling for path based clustering. In: Van Gool L (ed) Pattern recognition—-Symposium of the DAGM. Springer, LNCS, vol 2449, pp 206–214Google Scholar
  14. Fischer B, Buhmann JM (2003) Path-based clustering for grouping of smooth curves and texture segmentation. IEEE Trans PAMI 25(4): 513–518CrossRefGoogle Scholar
  15. Fred ALN, Jain AK (2002) Data clustering using evidence accumulation. In: Proceedings of the 16th international conference on pattern recognition, ICPR 2002, Quebec City, pp 276–280Google Scholar
  16. Fred ALN, Jain AK (2005) Combining multiple clustering using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6)Google Scholar
  17. Frossyniotis D, Likas A, Stafylopatis A (2004) A clustering method based on boosting. Pattern Recognit Lett 25(6): 641–654CrossRefGoogle Scholar
  18. Hastie T, Tibshirani R, Friedman JH (2001) The elements of statistical learning. Springer, BerlinCrossRefzbMATHGoogle Scholar
  19. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood CliffszbMATHGoogle Scholar
  20. Jain AK, Moreau JV (1987) The bootstrap approach to clustering. In: Devijver PA, Kittler J (eds) Pattern recognition theory and applications. Springer, Berlin, pp 63–71Google Scholar
  21. Jiamthapthaksin R, Eick CF, Lee S (2010) GAC-GEO: a generic agglomerative clustering framework for geo-referenced datasets. Knowl Inf SystGoogle Scholar
  22. Levine E, Domany E (2001) Resampling method for unsupervised estimation of cluster validity. Neural Comput 13: 2573–2593CrossRefzbMATHGoogle Scholar
  23. Minaei-Bidgoli B, Punch WF (2003) Using genetic algorithms for data mining optimization in an educational web-based system. GECCO :2252-2263Google Scholar
  24. Minaei-Bidgoli B, Topchy A, Punch WF (2004a) Ensembles of partitions via data resampling. In: Proceedings of international conference on information technology, ITCC 04, Las VegasGoogle Scholar
  25. Minaei-Bidgoli B, Topchy A, Punch WF (2004b) A comparison of resampling methods for clustering ensembles. In: Proceedings of conference on machine learning methods technology and application, MLMTA 04, Las VegasGoogle Scholar
  26. Mohammadi M, Alizadeh H, Minaei-Bidgoli B (2008) Neural network ensembles using clustering ensemble and genetic algorithm. In: Proceedings of international conference on convergence and hybrid information technology, ICCIT08, 11–13 Nov 2008, published by IEEE CS, Busan, KoreaGoogle Scholar
  27. Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resamlping-based method for class discovery and visualization of gene expression microarray data. J Mach Learn 52(1)Google Scholar
  28. Odewahn SC, Stockwell EB, Pennington RL, Humphreys RM, Zumach WA (1992) Automated star/galaxy discrimination with neural networks. Astron J 103: 308–331CrossRefGoogle Scholar
  29. Park BH, Kargupta H (2003) Distributed data mining. In: Ye N (eds) The handbook of data mining. Lawrence Erlbaum Associates, HillsdaleGoogle Scholar
  30. Parvin H, Alizadeh H, Minaei-Bidgoli B, Analoui M (2008a) CCHR: combination of classifiers using heuristic retraining. In: Proceedings of international conference on networked computing and advanced information management (NCM 2008), Korea, Sep 2008, published by IEEE CSGoogle Scholar
  31. Parvin H, Alizadeh H, Minaei-Bidgoli B, Analoui M (2008b) An scalable method for improving the performance of classifiers in multiclass applications by pairwise classifiers and GA. In: Proceedings of international conference on networked computing and advanced information management (NCM 2008), Korea, Sep 2008, published by IEEE CSGoogle Scholar
  32. Parvin H, Alizadeh H, Minaei-Bidgoli B (2008c) A new approach to improve the vote-based classifier selection. In: Proceedings of international conference on networked computing and advanced information management (NCM 2008), Korea, Sep 2008, published by IEEE CSGoogle Scholar
  33. Parvin H, Alizadeh H, Moshki M, Minaei-Bidgoli B, Mozayani N (2008d) Divide & conquer classification and optimization by genetic algorithm. In: Proceedings of international conference on convergence and hybrid information technology, ICCIT08, Nov 11–13 2008, published by IEEE CS, Busan, KoreaGoogle Scholar
  34. Pfitzner D, Leibbrandt R, Powers D (2009) Characterization and evaluation of similarity measures for pairs of clusterings. Knowl Inf SystGoogle Scholar
  35. Roth V, Lange T, Braun M, Buhmann JM (2002) A resampling approach to cluster validation. In: Proceedings in computational statistics: 15th symposium COMPSTAT 2002. Physica-Verlag, Heidelberg, pp 123–128Google Scholar
  36. Saha S, Bandyopadhyay S (2009) A new multiobjective clustering technique based on the concepts of stability and symmetry. Knowl Inf SystGoogle Scholar
  37. Strehl A, Ghosh J (2003) Cluster ensembles-a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617zbMATHMathSciNetGoogle Scholar
  38. Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison-Wesly, ReadingGoogle Scholar
  39. Topchy A, Jain AK, Punch WF (2003) Combining multiple weak clusterings. In: Proceedings of 3rd IEEE international conference on data mining, pp 331–338Google Scholar
  40. Topchy A, Jain AK, Punch WF (2004a) A mixture model for clustering ensembles. In: Proceedings of SIAM international conference on data mining, SDM 04, pp 379–390Google Scholar
  41. Topchy A, Minaei-Bidgoli B, Jain AK, Punch WF (2004b) Adaptive clustering ensembles. In Proceedings of international conference on pattern recognition, ICPR 2004, Cambridge, UKGoogle Scholar
  42. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Record 25(2): 103–114CrossRefGoogle Scholar
  43. Zhang B, Hsu M, Forman G (2000) Accurate recasting of parameter estimation algorithms using sufficient statistics for efficient parallel speed-up demonstrated for center-based data clustering algorithms. In: Proceedings of 4th European conference on principles and practice of knowledge discovery in databases, in principles of data mining and knowledge discoveryGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  • Behrouz Minaei-Bidgoli
    • 1
  • Hamid Parvin
    • 1
  • Hamid Alinejad-Rokny
    • 2
    • 4
    Email author
  • Hosein Alizadeh
    • 1
  • William F. Punch
    • 3
  1. 1.Department of Computer EngineeringIran University of Scienceand TechnologyTehranIran
  2. 2.Department of Computer Engineering, Science and Research BranchIslamic Azad UniversityTehranIran
  3. 3.Department of Computer Science and EngineeringMichigan State UniversityEast LansingUSA
  4. 4.GhaemshahrIran

Personalised recommendations