A Family of Two-Dimensional Benchmark Data Sets and Its Application to Comparing Different Cluster Validation Indices

  • Jorge M. Santos
  • Mark Embrechts
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8495)


There are two main objectives in this paper: the first one is to introduce a collection of two-dimensional benchmark data sets with a wide variety of clustering characteristics that are typical for real-world data sets. These simple 2-D data sets allow the user to easily evaluate clustering solutions from a variety of different clustering algorithms; the second one is to evaluate four different commonly used clustering validation indices by using these 2-D benchmark data sets. It is shown that even for simple 2-D data sets there is a large discrepancy on the ideal number of clusters suggested by traditional cluster validation indices. The performed experiments also suggest that the Dunn and the GAP statistic seems to be more robust cluster validation indices, even though they still fail to comply with common sense clustering solutions in more than 50% of the cases.


  1. 1.
    Ultsch, A.: Clustering with som: U*c. In: Workshop on Self-Organizing Maps, pp. 75–82 (2005),
  2. 2.
    Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. 8th Int’l Conf. Computer Vision, vol. 2, pp. 416–423 (July 2001)Google Scholar
  3. 3.
    Alpert, S., Galun, M., Basri, R., Brandt, A.: Image segmentation by probabilistic bottom-up aggregation and cue integration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (June 2007)Google Scholar
  4. 4.
    Santos, J.M., Marques de Sá, J.: Human clustering on bi-dimensional data: An assessment. Technical Report 1, INEB - Instituto de Engenharia Biomédica, Porto, Portugal (October 2005)Google Scholar
  5. 5.
    Santos, J.M.: Bi-dimensioanl data sets,
  6. 6.
    Davies, D., Bouldin, D.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1, 224–227 (1971)Google Scholar
  7. 7.
    Dunn, J.C.: Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics 4(1), 95–104 (1974)CrossRefMathSciNetGoogle Scholar
  8. 8.
    Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics 20, 53–65 (1987)CrossRefzbMATHGoogle Scholar
  9. 9.
    Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2), 411–423 (2001)CrossRefzbMATHMathSciNetGoogle Scholar
  10. 10.
    Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31(8), 651–666 (2010)CrossRefGoogle Scholar
  11. 11.
    Xu, R., Wunsch, D.: Clustering. IEEE Press Series on Computational intelligence. IEEE (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Jorge M. Santos
    • 1
    • 2
  • Mark Embrechts
    • 3
  1. 1.ISEP, School of EngineeringPolytechnic of Porto - Dept. of MathematicsPortugal
  2. 2.INEB, Biomedical Engineering InstitutePortoPortugal
  3. 3.Dept. Ind. Systems Eng.Rensselaer Polytechnic InstituteTroyUSA

Personalised recommendations