Abstract
In analogy to clinical trials, in a benchmark experiment based on real datasets we can see the considered datasets as playing the role of patients and the compared methods as playing the role of treatments. This view of benchmark experiments, which has already been suggested in the literature, brings to light the importance of statistical concepts such as testing, confidence intervals, power calculation, and sampling procedure for the interpretation of benchmarking results. In this paper we propose an application of these concepts to the special case of benchmark experiments comparing clustering algorithms. We present a simple exemplary benchmarking study comparing two classical clustering algorithms based on 50 high-dimensional gene expression datasets and discuss the interpretation of its results from a critical statistical perspective. The R-codes implementing the analyses presented in this paper are freely available from: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/boulesteixhatz.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Boulesteix, A.-L.: On representative and illustrative comparisons with real data in bioinformatics: response to the letter to the editor by Smith et al. Bioinformatics 29(20), 2664–2666 (2013)
Boulesteix, A.-L.: Ten simple rules for reducing overoptimistic reporting in methodological computational research. PLOS Comput. Biol. 11, e1004191 (2015)
Boulesteix, A.L., Lauer, S., Eugster, M.J.E.: A plea for neutral comparison studies in computational sciences. PLoS One 8(4), e61562 (2013)
Boulesteix, A.-L., Hable, R., Lauer, S., Eugster, M.J.: A statistical framework for hypothesis testing in real data comparison studies. Am. Stat. 69, 201–212 (2015)
de Souza, B., de Carvalho, A., Soares, C.: A comprehensive comparison of ml algorithms for gene expression data classification. In: Neural Networks (IJCNN), The 2010 International Joint Conference on IEEE, pp. 1–8 (2010)
Doove, L., Wilderjans, T., Calcagni, A., van Michelen, I.: Deriving optimal data-analytic regimes from benchmarking studies. Comput. Stat. Data Anal. 107, 81–91 (2017). http://doi.org/10.1016/j.csda.2016.10.016. http://www.sciencedirect.com/science/article/pii/S0167947316302432
Efron, B.: Better bootstrap confidence intervals. J. Am. Stat. Assoc. 82(397), 171–185 (1987)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Jelizarow, M., Guillemot, V., Tenenhaus, A., Strimmer, K., Boulesteix, A.-L.: Over-optimism in bioinformatics: an illustration. Bioinformatics 26(16), 1990–1998 (2010)
Macià, N., Bernadó-Mansilla, E., Orriols-Puig, A., Ho, T.K.: Learner excellence biased by data set selection: a case for data characterisation and artificial data sets. Pattern Recogn. 46(3), 1054–1066 (2013)
Seibold, H., Zeileis, A., Hothorn, T.: Model-based recursive partitioning for subgroup analyses. Int. J. Biostat. 12(1), 45–63 (2016)
Steinley, D., van Mechelen, I., IFCS Task Force on Benchmarking, 2015: Benchmarking in cluster analysis: preview of a white paper. Abstract. Conference of the International Federation of Classification Society, Bologna, 6th to 8th July 2015
Yousefi, M.R., Hua, J., Sima, C., Dougherty, E.R.: Reporting bias when using real data sets to analyze classification performance. Bioinformatics 26(1), 68–76 (2010)
Acknowledgements
We thank Sarah Tegenfeldt for language correction and the IFCS Task Force on Benchmarking, in particular to Iven van Mechelen, for very fruitful discussions on the topics of our paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Boulesteix, AL., Hatz, M. (2017). Benchmarking for Clustering Methods Based on Real Data: A Statistical View. In: Palumbo, F., Montanari, A., Vichi, M. (eds) Data Science . Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-55723-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-55723-6_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55722-9
Online ISBN: 978-3-319-55723-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)