Abstract
Traditional quality indexes (Inertia, DB, …) are known to be method-dependent indexes that do not allow to properly estimate the quality of the clustering in several cases, as in that one of complex data, like textual data. We thus propose an alternative approach for clustering quality evaluation based on unsupervised measures of Recall, Precision and F-measure exploiting the descriptors of the data associated with the obtained clusters. Two categories of index are proposed, that are Macro and Micro indexes. This paper also focuses on the construction of a new cumulative Micro precision index that makes it possible to evaluate the overall quality of a clustering result while clearly distinguishing between homogeneous and heterogeneous, or degenerated results. The experimental comparison of the behavior of the classical indexes with our new approach is performed on a polythematic dataset of bibliographical references issued from the PASCAL database.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Attik, M., Al Shehabi, S., Lamirel, J.-.C.: Clustering Quality Measures for Data Samples with Multiple Labels. In: IASTED International Conference on Artificial on Databases and Applications (DBA), Innsbruck, Austria, pp. 50–57 (February 2006)
Bock, H.-H.: Probability model and hypothese testing in partitionning cluster analysis. In: Arabie, P., Hubert, L.J., De Soete, G. (eds.) Clustering and Classification, pp. 377–453. World Scientific, Singapore (1996)
Davies, D., Bouldin, W.: A cluster separation measure. IEEE Transaction on Pattern Analysis and Machine Intelligence 1, 224–227 (1979)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood for incomplete data via the em algorithm. Journal of the Royal Statistical Society B-39, 1–38 (1977)
Diarmuid, Ó.S., Copestake, A.: Semantic classification with distributional kernels. In: Proceedings of COLING 2008, pp. 649–656 (2008)
Dunn, J.: Well Separated clusters and optimal fuzzy partitions. Journal of Cybernetics 4, 95–104
Forest, D.: Application de techniques de forage de textes de nature prédictive et exploratoire à des fins de gestion et danalyse thématique de documents textuels non structurés, PhD Thesis, Quebec University, Montreal, Canada (2007)
Ghribi, M., Cuxac, P., Lamirel, J.-C., Lelu, A.: Mesures de qualité de clustering de documents: Prise en compte de la distribution des mots-clés. In: Atelier EvalECD 2010, Hamamet, Tunisie (January 2010)
Gordon, A.D.: External validation in cluster analysis. Bulletin of the International Statistical Institute 51(2), 353–356 (1997); Response to comments. Bulletin of the International Statistical Institute 51(3), 414–415 (1998)
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. Journal of Intelligent Information Systems 17(2/3), 147–155 (2001)
Kassab, R., Lamirel, J.-C.: Feature Based Cluster Validation for High Dimensional Data. In: IASTED International Conference on Artificial Intelligence and Applications (AIA), Innsbruck, Austria, pp. 97–103 (February 2008)
Kohonen, T.: Self-organized formation of topologically correct feature maps. Biological Cybernetics 43, 56–59 (1982)
Lamirel, J.-C., Al-Shehabi, S., Francois, C., Hofmann, M.: New classification quality estimators for analysis of documentary information: application to patent analysis and web mapping. Scientometrics 60, 445–562 (2004)
Lamirel, J.-C., Attik, M.: Novel labeling strategies for hierarchical representation of multidimensional data analysis results. In: IASTED International Conference on Artificial Intelligence and Applications (AIA), Innsbruck, Austria (February 2008)
Lebart, L., Morineau, A., Fenelon, J.P.: Traitement des données statistiques, Dunod, Paris (1979)
MacQueen, J.: Some methods of classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symposium in Mathematics, Statistics and Probability, vol. 1, pp. 281–297. Univ. of California, Berkeley (1967)
Martinetz, T., Schulten, K.: A neural gas network learns topologies. Artificial Neural Networks, 397–402 (1991)
Milligan, G.W., Cooper, M.C.: An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika 50, 159–179
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65
Salton, G.: The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall Inc., Englewood Cliffs (1971)
Van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lamirel, JC., Cuxac, P., Mall, R., Safi, G. (2012). A New Efficient and Unbiased Approach for Clustering Quality Evaluation. In: Cao, L., Huang, J.Z., Bailey, J., Koh, Y.S., Luo, J. (eds) New Frontiers in Applied Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 7104. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28320-8_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-28320-8_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28319-2
Online ISBN: 978-3-642-28320-8
eBook Packages: Computer ScienceComputer Science (R0)