Abstract
In partition clustering, it is often impossible to have an idea of the desired number of clusters, particularly when considering clustering in large repositories of artefacts, such as images. The majority of partition cardinality estimation methods consider expert manual values that require important estimation efforts without any insurance of their accuracies. More generally, it is unrealistic for the user, even if an expert, to specify the number of the desired clusters in an accurate way, because its knowledge on image content does not cover the whole image repositories. An archivist, even if an expert, cannot cover accurately millions of images updated every day. What should he do when dealing with millions of images? Should he consider 1 or 2, or 3 or 4.... Or 1000.... or 10000 .... or 999999 clusters? How should he define the right number of clusters? It is difficult to propose a generic answer to these questions. However, an expert can validate the results of a clustering process. In this paper, we present an approach that estimates automatically the best partition cardinality (the best number of clusters) in the context of contentbased accessing in image repositories. We suggest a method that reduces drastically the number of iterations necessary to extract the best number of clusters. The method is based on two points: - the reduction of the number of iterations of the clustering method. - the amelioration of the clustering confidence by a global variance ratio criterion.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Berchtold, S., Keim, D.A., Kriegel, H.-P.: The X-Tree: An Index Structure for High-Dimensional Data. In: Proc. 22nd VLDB 1996, Bombay, India (1996)
Bouet, M., Djeraba, C.: Visual Content-Based Retrieval in an Image Database with Relevant Feedback. IEEE IW-MMDBMS-1998, Dayton, USA, pp. 98–105 (1998)
Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics (3), 1–27 (1974)
Cohen, A., Daubechies, I., Feauveau, J.: Bi-orthogonal bases of compactly supported wavelets. Comm. Pure Appl. Math. 45, 485–560 (1992)
Cutting D. R., Karger D. R., Pederson J. O., Tukey J. W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proceedings of ACM/SIGIR (1992)
Diday, E.: The dynamic cluster method on non-hierarchical clustering. Journal of Information Science (2), 61–88 (1973)
Djeraba, C.: Association and Content-Based Retrieval. TKDE 15(1), 118–135 (2003)
Fernandez, G., Meckaouche, A., Peter, P., Djeraba, C.: Intelligent Image Clustering. In: EDBT Workshops, pp. 406–419 (2002)
Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer Journal 41, 548–588 (1998)
Frigui, H., Krishnapuram, R.: Clustering by competitive agglomeration. Pattern Recognition 30(7), 1109–1119 (1997)
Gyllenberg, M., Koski, T., Lund, T., Nevalainen, O.: Clustering by adaptive local search with multiple search operators. Pattern Analysis and Applications 3, 348–357 (2000)
Hartigan, J.: Clustering Algorithms. Wiley, New York (1975)
Alexander, H., Keim, D.: Optimal grid-clustering: Towards breaking the curse of dimensionality. In: Proc. 25th VLDB Conference, Edinburgh, Scotland (1999)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3) (1999)
Kärkkäinen, I., Fränti, P.: Stepwise algorithm for finding unknown number of clusters. In: Proc. Advanced Concepts for Intelligent Vision Systems (ACIVS 2002), Gent, Belgium, September 2002, pp. 136–143 (2002)
Rainer, K., Thomas, E.: A framework for experimental evaluation of clustering techniques. In: International Workshop on Program Comprehension (IWPC 2000), Limerick, Ireland (2000)
Krzanowski, W.J., Lai, Y.T.: Criterion for Determining the Number of Groups in a Data Setusing Sum of Squares Clustering. Biometrics 44, 23–44 (1985)
McQueen, J.: Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(1), 159–179 (1985)
Pelleg, D., Moore, A.: X-means: Extending k-means with efficient estimation ofthe number of clusters. In: Proceedings of the Seventeenth International Conference on Machine Learning, ICML-2000 (2000)
Redner, R.A., Walker, H.F.: Mixture densities, maximum likelihoodand the em algorithm. SIAM Rev. 26, 195 (1984)
Rosenblum, E.P.: A simulation study of information theoretic techniques and classical hypothesis tests in one factor anova. In: Proc. 1st US/Japan Conf. Frontiers Statist. Modeling: Inform. Approach, vol. 2, pp. 31–46 (1994)
Gholamhosein, S., Surojit, C., Aidong, Z.: WaveCluster: a wavelet-based clustering approach for spatial data in very large databases. The VLDB Journal 8(3-4), 289–304 (2000) ISSN: 1066–8888
Smith, S.P., Dubes, R.: Stability of a Hierarchical Clustering. Pattern Recognition 12, 177–187 (1980)
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the Gapstatistic. JRSSB (2000)
Vinod, H.D.: Integer programming and the theory of grouping. Journal of the American Statistical Association 64(326), 506–519 (1969)
Bozdagan, H.: Mixture-model cluster analysis using model selectioncriteria and a new informational measure of complexity. In: Proc. 1st US/Japan Conf. Frontiers Statist. Modeling: Inform. Approach, vol. 2 (1994)
Xu, L., Krzyzak, A., Oja, E.: Rival penalized competitive learning forclustering analysis, RBF net and curve detection. IEEE Trans. Neural Networks 4 (July 1993)
Xu, L.: How many clusters?: A YING-YANG machine based theoryfor a classical open problem in pattern recognition. In: Proc. IEEE Int.Conf. Neural Networks, vol. 3 (1996)
Willett, P.: Document clustering using an inverted file approach. Journal of Information Science 2, 223–231 (1990)
Zahn, C.T., Roskies, R.Z.: Fourier descriptors for plane closed curves. IEEE Trans. on Computers (1972)
Zhang, T., Ramakrishnan, R., Linvy, M.: BIRCH: An Efficient Data Clustering Method for very Large Databases. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 103–114. ACM Press, New York (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fernandez, G., Djeraba, C. (2003). Partition Cardinality Estimation in Image Repositories. In: Zaïane, O.R., Simoff, S.J., Djeraba, C. (eds) Mining Multimedia and Complex Data. PAKDD 2002. Lecture Notes in Computer Science(), vol 2797. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39666-6_15
Download citation
DOI: https://doi.org/10.1007/978-3-540-39666-6_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20305-6
Online ISBN: 978-3-540-39666-6
eBook Packages: Springer Book Archive