Abstract
Clustering is the problem of grouping data based on similarity. While this problem has attracted the attention of many researchers for many years, we are witnessing a resurgence of interest in new clustering techniques. In this paper we discuss some very recent clustering approaches and recount our experience with some of these algorithms. We also present the problem of clustering in the presence of constraints and discuss the issue of clustering validation.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Agrawal R., Gehrke J., Gunopulos D. and Raghavan P. (1998) Automatic subspace clustering of high dimensional data for data mining applications. In Proc. ACMSIGMOD Int. Conf. Management of Data, pp 94–105.
Ester M., Kriegel H.-P., Sander J. and Xu X. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. ACM-SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp 226–231.
Estivill-Castro V. and Lee I. (2000) Autoclust+: Automatic clustering of pointdata sets in the presence of obstacles. In Int. Workshop on Temporal and Spatio-Temporal Data Mining, pp 133–146.
Estivill-Castro V. and Lee I. (2000) Autoclust: Automatic clustering via boundary extraction for mining massive point-data sets. In Proc. 5th International Conference on Geocomputation.
Foss A., Wang W. and Zaïane O. R. (2001) A non-parametric approach to web log analysis. In Proc. of Workshop on Web Mining in First International SIAM Conference on Data Mining, pp 41–50.
Foss A. and Zaiäne O. R. (2002) TURN* unsupervised clustering of spatial data, submitted to ACM-SIKDD Intl. Conf. on Knowledge Discovery and Data Mining, July 2002.
Gath I. and Geva A. (1989) Unsupervised optimal fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7).
Guha S., Rastogi R. and Shim K. (1999) ROCK: a robust clustering algorithm for categorical attributes. In 15th ICDE Int’l Conf. on Data Engineering.
Halkidi M., Vazirgiannis M. and Batistakis I. (2000) Quality scheme assessment in the clustering process. In Proc. of PKDD, Lyon, France.
Han J. and Kamber M. (2000) Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers.
Huang Z. (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, v2 pp283–304.
MacQueen J. (1967) Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. Math. Statist. Prob..
Sammon J. Jr. (1969) A non-linear mapping for data structure analysis. IEEE Trans. Computers, v18 pp401–409.
Karypis G., Han E.-H. and Kumar V. (1999) Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8) pp68–75.
Kohonen T. (1995) Self-Organizing Maps, Springer-Verlag.
Kaufman L. and Rousseeuw P. J. (1990) Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons.
Halkidi M., Batistakis Y. and Vazirgiannis M. (2001) On clustering validation techniques. Journal of Intelligent Information Systems, 17(2–3) pp 107–145.
Ankerst M., Breunig M, Kriegel H.-P. and Sander J. (1999) Optics: Ordering points to identify the clustering structure. In Proc. ACM-SIGMOD Conf. on Management of Data, pp 49–60.
Pal N. R. and Biswas J. (1997) Cluster validation using graph theoretic concepts. Pattern Recognition, 30(6).
Ng R. and Han J. (1994) Efficient and effective clustering method for spatial data mining. In Proc. Conf. on Very Large Data Bases, pp 144–155.
Guha S., Rastogi R. and Shim K. (1998) CURE: An efficient clustering algorithm for large databases. In Proc. ACM-SIGMOD Conf. on Management of Data.
Schwenker F., Kestler H. and Palm G. (2000) An algorithm for adaptive clustering and visualisation of highdimensional data sets. In H.-J. L. G. della Riccia, R. Kruse, editor, Computational Intelligence in Data Mining, pp 127–140. Springer, Wien, New York.
Sharma S. (1996) Applied Multivariate Techniques. John Willey & Sons.
Sheikholeslami G., Chatterjee S. and Zhang A. (1998) Wavecluster: a multiresolution clustering approach for very large spatial databases. In Proc. 24th Conf. on Very Large Data Bases.
Smyth P. (1996) Clustering using monte carlo cross-validation. Proc. ACMSIGKDD Int. Conf. Knowledge Discovery and Data Mining.
Steinbach M., Karypis G. and Kumar V. (2000) A comparison of document clustering techniques. In SIGKDD Workshop on Text Mining.
Theodoridis S. and Koutroubas K. (1999) Pattern recognition, Academic Press.
Tung A. K. H., Hou J. and Han J. (2001) Spatial clustering in the presence of obstacles. In Proc. ICDE Int. Conf. On Data Engineering.
Tung A. K. H., Ng R., Lakshmanan L. V. S. and Han J. (2001) Constraint-based clustering in large databases. In Proc. ICDT, pp 405–419.
Xie X. and Beni G. (1991) A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4).
Zaïane O. R., Foss A., Lee C.-H. and Wang W. (2002) Data clustering analysis from simple groupings to scalable clustering with constraints. Technical Report, TR02-03, Department of Computing Science, University of Alberta.
Zaïane O. R. and Lee C.-H. (2002) Clustering spatial data in the presence of obstacles and crossings: a density-based approach. submitted to IDEAS Intl. Database Engineering and Applications Symposium.
Zhang T., Ramakrishnan R. and Livny M. (1996) BIRCH: an efficient data clustering method for very large databases. In Proc. ACM-SIGKDD Int. Conf. Managament of Data, pp 103–114.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zaïane, O.R., Foss, A., Lee, CH., Wang, W. (2002). On Data Clustering Analysis: Scalability, Constraints, and Validation. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47887-6_4
Download citation
DOI: https://doi.org/10.1007/3-540-47887-6_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43704-8
Online ISBN: 978-3-540-47887-4
eBook Packages: Springer Book Archive